Saturday, January 7, 2012

The circle of observability

Source: http://www.youtube.com/watch?v=l7aQWoTRqKw

   +----->  PROFILING -----+       
   |         -top          |
   |         -oprofile     |
   |         -perf         |
   |                       |
   |     +---------+       |
   |     |SystemTap|       |
   |     +---------+       |
   |                       |
   |                       v
 DEBUGGING <------------- TRACING
  -gdb                    -strace
                          -ltrace
                          -ftrace

Wednesday, August 24, 2011

How to improve disk I/O

Someone called "Unnikrishnan P." from India posted the next question in the Linked-In Linux Group:

how to improve disk I/O

I have a server with 128GB RAM and I cannot find the swap memory (120G) being used even when the disk utilization is at %util =95% (using iostat -d -x 5 3). What might be the root cause? What can be done to make the server perform much better?


As is ussually the case, the thread of answers started to grew in time. Some of them were really interesting so I decided to post here a summarized view of them extracting the most interesting answers and avoiding repeated comments:


Phil Quiney:

The bottleneck is going to be disk I/O performance: Offloading disk operations onto dedicated hardware (RAID controllers - proper ones, not the ones fitted to motherboards) will speed things up. Sun were doing this years ago with SCSI drives as they had enough intelligence to do disk to disk transfers without going via the host, IDE was/is too dumb to do that.



Jim Parks:

With a minimum of four spindles, you could do a RAID 10 setup. RAID 10 is a non-standard RAID level, but it provides both redundancy and higher performance, as it uses multiple striped volumes in a mirrored configuration.

From around 12 spindles on up, you're probably better off with a RAID 6 (striping with double-distributed parity) setup. There will almost certainly be people who want to argue with me about this. They have their ideas, I have mine, and I'm not going to engage in a drawn out discussion about it, but the short version is that the more spindles you have, the more RAID 6 will make sense. RAID 6 will allow you to get more usable space than RAID 10. For example, a RAID 10 array comprised of 16 1TB drives will provide 8 TB of usable space. The same 16 1 TB drives in a RAID 6 configuration would provide 14 TB of usable space -- that's 75% more useful diskspace, with the ability to survive the loss of two drives without data loss.



Micheas Herman:

If /tmp is mounted to tmpfs then the large swap space may be reasonable, but otherwise you might try turning the swapoff temporarily to see if that doesn't reduce the diskio.
If you do have millions of session cookies in /tmp and you are using tmpfs moving the swap to a solid state drive might help a lot.

If you are trying to get the swap space to be used more aggressively you can check what the swappiness of your system is

by running: cat /proc/sys/vm/swappiness

Which will return a number from 0 to 100

If the number is less than 100 and you want to increase the swap usage as much as possible you can run the following:

echo 100 > /proc/sys/vm/swappiness

To make the change persistent across reboots run:

echo "vm/swappiness=100" >> /etc/sysctl.conf

If you want to reduce the swap usage to the absolute minimum do the same using 0 instead of 100.

If you are having a database with a lot of commits disk tuning may not be the best way to improve performance.

Removing unnecessary indexes can greatly reduce the amount of disk access. using transactions to increase the quantity of data written at one time can also improve performance. It can also make it much worse.

If the server is a mail server there tends to be a lot that can be done but there tends to be very little general advice other than to make sure that the disk cache is not being flushed after every mail received.



David Pye

I don't see any information regarding hardware configuration here, have I missed something???

Is the issue really disk utilization - have you investigated the system as a whole?

If disk I/O is suspect, then there are a number of things which could be addressed, correct RAID config is one, no. of I/O channels is another - how many, fibre / copper, etc, can you get FC direct to the disk?, fastest possible disk hardware?, more extreme areas could be to use only a small % of outermost disk surface via partitioning in order to achieve less head latency, examine the nature of the disk activity,



James Sutherland:

Also check the filesystem in use - given where this question is posted, probably something like ext3? If so, check the journalling mode: data=ordered, data=journaled and data=writeback all perform different in different situations.



John Lauro:

With 128GB of RAM you are probably better off seeing if you can move some of that I/O to a tmpfs RAM disk, often mounted under /dev/shm for RH/Centos/SL.
...
If you don't need to record when a file is accessed (many people don't even know you can get that info), you can mount the filesystem with noatime. That can make a big difference on the I/O utilization especially for a web server that has lots of small files all over the place.

If some of your I/O is being generated by mysql or other database, perhaps there is some temporary tables such as for session tracking that you might want to consider converting to memory tables. The rows will be emptied if you have to restart the database, but generally that just means the user has to relogin.



Enrique Arizon Benito (aka, that's me!):


http://en.wikipedia.org/wiki/Hdparm
man hdparm

Also, maybe some app has "too much" log output enabled and it is just a matter of modifying the /etc/... config file. Inotify can be your friend in this case:

* http://linux.die.net/man/7/inotify

Or you can simple use something like:

find /var -mmin -1 -size +0.5G
(find files 'inside' /var changed/modified in the last minute that are bigger than 0.5Gbytes)



Bond Masuda:

first of all, you need to figure out what's consuming the I/O? get a tool like iotop to help:

http://guichaz.free.fr/iotop/
...
I usually use ballpark figures like assuming most modern spinning hard drives can do about 100MB/sec sequential I/O and about 40-60MB/sec random I/O for SATA type drives and add about 10-20% higher numbers for SAS type drives. SSDs of course can have much higher numbers. using those as "guidelines", try to figure out what disk configuration will meet your requirements and add about 20% for overhead. for example, if i have a 8 disk RAID-5 array, which gives me 7 effective spindles, I would assume this can roughly meet a sequential I/O requirement of about 583MB/sec (7x100MB/sec / 1.2 = 583).

with multi-disk arrays, you'll want to consider the "best on average" stripe size for your array... if your I/O pattern is reading sequentially large files, then a larger stripe size is in your favor. if you're doing random I/O and reading small chunks here and there, then a moderate to smaller stripe size a better choice.

also, in situations where you are looking are really large arrays that might span multiple controllers, you'll need to consider the bottlenecks of the bus. older PCI-X buses are half-duplex and limited to about 1GB/sec. more modern servers (like yours with 128GB of RAM) will likely use PCI-E 4x or 8x buses which can do full-duplex and 2GB/sec (8x).

Rob Strong: not attempting to read all these comments, but take a look at the output of iostat "%iowait" field. If this number starts getting up, say past... 20% or more then you possibly have an issue with slow disk and/or bad raid setup.



Oleksiy Dovzhanytsya:

...
http://www.xaprb.com/blog/2009/08/23/how-to-find-per-process-io-statistics-on-linux/



Unnikrishnan P. (original author of the question) added some (important) details to its setup:

This is an NFS server which is used by other servers which runs tools used by the developers. I had moved some users to a dedicated server now and found that performance on this server has increased. Previously there were 4 servers accessing this NFS share. Alot of read/ write happening. I was using RAID 5 (LSI controller) on this server. May be the throughput of RAID 5 is the problem, as that wont support heavy rw operations.



James Sutherland:

That information helps - yes, RAID5 can be a bottleneck with heavy write activity, particularly scattered writes, since it has to read the corresponding blocks from the other disks to recalculate parity every time.

What filesystem is it running, ext3? You may find changing the data journalling, or switching to a separate journal device, will help here: mount with data=writeback (risky, if you don't have a very good UPS!) or data=journal. Also see if your LSI card is doing write back caching or not, that can help significantly.



John Lauro:

What IO scheduler are you running on your disks? If you have at least 128mb on your controller make sure you don't use the often default of cfq as the IO schedulerer as it is the worst with a raid controller...

for a in /sys/block/*/queue/scheduler ; do echo $a : `<$a` ; done




Unnikrishnan P, adding some more info:

I have a server with 128GB RAM and I cannot find the swap memory (120G) being used even when the disk utilization is at %util =95% (using iostat -d -x 5 3). What might be the root cause? What can be done to make the server perform much better?



Jeremy Page:

I'm reasonably certain relatime is now the default for most mount commands.

Cat /proc/mounts and see what options your clients are. I'm using squeeze so probably not a good test but (edited out my personal stuff)
home.site.company.com:/home/pagej /home/pagej nfs rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.28.17.214,mountvers=3,mountport=4046,mountproto=udp,local_lock=none,addr=10.48.33.214 0 0


John Lauro:

reltime is standard on EL6 and nearly as good as noatime for reducing I/O, but it is not the default for EL5 which still common enough that I wouldn't assume someone isn't running it


Wayne Monfries:

As mentioned above by Bond Masuda, you may obtain some easy gains by increasing the read ahead buffer size (probably most helpful for sequential loads)
blockdev --setra 2048 /dev/sda # increase readahead buffer size
Another area that is often overlooked with external raid arrays is that they are capable of processing much greater numbers of IO transactions than a single disk - so check out the qdepth on your adapters and devices.


Monday, August 22, 2011

Somo tips to make XML and HTML+JS files easier to read and mantain

Here comes a couple of tips/conventions to easify xml edition. Anyone who has used Ant overengineered parametrization config files and then changed to the beautiful Maven default-by-convention config files will be aware by now how useful conventions are. For the rest of you, believe me, conventions are the best friend of anyone writting XML files.

Convention 1: Use upper case for ids global/shared constants. Those constants that are normally defined first in file and used then "everywhere". That's and standard convention in most programming languages.

Convention 2: (This is the important one). XML tags further referenced by other XML tags are usually identified by an id="arbitrary_name" attribute. This arbitrary name is later used to refer to the tag element. This second convention tell us to use the next rule for the arbitrary name:
Prefix it with "type_" where "type_" ussually equals to the tag name that "owns" the id attribute or a real type if the tag name is not descriptive enough / too generic.


Behind is shown two (Ant) XML files before and after applying conventions:
BEFORE APPLYING CONVENTIONS        ||   AFTER APPLYING CONVENTIONS                                
                                     ||                                                             
<project name=....>                  || <project name=....>
...                                  || ...
<property name="src.java.dir"        || <property name="DIR_JAVA_SRC"
   value="src" />                    ||    value="src" />
<property name="lib.dir"             || <property name="DIR_LIB"
   value="../../../lib" />           ||    value="../../../lib" />
<property name="build.dir"           || <property name="DIR_BUILD"
   value="bin" />                    ||    value="bin" />
                                     || 
<path id="project.classpath">        || <path id="PATH_PROJECT">
  <fileset dir="${lib.dir}">         ||   <fileset dir="${DIR_LIB}">
    <include name="**/*.jar" />      ||       <include name="**/*.jar" />
  </fileset>                         ||   </fileset>
</path>                              || </path>
                                     || 
<patternset id="conf">               || <patternset id="PATTERNSET_CONF">
  <include name="**/*.xml" />        ||   <include name="**/*.xml" />
  <include name="**/*.properties" /> ||   <include name="**/*.properties" />
  <include name="**/*.conf" />       || </patternset>
</patternset>                        ||
                                     || <patternset id="PATTERNSET_IMAGES">
<patternset id="images">             ||   <include name="**/*.png" />
  <include name="**/*.png" />        ||   <include name="**/*.jpg" />
  <include name="**/*.jpg" />        ||   <include name="**/*.gif" />
  <include name="**/*.gif" />        ||   <include name="**/*.gif" />
</patternset>                        || </patternset>
...                                  || 
<target name="copyconf">             || <target name="copyconf">
  <mkdir dir="${build.dir}" />       ||   <mkdir dir="${DIR_BUILD}" />
  <copy todir="${build.dir}">        ||   <copy todir="${DIR_BUILD}">
    <fileset dir="${src.java.dir}">  ||     <fileset dir="${DIR_JAVA_SRC}">
      <patternset refid="conf" />    ||       <patternset refid="PATTERNSET_CONF" />
    </fileset>                       ||     </fileset>
  </copy>                            ||   </copy>
</target>                            || </target>
                                     || 
</project>                           || </project>

The first great advantage of using the second convention is that now we can use word/code completion in our favourite text editor.


Notice the next points in the previous example:

- Once conventions are applied, if we want to edit a dir value we just write:
<tagName dir="{$DIR_
and using word_completion in our editor (for example Ctrl+X n in vim) will show a list with the avaible dirs constants defined.
Notice that <property name="arbitrary_name" ...> is a little bit assimetric (Property is actually used as a meta-tag to define constants string values for later replacement in our XML text file). It doesn't follow the convention:
<tagName id="arbitrary_name" ...>
using "name", not "id" for the tag attribute. Also we can't use the tag name, "property" as the prefix since it itself doesn't provide the type of the object (it's always an string type value). We will use something like "DIR_" as in the previous example to indicate that the property value is a directory. Other useful and descriptive prefixes could be "PATH_", "CLASSPATH_", "URL_", ... where the "PREFIX_" is a descriptive string describing the actual type of its value.

- For the "patternset", that follows the standard '<tagName id="arbitry_name" ' we will write:
<patternset idref="PATTERNSET_ (Ctrl+X n for autocompletion)
And word/code completion will offer now a list of available candidates:

- We could now continue to use code completion with any defined PATH_... element or any other type.

- The convetion used is also exceptionally useful when mixing HTML (sort of pseudo-XML) and Javascript. For example instead of identifying a table like:
<table id="results" .... >
...
<table id="buttons" .... >
we can use:
<table id="table_results" .... />
...
<table id="table_buttons" .... />
again code completion will be at our disposal when handling the html element (table, div, form,...) through Javascript. Now the editor can help us writting our "risky and error prone" javascript code as the next screenshot probes:
Basically the "TYPE_" prefix convention is adding manual type safety to our non-type-safe XML/HTML+JS files.

"Magically" now we are half-way between the non-type-safe languages and the compile-check type-safe ones. We don't yet have a compile-check to advice us of code mistakes, but at least the editor helps us now with code completion (that actually is certainly an indirect check safety measure).








Saturday, August 6, 2011

This is not a joke, this is Java EE.

Extracted from the Book "Real World Java EE Patterns":
@Stateful
@Local(BookFacade.class)
@TransactionAttribute(TransactionAttributeType.NEVER)
public class BookFacadeBean implements BookFacade {
    @PersistenceContext(type=PersistenceContextType.EXTENDED) 
    private EntityManager em;
    private Book currentBook;
    public Book find(long id) { 
        this.currentBook = this.em.find(Book.class, id); 
        return this.currentBook;
    }

    public void create(Book book){
        this.em.persist(book);
        this.currentBook = book;
    }
    public Book getCurrentBook() { return currentBook; }

    @TransactionAttribute(TransactionAttributeType.REQUIRES_NEW) 
    public void save() {
        // nothing to do here
    }
}


The transaction setting of the BookFacade may appear a little esoteric at the first glance.
On the class level the TransactionAttributeType.NEVER is used. Therefore it is guaranteed, that no method of the bean will be invoked in an active transaction. The method save, however, overwrites this rule with the setting TransactionAttribute.REQUIRES_NEW at the method level, which in turn starts a new transaction every time. The method create in turn was not annotated with TransactionAttributeType.REQUIRES_NEW even though the EntityManager#persist methos is invoqued inside. This invocation will not fail with the TransactionRequiredException. It will not fail because on an EntityManager with an EXTENDED Persistence context, the perssit,remove,merge, and refresh methdos may be called regardless of whether a transaction is active or not. The effects of these operations will be committed to the database when the exteneded persistence context is enlisted in a transaction and the transaction commits. It actually happens in the empty method save.

So in order to avoid a few extra lines of intuitive standard java code in charge of transaction demarcation the author has developed a counter-intuitive code that just work when calling empty function (and just if invoqued in the correct order). What's is worse, the author looks really excited with this way of working.

I'm completly sure no one except those in charge of developing Java EE containers will think calling and empty function with metadata is the proper way to work.

Fortunatelly for the rest of us, we have Spring.

Sunday, May 29, 2011

Xvfb+x11vnc outperforms Xvnc

(update 2011/08/29: NX offers a much better aproach to remote X control).

If anyone of you has ever tried to run a remote X desktop using Xvnc you will probably have got quite dissapointed. The performance is poor and the response to user accions is annoying. Do not despair. There is an much better alternative. Just combine Xvfb (the "virtual in memory" X server) plus x11vnc (the vnc that attach to an existing X server). Extracted from the wikipedia.


export DISPLAY=:1
Xvfb :1 -screen 0 1024x768x16 &
fluxbox &
x11vnc -display :1 -bg -nopw -listen localhost -xkb


(Another probe in favor of the "Do one thing and do it right!" Unix philosophy)

Sunday, May 15, 2011

KILL THE NULL, KILL'EN ALL

Some will argue that null values are useful and even needed when writing software. Nulls allows to declare a variable without a defined value and that's something many people will think is needed when we don't known in advance the real value of an entity.

Imagine for example a database whose squema is upgraded adding a new column "second name" to a table. Since we don't know in advance what value "second name" has to be for each row, and since many new rows possible will have not "second name" it's quite easy to let it default to NULL.

In fact, I will try to show that using NULL values is always a bad idea. The first trouble many people will have trying to get rid of nulls is solving the problem in the previous paragraph. If I don't know the value of a variable of if a variable has not value at all (maybe because performance reasons force to denormalize a database), there is no other way that assigning NULL.

Right? wrong. It's always possible to assign a default value to tag a field as NULL. In the previous example we could simple assign the string "NULL" to indicate a non existent second name. It looks an arbitrary decisition that doesn't improve our code at all. In fact it does make our application safer. Let's see why.
Imagine a typicall web app in which a form sends the "second name" to our server in a get/post request. Something like second_name=Armstrong.
Then our php server code will look something like:

$second_name = $_REQUEST["second_name"];

Look now what happens if accidentally we misspell second_name in the html form or the php code:
$second_name will be assigned a NULL value. Not the "NULL" string we want in case "second_name" is not used.

The previous example is just one of the many possible ways in which a variable can be silently set a wrong NULL value. As you can see NULL values are quite risky, since languages tend to leave variables in a "null" state when something goes wrong. For example Java, probably the most widely used programming language around has a big design mistake:
try-catch exception demarcation bounds are used also as visibility demarcation bounds. That means that we can not write a code like:

   try{
MyClass myObject = new MyClass();
...
   } catch(Exception e){
...
   }
myPreparedStatement.set(0,myObject);
myPreparedStatement.execute();

The java language syntax force to do something similar to:

MyClass myObject = null;
   try{
myObject = new MyClass();
...
   } catch(Exception e){
...
   }
...
myPreparedStatement.set(0,myObject);
myPreparedStatement.execute();

Great! If an exception is thrown at object initialization and we are not cautious enough, we will insert a false null into the database.

(Note: In the previous example it's possible to move the two last lines inside the try-catch and that will avoid propagating a null to the database. In a larger code it could be impossible and even if it could, it's up to the user writing the code to do so and we are searching a way of writting code that doesn't depends on programmer's skills and experience).

With C#, the situation is even worse, since C# designers were drunk enough to not only copy the same Java pitfalls, but to encourage the use of nulls "everywhere"

In fact, NULL values are the biggest mistake ever made in software history, and as the time pass, the dammage augment.

The problem with NULL is that NULL actually is/means/maps to *nothing*. Null is not "unknown", Null is not "not applicable", not even "not available", not even "right" o "wrong". Null bears no information at all. If we were physicists NULL will just compare to a singularity in our ecuations, a clear symptom that something is really, really wrong.

In the example of the database we can assign the string "NULL" to indicate an unknown value. In fact we can assign the string "Not known" to indicate we don't known it's value, even if it exist, and we can also assign the string "Not applicable" when the row, due to break of normalization doesn't use that value. Our code will now be able to compare the string and apply a bussiness logic or another since now it carries information and state.

Convention let us replace "null" values with a given value or even better, with a given set of values ("not known","not aplicable","" (empty),...) . That's probably what we really want. That means a little bit of extra coding since we will need to create default rows/objects for each entity in our UML diagram, but the extra work is always worth the effort when our code grows beyond the thousand lines of code. Magically null pointers will vanish. Errors will be detected much early and correct exceptions thrown properly.

And last but no least, here comes a trick to avoid NULLable values in a database in a safe way:

  • Imagine you have a set or "core" tables in your squema, for example "client" and "product".

  • Imagine now that due to performance issues or whatever the reason you have another not-normalized table "notNormalizedTable" with a foreing key column "fk_client_id" and "fk_product_id" that many times doesn't have a given set value either for client or product or both.

  • If you want to avoid the risky NULLs the first step is to create default client and product rows. For example:

    client:
    =======
    id | name
    01 | "Not Known"
    02 | "Not Applicable"
    ...
    product:
    =======
    id | name
    01 | "Not Known"
    02 | "Not Applicable"
    ...


  • A new problem arises now. It's possible to accidentally delete either client or product "meta-rows" that represent "Not applicable"/"Not known"/... values, and then our strategy for NULL replacement will go to an end.
    Noneless, the solution is quite easy. Just create a new meta-table "nullKiller" with a foreign key for each table for which we want "meta-rows" with the "ON DELETE NO ACTION ON CASCADE NO ACTION" clause like:

    create table `nullKiller`(
    fk_client_id
    fk_product_id
    ...
    fk_entityN_id
    CONSTRAINT `nullKiller_const1` \
      FOREIGN KEY (`fk_client_id`) \
      REFERENCES `client` (`id`) \
      ON DELETE NO ACTION ON UPDATE NO ACTION,
    CONSTRAINT `nullKiller_const2` \
      FOREIGN KEY (`fk_product_id`) \
      REFERENCES `product` (`id`) \
      ON DELETE NO ACTION ON UPDATE NO ACTION,
    ...
    CONSTRAINT `nullKiller_constN` \
      FOREIGN KEY (`fk_entityN_id`) \
      REFERENCES `entityN` (`id`) \
      ON DELETE NO ACTION ON UPDATE NO ACTION
    );

    then just insert the new rows:

    INSERT INTO `nullKiller` \
      (fk_client_id,fk_product_id,...,fk_entity_N_id) \
      VALUES (1,1,....1), (2,2,....2);


    Once the `nullKiller` rows are inserted we are sure that a reserved set of ids for each entity are reserved for default values thanks to the "ON UPDATE NO ACTION" clause, and we are also sure those rows will not be accidentally deleted thanks the "ON DELETE NO ACTION" clause.


So, you know what to do now: "KILL THE NULL, KILL'EN ALL"