Wednesday, August 24, 2011

How to improve disk I/O

Someone called "Unnikrishnan P." from India posted the next question in the Linked-In Linux Group:

how to improve disk I/O

I have a server with 128GB RAM and I cannot find the swap memory (120G) being used even when the disk utilization is at %util =95% (using iostat -d -x 5 3). What might be the root cause? What can be done to make the server perform much better?

As is ussually the case, the thread of answers started to grew in time. Some of them were really interesting so I decided to post here a summarized view of them extracting the most interesting answers and avoiding repeated comments:

Phil Quiney:

The bottleneck is going to be disk I/O performance: Offloading disk operations onto dedicated hardware (RAID controllers - proper ones, not the ones fitted to motherboards) will speed things up. Sun were doing this years ago with SCSI drives as they had enough intelligence to do disk to disk transfers without going via the host, IDE was/is too dumb to do that.

Jim Parks:

With a minimum of four spindles, you could do a RAID 10 setup. RAID 10 is a non-standard RAID level, but it provides both redundancy and higher performance, as it uses multiple striped volumes in a mirrored configuration.

From around 12 spindles on up, you're probably better off with a RAID 6 (striping with double-distributed parity) setup. There will almost certainly be people who want to argue with me about this. They have their ideas, I have mine, and I'm not going to engage in a drawn out discussion about it, but the short version is that the more spindles you have, the more RAID 6 will make sense. RAID 6 will allow you to get more usable space than RAID 10. For example, a RAID 10 array comprised of 16 1TB drives will provide 8 TB of usable space. The same 16 1 TB drives in a RAID 6 configuration would provide 14 TB of usable space -- that's 75% more useful diskspace, with the ability to survive the loss of two drives without data loss.

Micheas Herman:

If /tmp is mounted to tmpfs then the large swap space may be reasonable, but otherwise you might try turning the swapoff temporarily to see if that doesn't reduce the diskio.
If you do have millions of session cookies in /tmp and you are using tmpfs moving the swap to a solid state drive might help a lot.

If you are trying to get the swap space to be used more aggressively you can check what the swappiness of your system is

by running: cat /proc/sys/vm/swappiness

Which will return a number from 0 to 100

If the number is less than 100 and you want to increase the swap usage as much as possible you can run the following:

echo 100 > /proc/sys/vm/swappiness

To make the change persistent across reboots run:

echo "vm/swappiness=100" >> /etc/sysctl.conf

If you want to reduce the swap usage to the absolute minimum do the same using 0 instead of 100.

If you are having a database with a lot of commits disk tuning may not be the best way to improve performance.

Removing unnecessary indexes can greatly reduce the amount of disk access. using transactions to increase the quantity of data written at one time can also improve performance. It can also make it much worse.

If the server is a mail server there tends to be a lot that can be done but there tends to be very little general advice other than to make sure that the disk cache is not being flushed after every mail received.

David Pye

I don't see any information regarding hardware configuration here, have I missed something???

Is the issue really disk utilization - have you investigated the system as a whole?

If disk I/O is suspect, then there are a number of things which could be addressed, correct RAID config is one, no. of I/O channels is another - how many, fibre / copper, etc, can you get FC direct to the disk?, fastest possible disk hardware?, more extreme areas could be to use only a small % of outermost disk surface via partitioning in order to achieve less head latency, examine the nature of the disk activity,

James Sutherland:

Also check the filesystem in use - given where this question is posted, probably something like ext3? If so, check the journalling mode: data=ordered, data=journaled and data=writeback all perform different in different situations.

John Lauro:

With 128GB of RAM you are probably better off seeing if you can move some of that I/O to a tmpfs RAM disk, often mounted under /dev/shm for RH/Centos/SL.
If you don't need to record when a file is accessed (many people don't even know you can get that info), you can mount the filesystem with noatime. That can make a big difference on the I/O utilization especially for a web server that has lots of small files all over the place.

If some of your I/O is being generated by mysql or other database, perhaps there is some temporary tables such as for session tracking that you might want to consider converting to memory tables. The rows will be emptied if you have to restart the database, but generally that just means the user has to relogin.

Enrique Arizon Benito (aka, that's me!):
man hdparm

Also, maybe some app has "too much" log output enabled and it is just a matter of modifying the /etc/... config file. Inotify can be your friend in this case:


Or you can simple use something like:

find /var -mmin -1 -size +0.5G
(find files 'inside' /var changed/modified in the last minute that are bigger than 0.5Gbytes)

Bond Masuda:

first of all, you need to figure out what's consuming the I/O? get a tool like iotop to help:
I usually use ballpark figures like assuming most modern spinning hard drives can do about 100MB/sec sequential I/O and about 40-60MB/sec random I/O for SATA type drives and add about 10-20% higher numbers for SAS type drives. SSDs of course can have much higher numbers. using those as "guidelines", try to figure out what disk configuration will meet your requirements and add about 20% for overhead. for example, if i have a 8 disk RAID-5 array, which gives me 7 effective spindles, I would assume this can roughly meet a sequential I/O requirement of about 583MB/sec (7x100MB/sec / 1.2 = 583).

with multi-disk arrays, you'll want to consider the "best on average" stripe size for your array... if your I/O pattern is reading sequentially large files, then a larger stripe size is in your favor. if you're doing random I/O and reading small chunks here and there, then a moderate to smaller stripe size a better choice.

also, in situations where you are looking are really large arrays that might span multiple controllers, you'll need to consider the bottlenecks of the bus. older PCI-X buses are half-duplex and limited to about 1GB/sec. more modern servers (like yours with 128GB of RAM) will likely use PCI-E 4x or 8x buses which can do full-duplex and 2GB/sec (8x).

Rob Strong: not attempting to read all these comments, but take a look at the output of iostat "%iowait" field. If this number starts getting up, say past... 20% or more then you possibly have an issue with slow disk and/or bad raid setup.

Oleksiy Dovzhanytsya:


Unnikrishnan P. (original author of the question) added some (important) details to its setup:

This is an NFS server which is used by other servers which runs tools used by the developers. I had moved some users to a dedicated server now and found that performance on this server has increased. Previously there were 4 servers accessing this NFS share. Alot of read/ write happening. I was using RAID 5 (LSI controller) on this server. May be the throughput of RAID 5 is the problem, as that wont support heavy rw operations.

James Sutherland:

That information helps - yes, RAID5 can be a bottleneck with heavy write activity, particularly scattered writes, since it has to read the corresponding blocks from the other disks to recalculate parity every time.

What filesystem is it running, ext3? You may find changing the data journalling, or switching to a separate journal device, will help here: mount with data=writeback (risky, if you don't have a very good UPS!) or data=journal. Also see if your LSI card is doing write back caching or not, that can help significantly.

John Lauro:

What IO scheduler are you running on your disks? If you have at least 128mb on your controller make sure you don't use the often default of cfq as the IO schedulerer as it is the worst with a raid controller...

for a in /sys/block/*/queue/scheduler ; do echo $a : `<$a` ; done

Unnikrishnan P, adding some more info:

I have a server with 128GB RAM and I cannot find the swap memory (120G) being used even when the disk utilization is at %util =95% (using iostat -d -x 5 3). What might be the root cause? What can be done to make the server perform much better?

Jeremy Page:

I'm reasonably certain relatime is now the default for most mount commands.

Cat /proc/mounts and see what options your clients are. I'm using squeeze so probably not a good test but (edited out my personal stuff) /home/pagej nfs rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=,mountvers=3,mountport=4046,mountproto=udp,local_lock=none,addr= 0 0

John Lauro:

reltime is standard on EL6 and nearly as good as noatime for reducing I/O, but it is not the default for EL5 which still common enough that I wouldn't assume someone isn't running it

Wayne Monfries:

As mentioned above by Bond Masuda, you may obtain some easy gains by increasing the read ahead buffer size (probably most helpful for sequential loads)
blockdev --setra 2048 /dev/sda # increase readahead buffer size
Another area that is often overlooked with external raid arrays is that they are capable of processing much greater numbers of IO transactions than a single disk - so check out the qdepth on your adapters and devices.

Monday, August 22, 2011

Somo tips to make XML and HTML+JS files easier to read and mantain

Here comes a couple of tips/conventions to easify xml edition. Anyone who has used Ant overengineered parametrization config files and then changed to the beautiful Maven default-by-convention config files will be aware by now how useful conventions are. For the rest of you, believe me, conventions are the best friend of anyone writting XML files.

Convention 1: Use upper case for ids global/shared constants. Those constants that are normally defined first in file and used then "everywhere". That's and standard convention in most programming languages.

Convention 2: (This is the important one). XML tags further referenced by other XML tags are usually identified by an id="arbitrary_name" attribute. This arbitrary name is later used to refer to the tag element. This second convention tell us to use the next rule for the arbitrary name:
Prefix it with "type_" where "type_" ussually equals to the tag name that "owns" the id attribute or a real type if the tag name is not descriptive enough / too generic.

Behind is shown two (Ant) XML files before and after applying conventions:
BEFORE APPLYING CONVENTIONS        ||   AFTER APPLYING CONVENTIONS                                
<project name=....>                  || <project name=....>
...                                  || ...
<property name=""        || <property name="DIR_JAVA_SRC"
   value="src" />                    ||    value="src" />
<property name="lib.dir"             || <property name="DIR_LIB"
   value="../../../lib" />           ||    value="../../../lib" />
<property name="build.dir"           || <property name="DIR_BUILD"
   value="bin" />                    ||    value="bin" />
<path id="project.classpath">        || <path id="PATH_PROJECT">
  <fileset dir="${lib.dir}">         ||   <fileset dir="${DIR_LIB}">
    <include name="**/*.jar" />      ||       <include name="**/*.jar" />
  </fileset>                         ||   </fileset>
</path>                              || </path>
<patternset id="conf">               || <patternset id="PATTERNSET_CONF">
  <include name="**/*.xml" />        ||   <include name="**/*.xml" />
  <include name="**/*.properties" /> ||   <include name="**/*.properties" />
  <include name="**/*.conf" />       || </patternset>
</patternset>                        ||
                                     || <patternset id="PATTERNSET_IMAGES">
<patternset id="images">             ||   <include name="**/*.png" />
  <include name="**/*.png" />        ||   <include name="**/*.jpg" />
  <include name="**/*.jpg" />        ||   <include name="**/*.gif" />
  <include name="**/*.gif" />        ||   <include name="**/*.gif" />
</patternset>                        || </patternset>
...                                  || 
<target name="copyconf">             || <target name="copyconf">
  <mkdir dir="${build.dir}" />       ||   <mkdir dir="${DIR_BUILD}" />
  <copy todir="${build.dir}">        ||   <copy todir="${DIR_BUILD}">
    <fileset dir="${}">  ||     <fileset dir="${DIR_JAVA_SRC}">
      <patternset refid="conf" />    ||       <patternset refid="PATTERNSET_CONF" />
    </fileset>                       ||     </fileset>
  </copy>                            ||   </copy>
</target>                            || </target>
</project>                           || </project>

The first great advantage of using the second convention is that now we can use word/code completion in our favourite text editor.

Notice the next points in the previous example:

- Once conventions are applied, if we want to edit a dir value we just write:
<tagName dir="{$DIR_
and using word_completion in our editor (for example Ctrl+X n in vim) will show a list with the avaible dirs constants defined.
Notice that <property name="arbitrary_name" ...> is a little bit assimetric (Property is actually used as a meta-tag to define constants string values for later replacement in our XML text file). It doesn't follow the convention:
<tagName id="arbitrary_name" ...>
using "name", not "id" for the tag attribute. Also we can't use the tag name, "property" as the prefix since it itself doesn't provide the type of the object (it's always an string type value). We will use something like "DIR_" as in the previous example to indicate that the property value is a directory. Other useful and descriptive prefixes could be "PATH_", "CLASSPATH_", "URL_", ... where the "PREFIX_" is a descriptive string describing the actual type of its value.

- For the "patternset", that follows the standard '<tagName id="arbitry_name" ' we will write:
<patternset idref="PATTERNSET_ (Ctrl+X n for autocompletion)
And word/code completion will offer now a list of available candidates:

- We could now continue to use code completion with any defined PATH_... element or any other type.

- The convetion used is also exceptionally useful when mixing HTML (sort of pseudo-XML) and Javascript. For example instead of identifying a table like:
<table id="results" .... >
<table id="buttons" .... >
we can use:
<table id="table_results" .... />
<table id="table_buttons" .... />
again code completion will be at our disposal when handling the html element (table, div, form,...) through Javascript. Now the editor can help us writting our "risky and error prone" javascript code as the next screenshot probes:
Basically the "TYPE_" prefix convention is adding manual type safety to our non-type-safe XML/HTML+JS files.

"Magically" now we are half-way between the non-type-safe languages and the compile-check type-safe ones. We don't yet have a compile-check to advice us of code mistakes, but at least the editor helps us now with code completion (that actually is certainly an indirect check safety measure).

Saturday, August 6, 2011

This is not a joke, this is Java EE.

Extracted from the Book "Real World Java EE Patterns":
public class BookFacadeBean implements BookFacade {
    private EntityManager em;
    private Book currentBook;
    public Book find(long id) { 
        this.currentBook = this.em.find(Book.class, id); 
        return this.currentBook;

    public void create(Book book){
        this.currentBook = book;
    public Book getCurrentBook() { return currentBook; }

    public void save() {
        // nothing to do here

The transaction setting of the BookFacade may appear a little esoteric at the first glance.
On the class level the TransactionAttributeType.NEVER is used. Therefore it is guaranteed, that no method of the bean will be invoked in an active transaction. The method save, however, overwrites this rule with the setting TransactionAttribute.REQUIRES_NEW at the method level, which in turn starts a new transaction every time. The method create in turn was not annotated with TransactionAttributeType.REQUIRES_NEW even though the EntityManager#persist methos is invoqued inside. This invocation will not fail with the TransactionRequiredException. It will not fail because on an EntityManager with an EXTENDED Persistence context, the perssit,remove,merge, and refresh methdos may be called regardless of whether a transaction is active or not. The effects of these operations will be committed to the database when the exteneded persistence context is enlisted in a transaction and the transaction commits. It actually happens in the empty method save.

So in order to avoid a few extra lines of intuitive standard java code in charge of transaction demarcation the author has developed a counter-intuitive code that just work when calling empty function (and just if invoqued in the correct order). What's is worse, the author looks really excited with this way of working.

I'm completly sure no one except those in charge of developing Java EE containers will think calling and empty function with metadata is the proper way to work.

Fortunatelly for the rest of us, we have Spring.