Wednesday, August 24, 2011

How to improve disk I/O

Someone called "Unnikrishnan P." from India posted the next question in the Linked-In Linux Group:

how to improve disk I/O

I have a server with 128GB RAM and I cannot find the swap memory (120G) being used even when the disk utilization is at %util =95% (using iostat -d -x 5 3). What might be the root cause? What can be done to make the server perform much better?

As is ussually the case, the thread of answers started to grew in time. Some of them were really interesting so I decided to post here a summarized view of them extracting the most interesting answers and avoiding repeated comments:

Phil Quiney:

The bottleneck is going to be disk I/O performance: Offloading disk operations onto dedicated hardware (RAID controllers - proper ones, not the ones fitted to motherboards) will speed things up. Sun were doing this years ago with SCSI drives as they had enough intelligence to do disk to disk transfers without going via the host, IDE was/is too dumb to do that.

Jim Parks:

With a minimum of four spindles, you could do a RAID 10 setup. RAID 10 is a non-standard RAID level, but it provides both redundancy and higher performance, as it uses multiple striped volumes in a mirrored configuration.

From around 12 spindles on up, you're probably better off with a RAID 6 (striping with double-distributed parity) setup. There will almost certainly be people who want to argue with me about this. They have their ideas, I have mine, and I'm not going to engage in a drawn out discussion about it, but the short version is that the more spindles you have, the more RAID 6 will make sense. RAID 6 will allow you to get more usable space than RAID 10. For example, a RAID 10 array comprised of 16 1TB drives will provide 8 TB of usable space. The same 16 1 TB drives in a RAID 6 configuration would provide 14 TB of usable space -- that's 75% more useful diskspace, with the ability to survive the loss of two drives without data loss.

Micheas Herman:

If /tmp is mounted to tmpfs then the large swap space may be reasonable, but otherwise you might try turning the swapoff temporarily to see if that doesn't reduce the diskio.
If you do have millions of session cookies in /tmp and you are using tmpfs moving the swap to a solid state drive might help a lot.

If you are trying to get the swap space to be used more aggressively you can check what the swappiness of your system is

by running: cat /proc/sys/vm/swappiness

Which will return a number from 0 to 100

If the number is less than 100 and you want to increase the swap usage as much as possible you can run the following:

echo 100 > /proc/sys/vm/swappiness

To make the change persistent across reboots run:

echo "vm/swappiness=100" >> /etc/sysctl.conf

If you want to reduce the swap usage to the absolute minimum do the same using 0 instead of 100.

If you are having a database with a lot of commits disk tuning may not be the best way to improve performance.

Removing unnecessary indexes can greatly reduce the amount of disk access. using transactions to increase the quantity of data written at one time can also improve performance. It can also make it much worse.

If the server is a mail server there tends to be a lot that can be done but there tends to be very little general advice other than to make sure that the disk cache is not being flushed after every mail received.

David Pye

I don't see any information regarding hardware configuration here, have I missed something???

Is the issue really disk utilization - have you investigated the system as a whole?

If disk I/O is suspect, then there are a number of things which could be addressed, correct RAID config is one, no. of I/O channels is another - how many, fibre / copper, etc, can you get FC direct to the disk?, fastest possible disk hardware?, more extreme areas could be to use only a small % of outermost disk surface via partitioning in order to achieve less head latency, examine the nature of the disk activity,

James Sutherland:

Also check the filesystem in use - given where this question is posted, probably something like ext3? If so, check the journalling mode: data=ordered, data=journaled and data=writeback all perform different in different situations.

John Lauro:

With 128GB of RAM you are probably better off seeing if you can move some of that I/O to a tmpfs RAM disk, often mounted under /dev/shm for RH/Centos/SL.
If you don't need to record when a file is accessed (many people don't even know you can get that info), you can mount the filesystem with noatime. That can make a big difference on the I/O utilization especially for a web server that has lots of small files all over the place.

If some of your I/O is being generated by mysql or other database, perhaps there is some temporary tables such as for session tracking that you might want to consider converting to memory tables. The rows will be emptied if you have to restart the database, but generally that just means the user has to relogin.

Enrique Arizon Benito (aka, that's me!):
man hdparm

Also, maybe some app has "too much" log output enabled and it is just a matter of modifying the /etc/... config file. Inotify can be your friend in this case:


Or you can simple use something like:

find /var -mmin -1 -size +0.5G
(find files 'inside' /var changed/modified in the last minute that are bigger than 0.5Gbytes)

Bond Masuda:

first of all, you need to figure out what's consuming the I/O? get a tool like iotop to help:
I usually use ballpark figures like assuming most modern spinning hard drives can do about 100MB/sec sequential I/O and about 40-60MB/sec random I/O for SATA type drives and add about 10-20% higher numbers for SAS type drives. SSDs of course can have much higher numbers. using those as "guidelines", try to figure out what disk configuration will meet your requirements and add about 20% for overhead. for example, if i have a 8 disk RAID-5 array, which gives me 7 effective spindles, I would assume this can roughly meet a sequential I/O requirement of about 583MB/sec (7x100MB/sec / 1.2 = 583).

with multi-disk arrays, you'll want to consider the "best on average" stripe size for your array... if your I/O pattern is reading sequentially large files, then a larger stripe size is in your favor. if you're doing random I/O and reading small chunks here and there, then a moderate to smaller stripe size a better choice.

also, in situations where you are looking are really large arrays that might span multiple controllers, you'll need to consider the bottlenecks of the bus. older PCI-X buses are half-duplex and limited to about 1GB/sec. more modern servers (like yours with 128GB of RAM) will likely use PCI-E 4x or 8x buses which can do full-duplex and 2GB/sec (8x).

Rob Strong: not attempting to read all these comments, but take a look at the output of iostat "%iowait" field. If this number starts getting up, say past... 20% or more then you possibly have an issue with slow disk and/or bad raid setup.

Oleksiy Dovzhanytsya:


Unnikrishnan P. (original author of the question) added some (important) details to its setup:

This is an NFS server which is used by other servers which runs tools used by the developers. I had moved some users to a dedicated server now and found that performance on this server has increased. Previously there were 4 servers accessing this NFS share. Alot of read/ write happening. I was using RAID 5 (LSI controller) on this server. May be the throughput of RAID 5 is the problem, as that wont support heavy rw operations.

James Sutherland:

That information helps - yes, RAID5 can be a bottleneck with heavy write activity, particularly scattered writes, since it has to read the corresponding blocks from the other disks to recalculate parity every time.

What filesystem is it running, ext3? You may find changing the data journalling, or switching to a separate journal device, will help here: mount with data=writeback (risky, if you don't have a very good UPS!) or data=journal. Also see if your LSI card is doing write back caching or not, that can help significantly.

John Lauro:

What IO scheduler are you running on your disks? If you have at least 128mb on your controller make sure you don't use the often default of cfq as the IO schedulerer as it is the worst with a raid controller...

for a in /sys/block/*/queue/scheduler ; do echo $a : `<$a` ; done

Unnikrishnan P, adding some more info:

I have a server with 128GB RAM and I cannot find the swap memory (120G) being used even when the disk utilization is at %util =95% (using iostat -d -x 5 3). What might be the root cause? What can be done to make the server perform much better?

Jeremy Page:

I'm reasonably certain relatime is now the default for most mount commands.

Cat /proc/mounts and see what options your clients are. I'm using squeeze so probably not a good test but (edited out my personal stuff) /home/pagej nfs rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=,mountvers=3,mountport=4046,mountproto=udp,local_lock=none,addr= 0 0

John Lauro:

reltime is standard on EL6 and nearly as good as noatime for reducing I/O, but it is not the default for EL5 which still common enough that I wouldn't assume someone isn't running it

Wayne Monfries:

As mentioned above by Bond Masuda, you may obtain some easy gains by increasing the read ahead buffer size (probably most helpful for sequential loads)
blockdev --setra 2048 /dev/sda # increase readahead buffer size
Another area that is often overlooked with external raid arrays is that they are capable of processing much greater numbers of IO transactions than a single disk - so check out the qdepth on your adapters and devices.

No comments: