Re-comparing file systems

The previous attempt at comparing file systems based on the ability to allocate large files and zero them met with some interesting feedback. I was asked why I didn’t add reiserfs to the tests and also if I could test with larger files.

The test itself had a few problems, making the results unfair:

– I had different partitions for different file systems. So the hard drive geometry and seek times would play a part in the test results

– One can never be sure that the data that was requested to be written to the hard disk was actually written unless one unmounts the partition

– Other data that was in the cache before starting the test could be in the process of being written out to the disk and that could also interfere with the results

All these have been addressed in the newer results.

There are a few more goodies too:
– gnuplot script to ease the charting of data
– A script to automate testing of on various file systems
– A big bug fixed that affected the results for the chunk-writing cases (4k and 8k): this existed right from the time I first wrote the test and was the result of using the wrong parameter for calculating chunk size. This was spotted by Mike Galbraith on lkml.

Browse the sources here

or git-clone them by

git clone

So in addition to ext3, ext4, xfs and btrfs, I’ve added ext2, reiserfs and expanded the ext3 test to cover the three journalling modes: data, writeback and guarded. guarded is the new mode that’s being proposed (it’s not yet in the Linux kernel). It’s to have the speed of writeback and the consistency of ordered.

I’ve also run these tests twice, once with a user logged in and a full desktop on. This is to measure the times that a user will see when actually working on the system and some app tries allocating files.

I also ran the tests in single mode so that there are no background services running and the effect of other processes on the tests is not seen. This is done to see the timing. The fragmentation will of course remain more or less the same; that’s not a property of system load.

It’s also important to note that I created this test suite to mainly find out how fragmented the files are when allocating them using different methods on different file systems. The comparison of performance is a side-effect. This test is also not useful for any kind of stress-testing file systems. There are other suites that do a good job of it.

That said, the results suggest that btrfs, xfs and ext4 are the best when it comes to keeping fragments at the lowest. Reiserfs really looks bad in these tests.Time-wise, the file systems that support the fallocate() syscall perform the best, using almost no time in allocating files of any size. ext4, xfs and btrfs support this syscall.

On to the tests. I created a 4GiB file for each test. The tests are: posix_fallocate(), mmap+memset, writing 4k-sized chunks and writing 8k-sized chunks. These tests are repeated inside the same partition sized 20GiB. The script reformats the partition for the appropriate fs before the run.

The results:

The first 4 columns show the times (in seconds) and the last four columns show the fragments resulting from the corresponding test.

The results, in text form, are:

# 4GiB file
# Desktop on
filesystem posix-fallocate mmap chunk-4096 chunk-8192 posix-fallocate mmap chunk-4096 chunk-8192
ext2 73 96 77 80 34 39 39 36
ext3-writeback 89 104 89 93 34 36 37 37
ext3-ordered 87 98 89 92 34 35 37 36
ext3-guarded 89 102 90 93 34 35 36 36
ext4 0 84 74 79 1 10 9 7
xfs 0 81 75 81 1 2 2 2
reiserfs 85 86 89 93 938 35 953 956
btrfs 0 85 79 82 1 1 1 1

# 4GiB file
# Single
filesystem posix-fallocate mmap chunk-4096 chunk-8192 posix-fallocate mmap chunk-4096 chunk-8192
ext2 71 85 73 77 33 37 35 36
ext3-writeback 84 91 86 90 34 35 37 36
ext3-ordered 85 85 87 91 34 34 37 36
ext3-guarded 84 85 86 90 34 34 38 37
ext4 0 74 72 76 1 10 9 7
xfs 0 72 73 77 1 2 2 2
reiserfs 83 75 86 91 938 35 953 956
btrfs 0 74 76 80 1 1 1 1

[Sorry; couldn’t find an option to make this look proper]

Fig. 1, number of fragments. reiserfs performs really bad here.
Fig. 2. The same results, but without reiserfs.
Fig. 3, time results, with desktop on

Fig. 4. Time results, without desktop — in single user mode.

So in conclusion, as noted above, btrfs, xfs and ext4 are the best when it comes to keeping fragments at the lowest. Reiserfs really looks bad in these tests. Time-wise, the file systems that support the fallocate() syscall perform the best, using almost no time in allocating files of any size. ext4, xfs and btrfs support this syscall.

The fallocate() Story Continues

Making apps use the fallocate() syscall instead of writing zeros to a file is the preferred way to init a file with all 0s. I was pleasantly surprised ktorrent already does that (but via a non-default config option):

I would like it if they made posix_fallocate() the default, if available on the target system. posix_fallocate() already uses fallocate() if supported by the filesystem, otherwise it falls down to the writing zeros block-by-block method. My last post showed the comparison of various file allocation methods, the performance of filesystems and also the fragmentation each method causes.

Reading that post again, it looks like it could’ve been written much better and could’ve used a couple of editing rounds. So I’ve decided to do a second post which will have better results and more file systems added to the fray. I’ve updated the test to calculate the numbers more reliably and have also run the tests once more with more filesystems and taking factors like hard disk geometry, seek times, etc., out of the equation. The git tree is already updated with the new code, so you can try it out yourself. In any case, stay tuned for the results.

Comparison of File Systems And Speeding Up Applications

Update: I’ve done a newer article on this subject at that removes some of the deficiencies in the tests mentioned here and has newer, more accurate results along with some new file systems.

How should one allocate disk space for a file for later writing? ftruncate() (or lseek() followed by write()) create sparse files, not what is needed. A traditional way is to write zeroes to the file till it reaches the desired file size. Doing things this way has a few drawbacks:

  • Slow, as small chunks are written one at a time by the write() syscall
  • Lots of fragmentation

posix_fallocate() is a library call that handles the chunking of writes in one batch; the application need not have to code his/her own block-by-block writes. But this still is in the userspace.

Linux 2.6.23 introduced the fallocate() system call. The allocation is then moved to kernel space and hence is faster. New file systems that support extents make this call very fast indeed: a single extent is to be marked as being allocated on disk (as traditionally blocks were being marked as ‘used’). Fragmentation too is reduced as file systems will now keep track of extents, instead of smaller blocks.

posix_fallocate() will internally use fallocate() if the syscall exists in the running kernel.

So I thought it would be a good idea to make libvirt use posix_fallocate() so that systems with the newer file systems will directly benefit when allocating disk space for virtual machines. I wasn’t sure of what method libvirt already used to allocate the space. I found out that it allocated blocks in 4KiB sized chunks.

So I sent a patch to the libvir-list to convert to posix_fallocate() and danpb asked me about what the benefits of this approach were and also asked about using alternative approaches if not writing in 4K chunks. I didn’t have any data to back up my claims of “this approach will be fast and will result in less fragmentation, which is desirable”. So I set out to do some benchmarking. To do that, though, I first had to make some empty disk space to create a few file systems of sufficiently large sizes. Hunting for a test machine with spare disk space proved futie, so I went about resizing my ext3 partition and creating about 15 GB of free disk space. I intended to test ext3, ext4, xfs and btrfs. I could use my existing ext3 partition for the testing, but that would not give honest results about the fragmentation (existing file systems may already be fragmented, causing big new files surely to be fragmented whereas on a fresh fs, I won’t run into that risk).

Though even creating separate partitions on rotating storage and testing file system performance won’t give perfectly honest results, I figured if the percentage difference in the results was quite high, that won’t matter. I grabbed the latest Linus tree and the latest dev trees for the userspace utilities for all the file systems and created about 5GB partitions for each fs.

I then wrote a program that created a file, allocated disk space and closed it and calculate the time taken in doing so. This was done multiple times for different allocation methods: posix_fallocate(), mmap() + memset() and writing zeroes in 4096 byte chunks and 8192 byte chunks.

So I had four methods of allocating files and 5G partition size. So I decided to check the performance by creating 1GiB file size for each allocation method.

The program is here. The results, here. The git tree is here.

I was quite surprised seeing poor performance for posix_fallocate() on ext4. On digging a bit, I realised mkfs.ext4 didn’t create it with extents enabled. I reformatted the partition, but that data was valuable to have as well. Shows how much a file system is better with extents support.

Graphically, it looks like this:
Notice that ext4, xfs and btrfs take only a few microseconds to complete posix_fallocate().

The number of fragments created:

btrfs doesn’t yet have the ioctl implemented for calculating fragments.

The results are very impressive and the final patches to libvirt were finalised pretty quickly. They’re now in the development branch libvirt. Coming soon to a virtual machine management application near you.

Use of posix_fallocate() will be beneficial to programs that know in advance the size of the file being created, like torrent clients, ftp clients, browsers, download managers, etc. It won’t be beneficial in the speed sense, as data is only written when it’s downloaded, but it’s beneficial in the as-less-fragmentation-as-possible sense.

Making Suspend Safer for File Systems

I saw these File System Freezing patches that got merged into Linus’ tree yesterday and instantly thought that these patches could be used to freeze file systems before going into a suspended state. At the recent, I met with Christoph Hellwig and one of the things we discussed was how he would never trust any file system to be in a consistent state after attempting suspend-to-disk.

The freezing patches are aimed at snapshotting as of now. Extending the suspend routines to make use of them is something I still have to look at. While working with file systems isn’t entirely new to me — I’ve worked on something called the Mosix File System earlier, it’s been a really long time. It’ll be quite interesting to work on this.

I had a brief chat with hch about this idea and while he says this still will not convince him to suspend to disk, it could be a good thing for suspend to ram where the laptop runs out of power but the fs could be in a good state. I agree. Though I’d like to use s-t-d with this!

I’ve had many ideas slip by without blogging about them for ages and later seen them implement by others. In this case, even if I don’t end up implementing something, I’d at least have the satisfaction of having penned it down first.