Reviews of distributed filesystems

I have a lot of data to work with, and I want to do it with just my mismatched bunch of servers, desktops, SSDs, and spinning disks. My equipment is old, so I want a filesystem that is robust not only to the failure of any drive, but also to the failure of any one machine. My preference is to build a hyper-converged system, where each machine hosts data in addition to working on compute jobs. Following are reviews I found on the main open-source distributed filesystems out there:

Gluster

Avoid GlusterFS.

Jonathan Dieter

Not gluster, it requires low latency and is difficult to autoscale. Thus, true high availability is difficult to achieve.

hi11111

as someone who watched a product launch fail, probably 90% because of glusters inability to scale in size without massive upgrades to the hardware, stay away

netburnr2

GlusterFS is latency dependent. Since self-heal checks are done when establishing the FD and the client connects to all the servers in the volume simultaneously, high latency (mult-zone) replication is not normally advisable. Each lookup will query both sides of the replica. A simple directory listing of 100 files across a 200ms connection would require 400 round trips totaling 80 seconds. A single drupal page could take around 20 minutes.

Joe Julian

I only have experience running Lustre and Gluster. Gluster was a mess, to be honest, though that was several years ago. Lustre (and we run Intel Enterprise Lustre) has been pretty solid. Most HPC outfits run lustre over ZFS, actually, so you get the benefits of both.

Ceph and Gluster can’t achieve the kind of performance required for HPC scratch. Hadoop can, but if you are managing a general purpose HPC cluster, nearly all software out there is expecting to write to a posix filesystem.

zhynn

People may use Gluster for HPC; I don’t know. But it really isn’t suited.

I run two Gluster clusters for more boring enterprisey reliability.

It is still kind of a mess – there are rough edges all over. Log messages make zero sense by themselves – you either need to learn what they’re indicative of or read the code. Frequently, typoing configuration commands (configuration is mostly imperative command line driven, not via configuration files) so that it fails, actually does make changes, such that you’ll have to unwind what it did before you can try again, etc.

All that said, Gluster is pretty solid for our sue – distributed, fault-tolerant POSIX storage. But I wouldn’t use it in an HPC environment.


__jal

currently i would give MooseFS, orangefs or lizardfs a try.

Had experience with ceph and glusterfs and both I can not recommend.

poelzi

I tried GlusterFS a few years ago and it took a good while to figure out the right setup but in the end had disappointing small file performance.


eis

I also tried ceph and gluster before settling on moosefs a couple years ago — gluster was slow for filesystem operations on a lot of files and it would get into a state where some files weren’t replicated properly with seemingly no problems with the network for physical servers.

mattbillenstein

Actually GlusterFS is close to my needs, however the “easy to install” and actually bad performance on many small files is a road blocker for us. However it runs well, even on just 3 servers.

merb

GlusterFS seems good on paper; but is really inherently unstable and isn’t designed for what you are intending to do. Also high availability has too many parts to really care it highly available

joe8mofo

Can’t use random drives. That’s why I didn’t run GlusterFS originally. You need to add drives in bunches, and that’s just not going to be practical. I’ve got literally dozens of drives ranging from 450gb to 8tb, with every size in between.

wintersdark

Lustre

Lustre began at CMU and appears to have been owned in some sense by Cluster File Systems, Sun Microsystems, Oracle, Xyratex, Seagate, Whamcloud, Intel, and DataDirect Networks. Lustre is complex. I decided it wasn’t for me after watching some Lustre 101 course videos.

While companies like DDN can make Lustre easier to install and maintain, it is by its very nature a difficult beast to tame

Timothy Prickett Morgan

Lustre is a pure-bread racehorse. It is not something the general masses should ever take a ride on; or even get up close and personal with. It will bite you; kick you; piss on you; crap on you; stomp on you; and generally try to kill you. It is cranky. It is not user friendly. It doesn’t play nice with others.

But if you do know how to ride it; you have track to ride it on; and it decides to let you ride; you are in for a real treat. It is a rush that can’t be beat. Even though after that ride your body will ache from the beating it’s been given.

In the right workloads Lustre can’t be beat. I’ve seen it completely transform large HPC clusters from dogs to kentucky derby winners. But as mentioned by many it is a complex beast that needs the right people; right hardware; and right workloads to make it run.

We used it for many years for large geospatial big data workloads. Huge images to many 1000’s of HPC nodes. Worked great. But it was highly sensitive to the technical staff running it; and ultimately it became too “cranky” to use in production mode without Gandalf and his wizard army to keep it from eating itself.

So we now run on NFS over IB and happily give up some performance for 24*365 uptime.

sanguy

In my experience, Lustre on a typical medium-size (several hundred node) cluster has been more reliable than NFS (at least once the cluster vendor was out of the way), and you can actually use it for MPI-IO, which is crucial. It’s been quite straightforward to operate, at least with a simple LNET setup. I don’t understand this stuff about racehorses. NFS has been easily flattened in various circumstances, too.

gnufx

Actually Lustre gives good “out of the box” performance… I’ve used a few HPC clusters and have almost never felt the need to issue commands to configure striping, etc.

frozenport

Lustre is for large HPC clusters. It’s not for your desktop; it’s not for your video editing suite. It is the only way to provide a filesystem that scales to support, for instance, a compute job of 100k ranks, all writing checkpoints and snapshots periodically. No, Hadoop doesn’t do it. Ceph is for performance-and-scaling-insensitive cloud installs.

Lustre is for when you expect to saturage 100 100Gb IB links to storage. it works remarkably well for its use case (though even on HPC clusters, MDS performance can be a problem).

markhahn

Well, NFS isn’t really cache coherent, and hence not POSIX compliant. Lustre is, but pays for it with an amazingly complicated architecture.

jabl

The only serious open-source competitors to Lustre are Ceph and glusterfs. But Ceph is too unstable, and glusterfs 3.0 is based off of distributed hash tables and so is not strongly consistent.

reality_czech

We used Lustre, pNFS and GPFS at Stanford on HPC gear (DDN, Panasas and some enterprise COTS). Luster has a lot of moving parts and config. Most folks tend to use Puppet/Chef and/or Rocks distro to deploy clusters in a somewhat sane manner. (Sometimes AFS too but not much.)

These days, Ceph/Gluster might work, but Lustre’s proven.

anon263626

Lustre is a monstrosity – badly designed, poorly implemented, very hard to configure and keep running or even get adequate performance under other than a single limited use case.

Good riddance to bad rubbish.

pinewurst

Lustre was definitely a beast (in a good sense), but we’d occasionally get bit by a work load high in metadata operations – which would bring Lustre to its knees due to latency.

lo_stronzo

right, so they took a vertical that no one else wanted and served it kind of badly. I don’t know about the last few years, but Lustre used to be famous for eating up a huge amount of time in the field, even given that its only supposed to be relevant for really large files.


convolvatron

BeeGFS/FhGFS

BeeGFS can use infiniband. It doesn’t support erasure coding. It uses a “buddy mirroring” system for redundancy. It can have multiple metadata servers, which removes that as a bottleneck.

Actually Lustre isn’t the only game in town for this. BeeGFS (http://beegfs.com) does a very good job at this as well, has better small to large scaling, understandable by mere-mortal error messages, doesn’t require a specific (ancient) kernel or distro for the server …

hpcjoe

We used BeeGFS since it was called FhGFS on a cluster of 32 nodes for a total capacity ranging from 256tb-1Pb. BeeGFS is very easy to configure and maintain, outperforming Lustre for every task we threw it at. In fact, we never transitioned lustre into production. Lustre was often slower than plain NFS for the cases where all storage could be mounted by a single node(!).

usernam

BeeGFS is great. I originally evaluated it when it was FhGFS and am still involved with a large production BeeGFS cluster deployment.

Any knucklehead can install it and it works transparently with excellent performance. It’s currently not enterprise-grade fault tolerant – that’s the worst thing I can say about it.

pinewurst

BeeGFS v6 has metadata mirroring via buddy groups, and data mirroring the same way. A few things are still not there, but it is rapidly getting closer.

Being open source helps as well.

Definitely recommend it, and have deployed it and supported it many places in my previous life

hpcjoe

MooseFS and BeeGFS don’t really seem to be in the same category of filesystem though.

BeeGFS comes from the HPC world where it is all about performance, while MooseFS seems more focused on high reliability even in the face of entire machines coming and going.

BeeGFS stores files striped over multiple machines and can use infiniband natively which gives us a system where bandwidth to individual files can reach a bit over 1GB/s (best case) and aggregated bandwidth can reach 30GB/s from a very mixed and unoptimized bioinformatics workload, a lot more when just testing the raw bandwidth.

Since BeeGFS uses an underlying filesystem on each storage target you can of course run raid, zfs or whatever it takes to make you comfortable that the individual storage targets aren’t going to be lost – which is what it takes for data to be unavailable.

If you want some extra reliability in BeeGFS it also supports mirroring so you only lose data if you fully lose two storage targets. We can’t really afford to run with full mirroring for our 3PB though.

We are very happy with it for our HPC environment but I’m not sure how well it works in an AWS setup.


birc_a

The company I work for has been running BeeGFS for 6 years with no issues. Write and read speeds on Dell storage hardware match or exceed the NetApp appliances we use.

We do occasionally have issues where ALL storage nodes must be rebooted but other than it works great.

shiftpgdn

I will say that I know two people that have independently tried fairly large trials (400TB+) of Lustre, GlusterFS, and BeeGFS, and BeeGFS was their eventual favorite.

craigyk

Parallel File Systems with exceptional metadata handling such as BeeGFS are perhaps the best option for TensorFlow on HPC.

Exxact Corporation

If good I/O performance is also important, then definitely also take a look at BeeGFS.

netzvolk

OrangeFS

I also looked at OrangeFS (originally PVFS2), but it doesn’t seem to provide replication, and, in preliminary testing, it was over ten times slower than the alternatives.

Jonathan Dieter

LizardFS/MooseFS

MooseFS development seemed to stagnate some years ago. LizardFS was a fork that got most of the new development. LizardFS has been the free version with all the features, and MooseFS has only offered those features (like erasure coding and hot backup metadata servers) to those who get the professional license. However, in recent years, MooseFS seems to be the one with all the updates and bugfixes.

In both of these, all the metadata is stored in RAM.

So far I heard a lot about GlusterFS and Ceph being great but so far I only seen MooseFS in production.

MooseFS seems to do its job well, the only problem is that by default it lacks HA. The master server is very important since it keeps track of where all data is located, so if it is gone your data is gone. You can set up one or more metalogger nodes, which sole purpose is to backup the metadata information. You can also manually promote metalogger to master.

They have commercial offering that includes HA, but I never used so hard to comment about it.

You could also set up HA yourself using corosync and peacemaker, but it’s a bit challenging task.

takeda

Moosefs is pragmatic — written in pretty tight C, performant, and the web UI for seeing the state of the cluster and how files are replicating is very nice. We have a small cluster with maybe a dozen nodes running for almost 2 years now with no hiccups…

mattbillenstein

So, six months on, LizardFS has served us well, and will hopefully continue to serve us for the next (few? many?) years.

Jonathan Dieter

MooseFS seems miles ahead of LizardFS today, waaay faster read/write (same hardware as LizardFS), and way better speeds for reading small files, which finally makes it possible to use as home folders in our studio.

hradec

I used moosefs, which the free version is now called lizardfs, in a POC with about 50tb and it performed VERY well and was simple to maintain.

The POC compared it to glusterfs and cephfs. Moosefs won the POC, but wasn’t selected because of the vendor support. For a small, non production workload, lizardfs would be my choice.

r3dk0w

LizardFS has got decent OSX support, using it right now. Installation is not THAT easy (used this https://github.com/lizardfs/lizardfs/blob/master/create-osx-package.sh). There should possibly be some guide or something on their site. Anyway, it just rocks on OSX, ubuntus and some windows we have (via NFS, native client is closed source). Constantly expanding, nice tiering (hot data on SSD, colder on HDD). It is about twice as fast as Gluster in our tests.

affidavitdonda

After working with Ceph for 11 months I came to conclusion that it utterly sucks so I suggest to avoid it. I tried XtreemFS, RozoFS and QuantcastFS but found them not good enough either.

I wholeheartedly recommend LizardFS which is a fork of now proprietary MooseFS. LizardFS features data integrity, monitoring and superior performance with very few dependencies.

2019 update: situation has changed and LizardFS is not actively maintained any more.

MooseFS is stronger than ever and free from most LizardFS bugs. MooseFS is well maintained and it is faster than LizardFS.

Onlyjob

Ceph/CephFS

After CephFS’s amazing performance in the single-client mode, I was looking forward to some incredible results, but it really didn’t scale as well as I had hoped, though it was still competitive with the other distributed filesystems. Once again, LizardFS has shown that when it comes to metadata operations, it’s really hard to beat, but its aggregate read and write performance were disappointing. And, once again, GlusterFS really struggled with the test. I wish it would have worked with the performance tuning for small files enabled, as we might have seen better results.

Jonathan Dieter

Take, for example, my favorite storage system Ceph. As I understand it was originally going to be CephFS, with multiple metadata servers and lots of distributed POSIX goodness. However, in the 10+ years its been in development, the parts that have gotten tons of traction and have widespread use are seemingly one-off side projects from the underlying storage system: object storage and the RBD block device interfaces. Only in the past 12 months is CephFS becoming production ready. But only with a single metadata server, and the multiple metadata servers are still being debugged.

With Ceph, some of these timing issues are that the market for object store and network-based block devices are dwarf the market for distributed POSIX. But I bring it up to point out that distributed POSIX is also just a really really hard problem, with limited use cases. It’s super convenient for getting an existing Unix executable to run on lots of machines at once. But that convenience may not be worth the challenges it imposes on the infrastructure.

epistasis

CephFS simply didn’t gain as much traction because it made sense to just store objects in many cases, and let something else worry about what is stored and where. A massive distributed file system is not nearly as necessary as people make it out to be for a lot of different workloads.

X-Istence

I’ve played around with Ceph, I really like its features and strong consistency.

But its weakness is that it is a complex beast to setup and maintain. If you think that configuring and maintaining a Hadoop cluster is hard, then Ceph is twice as hard. Ceph has really good documentation, but all the knobs you have to read and play around with is still too much.

olavgg

I’ll echo that, we had a project that at one time, even with a couple experienced Ceph admins, suffered meltdown after meltdown on Ceph until we just took the time to move the project and workload over to HDFS. Our admins learned to setup and administer HDFS from scratch and have had far fewer issues.

bane

I’ve run a ceph deployment for about 18 months now, it’s good. ops-wise it’s like anything, a few quirks to learn – but it’s very solid. I played around with a couple of puppet modules and ended up on ceph-deploy as well, makes life quite easy.

angrygoat

Doing some work with the Ceph code in the (G)iant timeframe, almost two years ago now, my assessment was that the code was under heavy revision and strong project oversight wasn’t obviously present. I wouldn’t have been in any hurry to use it in production. Today the project says the latest release(s) are of production grade.

pdp10

Ceph is an object-based scale-out distributed storage platform with geo-replica capabilities. CEPH provides a filesystem interface and a block storage interface. It is complicated to setup, but works very well for a distributed object (Dropbox replacement)/filesystem. I would recommend Ceph for your use cases, but it will take some time to architect and setup properly

joe8mofo

Probably not a drama for you but Ceph falls apart above 200 nodes. It’s a great tool but if you want to scale it big, be warned.

dreadpiratewombat

Tip: Ceph prefers uniform hardware across pools. If you are adding drives of dissimilar size, you can adjust their weights. However, for best performance, consider a CRUSH hierarchy with drives of the same type/size.

ceph.com

Tahoe-LAFS

Erasure coding, focus on security with encrypted data

I use Tahoe-LAFS for all of my distributed FS stuff, and I really do love it. One minor downside is that the introducer server is a SPOF, but they can be backed up/spun up and distributed introducers are on the roadmap for the next year.


thraway2016

Almost to my bewilderment I learned the Tahoe Lafs distributed filesystem is written in Python.

commandline_be

RozoFS

All I can find are press releases.

Additional info

Marlan Marinov gave a presentation entitled, “Comparison of FOSS distributed filesystems.” His slides were very useful to read, although he says that BeeGFS uses FUSE, which I don’t believe is correct.