Sunday, August 20, 2006

Writing a distributed filesystem in 24 hours

So you want to run Flickr, YouTube out of business with your brand new site. Whether it happens or not - you need a decent distributed filesystem to store the millions of files your users will upload.

You don't need full filesystem functionality (i.e. seeking files, reading/writing parts of files, using the filesystem as a normal Linux FS) for this purpose, so GFS is probably not for you. Besides, it seems to be kernel-based (and I hate Linux kernel-based solutions for their instability) and it also seems to be RedHat only, so it's not really a choice for you - the true hacker. There is also Coda and AFS but both seem to be immature and I wouldn't bet any money on any of those.

What you need are simple put/get file operations. This narrows your choice down to MogileFS. MogileFS has some bad things about it:
  1. It's Perl-based (yuck!)
  2. It requires a central database to store the locations of files, and when your load rises the database gets more and more hits possibly becoming the bottleneck of the whole deployment.
If you can cope with these then you're all set. But if you want to achieve infinite scalability and get rid of the central database - read on.

What you need in order to hack a fully fledged distributed FS:
  • Python, the best language a business can use without risking the lack of programmers (Common Lisp is better, but I don't know anyone who knows it except for me)
  • Some means of communication between the client and the FS - I chose HTTP (using Python's builtin urllib on the client side and Django on the server side), but you can use Twisted's PB or any other communications protocol.
Well.. that's it. No database required, no kernel hacks, no hidden costs.

So how DOES it work?
Let's say you have 9 servers to use for the filesystem, each with 400GB of storage (2 SATA drives in RAID 0). We will divide those into 3 clusters, each cluster consisting of 3 servers. So we will replicate every file in the FS to 3 independent machines. This way the machines can have cheap SATA-based hard drives running in RAID 0, and when a machine fails we still have 2 other computers storing the same files.

On each of the servers there is a HTTP server running which channels all requests to a Python program supporting two operations - putFile and getFile. It can be a single Django view, or two Python methods exported by PB. Whatever it is it's simple and you can implement it in a matter of minutes.

The whole magic is inside the client. What the client does is it treats our server clusters as hashtable buckets. So basing on the name of the file we want to put or get from the FS, it calculates the hash of it (MD5 for example which is fast and available as a Python library) and divides that modulo the number of clusters. As a result we get the number of the cluster to which this file belongs. If it's a putFile operation we're doing - we execute this operation on ALL machines in this cluster, and if it's a getFile - we execute it on a random one (and if it's not working - we try another one and so on).

Simple, easy to implement and effective. There is of course a bunch of problems with this implementation (adding clusters, lack of transactions, etc) but all these can be overcome. One thing is certain - you can have it up and running within 24 hours.

If you have any questions please post them in the comments, and perhaps I will continue to part two of this tutorial with more details and perhaps code snippets. And if you are interested in buying such a solution together with all the software required to grow and maintain it - let me know.


cratuki said...

I'm very interested to know how you go with this. One of the things I need to write as a home project at the moment is a file system which encrypts as it saves, and also puts the saved file into a queue to be copied via rsync to a remote host. This saves me having to worry about RAID.

I was going to use something like LUFS-Python to implement it. I've written a blog entry around this topic recently and you might be interested in it.

cratuki said...

Another idea for a cool filesystem: a software development platform for a language like lisp. Every time you save it interprets and runs the contents of the file. Thus, you could basically check out your source repository to the file system and then just switch it on. You wouldn't need an interactive console although you may choose to connect a log viewer to it. You'd be able to use any editor.

Wojtek Sobczuk said...

LUFS-Python seems like a nice idea if you want to bind yourself to Linux and not require a client for the filesystem but treat it as a local one.

I don't think that this solution gives a lot of benefits though if you only plan on using the FS from your code. If you want other applications to use it then yes, it's a good idea.

I don't quiet get your other idea regarding lisp and software development?

cratuki said...

> I don't quiet get your other idea
> regarding lisp and software development?

Oh - well the filesystem is also a lisp interpreter. So every time you save a file, the event first saves the file as requested, and then injects the source into the lisp interpreter. In this way the new code comes into effect as a result of the save.

Wojtek Sobczuk said...

Oh I get it now. That's an exotic way to implement this idea ;)

Toby said...

Your idea sounds a very close to GFS (Google's, not RedHat's), if a lot more simplistic. They also do not have transactions, not really requiring them at that level. If you're interested, check out how they implement atomic, concurrent append without a distributed lock manager. Its kinda neat.