This is my first post as an English language blogger, so be gentle :). In the recent months, I’ve had my fair share of disputes with several storage vendors and system integrators. Starting from the simple and easy task of selecting a flexible network attacked storage option for our small company, we were having a hard time bouncing between enterprise class integrators and consumer vendors. Our Goldie Locks approach didn’t seem to find any momentum: there simply isn’t the right choice in the market.
What’s so special about our storage?
Well… nothing, really. What I was looking for was a cost-effective system, capable of performing the following simple tasks:
a. Backup. While we intended to keep live data on our main server, we needed some sort of very large, cost-effective and fast mirror of our stuff. That’s really 101, nothing really special.
b. Live data access. Later we reviewed our architecture, as we thought that it would be a good idea to split some tasks (and failover mechanisms) between multiple small servers, thus creating the need for a shared storage array. In plain english, we wanted to be able to access our data either trough some networking protocols (SMB, AFP) or trough WAN protocols like FTP or WebDAV.
c. Redundancy, integrity and high availability. Don’t think of Project Manhattan, we just wanted our services to survive common hardware failures, on all levels if possible and for our data to survive common data corruption events. That’s the most basic assumption in designing storage architectures: everything that COULD fail, WILL fail in the worse possible moment. With hardware being relatively cheap these days, we thought that was the easy task…
d. Large capacity. As you might know, backup doesn’t do very much on it’s own. Read a nice story by Robin Harris and how he lost data in multi-layered backup environments. The same thing happened to us in past. What we wanted is the storage capacity to do a “hot” archive, keeping incremental backups of our data on well-managed disk arrays with plenty of redundancy.
Here’s what we got:
1. Enterprise class storage integrators we found to live in a completely parallel universe. As we are a small company, even though we are doing alright for Eastern European standards, there was no way in hell we could afford a SAN array with some hardcore servers that would cost probably more than our operational profit in the last couple of years. AND, there was also no way in hell I would have bought that crap even if I would have won the lottery, as we have found there are some good options out there.
2. Consumer vendors were selling at best some fragile and low performance NAS units, which might have made a pretty sensible purchase, with all the shortcomings, if we wouldn’t have pushed the “Goldie Locks” envelope. After all, a cheap Atom box with limited performance and a lot of proprietary limitations, providing at least some decent storage capacity is not something that I would easily spend $2k – $3k on.
What’s my problem?
In a world where raw storage costs $0,07 / GB (and I’m being generous here), how difficult is it to have a decent and reliable storage solutions for, let’s say, $0,3/GB? Is it that hard? Well, it is. Here’s why:
1. Everybody is selling RAID. Well, I hate to break it to you, but if you are told that “RAID” equals “safety”, by any means, somebody is selling you plain ol’ snake oil. It is a delicate balance between reliability and fault tolerance (please read more about that here), RAID is brilliant at adding fault tolerance, but not at increasing reliability. Besides that, as disks are getting bigger, the rate of URE’s (unrecoverable read errors) is dangerously threatening the feasibility of single parity raid rebuilds. And things don’t look good in the future either.
2. Vendors are keen on selling hardware RAID and explaining you why software RAID sucks so much. While I understand that any system integrator is happy to sell expensive (read: overpriced) hardware, there is a tiny issue here: there is no “hardware” RAID. There is only software RAID which is embedded in hardware. Which is also usually pretty old. But it works, we got to give them that.
3. In almost all cases the client has absolutely no idea what he is buying. Sadly, in most cases than not, the vendor has also very little clue of what he’s selling. So, most discussions are like “here’s what we got. no clue what’s in the box, but some smart people I’ve never met say that it’s hot. want it or not?” Well, not.
So, what to do to have decent storage?
First, we should do the thinking part. What we need first is the actual storage itself. Luckily, there aren’t many options left, so by “storage” one can only mean disks. And by “disks”, I mean large-capacity, consumer-grade SATA disks. Why not some “exotic” SAS-or-like options, with impressive specs like 15k rpm and large throughputs? Well, that’s why:
– All Google infrastructure, is running on cheap, consumer-grade hardware and is doing just fine. More than that, one of the world’s most comprehensive study about HDD failure rates, performed by Google on a population of 100.000 disks, finds failures of consumer-grade disks at very acceptable levels. It also finds that most “embedded ” early warning systems like S.M.A.R.T. fail to predict disk failure more often than not. So, it’s not the hardware.
– I don’t buy much of the “ultra-fast 15k rpm” SAS drive argument. Not while those things are spinning 2.5″ platters with low density data on them (enterprise safety, right?). Their not that fast and the bang for buck is underwhelming, at best. When measured against capacity, the energy efficiency isn’t that brilliant either. And in terms of security, I really don’t care about low-density and big blocks, as long as the filesystem cannot insure end-to-end integrity. ECC memory and smart controllers are not enough on their own. Say in case of a power or hardware failure while write data is in cache…
Second, we need a decent file system. While Google has it’s own way of handling data (see the GFS link above), we definitely don’t have the economy of scale of implementing something like that, even if it wouldn’t be proprietary. But we have ZFS. One of the world’s most brilliant and effective file systems, ZFS can provide right about everything we asked for: data integrity trough a clever copy-on-write transactional model, live snapshots of our data, the ability of easily managing virtual pools of multiple-terrabyte drives (no volumes), and many more. I will definitely write more details about ZFS in the future, but for people unfamiliar with it, here’s what you need to know: it blows away hardware RAID from any possible angle. Just software alone, on regular consumer hardware. AND it’s free.
Third, we need an OS. ZFS is just a file system (actually a volume manager), it has to run on something that’s supporting it. Luckily, there aren’t many options either, thanks to the captains of the industry who worked hard at narrowing it down for us:
a. OpenSolaris. The open source OS developed by Sun Microsystems (who also developed ZFS) would be a pretty good choice. If it wouldn’t have had a pretty uncertain future, due to Oracle’s narrow-minded strategy that will likely end with the termination of OpenSolaris. Sadly, the OpenSolaris community is not as strong as the BSD community and it probably won’t be capable of sustaining the project from forks.
b. Speaking of which, FreeBSD sounds like a sensible solution, as it has everything one could wish for: a great developing community, extremely mature code, well anchored in the core UNIX philosophy, a great security and stability record, wide presence on the industry (most Juniper boxes are running BSD) and ZFS support. And, did I mention it’s free?
c. There could have been Mac OS in this shortlist, hadn’t Apple scrapped ZFS support over some commercial issues. Apple makes some very reasonable hardware when it comes to servers (the Mac Mini server is brilliant in a low-cost environment) and I would have loved to tweak OS X with ZFS, but it probably wouldn’t have made a very flexible product.
Finally, all we have to do is to read thoroughly the hardware compatibility list (HCL) of our chosen OS and to build (or build-to-order) a compatible machine of our choosing: a motherboard with plenty of SATA connections (we found a few with 10-12 pcs), some energy efficient processor (more like an C2D than an Atom, nevertheless), as much RAM as we can fit in (12 – 16 GB would do just fine), a couple of Gigabit cards (for link aggregation) and the rest of the details. While this isn’t a walk in the park, it’s not rocket science either. Our bill-of-materials for a 16 TB system (Core i7, 16 GB RAM, SSD-based buffering and much more) is around $4000, with off-the-shelf components. What could a baby like that do? Well, for example handling 150 IP cameras that take 10 fps on 5 MP resolution, without hiccups, while ensuring a 10 TB usable capacity in a triple (N+3) redundancy configuration. Not bad for $4000. Want a live mirror? Just build two boxes and you have 12 TB high availability storage (2 x N+2, enough even for the most paranoid) for below $10k. Some 1/3 to 1/2 the price for what enterprise vendors are selling it.
We could build it, but who’s buying it?
Frankly? Nobody. As it is one of the (arguably) undisputed brilliant storage solutions, ZFS-based systems on top of off-the-shelf hardware has no market. The reasons behind this are complex and time-consuming to explain, but, in a nutshell, the situation is like this: most corporate IT managers are more worried about the risk exposure of the investment itself than the technical feasibility of the solution. As we are still speaking of high-capacity, high-security and high-performance storage, such a product would address largely the corporate market, which very rarely shops outside well-established brands, which, of course, favor proprietary systems, covered by IP’s that ensure a solid gross margin. And for that they offer a sensible exposure safety and deliver on their promises, we got to give them that. Even if they don’t promise much and charge way more than it’s worth it.
Nevertheless, the world needs storage. As the data infrastructure is moving towards clouds (not necessarily public ones, but that’s another topic), storage is more and more a vital component of our effectiveness, our security and our life. Most consumers don’t have decent backup strategies. Most enterprises don’t have reliable archiving strategies. And yes, most governments don’t have decent disaster recovery architectures. It’s incredible how much there is to do on the storage market. And, sadly, these is still too little going on: the market cannot embrace massive storage on top-brand gross margin, while it still needs all those features.
Some people have built upon such open source architectures and came up with great products. But they didn’t sold the technology directly, they sold the service. Zetta is rolling out a cloud storage service that is closely based on micro-architectures like the one described here. They don’t state that explicitly but, even it it weren’t for their name (ZFS once meant Zettabyte File System), their white-papers and specs are pretty straight-forward (snapshots, N+3, etc).
Err… Give it for free?
The really awkward situation is that we have the need for it, we have the technology to do it and numbers would end up, but it’s still not happening. What about giving it for free? After all, there were some people who thought of sharing storage box designs before. We have open source beer, for crying out loud, why not open source storage?
All it takes is some guys to go hands-on with a few designs, ranging from cheap Atom boxes to more serious Nehalem file servers and to publish simple DIY guides and OS builds for people to build their own storage. Slowly, a community support could easily substitute for a technical service and could provide a feasible alternative to a SME. There might be some people who would jump on such a wagon…
Check out my other IT-related posts on my Romanian language blog.