Reed Solomon codes are cool

Imagine wending your way through a great book on your e-reader, the world melting away, and suddenly everything comes crashing back to reality with an apologetic Sorry! Chapter 20 corrupted! message.

A few tired cells of the flash storage gave up the ghost overnight and corrupted your book.

Wouldn’t it be great if your device didn’t complain about its innards, and recovered from the problem itself?

As with all blog articles with long winded opening sections, there’s a big reveal coming. Which gives me great pleasure to welcome Reed Solomon to the stage! Come on out Reed, Solomon - you too!

Recovering from corruption

sigh. I know…

The central idea of Reed Solomon codes, is you can recover data in the event of corruption or data loss up to a tolerated level of failure. This is commonly found in data storage or signal processing systems as an extra layer of protection in the event of problems arising.

It’s actually used a lot in your day to day life, according to Wikipedia:

They have many applications, the most prominent of which include consumer technologies such as CDs, DVDs, Blu-ray Discs, QR Codes, data transmission technologies such as DSL and WiMAX, broadcast systems such as satellite communications, DVB and ATSC, and storage systems such as RAID 6.

Even if a QR code gets a bit smudged in the rain, or your CD-ROM copy of Encarta ‘98 is suffering from a bout of bit rot, you can take solace in the fact that this error correction algorithm keeps the good times rolling.

Which is kind of reassuring in a way, especially with the write-once-read-many situations where getting a backup copy might involve a lot of time and money.

The Wikipedia page doesn’t mention cloud storage, I’m pretty sure Google use it as part of their file storage systems1, so you have Reed Solomon to thank for at least one of the lines of defense that keep your cloud data durable.

It’s even being used as a safety net for storing data in DNA, which is mind blowing. If anything, it shows that if you can store something in bits, this algorithm is generic enough to provide an error correction mechanism.

What’s going on

In the storage case, you prepare your data by splitting it into fixed evenly sized data shards, then feed them to the algorithm. After some number crunching, it will return some additional parity shards. This bag of shards make up the set of data you store on disk(s).

Those parity shards, are the magic sauce that lets you recover any combination of corrupted or missing shard(s).

It just amazes me that this works! It feels like some sort of magic, but it’s actually grounded in some clever mathematics, that I won’t attempt to explain or provide a strained analogy for, other people have done it better.

It’s not limitless though, the amount of parity shards dictates the maximum number of failed shards you can recover. So you’d base this on expected level of failure.

You could say you want one parity shard for each data shard, the downside to that approach is the storage required would be double, as well as increasing the computation time needed to recover from failure. Probably overkill for not much gain.

According to this post, Backblaze uses a parity of 3 for their Vault product, which from the sounds of it is the sweet spot between balancing the risk of failure vs. processing speed.

A Vault splits data into 17 shards, and has to calculate 3 parity shards from that, so that’s the configuration we use for performance measurements. Running in a single thread on Storage Pod hardware, our library can process incoming data at 149 megabytes per second.

Yeah, seen it all before

This might be common knowledge to a lot of people but I found it an interesting detour down the Wikipedia rabbit hole. I’m not sure if I’ll ever find myself implementing something that uses this, but it’s nice to have it in the toolbox.