Welcome to Inkbunny...
Allowed ratings
To view member-only content, create an account. ( Hide )
GreenReaper

From 2TB to 3TB - how hard could it be?

This week we expanded Inkbunny's usable submission storage by 50% without paying a penny.
This journal details why, and how, for the curious. (This diagram of RAID levels may help.)

Inkbunny is always growing - it varies depending on bulk uploads and transfers, but on average we add 1GB/day. For our donation drive we estimated that our 2TB of submission storage would last three years.

Three years sounds like a long time. But traffic is increasing, as is the number of active users. We'd like to get as far ahead of ourselves as possible with our current hardware.

Up 'til now, Inkbunny's submissions have been stored in a RAID 1+0 array of four 1TB 7200-RPM disks; its database is stored on two 64Gb SSDs, in RAID 1. Both of these RAID systems perform "mirroring" - storing two copies of each bit to provide redundancy against hardware failure.

[RAID isn't a backup; it improves performance and avoids downtime in certain situations, as just happened to e621. We have separate on-site and off-site backups.]

RAID 1+0 provides great write performance. This was important before we moved the database to SSDs and reduced other writes to that disk. It's now a bit of a waste, so we decided to switch to RAID 5. This uses simple calculations to use a single disk's worth of storage to protect any number of data disks.

In theory, RAID 5 is slower than RAID 1+0 for general use. In practice, our RAID controller stores writes in 1GB of RAM and reports immediate completion, giving similar write performance regardless of RAID level. It's got a backup capacitor; if you pull the plug out, it stores in-flight data in flash until power is restored.

[RAID controllers can get very fancy. We could plug in more SSDs and cache our hard disks with them, but that'd be overkill at our size.]

RAID 5 has become less popular as disk capacity has increased faster than write speeds. Having to read each disk during a restore stresses them; if any more fail, you face having to restore at least part of your data from backups - and parity calculations limit rebuild performance (to around 33MB/sec in our case). But we only have four disks, and they're not too big. They've lasted several years and average a 25-35% utilization rate at 33°C, which research suggests is close to ideal.

As a bonus, RAID 5 can be faster for reads. Due to their construction, the start of a hard disk is normally faster than the end - this is why they get slower as they fill up. With data striped across three disks rather than two, the "distance" from the start is reduced by 33%. We get many requests for little files, and we can't cache all of them, so read access time is important.

----

That's why we moved to RAID 5. But how? On the RAID side, it took just two steps:

* Transforming from RAID 1+0 to RAID 5 (the longest part; the controller had to rewrite every bit)
* Increasing the size of the logical disk (almost instant, with some background work)

The transformation was timed to fall outside our peak period - a good thing, since it took 15 hours. The controller was now presenting a 3TB "logical disk" - and once we'd poked Linux, it knew that as well.

Unfortunately, our system was set up with a master boot record for its partitions. MBR only works up to the 2TB mark. That's fine when your disk is 2TB, but poses problems after that.

The newer GUID Partition Table supports partitions of almost unlimited size. GPT is usually associated with UEFI (a replacement for the BIOS), but it works fine with BIOS-based computers - at least, if you're not using Windows.

What we had to do was convert to GPT, allocate a BIOS boot partition to store our bootloader's second stage, reinstall said bootloader, replace the old partition record with one indicating that the partition was now larger, reboot to get Linux to accept the new partition values, and then expand the filesystem.

As one staff member noted, it felt "like open heart and brain surgery all at once". Inkbunny's server is in a datacenter hundreds of miles away; if we'd got it wrong, we'd be left to fix it in rescue mode (the equivalent of a Live CD) over a remote connection. But it all worked out in the end.

From what we can see, there's essentially no difference in performance - we just have more space.
If anything, disk utilization has decreased. As a bonus, if we have to increase storage beyond 3TB, it'll take one more disk, rather than two. Hopefully we won't have to do that for about five years, though!
Viewed: 208 times
Added: 9 years, 10 months ago
 
b4818529c406
9 years, 10 months ago
And this is why I love Inkbunny.

Technical news shared with the commoners, and the staff actually know what they're doing.
Alfador
9 years, 10 months ago
=^_^= I am pleased to learn these things. Including the fact I was unaware of, that MBRs are only for 2TB drives and less. @_@
GreenReaper
9 years, 10 months ago
Technically if you use a native "advanced format" drive with 4KB physical and logical sectors, you could get it up to 16TB, since the issue is the maximum logical sector. We're not doing that, though.

16TB is also a limit for NTFS volumes on Windows XP with the default 4K clusters, for similar reasons. This isn't normally an issue, given the MBR limit, but might come up with RAID or dynamic disks.
maxinered
9 years, 10 months ago
Did you say you're using an SSD for Database work? Last time I heard is that SSDs don't like many writes at all (providing they're just large EEPROMs with write cycles of 10,000 to 100,000 write cycles). So putting a database with a lot of write cycles on it sounds rather counterproductive to me.
LeonHunter
9 years, 10 months ago
Actually, SSDs are preferable in environments where high IOPS is important - such as a database server. Concerns about SSD write-cycles are usually unfounded outside of enterprise use cases - unless of course you're seriously abusing disk storage in ways that it shouldn't be used in. Most marketable SDDs use a even-wear algorithm to ensure the individual cells wear down at a mostly even pace - though such doesn't mean much for InnoDB databases, which do not release storage space if a row is erased. Regardless of that fact, the cell life of an SSD may ultimately exceed 2 petabytes of writing depending on make and model.

I doubt Inkbunny's database will create that many writes within the hardware life-cycle.
GreenReaper
9 years, 10 months ago
In over a year, we've only made 92 TB of writes to the database filesystem - though we put swap on it as well.  The SSDs are Transcend SSD 320 2.5". Not sure it counts as "enterprise", but if one of them fails, then it shouldn't take long to mirror it back from the other - and our host will have to pay to replace it. The hope is that they don't both fail at exactly the same time!

Looking at the SMART information on the individual drives, the SSD "wear-out indicator" is 0 (raw value: 43117 cycles), but that really doesn't mean all that much - drives can fail before or after that. There are no reallocations or uncorrectable errors on any of the drives we're using.
GreenReaper
9 years, 10 months ago
As for databases - we're using PostgreSQL for the site; MySQL only for statistics. PostgreSQL works in a similar way, writing new information while keeping the old rows around. We do a VACUUM FULL every few months which completely rewrites the tables (we have TRIM/discard enabled). Of course, it's a Sandforce drive, and much of the database is compressible.
Shokuji
9 years, 10 months ago
Good work, guys. =) I also wanted to say that I appreciate how you add so many links for those who want to know more and learn a thing or two. =3
Fellarts
9 years, 10 months ago
Very nice. Keep up the awesome work guys!
Christiebunny
9 years, 10 months ago
"like open heart and brain surgery all at once"  ..... been there, done that, picked up the pieces afterwards, more than once :p  RAID is hell to deal with sometimes, but when things go right it's pretty awesome stuff :)
New Comment:
Move reply box to top
Log in or create an account to comment.