Inkbunny's development started almost five years ago. Most of that time was spent adding new features - and everyone loves those! But just as an artist might need to clean up around the edges, there are times we need to step back and refactor the code and the database. I'll be talking about the latter here.
As powerful as our current server is, it only has 32Gb of RAM, and we'd like to keep as much of the database in memory as possible. The enterprise-grade SSDs used to store the data are expensive, too - and we want to use your donations as efficiently as possible.
As such, we're applying transformations to reduce the size and rate of growth of the database. This could mean the difference between having to upgrade in a year or two vs. three or four. It should also improve its performance - less data to handle means faster operations.
If that's enough for you, you can stop here. If you want to know exactly what we're doing, and don't mind learning a bit about databases, read on!
Reducing the size of keys
Relational databases are organized into tables, and within each are rows and columns. Each row has information related to one instance of a particular type of thing - for example, there's one row per user in the users table, containing their username, email address, etc.
In most cases, each row is uniquely identified by one or more pieces of information. This is known as a key, and it's used in other tables to record a relation; for example, both the unread submissions and the favourites tables store one copy of the user and submission keys for each unread submission/+fav.
The primary key is often set to be a steadily-increasing number which otherwise has no meaning. This helps avoid issues with relations to other tables. (Usernames are unique, and could be used as a key, but they can also change; we don't want to have to update every table relating to a user when they do.)
Numbers are also small. Think about how fast unread submissions can rack up if you're watching a lot of users. We have members with tens of thousands of them. (This brought down FA in December.)
What we've used to date is an eight-byte integer (whole number) key. Eight bytes doesn't sound like much, but it can store 9,223,372,036,854,775,807 different positive values. It's good to be optimistic, but the reality is that Inkbunny will never have 9,223,372,036,854,775,807 users, past or present.
It's a fair bet that we'll also never have 2,147,483,647 users, the maximum for a four-byte signed integer. We can therefore use a four-byte key, saving space in the user table and every table which relates to it. This same reduction can be applied to submissions (even deviantART has only used 420,000,000 submission IDs), private messages, comments, keywords, etc.
The one thing that might not work so well is user favorites. We aleady have nine million of these, and it's not beyond the realm of possibility that we could have 250 times that in the future. But that's OK, because it turns out we can just get rid of that key!
Removing redundant keys on junction tables
We often want to represent a relationship between two types of object - for example, if and when a user +fav'd a work, and how many stars they used. These details are recorded in a separate table, using copies of the user and submission IDs (known as foriegn keys).
Because a user can only have one +fav for a single submission, the combination of the user and submission IDs (the copies I mentioned earlier) uniquely identifies the row. It's a compound key. A separate key is unnecessary - reflected by the fact that we never queried it. So we're getting rid of it.
Such tables are among the largest in the database (some people have over 20000 +favs), so slimming them down has a big impact. We expect to save almost 1GB on +favs alone, when combined with a few other things, such as . . .
Using more efficient data representations
Many items of data are stored as strings of text. Text is great - it can be used for just about anything. However, our database has more efficient data types which take up less space and are faster to search and compare to each other.
In particular, the 'inet' type is designed to store a network address, in just seven bytes for an IPv4 address. We're also planning to convert the number of +fav stars to a smaller format, and store MD5 hashes as UUIDs rather than strings, halving their size.
Removing unnecessary data
Of course, the cheapest data is that which you don't store at all. For example, we don't need to store your user ID, IP address or the date that you for every row in a character information sheet. Most of this information is stored for the submission. We just need the submission ID, the field, and the text.
The tables aren't the only thing we have to look at, though.
Removing unused indexes
You might think it natural for rows in a table to be ordered, In fact, data is generally stored in no particular order; for example, user 100 might come before user 1. This avoids the need to re-order rows on addition or deletion - besides, there are several different orders which might be used.
Instead, the database uses indexes. You can think of them as lists ordered by one or more pieces of data from the table, with each line pointing to a particular row.
For example, there might be indexes on username, user ID and email address. Without these, doing searches on such data - like finding every user whose name started with E, or checking to see that a given user ID didn't already exist - would be very slow; you'd have to read the whole table.
Indexes are crucial for performance. However, they occupy lots of space - often more than the table itself - and take time to update when entries are added or deleted, so if they're not actually used, they're a waste. We're identifying and removing unused ones.
Reordering the columns of tables to decrease padding
Unfortunately the database software we're using isn't smart enough to say "this field is big, it's got to go first". Instead, it arranges columns in the order given, and uses padding to enforce alignment.
By ordering the fields such that the larger, regularly-sized ones go first, we can reduce or eliminate padding. A few bytes can translate into tens of megabytes when you have millions of rows.
One neat trick that saves 45Mb
Reducing padding is great, but what if you end up with a tiny field at the end - say, the number of stars in a +fav - which has to be padded to a full eight bytes? Why, you get rid of it entirely!
Databases have a concept called NULL, which represents the lack of information, as opposed to any particular value. With our database software, nulls are stored in a series of bits after each row's header - if a bit is set, there is no space reserved for the corresponding field. There's normally a byte (eight bits) of padding after the header - alignment, again - so if a row has eight or fewer fields, nulls are free.
Here's the trick: It turns out that 70% of +favs use three stars, 20% are two star, 10% are one star. When it's three stars, we can set the star value to NULL, and turn it back into three stars when it's queried. This can save eight bytes on 70% of the rows in the table.
(Database purists might regard this as an abuse of NULL, but the savings are worth it.)
Whew. It's a lot, but it's worth it. In the end, we think we can get the size of a +fav down from 84 bytes to 47 - a 44% improvement - and its index sizes down by 53%. The total savings are 900Mb; a full 50%. And that's just one table. Last night's changes cut almost 1.5GB in total
These kind of data transformations take time, so we're doing them in stages. We're trying to keep it to off-peak times, but apologize for any inconvenience caused! The result should be a faster site with less downtime for site migrations, so it's worth it in the end. :-)