IB Server Upgrade - Performance Graphs

by

Hi everyone

More tech stuff for those interested! This post will be interesting to some people, and gibberish to others.

Just what does adding SSD disks and doubling our RAM to 32GB do for our site performance?

tl;dr: We upgraded the server and it's waaaay faster now.

We use a wonderful performance graphing program called Munin to monitor our system. It runs constantly and provides these graphs live.

And this is what we see! https://www.dropbox.com/sh/u9cwrb3r0ooi7hf/N8tZi6-7-D

The results are impressive. We have seen immediate and dramatic improvements in a number of key areas.

In these graphs the upgrade occurred around 2:30pm (CEST, Amsterdam) and the site was back online around 3:30pm. But it should be fairly obvious.. it's the point at which there's a gap (where the server was shut down and upgraded) and then all those busy lines drop away to nothing as a stressed system finally gets a much needed performance boost. :P

What is SSD? - Basically it's a bunch of memory chips stuffed in to a can and used as a regular hard disk. It looks and works like a regular disk, but it's super fast and has no moving parts inside. The memory they use is not as fast as actual system RAM but it's ridiculously faster than spinning disks. And it's very expensive per GB! So you usually only buy very small ones.

Don't SSDs wear out fast? No. Or sort of yes. They used to have a bad reputation but technology has taken leaps and bounds lately. They may last less time than a spinning disk but they should last the normal lifetime you'd expect to use them for before they become outdated anyway. In our case they'll be obsolete in two years and we'll get better ones for the same rental price per month (basically a free upgrade). So they only have to last that long.

SSDs are being sold as performance desktop and server main drives now. If they fell over “after just a few thousand writes” then they wouldn't be ready for this prime time. And they wouldn't be selling them as the standard high-performance setup at one of Europe's biggest data centers. We get replacements free anyway if they break. We run the SSDs in a RAID 1 redundant configuration so we'd just get an urgent email from the system about a failure, but no data loss. Then the host is obliged to replace the broken disk for free.

And SAS? - Actually SAS isn't really a disk type but a way to join disks to a server. In our case our SAS drives are a bunch of 15kRPM disks in another enclosure (the regular kind of disks with spinning platters inside). SAS is slow compared to SSD but WAY cheaper per GB. The 15kRPM means they spin faster than your regular desktop hard drives.

SSD vs SAS - A quick performance benchmark showed the SSD clocking in at about 140MB/s, while the SAS disk pack trundles along at a sad 70MB/s. Some people familiar with both technologies may think those numbers are low, but this is real data rate in a real system, not some ideal benchmarks provided by the manufacturer or reviewer. Combine that with the fact that SSD has barely any seek time by comparison, and you can see why it makes such a difference. These benchmarks are still not true “real world” speed on a fully loaded system either. I ran these tests during our downtime. Add a busy site with truly random access and hundreds of processes fighting over the IO, and you get even less real speed. Either way what it says roughly is that SSD is twice as fast as SAS in our setup.

There's some argument about SAS being “better for databases” than SSD due to comparisons in certain situations with read vs write and other factors. Well.. go look at our graphs. It's all the proof we needed that SSD kicks SAS butt. In our case, SSD is super fast for the particular use we put it to, and is way better than SAS for our database.

We'd never use the SSD for deep storage like image assets or anything else the webserver needs to access directly and infrequently. The SAS disk pack is perfect for that. And at 1TB of disks in the pack, replicating that with SSD would be stupidly expensive.

Let me explain what the graphs in this link show us... Slide show time! Get some popcorn.

CPU usage – We can already see that the CPUs are under far less load since the upgrade (the blue at the bottom of the graph). The most interesting improvement we can see here is the purple stuff at the top vanishing away. That's a measure of how long the CPUs are waiting for the disks to respond with data. More purple = bad. A tiny sliver of purple you can barely see = good! :P

In some Munin graphs, the colors are stacked on top of each other to make it easier to read. So you read the total amount of a color you see per column, you don't read it from “zero” at the bottom. In this case the comparative IO wait time is stacked at the top. This measurement has gone from a hiedous max of 300 down to a constant 7.95. Yay!

Those regular spiky bits on the CPU graph are the scheduled processes that run at regular intervals, like mail sending and database clean up. I always think of it as IB's heartbeat!

Disk IOs Per device / IOstat - Huge and obvious drop there. The extra RAM means our Postgresql database is almost never needing to read data from the disk. We've upped the memory allocation for the database so it can pretty much store all of Inkbunny in memory (the DB being about 12GB on disk at the moment). The green line is the busyness our existing 15krpm SAS disk pack. The blue line is the new SSD. As you can see, the poor old SAS disk pack was getting hammered before the upgrade. As it was shared between the Apache webserver and Postgresql database, they kept fighting over disk time and getting in each other's way. Not any more!

Disk latency - pretty much the same deal, but perhaps an even more dramatic improvement. The disks get hit less often now as more of the data they need is sitting in the file cache in RAM. With the database and webserver split between two disks, the time to retrieve data when they do ask for it is cut waaaay down. You almost can't see the lines on the graph any more. Before the upgrade it was a stressed-out Wall Street executive drinking 15 cups of doubleshot espresso a day to keep up with the heavy workload. Now it's sitting around at home bored most of the time with nothing to do. I almost feel sorry for it!

Disk throughput - More data in the bigger file cache in RAM means the disks don't get accessed much at all now. We've basically halved the amount the old SAS disk pack is accessed. Once a file is read, it gets put in RAM and can be accessed again and again instantly. More RAM means more file data that can be cached this way. This caching doesn't make as dramatic a difference to writes, because everything needs to get written out to disk, even if you're caching it.

Disk utilisation per device - Wow! Well you can see here that IB was once in trouble. And now it isn't. We were hitting 100% of our disk's capability to service requests, back when we just had the SAS disk pack. That meant a slow site during backups and PHP session file cleanup (where it hits up to 100% on the graph). Now we're back to a steady 10% utilisation on the SAS pack, and almost none of the SSD disk's full potential is being used. SSD is just too fast for our little website to make so much as a blip on the graphs.

Theoretically this means our site could get something like 10 times busier before disk speed becomes an issue again. That means going from 14,000 members (not counting guests) accessing the site a day to 140,000. Not a problem we expect in the immediate future! So we now have lots and lots of room to grow.

IOstat – See “Disk IOs Per device / Iostat” above.

Load average - On a system with 8 cores, a rough rule-of-thumb is that the load measurement value should never go over 8. The load measurement takes in to account CPU load but also how much time CPUs are spending waiting for slow disks. As you can see, Inkbunny was exceeding the limit of 8 many times during a day. When this happens, the site becomes noticeably slow. After the upgrades, we drop back to a very healthy load of 1 to 1.5 out of 8. That means that on average just one CPU core of our 8 is busy at any given time, rather than most or all of them! This again means we have lots more room to grow now, and a much faster snappier site for users, especially during unexpected peak loads.

Memory usage - The biggest change is the massive growth in the file cache (the purple area). Now that we have 32GB of RAM, even with the database potentially eating up half of it, we have a heap of room to cache the commonly accessed files right in RAM. The benefits of this are seen in all the previous graphs!

PostgreSQL connections - We use a database connection pooler to help cut back on overhead per connection from the web server to the database. The pooler keeps a certain number of connections to the database open at all times to cope with fluctuations in requests. As you can see here, the pooler is now way less busy. The graph goes from crazy to calm. The database is answering requests so fast now that the pooler only needs to hold about 10 connections open (with about 3 active at any given time). Before the upgrade it would spike up to 95, and in the past we've seen it hit 200 (at which point the site explodes and dies). 10 is much better than 200!

Swap in/out - Swap is disk space allocated to containing the contents of RAM when it overflows or when the OS thinks some contents of RAM don't need to be accessed frequently enough to keep it on the faster RAM chips. Okay maybe an oversimplification but close enough! Long story short - we were using swap space on disk a lot. This was bad. Now we basically don't use it at all. This is good!

And thus ends the tour! I hope it is of some interest to those who are curious.

Thanks

Starling