Welcome to Inkbunny...
Allowed ratings
To view member-only content, create an account. ( Hide )
GreenReaper

More cores, less speed? The perils of over-threading.

As the images we host get larger, I spend a lot of time figuring out how to keep things snappy.

Our main server has two CPUs, each with six physical cores… presented as 12 due to hyper-threading, for a total of 24 "virtual" cores. We're paying for this hardware, so we'd like to make the best use of it.

One of our biggest CPU hogs is image processing library ImageMagick - and fortunately it supports OpenMP, a system for doing using multiple cores. Since it sees 24 cores, it tries to use all of them.

Perhaps unsurprisingly, this is non-optimal without tuning; but what I did find surprising was how it failed. Here's what happened when I benchmarked four iterations of our more expensive operations -
resizing a 9MB, 3507x2480 PNG file to 920×651 - using from 1 to 24 threads:

[Cores]: Iterations/sec - User time - Real time

[1]:          0.109ips         36.660u      36.660
[2]:          0.185ips         37.600u      21.680
[3]:          0.232ips         38.730u      17.210
[4]:          0.278ips         40.210u      14.380
[5]:          0.324ips         39.110u      12.350
[6]:          0.319ips         46.740u      12.540
[7]:          0.337ips         49.720u      11.870
[8]:          0.339ips         54.610u      11.810
[9]:          0.365ips         51.310u      10.970
[10]:        0.368ips         55.600u      10.880
[11]:        0.364ips         59.310u      10.990
[12]:        0.369ips         62.980u      10.850
[13]:        0.400ips         60.860u      10.010
[14]:        0.405ips         61.460u      9.870
[15]:        0.386ips         70.550u      10.350
[16]:        0.399ips         72.270u      10.020
[17]:        0.421ips         67.180u      9.500
[18]:        0.412ips         70.660u      9.720
[19]:        0.412ips         73.790u      9.700
[20]:        0.430ips         71.860u      9.300
[21]:        0.241ips         49.040u      16.620
[22]:        0.226ips         50.660u      17.680
[23]:        0.227ips         49.950u      17.600
[24]:        0.209ips         52.340u      19.170

[You can think of user time as measuring "CPU effort"; real time is "how long did it take?"]

When ImageMagic can't use 21+ cores, it seems to fall back to using around three. This is decidedly non-optimal. It gets worse: for smaller images, real time starts to increase around 11 cores, or even 7. And there's more overhead than actually using three; compare the larger user time values for [22] vs. [3].

This could be because it's trying to do 11*2=22 or 7*3=21 threads because of the sizes, or it could be splitting memory/cache access over multiple processors. Or it could be something completely different.

As a result, I'm limiting it to six threads on our main server, and four on our secondary - in each case, the number of physical cores on one of its CPUs. This consistently gives a 2-2.5x speedup over one thread. In some cases, it's non-optimal in terms of wall-time, but only a little - and it leaves the other cores free.

Going past that burns up excessive CPU cycles as overhead. If nothing else, this heats the system up, reducing its lifetime… it also means the cores can't be used to process web pages or database queries. This'll be more important once we deploy PostgreSQL 9.6, supporting parallel execution within a query.

tl;dr Upload processing of large images should be snappier now, whilst using less CPU time. Yay!
Viewed: 157 times
Added: 8 years, 2 months ago
 
fluffdance
8 years, 2 months ago
I miss the good ol' days, when the speed of the CPU determined the performance of the system, and good programming (or good tuning) could turn a low-end system into a functional powerhouse.  Back in '02, I had a Win2000-based media server running on a Pentium 100DX with a whopping 12MB of RAM that I came across while dumpster diving.  It wasn't quick, but it was usable and functional.

You'd think that it would be up to the operating system to sort out thread prioritization, and that there would be some sort of logic involved in what that prioritization would actually be, but I guess that's not the case.  With many cores, comes many responsibilities~
GreenReaper
8 years, 2 months ago
Linux does a good job, most of the time. But it's not possible to optimize for every condition, across every combination of application and system configuration, especially considering the server is also processing a varying number of database queries, images, and web pages at the same time.

I'm not sure that's even the problem here, though - there seems to be some kind of gating going on inside OpenMP (the multithreading library ImageMagick uses). For all I know, it's trying to prevent even worse performance which might be caused if it did try to run that many threads at once.

I do enjoy optimization challenges, like getting 60FPS H.265 decoding working on my old netbook and laptop recently with a custom HEVC library. You can push old hardware surprisingly far nowadays.
Part of that has come through increasing use of parallelism, fixed-function hardware, and instructions such as SSE, AES and AVX… because making general-purpose instructions faster hits a thermal wall. We'll be talking a bit more about that in a forthcoming journal on Inkbunny about the latest release.
Jay1743
8 years, 2 months ago
Hey, something in my wheelhouse!

My first thought was on-processor caching. But if that were the case, I'd expect to see a significant performance drop going from 12 to 13 core (or before that, depending on the layout decisions made by OpenMP's scheduler). That doesn't appear to be the case here. OpenMP might be making bad placement decisions, but I haven't really used it enough to comment on that.

If you want to go a little deeper, strace and gprof are the tools to use.
GreenReaper
8 years, 2 months ago
Honestly, it was bizarre - it gave the appearance of hard-limiting it to 200% CPU when the thread count was unlimited by policy. I'm not sure how sophisticated IM's use of OpenMP is - they seem to use a lot of "split this up into static chunks of four loop iterations" directives.

I might look deeper if it gets to the point where we're running out of things to optimize, but my next task in that area will probably be using a single pipeline with parenthesized side processing and +clone 0 to generate all thumbnails at once, rather than one at a time from the same source file.

I foresee a big speedup with that, because image decode proved to be a significant factor for pngcrush - in the end we combined a CloudFlare fork (using CLMUL instructions) with the SSE2-optimized Paeth filtering present in recent versions of libpng; it sped the whole thing up by ~30%.
Jay1743
8 years, 2 months ago
On a lark, I decided to try it on my desktop using a 5mb test image. The results were not encouraging:

Performance[1]: 40i 0.207ips 1.000e 619.960u 3:12.840
Performance[2]: 40i 0.150ips 0.419e 840.850u 4:27.380
Performance[3]: 40i 0.136ips 0.396e 902.820u 4:54.480
Performance[4]: 40i 0.122ips 0.370e 948.380u 5:28.620
GreenReaper
8 years, 2 months ago
Yeah… I imagine it's dependant in part on what you're doing, as well as the underlying architecture. Inkbunny runs on a pair of Sandy Bridge EN server CPUs with almost twice the cache of their desktop equivalents; it helps a lot for this kind of thing.

In fairness to IM's developers, they warn that it should be tested and tuned for your application and deployment. I can't see many end-users going to the trouble, though.
imer
8 years, 2 months ago
looks like doing the resizing asynchronously with a single thread would be the most effective use of the cpu
as is, have a bunch of workers just dedicated to doing image scaling and just polling every now in the browser and then to see if it was done
writing the image details out should take the user longer than scaling the image anyways
GreenReaper
8 years, 2 months ago
It'd be a lot faster if we did a simple scale; but to preserve quality, we're resizing with Lanczos resampling and an unsharp operation after gamma adjustment to operate within a linear colourspace.

As you say, writing images can take a while. I think there's a significant amount of time taken up with repeated image decoding, too, as noted above. It remains to be seen whether creating a pipeline with side channels will automatically perform those resize operations within threads of their own, but cutting decode operations would be a start.

Single-threaded resizing might be the most efficient use of the CPU if we were at full load (although that discounts potential benefit of clearing the cache faster so there's room for other data/processes). However, the user experience is also important - getting that 36 seconds down to 12 makes a big difference, because they're less likely to drift off to something else - so parallelizing a single operation is important. Ideally, it'd be closer to 2 sec; it often is, but that gets tricky if it's a 10MB+ original. :-D

We're not CPU-limited in terms of throughput - on average, we use maybe 3.5 of our 24 virtual cores (mostly for database operations, partly for page rendering, and partly for image processing).
New Comment:
Move reply box to top
Log in or create an account to comment.