Under the supervision of lead sysadmin Adam Maulis and director David Ritter of ELTE IIG, I requested access to the supercomputer of the National Information Infrastructure Development Institute of Hungary, in order to execute a series of scalability tests. The project request, submitted under the title Scalability of shared memory systems, measured through a specific problem: bzip2 compression and decompression, was approved by department lead Peter Stefan, Ph.D. of the NIIFI Department of Application Development and Operations, in October 2009.
The experiment aims at determining how compression and decompression throughput varies as a function of the number of concurrent compressor (decompressor) threads. The highest thread count tested is 103.
The regina node, where the tests were run, is a Sun Fire E25K with 72 dual-core UltraSPARC-IV+ processors @1.8GHz. Each CPU disposes over 32MB dedicated cache. The machine has 288GB physical RAM.
The operating system (as per "uname -p -s -r -v") is a SunOS 5.9 Generic_122300-13 sparc. The test application was compiled by the platform C compiler, Sun C 5.9 SunOS_sparc Patch 124867-02 2007/11/27. This platform is certified for the UNIX98 brand.
The SUN Grid Engine v5.3 schedules jobs over the supercomputer's nodes. All lbzip2 test jobs were submitted specifically to the regina node, it performing highest.
My multi-threaded bzip2 utility, lbzip2 was chosen for this purpose. When starting the project, I compiled the then most recent version, 0.17, with "Makefile.portable", which passes the -O flag to the compiler, enabling unspecified optimization. Considering the previous paragraph, my original decision to code lbzip2 based exclusively on the Single UNIX® Specification, Version 2, was very fortunate; I didn't have to change anything. (SUSv2 is the first SUS version with threads.)
On this platform, a 32-bit binary is built by default (ELF 32-bit MSB executable SPARC32PLUS Version 1, V8+ Required), but by slightly editing the order of preferred UNIX98 programming environments in "lfs.sh", a 64-bit binary could be built as well (ELF 64-bit MSB executable SPARCV9 Version 1).
lbzip2 utilizes the POSIX Threads API, and the low-level interface of Julian Seward's bzip2 library. I found version 1.0.4 of the latter preinstalled on regina.
(The charts can be magnified by right-clicking them and viewing them separately, or by resizing your browser window. The gnumeric worksheet, version 1.17, can be downloaded here.)
Throughput denotes the number of bytes consumed from the (plain text or compressed) input file per unit time.
|Cumulative Throughput [MB/s]||Per-Worker Throughput [MB/s]|
|32-bit compression||64-bit compression|
|32-bit decompression||64-bit decompression|
|VMem Peak for Whole Test Job [MB]||VMem Peak Averaged Per Worker [MB]|
lbzip2 scales almost linearly. I'm especially content with the scaling of the multiple-workers decompressor, which distributes the scanning for the bit-aligned bzip2 block boundary bit-strings over the worker threads, so that the splitter can remain input-bound. As expected, the per-thread efficiency deteriorates slowly as the number of threads increases.
The sharp drop in decompressor performance from one worker thread to two worker threads is explained by the fact that lbzip2 provides a dedicated, single-worker decompressor algorithm.
The 32-bit binary performs better during compression, while the 64-bit binary performs better during decompression. I suspect the 32-bit bzip2 library is faster in both modes; however, lbzip2's multiple-workers decompressor does a lot of 64-bit shifting, and this is likely so much slower in the 32-bit binary that its decompression advantage evaporates. The 32-bit binary runs out of address space with more than 90 threads during decompression.
Bzip2 compressor utilities are very CPU-cache sensitive. Considering how each worker thread required no more than 7,600,000 bytes for the actual compression / decompression, plus some additional memory for bookkeeping, the "hot" sets fit easily into the huge CPU-cache of regina: each CPU possesses 32MB dedicated cache, and runs two worker threads. Looking at the "VMem Peak Averaged Per Worker" chart, which also accounts for buffered input/output blocks, there may have been very little cache contention.
In this aspect, the experiment failed to work as a true shared memory test, because the cache coherence implementation of the gigantic platform effectively provided message-passing under the hood (as lbzip2 mostly employs the producers-consumers pattern). At least up to the tested number of worker threads, lbzip2 was unable to stress the scalability limits of the underlying platform (which is otherwise quite pleasant for lbzip2 users on such platforms).
$Id: scaling.html,v 1.17 2010/03/29 21:36:49 lacos Exp $
All content is Copyright © 2009, 2010 Laszlo Ersek. CC-BY-SA v3.0