Performance tests and benchmarks

Introduction

BitMagic Library implements various optimization techniques to deliver consistent performance in various programming environments. BitMagic library minly focuses on algorithms used in search systems and databases, but it was used in visualization algorithms and desktop tools as well.

We would like to summrize the performance optimization factors we find most important in production systems operating on a petabyte scale (or thousands of transactions per second).

Performance factors

General optimization of algorithms, loop unrolling, marked pointers, use of old but true optimization tricks created back when computer programming was an art. Contemporary compilers are very good, but well written code makes compilers excel.
Cache aware algorithms. Correct structuring of algorithms to use CPU on chip memory (CPU cache). This is one of the most efficient optimization techniques allows to minimize access to memory.
Optimization of memory allocations. Again, contemporary runtime libraries are wonderfully great, but they benefit from algorithms using aligned, structured allocation patterns. Practical productions systems often suffer severe performance penalties from allocating millions of small objects and reliance on off the shelf memory management. BitMagic library allows to customize its work even for most complex situations, when other parts of the system are not exactly good neighbors and cause heap fragmentation and degradations due to allocation penalties.
SIMD and 64-bit optimizations. BitMagic automatically adapts to 64-bit compilation and can be compiled to use SSE2, SSE4.2 and AVX2 instructions. Effectiveness of SSE4.2 varies depending on situation, but we often see performance improvement of up to 30% on SSE4.2 optimized algorithms due to enabled hardware population count. AVX2 optimizations (available from 3.10.0) give even better performance.
Flexible compression for storage and network transfer. Extra compression for storage always pays off, because I/O and network are very often a bottleneck. From version to version we are trying to make serialization more compact to save performance in I/O chains.

TO BE CONTINUED.