Design objectives

BitMagic Library Design

Objectives

BitMagic library was created as a compressed inverted index and bitmap engine to speed up search system logical queries and certain database operations. Indeed all relational databases are based on principles proposed by Edgar F.Codd and Raymond F.Boyce. It is Relational Model. Relational Model in turn is based on Sets Theory which studies operations on collections of objects. Sets Theory operates on sets and defines operations as Unions, Intersections, Set Difference, etc. Manifestation of this basic principles can be seen in any, even simple logical expression used in SQL Queries (e.g., SELECT * WHERE (FieldA = 10 AND FieldB = 11) OR FieldC = 'abc')

Set theoretical operations can be implemented as Binary operations and this is where bitmaps, bit-vectors and compressed inverted lists suddenly become very useful, because they very often radical performance improvements. If result of any query is a bit-vector - all logical operations can be described as binary operations (AND, OR, NOT, XOR, SUB (AND NOT)).

The goal of BitMagic Library is to run such operations and run it fast.

Introduction to bm::bvector<>

In the backbone of BitMagic C++ Library - bit-vector container: bm::bvector<>.

Key characteristics:

  • Dynamic range of addressable space of 232-1 bits.
  • SIMD optimizatio. Bit-vectors love SIMD, BitMagic is designed to be compiled for specific Intel SIMD set (SSE2/SSE4.2/AVX2/more to come)
  • Implements several types of on the fly adaptive compression.
  • Implements random access to any bit in constant or near constant time
  • Implements all types of logical operations on elements of a set or the whole set-vector
  • Offers efficient memory management
  • Provide algorithms working with bit-vectors and integer sets
  • Provide iterator-enumerator for easy access to data, inline with common STL algorithms ()

BitMagic defines several stages, how bm::bvector<> can be used, depending on the required balance of performance, functionality and compression.

bm::bvector at Stage 1

Quick example:


int main(void)
{
    bm::bvector<>>   bv;
    cout << bv.count() << endl;

    // Set some bits.
    bv.set(10);
    bv.set(100);
    bv.set(1000000);  // <- set some bit far from the first two bits, blocks in between are not allocated

     cout << bv.count() << endl;
    return 0;
}
                            
What do we get here? Default mode, or Stage 1 offers the most basic bit-set implementation. Bits are bits, no lists of integer or block compression. Fastest access to elements of the set. It is almost std::vector<bool> only better.

This mode offers dynamic and automatic memory management. You just set a bit, anywhere within the default range of [0 - 232-1], memory allocation happens automatically. While automatic, memory allocation is not implemented as contiguous block of memory. If you set a few bits at a far distance (as illustrated in the example above) bvector takes care of some basic form of compression. One unit of memory allocation at this Stage 1 is 65535 bits or 8K bytes.

Stage 1 use is convenient when we want to fill in all the set values, need fast access, not too many vectors in RAM or expect vector(s) to be fairly random.

bm::bvector at Stage 2

This variant of of use assumes, that we need to apply on-the-fly RLE compression (we use vaiant we called Delta-GAP or DGAP).


{
    bm::bvector<>   bv(bm::BM_GAP);
}
                            

{
    bm::bvector<>   bv;
    bv.set_new_blocks_strat(bm::BM_GAP);  //  set DGAP compression mode ON
}
                            
This mode includes everything from Stage 1, plus DGAP. DGAP mode uses optimistic approach and creates blocks not as bitmaps but as arrays of integers, smaller than 8KB allocation units of Stage 1.

Allocation of GAP blocks happen in chunks (more about it later). Access to each set element in this mode is slower, but we use sorted arrays and optimized binary search, so Big-O is defined via logarithm number of GAP elements in a block (not the whole set!). There is some penalty, but it is upper bounded.

When we work in this mode we may expect, that BitMagic takes care of the optimal memory consumption automatically. And this is partially true. bvector allows us to set or clear any bits, adding more blocks or resizing blocks to fit new set values. But it only grows the size of blocks as needed and even convert blocks to Stage 1 (true bitmap) when enthropy grows beyond certain point. It does not shrink back or return memory automatically. While you modify your bit-vector by setting bits or doing logical operations with other vectors it tries to avoid potentially unnecessary compressions or de-allocations. The design goal here is predictable performance, at the expense of memory.

Bit-vector optimization

When we are done with the editing, logical operations and other forms of bit-vector content manipulations we obviously want to reclaim the memory. Method for memory optimization is bm::bvector<>::optimize()


{
    bm::bvector<&rt;   bv;
    
    ...  // bit-vector editing here
    
    
    BM_DECLARE_TEMP_BLOCK(tb)
    bv.optimize(tb);         // memory optimization

By default, bm::bvector<>::optimize() will do maximum to free memory and convert bitmaps in DGAP form of optimal length. Existing DGAP blocks are optimized for best allocation (tries to fit into smaller block size). Please note that block allocation strategy is not changed. Optimization still not going to use variable block size or any other tricks, which may hurt performance of allocator.

In its default form bm::bvector<>::optimize() may still be considered kind-of slow. Plus conversion of bit-blocks into DGAPs may be undesirable. No problemo! You can use opt_free_0 or opt_free_01 modes, which will just free unused memory without attempts of compression.


{
    bm::bvector<&rt;   bv;
    
    ...  // bit-vector editing here
    
    
    BM_DECLARE_TEMP_BLOCK(tb)
    bv.optimize(tb, bm::bvector<&rt;::opt_free_01); // just free some empty and full blocks

More information here.

All this we have done is not enough. Actually for any serious database or search or decision making system working on search spaces worth BitMagic - this is tragically insufficient. Does not matter, how much you compress - it will never fit in memory. Period. You have to send it to storage.

I/O and Serialization. Stage 3 compression.

Storage requires serialization into one single BLOB, the most optimal allocation and possibly more compression.

Serialized format is the most memory efficient way to store vectors, and it is also good for performance. Cost of I/O is high, it involves disk/network I/O which is a naturally slow subsystem and can be slower than slow on cloud micro-services hosted far away.

Example on serialization here.

So serialization is the most compact Stage 3 form of bit-vector. It uses various compressio encoding methods, Delta RLE compression, integer lists, Elias Gamma codes, differential codes, etc. At the end of the day it produces a portable BLOB, which can be stored or sent over network.

What you can or cannot do with serialized BLOB?

When bit-vector is Stage 3 - you obviously looses random access to elements. You cannot change a bit at random, you cannot check a bit.

Skipping the obvious that serialized BLOB can be part of I/O, we need to point that serialized BLOB can actually be part of logical oprations. BitMagic offers a special class bm::operation_deserializer<> which can take a bm::bvector<> plus a buffer and run a logical opration (OR by default) so the result gets deserialized into arg 1 (bvector).

Link.

The typicall use case for this type of operation is when we need only part of a BLOB, say bits from 10 to 2000. This use case is very common in all pagination algorithms, cursor retrieval, etc. The rest of a BLOB may not be needed to materialize at all, saving costs of memory copying, allocation, decompression. You can do that with operation deserialized. Use bm::bvector<>::set_range() method to set bits in the window of interest then run operation AND deserialization.

You can actually check a single bit that way, out of a BLOB. Operation deserializer supports set_COUNT_* operations. To check one single bit you need to set one single bit in the traget vector and run deserialization with parameter set_COUNT_AND. Operation deserializer returns target result population count (set cardinality). If it is one as in our case - we have our bit set.

So our assumption that we "cannot check a bit" in serialized BLOB is not correct. You can check a bit. It is just that you cannot check a bit in constant time. To check many bits, you still need to fully de-serialize the bvector and materialize its BLOBs for random access.

Sparse Vector and Compressed Sparse Vector

One of the latest additions to BitMagic was introduction of bm::sparse_vector<>, - container for integers which keeps elements "transposed" into a bit-matrix using bm::bvector<> Such transposition works well for cases, when dynamic range of elements is regularly lower than the elementary C type. For example if you data uses 17-bits out of 32-but range - at least on some partial interval - it would be a good fit. bm::sparse_vector<> also supports semantics of NULL values (unassigned) which makes it a good candidate for storing columnar data for database applications (requires some effort to build dictionaries of cause).

Some use cases for sparse_vector<>:

  • Storage of arays of information, signals, temporal data, asto data, mol-bio data, etc.
  • Store relationship information for 1:M (1:M or NULL) relationships, group theory algorithms, set to set transformations, unordered sets
  • Use of algorithms more efficient on bit-transposed data, raher than plain regular arrays

Sparse vector can also be serialized, BitMagic provides a growing list of algorithms adapted for this data representation format.

Conclusion

  • BitMagic implements operations with large sparse bitmaps. Small cases may be better served using std::vector<bool>.
  • bvector offers various levels of in-memory compression, but consistently favors speed over memory usage
  • To get maximum memory savings you need to serialize vector into BLOB. You can still use BLOB as an argument for set algebraic operations and binary distance operations, which describe a rather broad spectrum of what you can do with a bitmap vector.