BitMagic Library Design

Objectives

BitMagic library was created as a compressed inverted index and bitmap engine to speed up search system logical queries and certain database operations. Indeed all relational databases are based on principles proposed by Edgar F.Codd and Raymond F.Boyce. It is Relational Model. Relational Model in turn is based on Sets Theory which studies operations on collections of objects. Sets Theory operates on sets and defines operations as Unions, Intersections, Set Difference, etc. Manifestation of this basic principles can be seen in any, even simple logical expression used in SQL Queries (e.g., SELECT * WHERE (FieldA = 10 AND FieldB = 11) OR FieldC = 'abc')

Set theoretical operations can be implemented as Binary operations and this is where bitmaps, bit-vectors and compressed inverted lists suddenly become very useful, because they very often radical performance improvements. If result of any query is a bit-vector - all logical operations can be described as binary operations (AND, OR, NOT, XOR, AND NOT).

The goal of BitMagic Library is to run such operations and run it fast.

Introduction to bm::bvector<>

In the hart of BitMagic C++ Library - bit-vector container: bm::bvector<>.

Key characteristics:

  • Dynamic range of addressable space of 232-1 bits.
  • Implements several types of on the fly adaptive compression.
  • Implements random access to any bit in constant or near constant time
  • Implements all types of logical operations on elements of a set or the whole set-vector
  • Offers efficient memory management

BitMagic defines several stages, how bvector can be used, depending on the required balance of performance, functionality and compression.

bm::bvector at Stage 1

Quick example:


int main(void)
{
    bm::bvector<>>   bv;
    cout << bv.count() << endl;

    // Set some bits.
    bv.set(10);
    bv.set(100);
    bv.set(1000000);  // <- set some bit far from the first two bits, blocks in between are not allocated

     cout << bv.count() << endl;
    return 0;
}
                            
What do we get here? Default mode, or Stage 1 offers the most basic bit-set implementation. Bits are bits, no lists of integer or block compression. Fastest access to elements of the set. It is almost std::vector<bool> only better.

This mode offers dynamic and automatic memory management. You just set a bit, anywhere within the default range of [0 - 232-1], memory allocation happens automatically. While automatic, memory allocation is not implemented as contiguous block of memory. If you set a few bits at a far distance (as illustrated in the example above) bvector takes care of some basic form of compression. One unit of memory allocation at this Stage 1 is 65535 bits or 8K bytes.

Stage 1 use is convenient when we want to fill in all the set values, need fast access, not too many vectors in RAM or expect vector(s) to be fairly random.

bm::bvector at Stage 2

This variant of of use assumes, that we need to apply on-the-fly RLE compression (we use vaiant we called Delta-GAP or DGAP).


{
    bm::bvector<>   bv(bm::BM_GAP);
}
                            

{
    bm::bvector<>   bv;
    bv.set_new_blocks_strat(bm::BM_GAP);  //  set DGAP compression mode ON
}
                            
This mode includes everything from Stage 1, plus DGAP. DGAP mode uses optimistic approach and creates blocks not as bitmaps but as arrays of integers, smaller than 8KB allocation units of Stage 1.

Allocation of GAP blocks happen in chunks (more about it later). Access to each set element in this mode is slower, but we use sorted arrays and optimized binary search, so Big-O is defined via logarithm number of GAP elements in a block (not the whole set!). There is some penalty, but it is upper bounded.

When we work in this mode we may expect, that BitMagic takes care of the optimal memory consumption automatically. And this is partially true. bvector allows us to set or clear any bits, adding more blocks or resizing blocks to fit new set values. But it only grows the size of blocks as needed and even convert blocks to Stage 1 (true bitmap) when enthropy grows beyond certain point. It does not shrink back or return memory automatically. While you modify your bit-vector by setting bits or doing logical operations with other vectors it tries to avoid potentially unnecessary compressions or de-allocations. The design goal here is predictable performance, at the expense of memory.

Bit-vector optimization

When we are done with the editing, logical operations and other forms of bit-vector content manipulations we obviously want to reclaim the memory. Method for memory optimization is bm::bvector<>::optimize()


{
    bm::bvector<&rt;   bv;
    
    ...  // bit-vector editing here
    
    
    BM_DECLARE_TEMP_BLOCK(tb)
    bv.optimize(tb);         // memory optimization

By default, bm::bvector<>::optimize() will do maximum to free memory and convert bitmaps in DGAP form of optimal length. Existing DGAP blocks are optimized for best allocation (tries to fit into smaller block size). Please note that block allocation strategy is not changed. Optimization still not going to use variable block size or any other tricks, which may hurt performance of allocator.

In its default form bm::bvector<>::optimize() may still be considered kind-of slow. Plus conversion of bit-blocks into DGAPs may be undesirable. No problemo! You can use opt_free_0 or opt_free_01 modes, which will just free unused memory without attempts of compression.


{
    bm::bvector<&rt;   bv;
    
    ...  // bit-vector editing here
    
    
    BM_DECLARE_TEMP_BLOCK(tb)
    bv.optimize(tb, bm::bvector<&rt;::opt_free_01); // just free some empty and full blocks

More memory savings?

You can call method bm::bvector<&rt;::optimize_gap_size() - it will produce DGAP blocks sizes to save space.

What if you deal with bit-vector of lower magnitude than full 232 bit space? By default bvector keeps the full top level description table. It can be adjusted upon construction.


{
    bm::bvector<&rt;   bv(bm::BM_GAP,    // turn ON DGAP compression by default
                             bm::gap_len_table::_len, // use default block size table
                             128000 // reset the maximum length to save some RAM (be careful not to override the boundaries)
                             );
    
    ... // edit the bvector here
    BM_DECLARE_TEMP_BLOCK(tb)
    bv.optimize(tb);         // memory optimization
    bv.optimize_gap_size();  // maximum memory optimization

Once we've done all that we get maximum memory savings without loosing any functionality. We can still edit the vector, we can get random access to element we can do logical operations, we can enumerate it.

More information here.

All this we have done is not enough. Actually for any serious database or search or decision making system working on search spaces worth BitMagic - this is tragically insufficient. Does not matter, how much you compress - it will never fit in memory. Period. You have to send it to storage.

I/O and Serialization. Stage 3 compression.

However well we compress bitmaps and lists in memory we would need to store it. Storage requires serialization into one single BLOB, the most optimal allocation and possibly more compression.

Serialized format is the most memory efficient way to store vectors, and it is also done for performance. Cost of I/O is very high, because it involve disk I/O which is a slow subsystem comparing to RAM or even network I/O which is even slower or cloud micro-services (may be far away).

Example on serialization here.

So serialization is the most compact Stage 3 form of bit-vector. It uses various compressio encoding methods, Delta RLE compression, integer lists, Elias Gamma codes, differential codes, etc. At the end of the day it produces a portable BLOB, which can be stored or sent over network.

What you can or cannot do with serialized BLOB?

When bit-vector is Stage 3 - you obviously loos random access to elements. You cannot change a bit, you cannot check a bit.

Skipping the obvious that serialized BLOB can be part of I/O, we need to point that serialized BLOB can actually be part of logical oprations. BitMagic offers a special class bm::operation_deserializer<> which can take a bm::bvector<> plus a buffer and run a logical opration (OR by default) so the result gets deserialized into arg 1 (bvector).

Link.

The typicall use case for this type of operation is when we need only part of a BLOB, say bits from 10 to 2000. This use case is very common in all pagination algorithms, cursor retrieval, etc. The rest of a BLOB may not be needed to materialize at all, saving costs of memory copying, allocation, decompression. You can do that with operation deserialized. Use bm::bvector<>::set_range() method to set bits in the window of interest then run operation AND deserialization.

You can actually check a single bit that way, out of a BLOB. Operation deserializer supports set_COUNT_* operations. To check one single bit you need to set one single bit in the traget vector and run deserialization with parameter set_COUNT_AND. Operation deserializer returns target result population count (set cardinality). If it is one as in our case - we have our bit set.

So our assumption that we "cannot check a bit" in serialized BLOB is not correct. You can check a bit. It is just that you cannot check a bit in constant time. To check many bits, you still need to fully de-serialize the bvector and materialize its BLOBs for random access.

Conclusion

  • BitMagic implements operations with large sparse bitmaps. Small cases may be better served using std::vector<bool>.
  • bvector offers various levels of in-memory compression, but consistently favors speed over memory usage
  • To get maximum memory savings you need to serialize vector into BLOB. You can still use BLOB as an argument for set algebraic operations and binary distance operations, which describe a rather broad spectrum of what you can do with a bitmap vector.