# BitMagic Library Design

## Objectives

BitMagic library was created as a compressed inverted index and bitmap engine to speed up search system logical queries and database operations. All relational databases are based on principles proposed by Edgar F.Codd and Raymond F.Boyce. It is Relational Model. Relational Model in turn is based on Algebra of Sets which studies operations on collections of objects. Algebra of Sets operates on sets and defines operations as Unions, Intersections, Set Difference, Negation. This basic principles used in SQL or information retrieval systems (e.g., SELECT * WHERE (FieldA = 10 AND FieldB = 11) OR FieldC = 'abc')

Another area where bit-maps and sets are used is scientific computing and artificial intelligence. Binary fingerprints, clustering, classification based systems need efficient set operations.

The goal of BitMagic Library is implement such operations for C++, provide wrapper for C language and bridge it to other popular programming languages: Python and Java (via JNI) to enable big data science and enterprise conputing.

BitMagic implements versitile API for Set operations, which covers various use cases and formulas for pairwise mutable or immutable operations, operations on groups, operations between bit-vectors and serialized BLOBs, operations with arrays of integers, etc. For more information, please read: The tutorial for Algebra of Sets

## Introduction to bm::bvector<>

In the backbone of BitMagic C++ Library - bit-vector container: bm::bvector<>.

### Key characteristics:

- Dynamic range of addressable space of 2
^{32}-1 bits. - SIMD optimizatio. Bit-vectors love SIMD, BitMagic is designed to be compiled for specific Intel SIMD set (SSE2/SSE4.2/AVX2/more to come)
- Implements several types of on the fly adaptive compression.
- Implements random access to any bit in constant or near constant time
- Implements all types of logical operations on elements of a set or the whole set-vector
- Offers efficient memory management
- Provide algorithms working with bit-vectors and integer sets
- Provide iterator-enumerator for easy access to data, inline with common STL algorithms ()

BitMagic defines several stages, how `bm::bvector<>`

can be used, depending on the required
balance of performance, functionality and compression.

### bm::bvector at Stage 1

Quick example:

```
int main(void)
{
bm::bvector<>> bv;
cout << bv.count() << endl;
// Set some bits.
bv.set(10);
bv.set(100);
bv.set(1000000); // <- set some bit far from the first two bits, blocks in between are not allocated
cout << bv.count() << endl;
return 0;
}
```

What do we get here? Default mode, or Stage 1 offers the basic sparse bit-set implementation.
Bits are bits, no lists of integer or block compression. Fastest access to elements of the set.
It is almost `std::vector<bool>`

only better, because it uses block-based memory management.
The design was inspired by simplicity of STL `set<>`

and `map<>`

.
Those classes require no size management, programer simply add elements to the set, all resizes are happening
on the fly. BitMagic offers the similar level of convinience.

Memory management is dynamic and **automatic**.
You just set a bit, anywhere within the default range of [0 - 2^{32}-1], memory allocation happens automatically.
While automatic, memory allocation is not implemented as contiguous block of memory. If you set a few bits at a far distance
(as illustrated in the example above) `bvector`

takes care of some basic form of compression.
One unit of memory allocation at this Stage 1 is 65535 bits or 8K bytes.

An important thing here is that when a random bit is set memory is NOT allocated all it once from 0 to the requented value [0...idx]. If you set a single bit at the end of 32-bit space - bvector allocates only one block plus some necessary control structures to find the block (we use fixed depth compression tree). It means you do not have to do object remappings or other forms of space compaction to actually save space.

Stage 1 use is convenient when we want to fill in all the set values, need fast access, not too many vectors in RAM or expect vector(s) to be fairly random.

### bm::bvector at Stage 2

This variant of of use assumes, that we need to apply on-the-fly RLE compression (we use vaiant we called Delta-GAP or DGAP).

```
{
bm::bvector<> bv(bm::BM_GAP);
}
```

```
{
bm::bvector<> bv;
bv.set_new_blocks_strat(bm::BM_GAP); // set DGAP compression mode ON
}
```

This mode includes everything from Stage 1, plus DGAP. DGAP mode uses optimistic approach and creates blocks not as bitmaps
but as arrays of integers, smaller than 8KB allocation units of Stage 1.
Allocation of GAP blocks happen in chunks (more about it later). Access to each set element in this mode is slower, but we use sorted arrays and optimized binary search, so Big-O is defined via logarithm number of GAP elements in a block (not the whole set!). There is some penalty, but it is upper bounded.

When we work in this mode we may expect, that BitMagic takes care of the optimal memory consumption automatically.
And this is partially true. `bvector`

allows us to set or clear any bits, adding more blocks or resizing blocks to fit new set values.
But it only grows the size of blocks as needed and even convert blocks to Stage 1 (true bitmap) when enthropy grows beyond certain point.
It does not shrink back or return memory automatically. While you modify your bit-vector by setting bits or doing logical operations with
other vectors it tries to avoid potentially unnecessary compressions or de-allocations. The design goal here is predictable performance,
at the expense of memory.

### Bit-vector optimization

When we are done with the editing, logical operations and other forms of bit-vector content manipulations
we obviously want to reclaim the memory. Method for memory optimization is `bm::bvector<>::optimize()`

```
{
bm::bvector<&rt; bv;
... // bit-vector editing here
BM_DECLARE_TEMP_BLOCK(tb)
bv.optimize(tb); // memory optimization
```

By default, `bm::bvector<>::optimize()`

will do maximum to free memory and convert bitmaps in DGAP form of optimal length.
Existing DGAP blocks are optimized for best allocation (tries to fit into smaller block size). Please note that block allocation strategy
is not changed. Optimization still not going to use variable block size or any other tricks, which may hurt performance of allocator.

In its default form `bm::bvector<>::optimize()`

may still be considered kind-of slow. Plus conversion of bit-blocks
into DGAPs may be undesirable. No problemo! You can use `opt_free_0`

or `opt_free_01`

modes, which will just free
unused memory without attempts of compression.

```
{
bm::bvector<&rt; bv;
... // bit-vector editing here
BM_DECLARE_TEMP_BLOCK(tb)
bv.optimize(tb, bm::bvector<&rt;::opt_free_01); // just free some empty and full blocks
```

More information here.

All this we have done is not enough. Actually for any serious database or search or decision making system working on search spaces worth BitMagic - this is tragically insufficient. Does not matter, how much you compress - it will never fit in memory. Period. You have to send it to storage.

## I/O and Serialization. Stage 3 compression.

Storage requires serialization into one single BLOB, the most optimal allocation and possibly more compression.

Serialized format is the most memory efficient way to store vectors, and it is also good for performance. Cost of I/O is high, it involves disk/network I/O which is a naturally slow subsystem and can be slower than slow on cloud micro-services hosted far away.

Example on serialization here.

So serialization is the most compact Stage 3 form of bit-vector. It uses various compressio encoding methods, Delta RLE compression, integer lists, Elias Gamma codes, differential codes, etc. At the end of the day it produces a portable BLOB, which can be stored or sent over network.

#### What you can or cannot do with serialized BLOB?

When bit-vector is Stage 3 - you obviously looses random access to elements. You cannot change a bit at random, you cannot check a bit.

Skipping the obvious that serialized BLOB can be part of I/O, we need to point that serialized BLOB can actually be part of
logical oprations. BitMagic offers a special class `bm::operation_deserializer<>`

which can take a `bm::bvector<>`

plus a buffer and run a logical opration (OR by default) so the result gets deserialized into arg 1 (bvector).

Link.

The typicall use case for this type of operation is when we need only part of a BLOB, say bits from 10 to 2000. This use case is very common in all
pagination algorithms, cursor retrieval, etc. The rest of a BLOB may not be needed to materialize at all, saving costs of memory copying,
allocation, decompression. You can do that with operation deserialized. Use `bm::bvector<>::set_range()`

method to set bits in the
window of interest then run operation AND deserialization.

You can actually check a single bit that way, out of a BLOB. Operation deserializer supports set_COUNT_* operations.
To check one single bit you need to set one single bit in the traget vector and run deserialization with parameter **set_COUNT_AND**.
Operation deserializer returns target result population count (set cardinality). If it is one as in our case - we have our bit set.

So our assumption that we "cannot check a bit" in serialized BLOB is not correct. You can check a bit. It is just that you cannot check a bit in constant time. To check many bits, you still need to fully de-serialize the bvector and materialize its BLOBs for random access.

## Sparse Vector and Compressed Sparse Vector

One of the latest additions to BitMagic was introduction of bm::sparse_vector<>, - container for integers which keeps elements "transposed" into a bit-matrix using bm::bvector<> Such transposition works well for cases, when dynamic range of elements is regularly lower than the elementary C type. For example if you data uses 17-bits out of 32-but range - at least on some partial interval - it would be a good fit. bm::sparse_vector<> also supports semantics of NULL values (unassigned) which makes it a good candidate for storing columnar data for database applications (requires some effort to build dictionaries of cause).

### Some use cases for sparse_vector<>:

- Storage of arays of information, signals, temporal data, asto data, mol-bio data, etc.
- Store relationship information for 1:M (1:M or NULL) relationships, group theory algorithms, set to set transformations, unordered sets
- Use of algorithms more efficient on bit-transposed data, raher than plain regular arrays

Sparse vector can also be serialized, BitMagic provides a growing list of algorithms adapted for this data representation format.

## Conclusion

`std::vector<bool>`

.