Version 6.0.0

Jan 20, 2020

Release Notes

This is major version release; it changes the serialization format for bit-vectors. New serialization format now uses more compact variable byte encoding for internal technical information. New release is backward compatible, old BLOBs does not need to be re-encoded. Old programs needs to be recompiled to read new serailized BLOBs. Compression rate improvements should within 1% but for large collections of bit-vectors and sparse vectors it makes positive difference.
New serialization format now supports bookmarks for efficient range deserialization.

Bookmarks increase serialization size, so it is not enabled by default. If you are not using range deserialization you do not need bookmarks.

serializer<BV>::set_bookmarks(bool enable, unsigned bm_interval);

If you use range deserialization (and want it to work faster) you need to setup bookmarking to allow deserializer to rewind through the BLOB when range deserialization request is processed.

bm_interval parameter in this case defines how often bookmarks are added between blocks of encoding. Block size is 64K elements. bm_interval = 64 would mean serializer will inject a skip reference every 64 blocks. Increasing this number keeps the BLOB smaller at the expense of the rewind precision. This affects performance and size not the result of the deserialization.

Range deserialization would work even without bookmarks.

Bookmarks and range deserialization works for all bit-transposed sparse vectors. Code snippet to illustrate setting bookmarks:


// serialize sparse vector
bm::sparse_vector_serial_layout<svector_u32> sv_lay;
{
    BM_DECLARE_TEMP_BLOCK(tb)
    sv1.optimize(tb);

    bm::sparse_vector_serializer<svector_u32> sv_serializer;
    sv_serializer.set_bookmarks(true, 64);

    sv_serializer.serialize(sv1, sv_lay);
}

const unsigned char* buf = sv_lay.buf();

bm::sparse_vector_deserializer<svector_u32> sv_deserial;
sv_deserial.deserialize_range(sv2, buf, 1, 4);

svsample08.cpp

Build environment for cmake fixes, enabled pthreads credits to Joshua Marshall (jrmarsha at mtu.edu).
Bit-trasposed sparse vector added copy_range() method for efficient range slicing str_sparse_vector<>::copy_range(...)
Bit-transposed rank-select compressed succinct vector added copy_range() method for range slicing. rsc_sparse_vector<>::copy_range(...)
Fixed serialization performance regression due to unnecessary initialization of temporary buffers (introduced in v.5.0.0)
Fixed minor corner case crash in copying of empty bit-vectors

Download v 6.0.0

Stories and use cases

What is the motivation and a benefit to do a major serialization format upgrade?

Classic information retrieval systems, storing compressed inverted lists rarely need a range deserialization. Range deserialization is essentially a paging algorithm and search usually does not know upfront search objects/documents ranges which will contain the results, thus vectors deserialized as a whole to run logical ops. Paging may appear later in displaying the result set.

BitMagic library implements bit-transposed containers, where bit-vector is used as a workhorse for bit-slicing. Succinct containers give developers a tool for succinct representation of unsorted vectors of integers, dictionaries of identifiers, vectors with NULL (unassigned) values using Rank-Select (NULL compression). All these are heavy weight containers, can be efficiently serialized as collections of bit-vectors using Elias-Gamma or Binary Interpolative Encoding. Binary Interpolative Encoding offers superior compression rates, but at the expense of a higher CPU cost. For data intensive (read it as bandwidth limited) applications high compute cost is often a great trade off, because compute resources are far ahead of bandwidth on all levels of compute architectures. Trading of CPU for bandwidth totally makes sense.

Keeping a data model (or data frame) as serialized (and compressed) sparse vectors is enabling technology only if it offers an efficient way to get access to compressed data pages (or ranges).

There are two domains where it is absolutely important: columnar databases (vector based) where paged access is a design pattern and bioinformatics, where data can be laid out as coordinates on the genome used as vector address indexes so program can request data within a specified genomic range (from 100K to 200K on chr1). Bioinformatics is an extreme case of data-intensive applications where compression helps to improve performance on all levels of architectures, from algorithms to services.

One specific example is WebAssemblies. Browser deployed WebAsm defines its sandboxing environment as 2GB RAM (often less) and it is a significant memory pressure for the data intensive applications. Fitting a data model in WebAsm begs for compression. On the CPU side WebAsm offers “near native performance”. In reality it can be times slower than SIMD enabled native code. This is not a criticism of WebAsm as it is great comparing to alternatives. WebAsm also offers threads to use many-core CPUs but browsers are often disable it, because of the security concerns, related to Spectr/Meltdown vulnerabilities and SharedBuffers attack vector. (Hopefully this will be addressed soon and threads re-enabled).

Testing of BitMagic in single threaded WebAsm without SIMD assist showed a need to implement a bookmark (or some other indexing) mechanism within serialized format to skip series of encoded blocks without paying the price of unnecessary decode. Use of bookmarks makes BitMagic range deserialization of containers 5x faster on chromosome sized models even without threads or SIMD.

Release Notes

Stories and use cases

Follow us on Twitter