dGap compression of inverted lists

Anatoliy Kuznetsov. 2002.


In many cases, bit blocks will frequently have a non-random bit distribution pattern. Here's an example:

Patterns such as these can be represented in different ways. One of the most popular is a list of integers, each representing 1 bits. For example
	{ 3, 7, 8, 9, 12, 13, 14, 15, 16 }
This is a list of indexes of bits stored as an ascending sequence of integers.

Another common way of representing ascending sequences is by using the method of D-GAP, the differences between elements of the list. Incidentally, this variant of D-Gap compression was used in the BitMagic library.

What is Delta Gap Compression?

In D-Gap compression, the very first integer in the sequence is always 1 or 0, and it works as a flag indicating the start bit. In the example above the first bit is 0. The integers following the flag are the lengths of the consecutive blocks of equal bits. The sum of all elements of the sequence without the starting flag value will give us the total length of the block. Essentially this coding model can be treated as a specialized variant of Run Length Encoding (RLE). The D-Gap form of the previous example is:

	{[0], 3, 1, 3, 3, 2, 4}
This translates to "Three zeroes, one 'one', three zeros, three ones, two zeroes, and four ones", or
	000 1 000 111 00 1111

What are the advantages of this method?

One of the advantages of the D-Gap method is that we can implement all logical operations without decompression of the block.

For example, let's take the logical operation NOT. This is the simplest case, when we need just to change the leading flag to opposite value. All values depend on the initial value, and do not require any additional intervention. It means we have a very efficient, lighting fast inversion.

The implementation of logical AND is more tricky and involves more manipulations with integers. The algorithm iteratively streams through two sequences combining integers from both lists and comparing lengths of the gaps. As a result, we have a new D-Gap coded block of different length. Performance of this method greatly varies on different data. If the D-Gaps are large and the blocks have a clear gap structure, the algorithm can demonstrate spectacular results. It can be even faster than iteration and combination of all words in a non-compressed block.

Logical OR operations are implemented as a set of equivalent NOT AND operations. In other words,

	(X or Y)
is the same as
	(not(not(X) && not(Y))
Our implementation of the NOT operation is very fast, which results in almost zero overhead.

What are the disadvantages?

The disadvantage of this method is that finding the value of a random bit always forces us to reiterate the whole sequence. On average we need to iterate half of the block to access the bit we need. When using a plain bit vector we always can pinpoint the bit using bitwise shifts, which is the simplest and fastest method.

For stream-like operations, such as choosing consecutive bits, this is not an issue, since we always can remember our current position in the vector.

Setting and cleaning bits can also be an issue. For instance, we may need to increment or decrement one or two integers, but if the changed bit lies on the gap border, we may have to split the current platform. This means we'd need to move the block of memory before continuing. This can be a problem for an application actively modifying bits.

However, the BitMagic library can be adapted to handle this case. If the D-Gap compression mode is inefficient, it can be switched-off. The software will continue to work as a plain bit vector without a performance penalty. If the application doesn't need a lot of random access operations, we can optimize the bit vectors by converting some of the bit blocks to D-Gap blocks.

Optimized D-Gap coding

As we now see the major disadvantage of the method is that we need to reiterate the sequence to find the bit value. It means that beat access time is proportional to number of elements, ie it is O(N). I f N is small we are safe, but when it grows the price of coding tend to become high.

Lets review our example:

	{[0], 3, 1, 3, 3, 2, 4}
Isn't it the same as:
	{[0], 2, 3, 6, 9, 11, 15}
What have we done ? Just an equivalent transformation of our initial sequence into a new sequence which aggregates all lengths Gap blocks from left to right. Every next GAP has been encoded in the form of the next border coordinate. Our formula: GAP(N) = GAP(N-1) + Length(GAP(N)). (Resembles Fibonacci numbers, isn't it ?).

The most important implication is that our sequence is ordered now and we can abandone linear search in favor of far more efficient binary search. Binary search will give us O(log(N)) estimated bit access time, where N is actual array length (cannot be more than fixed GAP block length).

Is it worth it?

You bet! However, overall performance of the D-Gap method does depend on the data we're trying to compress.


Fig. 1

The diagram illustrates a hypothetic distrubution of bits along the bitvector. In the areas where bits desity grows above certain limit D-Gap representation becomes applicable. The same thing happens when our vector becomes sparse. And sometimes some areas in between can expose non-random platform structure and become the subject of compression.

To use this method effectively, D-Gap coded blocks need to be short enough to avoid performance degradation. As a result of setting and cleaning random bits, the blocks will eventually get fragmented and the sequence effectively becomes longer. Therefore, after reaching a certain threshold we should not keep the block as D-Gap. The BitMagic library detects this case and automatically converts the block into a plain bit block.