Toolkit for data mining
BitMagic implements compressed bit-vector containers and tools for algebra of sets. It supports full variety of logical operations
(AND, OR, XOR, NOT, MINUS) used for building inverted list indexes and query acceleration in relational databases, full text search
systems, geo-spacial search systems, chemical substructure indexing, etc. BitMagic library offers multiple inverted list compression
options balancing memory consumption, speed and utility for fast in-memory operations, storage and network transfer
Scientific computing
BitMagic implements algorithms for high-performance scientific computing: binary similarity and clustering (Humming, Tanamoto, etc),
binary fingerprinting and multi-dimentional binary analysis, randomization of binary sets, lexicographical comparison of bit-vectors.
All this are used in ranking and prediction algorithms in large scale retrieval systems
Performance
BitMagic algorithms and containers offer cross platform, yet perfomance optimized code.
Maximum performance can be achieved for 256-bit AVX2 builds (SSE2 and SSE4.2 also supported).
BitMagic uses CPU cache friendly algorithms, cache blocking, memory alignment, prefetch
and other bandwidth optimization techniques. BitMagic uses memory pools and allocation
performance optimizations useful in large-scale systems with lots of actors and
threads actively using heap
Licensing
BitMagic is a free open source library. You can use this software in any commercial or non-commercial projects, free of any charge.
The only requirement is that you have to explicitly mention this project in any derivative product, its WEB Site, published materials,
articles or any other work derived from this project or based on our code or know-how.
Powerful Tool for Big Data Problems and Logical Inference
BitMagic Library helps to develop high-throughput intelligent search systems,
promote combination of hardware optimizations and
on the fly compression to fit inverted indexes and binary fingerprints into memory, minimize disk and network footprint.
Functions
- compressed bit-vector container, implements random access methods,
with range of set-algebraic functions, ranks, find and traverse methods,
STL-style iterators
- set algebraic operations: AND, OR, XOR, MINUS for bit-vectors and integer sets.
Interoperable with low level C arrays and STL compatible containers (via iterators).
- serialization/hybernation of containers into compressed BLOBs for database persistence
or in-memory compression
- memory management with focus on optimization (avoiding) allocations/de-allocations,
minimization of heap fragmentation, custom allocators.
- set algebraic operations on compressed bit-vector BLOBs
- statistical engine to efficiently construct binary similarity and distance
metrics (Tevrsky, Hamming, Tanimoto, Dice or your own)
- containers for sparse vectors and collections for native integer types.
Works throug bit-transposition and compression of each separate bit-plain.
Supports for NULL semantics. Can be used for memory-compresses vector/columnar
search systems with focus on memory efficiency
- algorithms on sparse vectors: dynamic range clipping (work in progress!)
- functional operations on integer sets (theory of groups): translations between sets,
mathematical images (work in progress!).
- binary compressed matrices for ER-operations, materialized joins,
one-to-many and many-to-many relationships, materialized RDBMS joins, graphs, etc.
(work in progress!)
- portable C-library layer as a bridge to Python, Java, .Net (work in progress!)
C and C++
BitMagic C++ Templates library offers STL friendly containers and iterators, all portable yet investing into low level optimizations.
Our templates are header-only designed for easy integration into your big project. We provide lean (no RTTI, no STL, no exceptions)
mapping into C language (JNI into Java and Scala - work in progress)
Storage and communications
Efficient serialization algorithms for saving containers. Serialization tools are provided for all containers, you can use
it with embedded systems (like Berkeley DB), large scale RDBMS systems (Oracle, MS SQL, MySQL) or NoSQL (memcached)
Cross-platform
Bit-vectors can be serialized and sent over network for cross-platform data exchange and streaming,
used for construction of network middleware and micro-services
Know-how
The mission of our project is to share tools, and expertise, use cases and know-how
of search systems, bit-vectors, inverted lists, compression techniques, libraries, programming language bindings, etc.
Getting started
BitMagic C++ Library implements easy, header only programming model.
Public code repository
BitMagic Library is hosted on GitHub and SourceForge.
Use cases
Use cases and design patterns for various applications for compressed bitvectors.
Design principles
Articles about design and performance optimizations.