Version 5.3.0

Nov 28, 2019

Release Notes

New method to find first mismatch between two bit-vectors. bm::bvector<>::find_first_mismatch(..) It was possible to use XOR operation to identify all mismatches, new method is works faster for cases when only first mismatch is important. New method is optimized for for SSE4.2 and AVX2. If mismatch not found it returns false which is an indication that two vectors are identical.
New method to check bit-vectors for equality. bool bm::bvector<>::equal(const bvector& bvect) const
bm::operation_deserializer<> class simplified deserialization method signatures not to require to pass an external scratch memory (temp block). Algebra of Sets sample simplified to reflect this change bvsetalgebra
Fixed compilation defects in C language mappings.

Application notes on mismatch search

Mismatch search is needed as part of use cases related to fingerprint scan algorithms where first mismatch position is needed to direct further search logic. Another case is comparison of sparse vectors where minimal mismatch between a set of pairs of vectors is a subject of evaluation-comparison to do lexicographical comparison (can be used in sort algorithms and Burrow-Wheeler Transform).

Quick example


bm::bvector<>   bv1 { 100, 200, 256 };
bm::bvector<>   bv2 { 100, 222, 256 };
bm::bvector<>::size_type pos;
bool f = bv2.find_first_mismatch(bv1, pos);
if (f)
{
    cout << "search mismatch position = " << pos << endl; // 200
}

Please note that operation is symmetrical: bv2 against bv1 or bv1 against bv2 give you the same result.

SIMD optimizations


inline
bool avx2_bit_find_first_diff(const __m256i* BMRESTRICT block1,
                                const __m256i* BMRESTRICT block2,
                                unsigned* pos)
{
    unsigned BM_ALIGN32 simd_buf[8] BM_ALIGN32ATTR;

    const __m256i* block1_end =
    (const __m256i*)((bm::word_t*)(block1) + bm::set_block_size);
    __m256i maskZ = _mm256_setzero_si256();
    __m256i mA, mB;
    unsigned simd_lane = 0;
    do
    {
        mA = _mm256_xor_si256(_mm256_load_si256(block1), _mm256_load_si256(block2));
        mB = _mm256_xor_si256(_mm256_load_si256(block1+1), _mm256_load_si256(block2+1));
        __m256i mOR = _mm256_or_si256(mA, mB);
        if (!_mm256_testz_si256(mOR, mOR)) // test 2x256 lanes
        {
            if (!_mm256_testz_si256(mA, mA))
            {
                // invert to fing (w != 0)
                unsigned mask = ~_mm256_movemask_epi8(_mm256_cmpeq_epi32(mA, maskZ));
                BM_ASSERT(mask);
                int bsf = bm::bsf_asm32(mask); // find first !=0 (could use lzcnt())
                _mm256_store_si256 ((__m256i*)simd_buf, mA);
                unsigned widx = bsf >> 2; // (bsf / 4);
                unsigned w = simd_buf[widx];
                bsf = bm::bsf_asm32(w); // find first bit != 0
                *pos = (simd_lane * 256) + (widx * 32) + bsf;
                return true;
            }
            // invert to fing (w != 0)
            unsigned mask = ~_mm256_movemask_epi8(_mm256_cmpeq_epi32(mB, maskZ));
            BM_ASSERT(mask);
            int bsf = bm::bsf_asm32(mask); // find first !=0 (could use lzcnt())
            _mm256_store_si256 ((__m256i*)simd_buf, mB);
            unsigned widx = bsf >> 2; // (bsf / 4);
            unsigned w = simd_buf[widx];
            bsf = bm::bsf_asm32(w); // find first bit != 0
            *pos = ((++simd_lane) * 256) + (widx * 32) + bsf;
            return true;
        }
        simd_lane+=2;
        block1+=2; block2+=2;

    } while (block1 < block1_end);
    return false;
}

Current approach to mismatch search is to advance forward using 2x256-bit AVX lanes which are XOR products used for comparison. Ones a long 512-bit mismatch found, it narrows it down with _mm256_testz_si256() - _mm256_cmpeq_epi32() - _mm256_movemask_epi8() - BSF.

The proposed approach makes assumption, that reading two AVX lanes is almost as cheap as one (2 load ports) and input streams expose significant similarity so 512-bit SIMD scan makes sense. This bitscan computation is quite trivial so our performance here should be limited by the memory bandwidth. When mismatch found current code has to unload the vector back to memory and do BSF detection with a scalar instruction.

SSE4.2 version is also available, uses the same vectorization approach, the only difference is that it scans two 128-bit SSE lanes at once.

Code examples

bvsample02