CUDA vs SSE2. Part 1.

Anatoliy Kuznetsov. Igor Tolstoy. 2009.

SSE2 vs CUDA in parallel bit stream transposition / similarity

Anatoliy Kuznetsov, Igor Tolstoy. Aug 16, 2009.

Introduction

One of the hottest trends of today is adoption of parallel programming techniques for GPGPU. GPUs of today have grown into powerful parallel machines claiming Teraflop capabilities at a commodity price. It is proven that graphics cards are very capable at floating point tasks and there are some attempts to use GPU for high bandwidth integer and logical computations applicable for databases and data-mining. We decided to experiment with CUDA on a commodity nVidia GTX 9500 card and create a mini application to dealing with a lot of integer bitwise arithmetic. For comparison we take a commodity Core2 Quad 2.4GHz system and SSE2 optimized algorithm.

Bit Transposition and Similarity Algorithm

The goal of this algorithm is to take a block of 32-bit integers and do a bit slicing, transposing the array into an equivalent matrix representation, where N-th row of the matrix corresponds to a N-th bit in the input array. Each row of the transposition matrix consists of N/32 bytes of the input stream.

Transposition

Figure is taken from [1]

After transposition is accomplished we decided to compute self-similarity of the data. We need to:

  • For each row of the matrix compute number of 1 bits (population count)
  • For each pair of rows compute Humming distance: population count in the XOR product of two bit rows
  • The result of the similarity stage should be a right triangular distance matrix of 32x32 where diagonal elemnts represent line N population count and elements U[i,j] are Humming distance between rows

    Distance Matrix

    From Wikipedia article on Triangular Matrix

    Why we think this parallel bit streams are important?

    Parallel bit streams are equivalent but alternative representation of a bit-vector. This transformation allows the same bit-stream operations like (AND, NOT, OR, XOR and SUB) plus it allows random access. Any 32-bit word can be combined in a constant time with the controlled, predictable CPU penalty. At the same time there is plenty of data not using the full 32-bit capacity, so a light weight compression algorithm becomes possible if we know some of the bit slices are always zero bits(or one) or sufficiently similar between each other. We think this data representation has a lot of potential for data-mining application manipulating sparse vectors of all sorts.

    Another example of runtime compression is the compression of limited alphabet streams. Bit-slicing algorithm can easily compress 5-letter DNA stream ('A', 'T', 'G', 'C' plus 'N') with efficiency of 3 bits per letter (other realtime compression schemes with random access often use 4 bits per letter (2 chars per byte -- easy byte alignment)).

    C++ implementation for bit-stream transposition

    Algorithm can be adapted for any transposition basis of 8, 16, 32 bits, but for the purpose of this article we only implement 32-bit slicing and experiment on blocks of 2048 integers (8K).

    
    
    const unsigned set_block_size = 2048;
    const unsigned set_block_plain_cnt = 32;
    const unsigned set_block_plain_size = 64;
    
    inline 
    void bit_block_transpose(
           const unsigned* block,
           unsigned        tmatrix[bm::set_block_plain_cnt][bm::set_block_plain_size])
    {
        unsigned col = 0;
        for (unsigned i = 0; i < bm::set_block_size; 
                             i+=bm::set_block_plain_cnt, 
                             block += bm::set_block_plain_cnt, 
                             ++col)
        {
            for (unsigned j = 0; j < bm::set_block_plain_cnt; ++j)
            {
                unsigned w;            
                w  =(((block[0] >> j) & 1) << 0) |
                    (((block[1] >> j) & 1) << 1) |
                    (((block[2] >> j) & 1) << 2) |
                    (((block[3] >> j) & 1) << 3) |
                    (((block[4] >> j) & 1) << 4) |
                    (((block[5] >> j) & 1) << 5) |
                    (((block[6] >> j) & 1) << 6) |
                    (((block[7] >> j) & 1) << 7) |
                    (((block[8] >> j) & 1) << 8) |
                    (((block[9] >> j) & 1) << 9) |
                    (((block[10]>> j) & 1) << 10)|
                    (((block[11]>> j) & 1) << 11)|
                    (((block[12]>> j) & 1) << 12)|
                    (((block[13]>> j) & 1) << 13)|
                    (((block[14]>> j) & 1) << 14)|
                    (((block[15]>> j) & 1) << 15)|
                    (((block[16]>> j) & 1) << 16)|
                    (((block[17]>> j) & 1) << 17)|
                    (((block[18]>> j) & 1) << 18)|
                    (((block[19]>> j) & 1) << 19)|
                    (((block[20]>> j) & 1) << 20)|
                    (((block[21]>> j) & 1) << 21)|
                    (((block[22]>> j) & 1) << 22)|
                    (((block[23]>> j) & 1) << 23)|
                    (((block[24]>> j) & 1) << 24)|
                    (((block[25]>> j) & 1) << 25)|
                    (((block[26]>> j) & 1) << 26)|
                    (((block[27]>> j) & 1) << 27)|
                    (((block[28]>> j) & 1) << 28)|
                    (((block[29]>> j) & 1) << 29)|
                    (((block[30]>> j) & 1) << 30)|
                    (((block[31]>> j) & 1) << 31);
                    
                unsigned* row = tmatrix[j];
                row[col] = w;
            } // for j
        } // for i
    }
    
    This algorithm implements 3 nested loops with the internal loop completely unrolled, doing forward 32 word lookup with bit-gathering and writing results into rows word by word. We think this algorithm represents both readability and sufficient performance. Simple 3-loop method proved to be slower, more complex SSE2 implementations are not available to us yet. :-) Loop unrolling with a lot of variables is good enough because current generations of x86 processors actively use out-of-order execution and register renaming to avoid costly memory access for local variables.

    C++ implementation of NxN Humming distance:

    
    inline 
    void bit_block_tmatrix_distance(
           const unsigned  tmatrix[set_block_plain_cnt][set_block_plain_size],
           unsigned        distance[set_block_plain_cnt][set_block_plain_cnt])
    {
        for (unsigned i = 0; i < bm::set_block_plain_cnt; ++i)
        {
            distance[i][i] = 0;
            for (unsigned j = i + 1; j < bm::set_block_plain_cnt; ++j)
            {
                const unsigned* row1 = tmatrix[i];
                const unsigned* row1_end = row1 + bm::set_block_plain_size;
                const unsigned* row2 = tmatrix[j];
    #ifdef BMVECTOPT
                unsigned count = VECT_BITCOUNT_XOR(row1, row1_end, row2);
    #else
                unsigned count = 0;
                do
                {
                    BM_INCWORD_BITCOUNT(count, row1[0] ^ row2[0]);
                    BM_INCWORD_BITCOUNT(count, row1[1] ^ row2[1]);
                    BM_INCWORD_BITCOUNT(count, row1[2] ^ row2[2]);
                    BM_INCWORD_BITCOUNT(count, row1[3] ^ row2[3]);
    
                    row1+=4;
                    row2+=4;
                } while (row1 < row1_end);
    #endif
                distance[i][j] = count;
            } // for j
        } // for i
    }
    
    

    The problem of Humming distance is embarassingly parallel and can be easily coded using SSE2 SIMD (and probably any other SIMD sistem). SSE2 optimizations are implementad in BitMagic library and described here: "128-bit SSE2 optimization".

    The parallel population counting algorithm boils down to:

    
    int count(long long b)
    {
         b = (b & 0x5555555555555555LU) + (b >> 1 & 0x5555555555555555LU);
         b = (b & 0x3333333333333333LU) + (b >> 2 & 0x3333333333333333LU);
         b = b + (b >> 4) & 0x0F0F0F0F0F0F0F0FLU;
         b = b + (b >> 8);
         b = b + (b >> 16);
         b = b + (b >> 32) & 0x0000007F;
    
         return (int) b;
    }
    

    Algorithmic complexity of NxN Humming distance is definately higher than transposition. (Some algorithmic optimization techniques can probably be applied to eliminate unnecessary scans).

    32-bit CPU vs. SSE2

    Long story short: SSE2 parallel algorithm wins with a long margin. We tested on different configurations, SSE2 optimized code is typically 2 times faster than 32-bit code (with lookup-table based bit counting). (Matrix transposition was exactly the same). Interesting note is that on Intel Atom processor SSE2 code wins with a higher margin, maybe in-order execution makes 32-bit code slow, maybe SIMD unit on Atom is very good...

    CPU

    Intel Core2 Quad CPU. (c) Intel Corp.

    SSE2 vs SSE4.2

    Unfortunately we have no data at this point. Nehalem microarchitecture implements hardware POPCNT, so it should be faster. Unfortunately it looks like Intel chose not to implement true SIMD 128-bit version of POPCNT, so final performance of a SSE + POPCNT mix is an open question...

    nVidia CUDA hardware

    For this experiment nVidia fully CUDA 2.3 capable nVidia GeForce 9500 GT PCI-E. This GPU is generations behind from the GT200 nVidia chip but still offers adequate casual gaming performance. This card packages 32 Shader Processors and dissipates 60W of power.

    GPU

    nVidia GeForce 9500 GT PCI-E

    CUDA host program (CPU)

    
    
    #include < stdlib.h >
    #include < stdio.h >
    #include < string.h >
    #include < math.h >
    #include "cutil_inline.h"
    #include "tm_kernel.cu"
    
    #define size_B  2048
    #define size_C  32
    
    #define mem_size_B  sizeof(unsigned int) * size_B * PROC
    #define mem_size_C sizeof(unsigned int) * size_C * size_C * PROC
    
    void randomInit(unsigned int * data, int size, unsigned int val)
    {
       for (int i = 0; i < size; ++i)
           data[i] = (i/size_B + 1);
    }
    
    void runTest(int argc, char** argv)
    {
       unsigned int * h_A;
       unsigned int * h_B;
       unsigned int * h_C;
    
       unsigned int * d_A;
       unsigned int * d_B;
       unsigned int * d_C;
    
       unsigned int timer = 0;
    
       if( cutCheckCmdLineFlag(argc, (const char**)argv, "device") )
           cutilDeviceInit(argc, argv);
       else
           cutilSafeCall(cudaSetDevice( cutGetMaxGflopsDeviceId()));
    
       cutilSafeCall(cudaSetDeviceFlags(cudaDeviceMapHost));
    
       cutilCheckError(cutCreateTimer(&timer));
       cutilCheckError(cutStartTimer(timer));
    
       cutilSafeCall(cudaHostAlloc((void **)&h_A,mem_size_B,cudaHostAllocMapped | 
                     cudaHostAllocWriteCombined | cudaHostAllocPortable));
       cutilSafeCall(cudaHostAlloc((void **)&h_B,mem_size_B,cudaHostAllocMapped | 
                     cudaHostAllocWriteCombined | cudaHostAllocPortable));
       cutilSafeCall(cudaHostAlloc((void **)&h_C,mem_size_C,cudaHostAllocMapped | 
                     cudaHostAllocWriteCombined | cudaHostAllocPortable));
    
       cutilSafeCall(cudaHostGetDevicePointer (&d_A, h_A, 0));
       cutilSafeCall(cudaHostGetDevicePointer (&d_B, h_B, 0));
       cutilSafeCall(cudaHostGetDevicePointer (&d_C, h_C, 0));
    
       for (unsigned int i = 0; i < 10000; i++)
       {
               randomInit(h_A, size_B * PROC,i);
               cutilSafeCall(cudaMemcpyToSymbol(c_A, h_A,mem_size_B,0,cudaMemcpyHostToDevice));
               dim3 trs3(32,16);
               TM<<< PROC,trs3 >>>(d_B,d_C);
               cutilSafeCall(cudaThreadSynchronize());
       }
       for (unsigned int k = 0; k < PROC; k++)
       {
           for (unsigned int i = 0; i < size_C; i++)
           {
               for (unsigned  int j = 0; j < size_C; j++)
               {
                       printf("%5d ",h_C[(k << 10) + size_C * i + j]);
               }
               printf("\n");
           }
       }
       cutilSafeCall(cudaFreeHost(h_A));
       cutilSafeCall(cudaFreeHost(h_B));
       cutilSafeCall(cudaFreeHost(h_C));
    
       cutilCheckError(cutStopTimer(timer));
       printf("Processing time: %f (ms) \n", cutGetTimerValue(timer)/(10000 * PROC));
       cutilCheckError(cutDeleteTimer(timer));
    
       cutilSafeCall(cudaThreadExit());
    }
    
    
    int main(int argc, char** argv)
    {
       runTest(argc, argv);
       cutilExit(argc, argv);
    }
    
    
    

    CUDA device program (GPU)

    
    #define PROC 8
    #ifndef _TM_KERNEL_H_
    #define _TM_KERNEL_H_
    
    __device__ __constant__ unsigned int c_A[2048 * PROC];
    
    __global__ void TM(unsigned int* B, unsigned int*C) {
           __shared__ unsigned int Bs[2048];
           int ibit   = threadIdx.x;       //32
           int iblock = threadIdx.y;       //16
           int jBlock = blockIdx.x;
           unsigned int * Ax = &c_A[(jBlock << 11) + (iblock << 7)];
           unsigned int * Bx = &Bs[(ibit << 6) + (iblock << 2)];
           unsigned int * By = &B [(jBlock << 11) + (ibit << 6) + (iblock << 2)];
           unsigned int * Cy = &C[jBlock << 10];
           By[0] = Bx[0] =
           (((Ax[0 ] >> ibit) & 1) << 0 ) | (((Ax[1 ] >> ibit) & 1) << 1 ) | 
           (((Ax[2 ] >> ibit) & 1) << 2 ) | (((Ax[3 ] >> ibit) & 1) << 3 ) |
           (((Ax[4 ] >> ibit) & 1) << 4 ) | (((Ax[5 ] >> ibit) & 1) << 5 ) | 
           (((Ax[6 ] >> ibit) & 1) << 6 ) | (((Ax[7 ] >> ibit) & 1) << 7 ) |
           (((Ax[8 ] >> ibit) & 1) << 8 ) | (((Ax[9 ] >> ibit) & 1) << 9 ) | 
           (((Ax[10] >> ibit) & 1) << 10) | (((Ax[11] >> ibit) & 1) << 11) |
           (((Ax[12] >> ibit) & 1) << 12) | (((Ax[13] >> ibit) & 1) << 13) | 
           (((Ax[14] >> ibit) & 1) << 14) | (((Ax[15] >> ibit) & 1) << 15) |
           (((Ax[16] >> ibit) & 1) << 16) | (((Ax[17] >> ibit) & 1) << 17) | 
           (((Ax[18] >> ibit) & 1) << 18) | (((Ax[19] >> ibit) & 1) << 19) |
           (((Ax[20] >> ibit) & 1) << 20) | (((Ax[21] >> ibit) & 1) << 21) | 
           (((Ax[22] >> ibit) & 1) << 22) | (((Ax[23] >> ibit) & 1) << 23) |
           (((Ax[24] >> ibit) & 1) << 24) | (((Ax[25] >> ibit) & 1) << 25) | 
           (((Ax[26] >> ibit) & 1) << 26) | (((Ax[27] >> ibit) & 1) << 27) |
           (((Ax[28] >> ibit) & 1) << 28) | (((Ax[29] >> ibit) & 1) << 29) | 
           (((Ax[30] >> ibit) & 1) << 30) | (((Ax[31] >> ibit) & 1) << 31) ;
    
           Ax += 32;
    
           By[1] = Bx[1] =
           (((Ax[0 ] >> ibit) & 1) << 0 ) | (((Ax[1 ] >> ibit) & 1) << 1 ) | 
           (((Ax[2 ] >> ibit) & 1) << 2 ) | (((Ax[3 ] >> ibit) & 1) << 3 ) |
           (((Ax[4 ] >> ibit) & 1) << 4 ) | (((Ax[5 ] >> ibit) & 1) << 5 ) | 
           (((Ax[6 ] >> ibit) & 1) << 6 ) | (((Ax[7 ] >> ibit) & 1) << 7 ) |
           (((Ax[8 ] >> ibit) & 1) << 8 ) | (((Ax[9 ] >> ibit) & 1) << 9 ) | 
           (((Ax[10] >> ibit) & 1) << 10) | (((Ax[11] >> ibit) & 1) << 11) |
           (((Ax[12] >> ibit) & 1) << 12) | (((Ax[13] >> ibit) & 1) << 13) | 
           (((Ax[14] >> ibit) & 1) << 14) | (((Ax[15] >> ibit) & 1) << 15) |
           (((Ax[16] >> ibit) & 1) << 16) | (((Ax[17] >> ibit) & 1) << 17) | 
           (((Ax[18] >> ibit) & 1) << 18) | (((Ax[19] >> ibit) & 1) << 19) |
           (((Ax[20] >> ibit) & 1) << 20) | (((Ax[21] >> ibit) & 1) << 21) | 
           (((Ax[22] >> ibit) & 1) << 22) | (((Ax[23] >> ibit) & 1) << 23) |
           (((Ax[24] >> ibit) & 1) << 24) | (((Ax[25] >> ibit) & 1) << 25) | 
           (((Ax[26] >> ibit) & 1) << 26) | (((Ax[27] >> ibit) & 1) << 27) |
           (((Ax[28] >> ibit) & 1) << 28) | (((Ax[29] >> ibit) & 1) << 29) | 
           (((Ax[30] >> ibit) & 1) << 30) | (((Ax[31] >> ibit) & 1) << 31) ;
    
           Ax += 32;
    
           By[2] = Bx[2] =
           (((Ax[0 ] >> ibit) & 1) << 0 ) | (((Ax[1 ] >> ibit) & 1) << 1 ) | 
           (((Ax[2 ] >> ibit) & 1) << 2 ) | (((Ax[3 ] >> ibit) & 1) << 3 ) |
           (((Ax[4 ] >> ibit) & 1) << 4 ) | (((Ax[5 ] >> ibit) & 1) << 5 ) | 
           (((Ax[6 ] >> ibit) & 1) << 6 ) | (((Ax[7 ] >> ibit) & 1) << 7 ) |
           (((Ax[8 ] >> ibit) & 1) << 8 ) | (((Ax[9 ] >> ibit) & 1) << 9 ) | 
           (((Ax[10] >> ibit) & 1) << 10) | (((Ax[11] >> ibit) & 1) << 11) |
           (((Ax[12] >> ibit) & 1) << 12) | (((Ax[13] >> ibit) & 1) << 13) | 
           (((Ax[14] >> ibit) & 1) << 14) | (((Ax[15] >> ibit) & 1) << 15) |
           (((Ax[16] >> ibit) & 1) << 16) | (((Ax[17] >> ibit) & 1) << 17) | 
           (((Ax[18] >> ibit) & 1) << 18) | (((Ax[19] >> ibit) & 1) << 19) |
           (((Ax[20] >> ibit) & 1) << 20) | (((Ax[21] >> ibit) & 1) << 21) | 
           (((Ax[22] >> ibit) & 1) << 22) | (((Ax[23] >> ibit) & 1) << 23) |
           (((Ax[24] >> ibit) & 1) << 24) | (((Ax[25] >> ibit) & 1) << 25) | 
           (((Ax[26] >> ibit) & 1) << 26) | (((Ax[27] >> ibit) & 1) << 27) |
           (((Ax[28] >> ibit) & 1) << 28) | (((Ax[29] >> ibit) & 1) << 29) | 
           (((Ax[30] >> ibit) & 1) << 30) | (((Ax[31] >> ibit) & 1) << 31) ;
    
           Ax += 32;
    
           By[3] = Bx[3] =
           (((Ax[0 ] >> ibit) & 1) << 0 ) | (((Ax[1 ] >> ibit) & 1) << 1 ) | 
           (((Ax[2 ] >> ibit) & 1) << 2 ) | (((Ax[3 ] >> ibit) & 1) << 3 ) |
           (((Ax[4 ] >> ibit) & 1) << 4 ) | (((Ax[5 ] >> ibit) & 1) << 5 ) | 
           (((Ax[6 ] >> ibit) & 1) << 6 ) | (((Ax[7 ] >> ibit) & 1) << 7 ) |
           (((Ax[8 ] >> ibit) & 1) << 8 ) | (((Ax[9 ] >> ibit) & 1) << 9 ) | 
           (((Ax[10] >> ibit) & 1) << 10) | (((Ax[11] >> ibit) & 1) << 11) |
           (((Ax[12] >> ibit) & 1) << 12) | (((Ax[13] >> ibit) & 1) << 13) | 
           (((Ax[14] >> ibit) & 1) << 14) | (((Ax[15] >> ibit) & 1) << 15) |
           (((Ax[16] >> ibit) & 1) << 16) | (((Ax[17] >> ibit) & 1) << 17) | 
           (((Ax[18] >> ibit) & 1) << 18) | (((Ax[19] >> ibit) & 1) << 19) |
           (((Ax[20] >> ibit) & 1) << 20) | (((Ax[21] >> ibit) & 1) << 21) | 
           (((Ax[22] >> ibit) & 1) << 22) | (((Ax[23] >> ibit) & 1) << 23) |
           (((Ax[24] >> ibit) & 1) << 24) | (((Ax[25] >> ibit) & 1) << 25) | 
           (((Ax[26] >> ibit) & 1) << 26) | (((Ax[27] >> ibit) & 1) << 27) |
           (((Ax[28] >> ibit) & 1) << 28) | (((Ax[29] >> ibit) & 1) << 29) | 
           (((Ax[30] >> ibit) & 1) << 30) | (((Ax[31] >> ibit) & 1) << 31) ;
    
           __syncthreads();
           if (ibit == iblock)
           {
                   long long * BX = (long long *)&Bs[ibit << 6];
                   Cy[(ibit << 5) + iblock] =
                   __popcll(BX[0])  + __popcll(BX[1])  + __popcll(BX[2])  + __popcll(BX[3])  + 
                   __popcll(BX[4])  + __popcll(BX[5])  + __popcll(BX[6])  + __popcll(BX[7])  +
                   __popcll(BX[8])  + __popcll(BX[9])  + __popcll(BX[10]) + __popcll(BX[11]) + 
                   __popcll(BX[12]) + __popcll(BX[13]) + __popcll(BX[14]) + __popcll(BX[15]) +
                   __popcll(BX[16]) + __popcll(BX[17]) + __popcll(BX[18]) + __popcll(BX[19]) + 
                   __popcll(BX[20]) + __popcll(BX[21]) + __popcll(BX[22]) + __popcll(BX[23]) +
                   __popcll(BX[24]) + __popcll(BX[25]) + __popcll(BX[26]) + __popcll(BX[27]) + 
                   __popcll(BX[28]) + __popcll(BX[29]) + __popcll(BX[30]) + __popcll(BX[31]) ;
    
                   ibit = 31 - ibit;
                   iblock = 31 - iblock;
    
                   long long * BY = (long long *)&Bs[iblock << 6];
    
                   Cy[(ibit << 5) + iblock] =
                   __popcll(BY[0])  + __popcll(BY[1])  + __popcll(BY[2])  + __popcll(BY[3])  + 
                   __popcll(BY[4])  + __popcll(BY[5])  + __popcll(BY[6])  + __popcll(BY[7])  +
                   __popcll(BY[8])  + __popcll(BY[9])  + __popcll(BY[10]) + __popcll(BY[11]) + 
                   __popcll(BY[12]) + __popcll(BY[13]) + __popcll(BY[14]) + __popcll(BY[15]) +
                   __popcll(BY[16]) + __popcll(BY[17]) + __popcll(BY[18]) + __popcll(BY[19]) + 
                   __popcll(BY[20]) + __popcll(BY[21]) + __popcll(BY[22]) + __popcll(BY[23]) +
                   __popcll(BY[24]) + __popcll(BY[25]) + __popcll(BY[26]) + __popcll(BY[27]) + 
                   __popcll(BY[28]) + __popcll(BY[29]) + __popcll(BY[30]) + __popcll(BY[31]) ;
           }
           else
           {
                   if (ibit > iblock)
                   {
                           ibit = 31 - ibit;
                           iblock = 31 - iblock;
                   }
                   long long * BX = (long long *)&Bs[ibit << 6];
                   long long * BY = (long long *)&Bs[iblock << 6];
    
                   Cy[(ibit << 5) + iblock] =  Cy[(iblock << 5) + ibit] =
                   __popcll(BX[0 ]^BY[0 ]) + __popcll(BX[1 ]^BY[1 ]) + __popcll(BX[2 ]^BY[2 ]) + 
                   __popcll(BX[3 ]^BY[3 ]) + __popcll(BX[4 ]^BY[4 ]) + __popcll(BX[5 ]^BY[5 ]) + 
                   __popcll(BX[6 ]^BY[6 ]) + __popcll(BX[7 ]^BY[7 ]) +
                   __popcll(BX[8 ]^BY[8 ]) + __popcll(BX[9 ]^BY[9 ]) + __popcll(BX[10]^BY[10]) + 
                   __popcll(BX[11]^BY[11]) + __popcll(BX[12]^BY[12]) + __popcll(BX[13]^BY[13]) + 
                   __popcll(BX[14]^BY[14]) + __popcll(BX[15]^BY[15]) +
                   __popcll(BX[16]^BY[16]) + __popcll(BX[17]^BY[17]) + __popcll(BX[18]^BY[18]) + 
                   __popcll(BX[19]^BY[19]) + __popcll(BX[20]^BY[20]) + __popcll(BX[21]^BY[21]) +  
                   __popcll(BX[22]^BY[22]) + __popcll(BX[23]^BY[23]) +
                   __popcll(BX[24]^BY[24]) + __popcll(BX[25]^BY[25]) + __popcll(BX[26]^BY[26]) + 
                   __popcll(BX[27]^BY[27]) + __popcll(BX[28]^BY[28]) + __popcll(BX[29]^BY[29]) + 
                   __popcll(BX[30]^BY[30]) + __popcll(BX[31]^BY[31]) ;
           }
    
    }
    #endif // #ifndef _TM_H_
    

    CUDA optimization notes

    We decided to resist the temptation to print here all 256 variants of CUDA kernel we tried. But we definitely want to outline the final tricks we used:

  • Pass input data array as a GPU constant. Constant memory is fast and it is equivalent of CPU cache. But not like CPU cache CUDA constant memory can be controlled programmatically.
  • Use shared memory to strore transposed matrix and compute Humming distance. Again, shared memory here is just another form of close to ALU cache, much faster than main GPU memory
  • Unroll loops to use more registers. GPU offers plenty of registers available to programmer(compiler). Typically you have to find a good balance between starting more threads and running portions of code sequentially on registers. There is no one solver bullet here - you have to experiment.
  • Use hardware bit counting, CUDA offers it.
  • Use device memory mapping. It seems to be faster than cudaMemCpy. (We don't completely understand how nVidia implements memory mapping over PCI, but it seems like an interesting feature.
  • Do latency optimization. CUDA kernel call with all memory transfer looks like a high latency operation. All preparations and staging can actually take longer than the call itself. This is especially important for relatively simple and high speed integer operations. So staging overhead for 4 transposition blocks on CUDA is approximately the same as for 1 block. Effectively it gives you 4 blocks for the price of one. And of cause you cannot partition the GPU so one CPU thread takes one Stream Multiprocessor (SM) and another one takes another. One GPU device cannot be used concurrently - so use it all or loose the remaining SM units.
  • Benchmark results

    1.Core2 Quad 6600 2.4GHz(1 core) 32-bit  -- 0.22ms
    2.Core2 Quad (1 core) SSE2               -- 0.12ms
    3.GT 9500    (1 SM)                      -- 0.42ms 
    4.GT 9500    (4 SM)                      -- 0.16ms (4 blocks at a time!)
    

    As you can see at integer operation 1 GT 9500 Stream Multiprocessor looses to 1 SSE aware CPU Core, while 4 SM start operating on par with 1 CPU Core. Raw integer performance of the GPU seems to be higher than CPU, but data transfer certainly takes its lions share. Core 2 Quad features 4 CPU cores, so it is also capable of running 4 blocks at a time. nVidia GT200 shows 30 SMs(?).

    Is CUDA worth it?

    Economics of CUDA development is not bad but it is not particularly good. CUDA language pretends to be C, but really looks like a good Macro ASM mixed with declarative approach for compute grid configuration. Graphics card has no Operating System or complex Task Planner/ Scheduler of its own, which is both good and bad. Direct access to the HW is good when your task scales well. If you need to partition your program into non-uniform small subtasks, you cannot run too many different tasks asyncronously in parallel without writing an "uber-shader" or "uber-CUDA-kernel". Uber-shaders are predictably complex and vulnerable to combinatorial explosion of variants of subalgorithms in the GPU kernel code. So real-life scalability of a GPU solution should be limited by the affordable complexity of software tools, CPU-GPU middleware to combine and execute batches of small tasks on GPU. A lot of GPU programming reminds us functional or declarative programming, so functional high level language should be very helpful for auto-parallelisation of GPU tasks.

    Larabee?

    Ecomonics of vector SIMD programming (SSE) also requires skills, where the main issue is current generation of SSE is NOT Turing complete (not even close). From the excellent article of Michael Abrash "A First Look at the Larrabee New Instructions (LRBni)" it seems that LRBni (Larabee vector command set) is going to be better, because it is at least vector complete (provides vector gather/scatter), which makes it increadibly awesome (but not Turing complete). Looking at both cases: SSE2 and CUDA we can extrapolate(well, speculate) both performance and complexity of programming for the future Intel Larabee hardware. Larabee will have 512-bit vector registers. This should be pretty fast if your loop is branchless. If you need a branch - LBNi provides "vector compare instructions" (with an open question on how to do actual branching after a VECTOR comparison?).

    The upcoming GP-GPU battle of CUDA3 vs. Larabee is going to be pretty interesting. If raw performance numbers are close to equal(?) the winning factor becomes compilers, languages, libraries, profiles and debuggers: tools (and (manufacturing) costs).

    Conclusion

    GP-GPU integer optimization start making sense if we want to combine a lot of computational resources in one box, growing a data-mining super-server (or cluster of super servers). Combining CPUs+GPU offers unprecedented concentration of computational resources otherwise available only for distributed systems like MPI, Google's Map-Reduce, Hadoop, other cluster-ware. Once our picture factors-in distributed cloud computing on the net, GP-GPU problems like PCIe latency suddenly become very affordable (network based cluster-ware cannot compete with realtime GPU). This allows creation of database super-nodes, where multiple CPUs cores and GPUs are combined with fast random access Solid State Drives, RAM Drives, etc. Concentartion of various silicon based (computational) resources can save bandwidth to slow devices (read: Hard Drives and Network Storage).

    Another consideration is that "The Gigahertz Race" is now over, shapes into "The Parallel Race". It means algorithms and systems should be adapted, revisited or re-architected to meet the new reality of hundreds of parallel threads (not necessarily pthreads) where data needs restructure to minimize collisions and facilitate parallel access.

    We hope this article is useful for other developers. We would be happy to see your comments, suggestions, objections at BitMagic library BLOG.

    References

  • 1."Architectural support for SWAR text processing with parallel bit streams: the inductive doubling principle"  Robert D. Cameron, Dan Lin
  • 2."A First Look at the Larrabee New Instructions (LRBni)"  Michael Abrash
  • 3."Integer 64-Bit Optimizations" Anatoliy Kuznetsov
  • 4."CUDA, Supercomputing for the Masses" Rob Farber
  • UPDATE!

    "SSE2 vs CUDA in parallel bit stream transposition / similarity. Part 2."