BitMagic-C++
Data Structures | Typedefs | Functions | Variables
xsample07.cpp File Reference

Example: Use of bvector<> for k-mer fingerprint K should be short, no minimizers here. More...

#include <assert.h>
#include <stdlib.h>
#include <iostream>
#include <vector>
#include <map>
#include <algorithm>
#include <utility>
#include <future>
#include <thread>
#include <mutex>
#include <atomic>
#include "bm64.h"
#include "bmalgo.h"
#include "bmserial.h"
#include "bmaggregator.h"
#include "bmsparsevec_compr.h"
#include "bmsparsevec_algo.h"
#include "bmundef.h"
#include "bmdbg.h"
#include "bmtimer.h"
#include "dna_finger.h"
#include "cmd_args.h"
Include dependency graph for xsample07.cpp:

Go to the source code of this file.

Data Structures

class  SortCounting_JobFunctor< BV >
 Functor to process job batch (task) More...
 
class  Counting_JobFunctor< DNA_Scan >
 k-mer counting job functor class using bm::aggregator<> More...
 

Typedefs

typedef std::vector< char > vector_char_type
 
typedef DNA_FingerprintScanner< bm::bvector<> > dna_scanner_type
 
typedef bm::sparse_vector< unsigned, bm::bvector<> > sparse_vector_u32
 
typedef bm::rsc_sparse_vector< unsigned, sparse_vector_u32rsc_sparse_vector_u32
 
typedef std::map< unsigned, unsigned > histogram_map_u32
 

Functions

std::atomic_ullong k_mer_progress_count (0)
 
static int load_FASTA (const std::string &fname, vector_char_type &seq_vect)
 really simple FASTA parser (one entry per file) More...
 
bool get_DNA_code (char bp, bm::id64_t &dna_code)
 
bool get_kmer_code (const char *dna, size_t pos, unsigned k_size, bm::id64_t &k_mer)
 Calculate k-mer as an unsigned long integer. More...
 
char int2DNA (unsigned code)
 Translate integer code to DNA letter. More...
 
void translate_kmer (std::string &dna, bm::id64_t kmer_code, unsigned k_size)
 Translate k-mer code into ATGC DNA string. More...
 
void validate_k_mer (const char *dna, size_t pos, unsigned k_size, bm::id64_t k_mer)
 QA function to validate if reverse k-mer decode gives the same string. More...
 
template<typename VECT >
void sort_unique (VECT &vect)
 Auxiliary function to do sort+unique on a vactor of ints removes duplicate elements. More...
 
template<typename VECT , typename COUNT_VECT >
void sort_count (VECT &vect, COUNT_VECT &cvect)
 Auxiliary function to do sort+unique on a vactor of ints and save results in a counts vector. More...
 
template<typename BV >
void generate_k_mer_bvector (BV &bv, const vector_char_type &seq_vect, unsigned k_size, bool check)
 This function turns each k-mer into an integer number and encodes it in a bit-vector (presense vector) The natural limitation here is that integer has to be less tha 48-bits (limitations of bm::bvector<>) This method build a presense k-mer fingerprint vector which can be used for Jaccard distance comparison. More...
 
void count_kmers (const vector_char_type &seq_vect, unsigned k_size, rsc_sparse_vector_u32 &kmer_counts)
 k-mer counting algorithm using reference sequence, regenerates k-mer codes, sorts them and counts More...
 
template<typename BV >
void count_kmers_parallel (const BV &bv_kmers, const vector_char_type &seq_vect, rsc_sparse_vector_u32 &kmer_counts, unsigned k_size, unsigned concurrency)
 MT k-mer counting. More...
 
template<typename BV >
void count_kmers (const BV &bv_kmers, rsc_sparse_vector_u32 &kmer_counts)
 k-mer counting method using Bitap algorithm for occurence search this method is significantly slower than direct regeneration of k-mer codes and sorting count More...
 
template<typename BV >
void count_kmers_parallel (const BV &bv_kmers, rsc_sparse_vector_u32 &kmer_counts, unsigned concurrency)
 Runs k-mer counting in parallel. More...
 
static void compute_kmer_histogram (histogram_map_u32 &hmap, const rsc_sparse_vector_u32 &kmer_counts)
 Compute a map of how often each k-mer frequency is observed in the k-mer counts vector. More...
 
void report_hmap (const string &fname, const histogram_map_u32 &hmap)
 Save TSV report of k-mer frequences (reverse sorted, most frequent k-mers first) More...
 
template<typename BV >
void compute_frequent_kmers (BV &frequent_bv, const histogram_map_u32 &hmap, const rsc_sparse_vector_u32 &kmer_counts, unsigned percent, unsigned k_size)
 Create vector, representing subset of k-mers of high frequency. More...
 
int main (int argc, char *argv[])
 

Variables

std::string ifa_name
 
std::string ikd_name
 
std::string ikd_counts_name
 
std::string kh_name
 
std::string ikd_rep_name
 
std::string ikd_freq_name
 
bool is_diag = false
 
bool is_timing = false
 
bool is_bench = false
 
unsigned ik_size = 8
 
unsigned parallel_jobs = 4
 
unsigned f_percent = 5
 
bm::chrono_taker::duration_map_type timing_map
 
dna_scanner_type dna_scanner
 

Detailed Description

Example: Use of bvector<> for k-mer fingerprint K should be short, no minimizers here.

Definition in file xsample07.cpp.

Typedef Documentation

◆ dna_scanner_type

Examples:
xsample07.cpp.

Definition at line 100 of file xsample07.cpp.

◆ histogram_map_u32

typedef std::map<unsigned, unsigned> histogram_map_u32
Examples:
xsample07.cpp.

Definition at line 103 of file xsample07.cpp.

◆ rsc_sparse_vector_u32

Examples:
xsample07.cpp.

Definition at line 102 of file xsample07.cpp.

◆ sparse_vector_u32

Examples:
xsample07.cpp.

Definition at line 101 of file xsample07.cpp.

◆ vector_char_type

typedef std::vector<char> vector_char_type
Examples:
xsample07.cpp.

Definition at line 99 of file xsample07.cpp.

Function Documentation

◆ compute_frequent_kmers()

template<typename BV >
void compute_frequent_kmers ( BV &  frequent_bv,
const histogram_map_u32 hmap,
const rsc_sparse_vector_u32 kmer_counts,
unsigned  percent,
unsigned  k_size 
)

Create vector, representing subset of k-mers of high frequency.

Parameters
frequent_bv[out]- bit-vector of frequent k-mers (subset of all k-mers)
hmap- histogram map of all k-mers
kmer_counts- kmer frequency(counts) vector
percent- percent of frequent k-mers to build a subset (5%) percent here is of total number of k-mers (not percent of all occurences)
k_size- K mer size
Examples:
xsample07.cpp.

Definition at line 904 of file xsample07.cpp.

References bm::sparse_vector_scanner< SV >::find_eq(), bm::rsc_sparse_vector< Val, SV >::get(), bm::rsc_sparse_vector< Val, SV >::get_null_bvector(), and bm::bvector< Alloc >::iterator_base::valid().

Referenced by main().

◆ compute_kmer_histogram()

static void compute_kmer_histogram ( histogram_map_u32 hmap,
const rsc_sparse_vector_u32 kmer_counts 
)
static

Compute a map of how often each k-mer frequency is observed in the k-mer counts vector.

Parameters
hmap- [out] histogram map
kmer_counts- [in] kmer counts vector
Examples:
xsample07.cpp.

Definition at line 859 of file xsample07.cpp.

References bm::rsc_sparse_vector< Val, SV >::get(), and bm::rsc_sparse_vector< Val, SV >::get_null_bvector().

Referenced by main().

◆ count_kmers() [1/2]

void count_kmers ( const vector_char_type seq_vect,
unsigned  k_size,
rsc_sparse_vector_u32 kmer_counts 
)
inline

k-mer counting algorithm using reference sequence, regenerates k-mer codes, sorts them and counts

Examples:
xsample07.cpp.

Definition at line 408 of file xsample07.cpp.

References get_DNA_code(), get_kmer_code(), and sort_count().

Referenced by count_kmers_parallel().

◆ count_kmers() [2/2]

template<typename BV >
void count_kmers ( const BV &  bv_kmers,
rsc_sparse_vector_u32 kmer_counts 
)

k-mer counting method using Bitap algorithm for occurence search this method is significantly slower than direct regeneration of k-mer codes and sorting count

Definition at line 653 of file xsample07.cpp.

References ik_size, bm::rsc_sparse_vector< Val, SV >::set(), and translate_kmer().

◆ count_kmers_parallel() [1/2]

template<typename BV >
void count_kmers_parallel ( const BV &  bv_kmers,
const vector_char_type seq_vect,
rsc_sparse_vector_u32 kmer_counts,
unsigned  k_size,
unsigned  concurrency 
)

MT k-mer counting.

Examples:
xsample07.cpp.

Definition at line 594 of file xsample07.cpp.

References count_kmers(), ik_size, and bm::rank_range_split().

Referenced by main().

◆ count_kmers_parallel() [2/2]

template<typename BV >
void count_kmers_parallel ( const BV &  bv_kmers,
rsc_sparse_vector_u32 kmer_counts,
unsigned  concurrency 
)

Runs k-mer counting in parallel.

Definition at line 781 of file xsample07.cpp.

References count_kmers(), k_mer_progress_count(), and bm::rank_range_split().

◆ generate_k_mer_bvector()

template<typename BV >
void generate_k_mer_bvector ( BV &  bv,
const vector_char_type seq_vect,
unsigned  k_size,
bool  check 
)

This function turns each k-mer into an integer number and encodes it in a bit-vector (presense vector) The natural limitation here is that integer has to be less tha 48-bits (limitations of bm::bvector<>) This method build a presense k-mer fingerprint vector which can be used for Jaccard distance comparison.

Parameters
bv- [out] - target bit-vector
seq_vect- [out] DNA sequence vector
k-size- dimention for k-mer generation
Examples:
xsample07.cpp.

Definition at line 306 of file xsample07.cpp.

References bm::BM_SORTED, get_DNA_code(), get_kmer_code(), sort_unique(), timing_map, and validate_k_mer().

Referenced by main().

◆ get_DNA_code()

bool get_DNA_code ( char  bp,
bm::id64_t dna_code 
)
inline

◆ get_kmer_code()

bool get_kmer_code ( const char *  dna,
size_t  pos,
unsigned  k_size,
bm::id64_t k_mer 
)
inline

Calculate k-mer as an unsigned long integer.

Returns
true - if k-mer is "true" (not 'NNNNNN')
Examples:
xsample07.cpp.

Definition at line 165 of file xsample07.cpp.

References get_DNA_code().

Referenced by count_kmers(), generate_k_mer_bvector(), and SortCounting_JobFunctor< BV >::operator()().

◆ int2DNA()

char int2DNA ( unsigned  code)
inline

Translate integer code to DNA letter.

Examples:
xsample07.cpp.

Definition at line 192 of file xsample07.cpp.

Referenced by translate_kmer(), and validate_k_mer().

◆ k_mer_progress_count()

std::atomic_ullong k_mer_progress_count ( )

◆ load_FASTA()

static int load_FASTA ( const std::string &  fname,
vector_char_type seq_vect 
)
static

really simple FASTA parser (one entry per file)

Examples:
xsample07.cpp.

Definition at line 116 of file xsample07.cpp.

References timing_map.

Referenced by main().

◆ main()

int main ( int  argc,
char *  argv[] 
)

◆ report_hmap()

void report_hmap ( const string &  fname,
const histogram_map_u32 hmap 
)

Save TSV report of k-mer frequences (reverse sorted, most frequent k-mers first)

Examples:
xsample07.cpp.

Definition at line 880 of file xsample07.cpp.

Referenced by main().

◆ sort_count()

template<typename VECT , typename COUNT_VECT >
void sort_count ( VECT &  vect,
COUNT_VECT &  cvect 
)

Auxiliary function to do sort+unique on a vactor of ints and save results in a counts vector.

Examples:
xsample07.cpp.

Definition at line 268 of file xsample07.cpp.

Referenced by count_kmers(), and SortCounting_JobFunctor< BV >::operator()().

◆ sort_unique()

template<typename VECT >
void sort_unique ( VECT &  vect)

Auxiliary function to do sort+unique on a vactor of ints removes duplicate elements.

Examples:
xsample07.cpp.

Definition at line 256 of file xsample07.cpp.

Referenced by generate_k_mer_bvector().

◆ translate_kmer()

void translate_kmer ( std::string &  dna,
bm::id64_t  kmer_code,
unsigned  k_size 
)
inline

Translate k-mer code into ATGC DNA string.

Parameters
dna- target string
k_mer- k-mer code
k_size-
Examples:
xsample07.cpp.

Definition at line 207 of file xsample07.cpp.

References int2DNA().

Referenced by count_kmers(), and Counting_JobFunctor< DNA_Scan >::operator()().

◆ validate_k_mer()

void validate_k_mer ( const char *  dna,
size_t  pos,
unsigned  k_size,
bm::id64_t  k_mer 
)
inline

QA function to validate if reverse k-mer decode gives the same string.

Examples:
xsample07.cpp.

Definition at line 224 of file xsample07.cpp.

References int2DNA().

Referenced by generate_k_mer_bvector().

Variable Documentation

◆ dna_scanner

dna_scanner_type dna_scanner
Examples:
xsample07.cpp.

Definition at line 109 of file xsample07.cpp.

◆ f_percent

unsigned f_percent = 5
Examples:
xsample07.cpp.

Definition at line 91 of file xsample07.cpp.

Referenced by main().

◆ ifa_name

std::string ifa_name
Examples:
xsample07.cpp.

Definition at line 80 of file xsample07.cpp.

Referenced by main().

◆ ik_size

unsigned ik_size = 8

◆ ikd_counts_name

std::string ikd_counts_name
Examples:
xsample07.cpp.

Definition at line 82 of file xsample07.cpp.

Referenced by main().

◆ ikd_freq_name

std::string ikd_freq_name
Examples:
xsample07.cpp.

Definition at line 85 of file xsample07.cpp.

Referenced by main().

◆ ikd_name

std::string ikd_name
Examples:
xsample07.cpp.

Definition at line 81 of file xsample07.cpp.

Referenced by main().

◆ ikd_rep_name

std::string ikd_rep_name
Examples:
xsample07.cpp.

Definition at line 84 of file xsample07.cpp.

Referenced by main().

◆ is_bench

bool is_bench = false
Examples:
xsample07.cpp.

Definition at line 88 of file xsample07.cpp.

◆ is_diag

bool is_diag = false
Examples:
xsample07.cpp.

Definition at line 86 of file xsample07.cpp.

Referenced by main().

◆ is_timing

bool is_timing = false
Examples:
xsample07.cpp.

Definition at line 87 of file xsample07.cpp.

Referenced by main().

◆ kh_name

std::string kh_name
Examples:
xsample07.cpp.

Definition at line 83 of file xsample07.cpp.

Referenced by main().

◆ parallel_jobs

unsigned parallel_jobs = 4
Examples:
xsample07.cpp.

Definition at line 90 of file xsample07.cpp.

Referenced by main().

◆ timing_map

Examples:
xsample07.cpp.

Definition at line 108 of file xsample07.cpp.

Referenced by generate_k_mer_bvector(), load_FASTA(), and main().