The swKmerLookup module is responsible for identifying potential kmer matches in the database and for grouping them into high scoring groups (HSGs). The module takes advantage of the MASS algorithm for fast similarity search

Usage

To list the required parameters, simply type:

$ swKmerLookup --help
Parameter Type Description
-o or --output string Output file (in Perl's Storable format) to store the identified HSGs (Default: hsg.sto)
--debug Identified HSGs are printed to screen
--db string Path to a database folder generated with swBuildDb
--query string A comma-separated list of reactivities of the query
--seq string Nucleotide sequence of the query
--threads int Number of processors to use (Default: 1)
--maxReactivity float Maximum value to which reactivities will be capped (Default: 1)
--kmerLen int Length (in nt) of the kmers (Default: 15)
--minKmers int Minimum number of kmers required to form a High Scoring Group (HSG; Default: 2)
--maxKmerDist int Maximum distance between two kmers to be merged in a HSG (Default: 30)
--matchKmerSeq The sequence of a query kmer and the corresponding database match must differ no more than --maxKmerSeqDist
--kmerMaxSeqDist float Maximum allowed sequence distance to retain a kmer match (requires --matchKmerSeq; Default: 0)
Note: when >= 1, this is interpreted as the absolute number of bases that are allowed to differ between the kmer and the matching region. When < 1, this is interpreted as a fraction of the kmer's length
--matchKmerGCcontent The sequence of a query kmer and the corresponding database match must have GC% contents differing no more than --kmerMaxGCdiff
--kmerMaxGCdiff float Maximum allowed GC% difference to retain a kmer match (requires --matchKmerGCcontent)
Note: the default value is automatically determined based on the chosen kmer length
--kmerOffset int Sliding offset for extracting candidate kmers from the query (Default: 1)
--kmerMinComplexity float Minimum complexity (measured as Gini coefficient) of candidate kmers (Default: 0.3)
--kmerMaxMatchEveryNt int A kmer is allowed to match a database entry on average every this many nt (Default: 200)


Output

The output generated by the module is an array of HSGs identified for the given query/database entry pair. The output file is generated in Perl's Storable format, to enable rapid processing by SHAPEwarp. When the --debug parameter is specified, the content of the array is also printed to screen:

{ dbId => 16S, db => [185,205], query => [0,20] }
{ dbId => 16S, db => [495,511], query => [0,16] }
{ dbId => 16S, db => [1006,1022], query => [0,16] }
{ dbId => 16S, db => [447,463], query => [1,17] }
{ dbId => 16S, db => [172,189], query => [2,19] }
{ dbId => 16S, db => [252,268], query => [4,20] }
{ dbId => 16S, db => [589,606], query => [6,23] }
{ dbId => 16S, db => [1174,1222], query => [6,54] }
{ dbId => 16S, db => [539,555], query => [7,23] }
{ dbId => 16S, db => [1486,1502], query => [7,23] }
{ dbId => 16S, db => [1013,1035], query => [9,31] }
{ dbId => 16S, db => [1236,1253], query => [9,26] }
{ dbId => 16S, db => [741,758], query => [12,29] }
{ dbId => 16S, db => [431,447], query => [13,29] }
{ dbId => 16S, db => [146,177], query => [14,45] }
{ dbId => 16S, db => [368,385], query => [14,31] }


where:

Field Description
dbId Database entry ID
db Start-end position of the HSG in the database entry
query Start-end position of the HSG in the query