The swKmerLookup
module is responsible for identifying potential kmer matches in the database and for grouping them into high scoring groups (HSGs). The module takes advantage of the MASS algorithm for fast similarity search
Usage
To list the required parameters, simply type:
$ swKmerLookup --help
Parameter | Type | Description |
---|---|---|
-o or --output | string | Output file (in Perl's Storable format) to store the identified HSGs (Default: hsg.sto) |
--debug | Identified HSGs are printed to screen | |
--db | string | Path to a database folder generated with swBuildDb |
--query | string | A comma-separated list of reactivities of the query |
--seq | string | Nucleotide sequence of the query |
--threads | int | Number of processors to use (Default: 1) |
--maxReactivity | float | Maximum value to which reactivities will be capped (Default: 1) |
--kmerLen | int | Length (in nt) of the kmers (Default: 15) |
--minKmers | int | Minimum number of kmers required to form a High Scoring Group (HSG; Default: 2) |
--maxKmerDist | int | Maximum distance between two kmers to be merged in a HSG (Default: 30) |
--matchKmerSeq | The sequence of a query kmer and the corresponding database match must differ no more than --maxKmerSeqDist |
|
--kmerMaxSeqDist | float | Maximum allowed sequence distance to retain a kmer match (requires --matchKmerSeq ; Default: 0)Note: when >= 1, this is interpreted as the absolute number of bases that are allowed to differ between the kmer and the matching region. When < 1, this is interpreted as a fraction of the kmer's length |
--matchKmerGCcontent | The sequence of a query kmer and the corresponding database match must have GC% contents differing no more than --kmerMaxGCdiff |
|
--kmerMaxGCdiff | float | Maximum allowed GC% difference to retain a kmer match (requires --matchKmerGCcontent )Note: the default value is automatically determined based on the chosen kmer length |
--kmerOffset | int | Sliding offset for extracting candidate kmers from the query (Default: 1) |
--kmerMinComplexity | float | Minimum complexity (measured as Gini coefficient) of candidate kmers (Default: 0.3) |
--kmerMaxMatchEveryNt | int | A kmer is allowed to match a database entry on average every this many nt (Default: 200) |
Output
The output generated by the module is an array of HSGs identified for the given query/database entry pair. The output file is generated in Perl's Storable format, to enable rapid processing by SHAPEwarp
. When the --debug
parameter is specified, the content of the array is also printed to screen:
{ dbId => 16S, db => [185,205], query => [0,20] }
{ dbId => 16S, db => [495,511], query => [0,16] }
{ dbId => 16S, db => [1006,1022], query => [0,16] }
{ dbId => 16S, db => [447,463], query => [1,17] }
{ dbId => 16S, db => [172,189], query => [2,19] }
{ dbId => 16S, db => [252,268], query => [4,20] }
{ dbId => 16S, db => [589,606], query => [6,23] }
{ dbId => 16S, db => [1174,1222], query => [6,54] }
{ dbId => 16S, db => [539,555], query => [7,23] }
{ dbId => 16S, db => [1486,1502], query => [7,23] }
{ dbId => 16S, db => [1013,1035], query => [9,31] }
{ dbId => 16S, db => [1236,1253], query => [9,26] }
{ dbId => 16S, db => [741,758], query => [12,29] }
{ dbId => 16S, db => [431,447], query => [13,29] }
{ dbId => 16S, db => [146,177], query => [14,45] }
{ dbId => 16S, db => [368,385], query => [14,31] }
where:
Field | Description |
---|---|
dbId | Database entry ID |
db | Start-end position of the HSG in the database entry |
query | Start-end position of the HSG in the query |