SHAPEwarp uses an approach inspired by the BLAST heuristic algorithm to identify regions of RNA structure similarity in a database of chemical reactivity profiles generated by SHAPE (Selective 2'-hydroxyl acylation analyzed by primer extension).
Given a user-provided query, the following analysis steps are performed:
-
All the possible query kmers of a user-defined length are enumerated, filtering out those corresponding to low structure complexity regions (i.e. those characterized by a Gini coefficient below a certain cutoff).
-
Kmers passing the complexity cutoff are then looked up in a database of reactivity profiles via MASS, a highly optimized algorithm for time series subsequences all-pairs-similarity search (TSAPSS) in just O(n2). Additionally, the percent GC content of the kmer and its matches is evaluated and matches having GC content deviating by more than a certain percentage from that of the query kmer, are discarded.
-
Kmer matches lying on the same diagonal, within a maximum distance from each other, are then grouped into high scoring groups (HSGs).
-
Ungapped HSGs are extended upstream and downstream by using a banded semi-global alignment algorithm, that incorporates both features of the Gotoh’s Smith-Waterman algorithm with affine gap penalties and of DTW. The extension process is stopped when the score drops below a certain threshold for more than a certain number of residues.
-
A null model is then built by searching the same query in a database of randomly shuffled reactivity profiles and used to estimate the E-value of each database match.
-
For those matches passing the user-defined E-value threshold, the aligned reactivity profiles can then be used to model the underlying structure by using the RNAalifold algorithm of the ViennaRNA package.
Usage
To list the required parameters, simply type:
$ SHAPEwarp --help
Parameter | Type | Description |
---|---|---|
-db or --database | string | Path to a database folder generated with swBuildDb |
-q or --query | string | Path to the query file Note: each entry should contain (one per row) the sequence id, the nucleotide sequence and a comma-separated list of SHAPE reactivities (see SHAPEwarp query files section below) |
-o or --output | string | Output directory (Default: sw_out/) |
-ow or --overwrite | Overwrites the output directory (if the specified path already exists) | |
--threads | int | Number of processors to use (Default: 1) |
--maxReactivity | float | Maximum value to which reactivities will be capped (Default: 1) |
--maxAlignOverlap | float | If two significant alignments overlap by more than this value, the least significant one (the one with the lowest alignment score) will be discarded (Default: 0.5) |
--nullHSGs | int | Number of HSGs in the shuffled database to be extended to build the null model (Default: 10000) |
--incE or --inclusionEvalue | float | E-value threshold to consider an alignment significant (Default: 0.01) |
--repE or --reportEvalue | float | E-value threshold to report a match (Default: 0.1 |
--makePlot | Generates plots of aligned SHAPE reactivity (or probability) profiles Note: plots are generated only for matches below the inclusion E-value cutoff |
|
--reportAln | string | Reports sequence alignments in the specified format ([f]asta or [s]tockholm) Note: alignments are reported only for matches below the inclusion E-value cutoff |
--foldQuery | Query SHAPE profile is first used to calculate a base-pairing probability profile, that is then used to search into the db Note: the db must have been generated with the --foldDb option |
|
Folding options (require --foldQuery or --evalAlignFold ) |
||
--slope | float | Slope for SHAPE reactivities conversion into pseudo-free energy contributions (Default: 1.8) |
--intercept | float | Intercept for SHAPE reactivities conversion into pseudo-free energy contributions (Default: -0.6) |
--maxBPspan | int | Maximum allowed base-pairing distance (Default: 600) |
--noLonelyPairs | Disallows lonely pairs (helices of 1 bp) | |
--noClosingGU | Dissalows G:U wobbles at the end of helices | |
--temperature | float | Folding temperature (Default: 37.0) |
Query folding-specific options (require --foldQuery ) |
||
--winSize | int | Size (in nt) of the sliding window for partition function calculation (Default: 800) |
--offset | int | Offset (in nt) for partition function window sliding (Default: 200) |
--winTrim | int | Number of bases to trim from both ends of partition function windows to avoid terminal biases (Default: 50) |
--ignoreReact | SHAPE reactivity is ignored when folding the query | |
Kmer lookup options | ||
--minKmers | int | Minimum number of kmers required to form a High Scoring Group (HSG; Default: 2) |
--maxKmerDist | int | Maximum distance between two kmers to be merged in a HSG (Default: 30) |
--kmerLen | int | Length (in nt) of the kmers (Default: 15) |
--kmerOffset | int | Sliding offset for extracting candidate kmers from the query (Default: 1) |
--matchKmerGCcontent | The sequence of a query kmer and the corresponding database match must have GC% contents differing no more than --kmerMaxGCdiff |
|
--kmerMaxGCdiff | float | Maximum allowed GC% difference to retain a kmer match (requires --matchKmerGCcontent )Note: the default value is automatically determined based on the chosen kmer length |
--matchKmerSeq | The sequence of a query kmer and the corresponding database match must differ no more than --kmerMaxSeqDist |
|
--kmerMaxSeqDist | float | Maximum allowed sequence distance to retain a kmer match (requires --matchKmerSeq ; Default: 0)Note: when >= 1, this is interpreted as the absolute number of bases that are allowed to differ between the kmer and the matching region. When < 1, this is interpreted as a fraction of the kmer's length |
--kmerMinComplexity | float | Minimum complexity (measured as Gini coefficient) of candidate kmers (Default: 0.3) |
--kmerMaxMatchEveryNt | int | A kmer is allowed to match a database entry on average every this many nt (Default: 200) |
Alignment options | ||
--alignMatchScore | float,float | Minimum and maximum score reactivity differences below 0.5 will be mapped to (Default: 0,2) |
--alignMismatchScore | float,float | Minimum and maximum score reactivity differences above 0.5 will be mapped to (Default: -6,-0.5) |
--alignGapOpenPenal | float | Gap open penalty (Default: -14) |
--alignGapExtPenal | float | Gap extension penalty (Default: -5) |
--alignMaxDropOffRate | float | An alignment is allowed to drop by maximum this fraction of the best score encountered so far, before extension is interrupted (Default: 0.8) |
--alignMaxDropOffBases | int | An alignment is allowed to drop below the best score encountered so far * --alignMaxDropOffRate by this number of bases, before extension is interrupted (Default: 8) |
--alignLenTolerance | float | The maximum allowed tollerated length difference between the query and db sequences to look for the ideal alignment along the diagonal (measured as a fraction of the length of the shortest sequence between the db and the query) (Default: 0.1) |
--alignScoreSeq | Sequence matches are rewarded during the alignment | |
--alignSeqMatchScore | float | Score reward for matching bases (Default: 0.5) |
--alignSeqMismatchScore | float | Score reward for matching bases (Default: -2) |
Alignment folding evaluation options (see also "Folding options") | ||
--evalAlignFold | Alignments passing the --inclusionEvalue threshold, are further evaluated for the presence of a conserved RNA structure by using RNAalifold |
|
--shufflings | Number of shufflings to perform for each alignment (Default: 100) | |
--blockSize | int | Size (in nt) of the blocks for shuffling the alignment (Default: 3) |
--inBlockShuffle | Besides shuffling blocks, residues within each block will be shuffled as well | |
--minBpSupport | float | Minimum fraction of base-pairs of the RNAalifold-inferred structure that should be supported by both query and db sequence to retain a match (Default: 0.75) |
--ribosumScoring | Use RIBOSUM scoring matrix | |
--alignFoldPvalue | float | P-value threshold to consider signficant an RNA structure predicted by RNAalifold (Default: 0.05) |
--RNAalifold | string | Path to RNAalifold executable (Default: assumes RNAalifold is in PATH) |
Understanding the algorithm
While it is advisable for most users to run SHAPEwarp with its default parameters, as these are the results of a careful and thorough calibration, it might be useful to adjust the analysis on a case-by-case basis.
First of all, as SHAPE reactivities can have a variable range, depending on the normalization scheme that has been adopted, both query and database reactivities are capped to a maximum value of maxReactivity
. During the kmer lookup phase, the algorithm enumerates all the possible kmers of size kmerLen
in the user-provided query, filtering out those:
- encompassing NaN values
- having Gini coefficient <
kmerMinComplextiy
, corresponding to regions of low structural complexity (expected to match most transcripts)
Kmers are matched against the database using the MASS algorithm. Given a kmer and a database reactivity profile, MASS returns an array of distances of that kmer to each position of the database profile. Matches having a distance < μ - 3 × σ, where μ and σ are respectively the mean and the standard deviation of the distances, are retained. Kmer-match pairs are then filtered out on the basis of the following criteria:
- when
matchGCcontent
is enabled and the difference between the GC% content of the kmer and that of its database match is >kmerMaxGCdiff
- when
matchKmerSeq
is enabled and the sequence of the kmer and that of its database match differ by more thanmaxKmerSeqDist
- when the ratio between the length of the database transcript and the number of matches for a given kmer in that transcript is <
kmerMaxMatchEveryNt
The retained kmers are then grouped into high-scoring groups (HSGs). HSGs are made of kmers residing on the same diagonal (so, corresponding to regions of the query and database that can be theoretically aligned without opening gaps). Each HSG is made of ≥ minKmers
kmers, residing within a maximum distance of maxKmerDist
from each other.
Each HSG constitutes the seed from which an alignment extension will be attempted, both upstream and downstream. The seed extension uses a dynamic programming algorithm that incorporates features of both dynamic time warping (DTW) and of Gotoh's Smith-Waterman implementation with affine gap penalties. In brief, if the reactivity difference between two bases is < 0.5, the corresponding alignment score is calculated by linearly mapping the reactivity difference to the alignMatchScore
range. Similarly, if the reactivity difference is ≥ 0.5, the corresponding alignment score is calculated by linearly mapping the reactivity difference to the alignMismatchScore
range. Affine gap penalties are controlled via the alignGapOpenPenal
, the gap open penalty, and alignGapOpenExt
, the gap extension penalty. During the matrix-filling stage of the algorithm, the alignment score is allowed to drop by not more than alignMaxDropOffRate
× the best alignment score encountered so far, for no more than alignMaxDropOffBases
. While the algorithm is sequence-agnostic by default, sequence can be optionally taken into account by enabling the alignScoreSeq
option. When enabled, sequence matches are rewarded by alignSeqMatchScore
and mismatches are penalized by alignSeqMismatchScore
. To further speed-up the alignment phase, the alignment matrix is only filled in a band of size b around the diagonal, where b is defined as min(10, alignLenTolerance
× min(q, d)), where q and d are respectively the length of the query and of the database entry. Therefore, a minimum band size of 10 nt is always granted. The final alignment score S is calculated as the sum of the scores of the HSG (the seed), plus the two upstream and downstream extensions.
After having extended any possible HSG in the database, the same procedure is repeated by searching the query in a database of shuffled reactivity profiles. A maximum of nullHSGs
HSGs identified in the shuffled database are extended, and the resulting alignment scores are used to build the null model, hence allowing to calculate a p-value for each alignment from the real database. The p-value is the probability of identifying by chance a match having a score ≥ S. The E-value corresponds to the number of expected database matches having a score ≥ S and it is thus calculated as p-value × the number of extended HSGs. It might happen, especially with long queries, that multiple HSGs will be identified for the same query-database entry pair, hence resulting in multiple overlapping alignments. To prevent duplicate matches from being reported, if multiple alignments overlap by more than maxAlignOverlap
, only the one with the lowest E-value is reported.
Optionally, if the evalAlignFold
option is enabled, alignments passing the incE
E-value cutoff can be evaluated for the presence of a consensus secondary structure, via RNAalifold. RNAalifold uses both the sequence alignment and the corresponding SHAPE reactivity information. SHAPE reactivities are converted into pseudo free energy contributions by using the approach from Deigan et al., 2009:
ΔGSHAPE(i) = slope
· ln[Reactivity(i) + 1] + intercept
If less than minBpSupport
base-pairs from the predicted consensus structure are supported for either of the two aligned RNAs, the match is discarded. Furthermore, the probability of predicting a structure with a free energy ≤ to that of the consensus structure inferred from the alignment is evaluated by randomly shuffling shufflings
times the columns of the original alignment in blockSize
-long blocks, followed by RNAalifold analysis. A z-score is calculated and the corresponding p-value determined. If the p-value is < alignFoldPvalue
, the alignment and the inferred structure are reported.
Using SHAPEwarp with non-SHAPE chemical probing data
To further enable the search of reactivity profiles derived by probing with nucleobase-specific reagents (i.e. DMS, CMCT, etc.), SHAPEwarp implements the possibility of converting reactivities into base-pairing probabilities (or, more precisely, unpaired probabilities). This option is activated by enabling the foldQuery
parameter. The partition function is calcolated for the query in sliding winSize
-long windows, slid with an offset of offset
nucleotides. To avoid end-biases due to folding being computed in windows, winTrim
bases are trimmed from both ends of the partition function window. The conversion of reactivities into pseudo free energy contributions follows the same approach described above for RNAalifold (optimal slope and intercept can be set by using the same slope
and intercept
parameters). From the base-pairing probabilities then the probability of each base of being unpaired is determined and used to search in the target database. The database must have been generated by enabling the foldDb
option of swBuildDb
.
SHAPEwarp query files
SHAPEwarp query files are plain-text files, containing the unique identifier, nucleotide sequence and reactivity profile, one per line, for each query to be searched against the database. The typical SHAPEwarp query file has the following structure:
Query_1
AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAG
NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,0.172,0.017,0.034,0.224,0.301,0.135,0.513,0.866,0.816,1.305,0.267,0.149,0.023,0.160,0.226,0.130,0.389,0.114,0.082,0.299,0.803,0.034,0.047,0.046,0.426,0.309,0.603,1.189,0.216,0.067,0.048,0.047,0.055,0.056,0.027,0.348,0.098,0.313,0.887,0.631,0.119,0.322,0.045,0.379,0.604,0.889,0.341,0.286,0.247,0.780,0.816,0.698,0.494,0.877,0.481,0.483,NaN,0.814,0.624,0.880,0.286,0.382,0.141,0.148,0.125,0.226,0.051,0.075,NaN,NaN,0.070,0.278,0.826,1.836,2.690,0.802,0.648,0.619,NaN,NaN,0.273,0.270,NaN,1.163,0.081,0.163,0.466,0.259,0.283,0.539,0.490,0.327,0.610,1.171,2.621,0.453,0.142,0.165,0.156,0.308,0.361,0.537,0.571,0.545,0.492,0.704,0.514,1.624,1.210,2.453,3.827,0.433,0.216,0.024,0.046,0.013,0.028,0.116,0.176,0.638,NaN,0.281,0.808,0.595,0.085,0.103,0.278,0.083,0.056,0.165,0.049,0.267,0.471,0.282,0.772,0.068,0.069,0.295,0.058,0.506,0.359,0.157,0.090,0.180,0.140,0.098,0.166,0.124,0.186,0.173,0.352,0.392,0.197,0.079,0.072,0.013,0.027,0.040,0.081,0.151,0.272,0.451,1.250,0.953,0.045,0.151,2.557,0.899,0.667,1.237,0.914,0.609,NaN,0.042,0.074,0.108,0.459,1.154,0.419,0.498,0.113,0.053,0.100,0.364,0.157,0.487,0.608,2.693,0.679,0.306
Query_2
AGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAGGGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGA
0.490,0.327,0.610,1.171,2.621,0.453,0.142,0.165,0.156,0.308,0.361,0.537,0.571,0.545,0.492,0.704,0.514,1.624,1.210,2.453,3.827,0.433,0.216,0.024,0.046,0.013,0.028,0.116,0.176,0.638,NaN,0.281,0.808,0.595,0.085,0.103,0.278,0.083,0.056,0.165,0.049,0.267,0.471,0.282,0.772,0.068,0.069,0.295,0.058,0.506,0.359,0.157,0.090,0.180,0.140,0.098,0.166,0.124,0.186,0.173,0.352,0.392,0.197,0.079,0.072,0.013,0.027,0.040,0.081,0.151,0.272,0.451,1.250,0.953,0.045,0.151,2.557,0.899,0.667,1.237,0.914,0.609,NaN,0.042,0.074,0.108,0.459,1.154,0.419,0.498,0.113,0.053,0.100,0.364,0.157,0.487,0.608,2.693,0.679,0.306,0.124,0.103,0.209,0.631,1.584,0.590,0.836,NaN,1.426,2.272,0.990,0.501,0.437,0.072,0.056,0.117,0.116,0.165,0.582,0.407,0.007,0.029,0.034,0.031,0.048,NaN,0.127,0.236,0.447,0.212,0.331,0.386,0.041,0.051,0.004,0.058,0.073,0.165,0.735,0.025,0.055,0.145,0.604,0.394,0.898,0.306,0.121,0.170,0.557,NaN,1.113,1.024,NaN,0.779,0.025,0.086,0.029,0.163,0.020,0.121,0.228,0.389,0.666,0.485,0.900,0.385,0.044,0.033,0.011,0.026,0.052,0.000,NaN,0.309,0.023,0.066,0.088,0.076,0.641,0.954,0.327,0.174,0.108,0.046,0.012,0.069,0.189,0.310,0.073,0.030,0.064,0.058,0.182,0.110,0.197,0.362,1.155,1.123,0.400,1.363
SHAPEwarp output
The typical SHAPEwarp's output has the following structure:
Query DB Qstart Qend Dstart Dend Qseed Dseed Score P-value E-value
16S_600 16S 0 199 608 807 60-114 668-722 190.53 4.47e-11 6.39e-09 !
16S_500 16S 0 196 508 704 160-186 668-694 203.47 2.30e-10 3.06e-08 !
16S_200 16S 9 199 216 406 120-183 327-390 187.13 3.13e-10 3.41e-08 !
16S_700 16S 0 161 708 869 65-129 773-837 159.63 3.19e-09 5.07e-07 !
16S_1300 16S 0 180 1308 1488 10-112 1318-1420 199.02 3.94e-09 6.19e-07 !
16S_900 16S 7 170 916 1079 16-100 925-1009 173.76 2.23e-08 1.16e-06 !
16S_400 16S 19 199 427 607 97-144 505-552 159.77 1.09e-07 1.42e-05 !
16S_300 16S 4 194 311 501 20-83 327-390 146.14 2.70e-07 3.40e-05 !
16S_800 16S 0 199 808 1008 73-199 882-1008 137.66 5.85e-07 6.67e-05 !
16S_1100 16S 49 199 1157 1307 68-96 1176-1204 132.35 9.51e-07 9.70e-05 !
16S_0 16S 10 161 17 168 85-102 92-109 87.73 6.57e-05 4.66e-03 !
16S_1200 16S 0 199 1208 1407 110-199 1318-1407 131.31 3.52e-05 5.39e-03 !
16S_1100 16S 40 194 55 209 170-185 185-200 80.03 4.52e-04 0.05 ?
16S_400 23S 25 155 1252 1382 74-92 1301-1319 87.09 3.55e-04 0.05 ?
16S_800 23S 47 164 1608 1725 64-81 1625-1642 80.20 4.64e-04 0.05 ?
16S_800 16S 46 161 30 145 56-71 40-55 76.82 6.86e-04 0.08 ?
where:
Field | Description |
---|---|
Query | Query ID |
DB | Database entry's ID |
Qstart | First aligned base of the query |
Qend | Last aligned base of the query |
Dstart | First aligned base of the database entry |
Dend | Last aligned base of the database entry |
Qseed | HSG position in the query |
Dseed | HSG position in the database entry |
Score | Alignment score |
P-value | The probability of finding by chance an alignment with the same (or greater) score |
E-value | The number of expected matches by chance, with the same (or greater) score, in a database of the same size |
If the evalAlignFold
option has been enabled, three additional fields will be reported:
Field | Description |
---|---|
Alifold p-value | The probability of obtaining by chance a consensus structure with the same (or lower) free energy |
Dbpsupport | Fraction of base-pairs in the consensus structure, that are supported by the database entry's sequence |
Qbpsupport | Fraction of base-pairs in the consensus structure, that are supported by the query sequence |