SHAPEwarp uses an approach inspired by the BLAST heuristic algorithm to identify regions of RNA structure similarity in a database of chemical reactivity profiles generated by SHAPE (Selective 2'-hydroxyl acylation analyzed by primer extension).

SHAPEwarp Workflow

Given a user-provided query, the following analysis steps are performed:

  1. All the possible query kmers of a user-defined length are enumerated, filtering out those corresponding to low structure complexity regions (i.e. those characterized by a Gini coefficient below a certain cutoff).

  2. Kmers passing the complexity cutoff are then looked up in a database of reactivity profiles via MASS, a highly optimized algorithm for time series subsequences all-pairs-similarity search (TSAPSS) in just O(n2). Additionally, the percent GC content of the kmer and its matches is evaluated and matches having GC content deviating by more than a certain percentage from that of the query kmer, are discarded.

  3. Kmer matches lying on the same diagonal, within a maximum distance from each other, are then grouped into high scoring groups (HSGs).

  4. Ungapped HSGs are extended upstream and downstream by using a banded semi-global alignment algorithm, that incorporates both features of the Gotoh’s Smith-Waterman algorithm with affine gap penalties and of DTW. The extension process is stopped when the score drops below a certain threshold for more than a certain number of residues.

  5. A null model is then built by searching the same query in a database of randomly shuffled reactivity profiles and used to estimate the E-value of each database match.

  6. For those matches passing the user-defined E-value threshold, the aligned reactivity profiles can then be used to model the underlying structure by using the RNAalifold algorithm of the ViennaRNA package.


Usage

To list the required parameters, simply type:

$ SHAPEwarp --help
Parameter Type Description
-db or --database string Path to a database folder generated with swBuildDb
-q or --query string Path to the query file
Note: each entry should contain (one per row) the sequence id, the nucleotide sequence and a comma-separated list of SHAPE reactivities (see SHAPEwarp query files section below)
-o or --output string Output directory (Default: sw_out/)
-ow or --overwrite Overwrites the output directory (if the specified path already exists)
--threads int Number of processors to use (Default: 1)
--maxReactivity float Maximum value to which reactivities will be capped (Default: 1)
--maxAlignOverlap float If two significant alignments overlap by more than this value, the least significant one (the one with the lowest alignment score) will be discarded (Default: 0.5)
--nullHSGs int Number of HSGs in the shuffled database to be extended to build the null model (Default: 10000)
--incE or --inclusionEvalue float E-value threshold to consider an alignment significant (Default: 0.01)
--repE or --reportEvalue float E-value threshold to report a match (Default: 0.1
--makePlot Generates plots of aligned SHAPE reactivity (or probability) profiles
Note: plots are generated only for matches below the inclusion E-value cutoff
--reportAln string Reports sequence alignments in the specified format ([f]asta or [s]tockholm)
Note: alignments are reported only for matches below the inclusion E-value cutoff
--foldQuery Query SHAPE profile is first used to calculate a base-pairing probability profile, that is then used to search into the db
Note: the db must have been generated with the --foldDb option
Folding options (require --foldQuery or --evalAlignFold)
--slope float Slope for SHAPE reactivities conversion into pseudo-free energy contributions (Default: 1.8)
--intercept float Intercept for SHAPE reactivities conversion into pseudo-free energy contributions (Default: -0.6)
--maxBPspan int Maximum allowed base-pairing distance (Default: 600)
--noLonelyPairs Disallows lonely pairs (helices of 1 bp)
--noClosingGU Dissalows G:U wobbles at the end of helices
--temperature float Folding temperature (Default: 37.0)
Query folding-specific options (require --foldQuery)
--winSize int Size (in nt) of the sliding window for partition function calculation (Default: 800)
--offset int Offset (in nt) for partition function window sliding (Default: 200)
--winTrim int Number of bases to trim from both ends of partition function windows to avoid terminal biases (Default: 50)
--ignoreReact SHAPE reactivity is ignored when folding the query
Kmer lookup options
--minKmers int Minimum number of kmers required to form a High Scoring Group (HSG; Default: 2)
--maxKmerDist int Maximum distance between two kmers to be merged in a HSG (Default: 30)
--kmerLen int Length (in nt) of the kmers (Default: 15)
--kmerOffset int Sliding offset for extracting candidate kmers from the query (Default: 1)
--matchKmerGCcontent The sequence of a query kmer and the corresponding database match must have GC% contents differing no more than --kmerMaxGCdiff
--kmerMaxGCdiff float Maximum allowed GC% difference to retain a kmer match (requires --matchKmerGCcontent)
Note: the default value is automatically determined based on the chosen kmer length
--matchKmerSeq The sequence of a query kmer and the corresponding database match must differ no more than --kmerMaxSeqDist
--kmerMaxSeqDist float Maximum allowed sequence distance to retain a kmer match (requires --matchKmerSeq; Default: 0)
Note: when >= 1, this is interpreted as the absolute number of bases that are allowed to differ between the kmer and the matching region. When < 1, this is interpreted as a fraction of the kmer's length
--kmerMinComplexity float Minimum complexity (measured as Gini coefficient) of candidate kmers (Default: 0.3)
--kmerMaxMatchEveryNt int A kmer is allowed to match a database entry on average every this many nt (Default: 200)
Alignment options
--alignMatchScore float,float Minimum and maximum score reactivity differences below 0.5 will be mapped to (Default: 0,2)
--alignMismatchScore float,float Minimum and maximum score reactivity differences above 0.5 will be mapped to (Default: -6,-0.5)
--alignGapOpenPenal float Gap open penalty (Default: -14)
--alignGapExtPenal float Gap extension penalty (Default: -5)
--alignMaxDropOffRate float An alignment is allowed to drop by maximum this fraction of the best score encountered so far, before extension is interrupted (Default: 0.8)
--alignMaxDropOffBases int An alignment is allowed to drop below the best score encountered so far * --alignMaxDropOffRate by this number of bases, before extension is interrupted (Default: 8)
--alignLenTolerance float The maximum allowed tollerated length difference between the query and db sequences to look for the ideal alignment along the diagonal (measured as a fraction of the length of the shortest sequence between the db and the query) (Default: 0.1)
--alignScoreSeq Sequence matches are rewarded during the alignment
--alignSeqMatchScore float Score reward for matching bases (Default: 0.5)
--alignSeqMismatchScore float Score reward for matching bases (Default: -2)
Alignment folding evaluation options (see also "Folding options")
--evalAlignFold Alignments passing the --inclusionEvalue threshold, are further evaluated for the presence of a conserved RNA structure by using RNAalifold
--shufflings Number of shufflings to perform for each alignment (Default: 100)
--blockSize int Size (in nt) of the blocks for shuffling the alignment (Default: 3)
--inBlockShuffle Besides shuffling blocks, residues within each block will be shuffled as well
--minBpSupport float Minimum fraction of base-pairs of the RNAalifold-inferred structure that should be supported by both query and db sequence to retain a match (Default: 0.75)
--ribosumScoring Use RIBOSUM scoring matrix
--alignFoldPvalue float P-value threshold to consider signficant an RNA structure predicted by RNAalifold (Default: 0.05)
--RNAalifold string Path to RNAalifold executable (Default: assumes RNAalifold is in PATH)


Understanding the algorithm

While it is advisable for most users to run SHAPEwarp with its default parameters, as these are the results of a careful and thorough calibration, it might be useful to adjust the analysis on a case-by-case basis.

First of all, as SHAPE reactivities can have a variable range, depending on the normalization scheme that has been adopted, both query and database reactivities are capped to a maximum value of maxReactivity. During the kmer lookup phase, the algorithm enumerates all the possible kmers of size kmerLen in the user-provided query, filtering out those:

  1. encompassing NaN values
  2. having Gini coefficient < kmerMinComplextiy, corresponding to regions of low structural complexity (expected to match most transcripts)

Kmers are matched against the database using the MASS algorithm. Given a kmer and a database reactivity profile, MASS returns an array of distances of that kmer to each position of the database profile. Matches having a distance < μ - 3 × σ, where μ and σ are respectively the mean and the standard deviation of the distances, are retained. Kmer-match pairs are then filtered out on the basis of the following criteria:

  1. when matchGCcontent is enabled and the difference between the GC% content of the kmer and that of its database match is > kmerMaxGCdiff
  2. when matchKmerSeq is enabled and the sequence of the kmer and that of its database match differ by more than maxKmerSeqDist
  3. when the ratio between the length of the database transcript and the number of matches for a given kmer in that transcript is < kmerMaxMatchEveryNt

The retained kmers are then grouped into high-scoring groups (HSGs). HSGs are made of kmers residing on the same diagonal (so, corresponding to regions of the query and database that can be theoretically aligned without opening gaps). Each HSG is made of ≥ minKmers kmers, residing within a maximum distance of maxKmerDist from each other.

Each HSG constitutes the seed from which an alignment extension will be attempted, both upstream and downstream. The seed extension uses a dynamic programming algorithm that incorporates features of both dynamic time warping (DTW) and of Gotoh's Smith-Waterman implementation with affine gap penalties. In brief, if the reactivity difference between two bases is < 0.5, the corresponding alignment score is calculated by linearly mapping the reactivity difference to the alignMatchScore range. Similarly, if the reactivity difference is ≥ 0.5, the corresponding alignment score is calculated by linearly mapping the reactivity difference to the alignMismatchScore range. Affine gap penalties are controlled via the alignGapOpenPenal, the gap open penalty, and alignGapOpenExt, the gap extension penalty. During the matrix-filling stage of the algorithm, the alignment score is allowed to drop by not more than alignMaxDropOffRate × the best alignment score encountered so far, for no more than alignMaxDropOffBases. While the algorithm is sequence-agnostic by default, sequence can be optionally taken into account by enabling the alignScoreSeq option. When enabled, sequence matches are rewarded by alignSeqMatchScore and mismatches are penalized by alignSeqMismatchScore. To further speed-up the alignment phase, the alignment matrix is only filled in a band of size b around the diagonal, where b is defined as min(10, alignLenTolerance × min(q, d)), where q and d are respectively the length of the query and of the database entry. Therefore, a minimum band size of 10 nt is always granted. The final alignment score S is calculated as the sum of the scores of the HSG (the seed), plus the two upstream and downstream extensions.

After having extended any possible HSG in the database, the same procedure is repeated by searching the query in a database of shuffled reactivity profiles. A maximum of nullHSGs HSGs identified in the shuffled database are extended, and the resulting alignment scores are used to build the null model, hence allowing to calculate a p-value for each alignment from the real database. The p-value is the probability of identifying by chance a match having a score ≥ S. The E-value corresponds to the number of expected database matches having a score ≥ S and it is thus calculated as p-value × the number of extended HSGs. It might happen, especially with long queries, that multiple HSGs will be identified for the same query-database entry pair, hence resulting in multiple overlapping alignments. To prevent duplicate matches from being reported, if multiple alignments overlap by more than maxAlignOverlap, only the one with the lowest E-value is reported.

Optionally, if the evalAlignFold option is enabled, alignments passing the incE E-value cutoff can be evaluated for the presence of a consensus secondary structure, via RNAalifold. RNAalifold uses both the sequence alignment and the corresponding SHAPE reactivity information. SHAPE reactivities are converted into pseudo free energy contributions by using the approach from Deigan et al., 2009:

ΔGSHAPE(i) = slope · ln[Reactivity(i) + 1] + intercept

If less than minBpSupport base-pairs from the predicted consensus structure are supported for either of the two aligned RNAs, the match is discarded. Furthermore, the probability of predicting a structure with a free energy ≤ to that of the consensus structure inferred from the alignment is evaluated by randomly shuffling shufflings times the columns of the original alignment in blockSize-long blocks, followed by RNAalifold analysis. A z-score is calculated and the corresponding p-value determined. If the p-value is < alignFoldPvalue, the alignment and the inferred structure are reported.

Using SHAPEwarp with non-SHAPE chemical probing data

To further enable the search of reactivity profiles derived by probing with nucleobase-specific reagents (i.e. DMS, CMCT, etc.), SHAPEwarp implements the possibility of converting reactivities into base-pairing probabilities (or, more precisely, unpaired probabilities). This option is activated by enabling the foldQuery parameter. The partition function is calcolated for the query in sliding winSize-long windows, slid with an offset of offset nucleotides. To avoid end-biases due to folding being computed in windows, winTrim bases are trimmed from both ends of the partition function window. The conversion of reactivities into pseudo free energy contributions follows the same approach described above for RNAalifold (optimal slope and intercept can be set by using the same slope and intercept parameters). From the base-pairing probabilities then the probability of each base of being unpaired is determined and used to search in the target database. The database must have been generated by enabling the foldDb option of swBuildDb.

SHAPEwarp query files

SHAPEwarp query files are plain-text files, containing the unique identifier, nucleotide sequence and reactivity profile, one per line, for each query to be searched against the database. The typical SHAPEwarp query file has the following structure:

Query_1
AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAG
NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,0.172,0.017,0.034,0.224,0.301,0.135,0.513,0.866,0.816,1.305,0.267,0.149,0.023,0.160,0.226,0.130,0.389,0.114,0.082,0.299,0.803,0.034,0.047,0.046,0.426,0.309,0.603,1.189,0.216,0.067,0.048,0.047,0.055,0.056,0.027,0.348,0.098,0.313,0.887,0.631,0.119,0.322,0.045,0.379,0.604,0.889,0.341,0.286,0.247,0.780,0.816,0.698,0.494,0.877,0.481,0.483,NaN,0.814,0.624,0.880,0.286,0.382,0.141,0.148,0.125,0.226,0.051,0.075,NaN,NaN,0.070,0.278,0.826,1.836,2.690,0.802,0.648,0.619,NaN,NaN,0.273,0.270,NaN,1.163,0.081,0.163,0.466,0.259,0.283,0.539,0.490,0.327,0.610,1.171,2.621,0.453,0.142,0.165,0.156,0.308,0.361,0.537,0.571,0.545,0.492,0.704,0.514,1.624,1.210,2.453,3.827,0.433,0.216,0.024,0.046,0.013,0.028,0.116,0.176,0.638,NaN,0.281,0.808,0.595,0.085,0.103,0.278,0.083,0.056,0.165,0.049,0.267,0.471,0.282,0.772,0.068,0.069,0.295,0.058,0.506,0.359,0.157,0.090,0.180,0.140,0.098,0.166,0.124,0.186,0.173,0.352,0.392,0.197,0.079,0.072,0.013,0.027,0.040,0.081,0.151,0.272,0.451,1.250,0.953,0.045,0.151,2.557,0.899,0.667,1.237,0.914,0.609,NaN,0.042,0.074,0.108,0.459,1.154,0.419,0.498,0.113,0.053,0.100,0.364,0.157,0.487,0.608,2.693,0.679,0.306
Query_2
AGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAGGGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGA
0.490,0.327,0.610,1.171,2.621,0.453,0.142,0.165,0.156,0.308,0.361,0.537,0.571,0.545,0.492,0.704,0.514,1.624,1.210,2.453,3.827,0.433,0.216,0.024,0.046,0.013,0.028,0.116,0.176,0.638,NaN,0.281,0.808,0.595,0.085,0.103,0.278,0.083,0.056,0.165,0.049,0.267,0.471,0.282,0.772,0.068,0.069,0.295,0.058,0.506,0.359,0.157,0.090,0.180,0.140,0.098,0.166,0.124,0.186,0.173,0.352,0.392,0.197,0.079,0.072,0.013,0.027,0.040,0.081,0.151,0.272,0.451,1.250,0.953,0.045,0.151,2.557,0.899,0.667,1.237,0.914,0.609,NaN,0.042,0.074,0.108,0.459,1.154,0.419,0.498,0.113,0.053,0.100,0.364,0.157,0.487,0.608,2.693,0.679,0.306,0.124,0.103,0.209,0.631,1.584,0.590,0.836,NaN,1.426,2.272,0.990,0.501,0.437,0.072,0.056,0.117,0.116,0.165,0.582,0.407,0.007,0.029,0.034,0.031,0.048,NaN,0.127,0.236,0.447,0.212,0.331,0.386,0.041,0.051,0.004,0.058,0.073,0.165,0.735,0.025,0.055,0.145,0.604,0.394,0.898,0.306,0.121,0.170,0.557,NaN,1.113,1.024,NaN,0.779,0.025,0.086,0.029,0.163,0.020,0.121,0.228,0.389,0.666,0.485,0.900,0.385,0.044,0.033,0.011,0.026,0.052,0.000,NaN,0.309,0.023,0.066,0.088,0.076,0.641,0.954,0.327,0.174,0.108,0.046,0.012,0.069,0.189,0.310,0.073,0.030,0.064,0.058,0.182,0.110,0.197,0.362,1.155,1.123,0.400,1.363


SHAPEwarp output

The typical SHAPEwarp's output has the following structure:

Query     DB   Qstart  Qend   Dstart  Dend    Qseed   Dseed      Score   P-value   E-value
16S_600   16S  0       199    608     807     60-114  668-722    190.53  4.47e-11  6.39e-09  !
16S_500   16S  0       196    508     704     160-186 668-694    203.47  2.30e-10  3.06e-08  !
16S_200   16S  9       199    216     406     120-183 327-390    187.13  3.13e-10  3.41e-08  !
16S_700   16S  0       161    708     869     65-129  773-837    159.63  3.19e-09  5.07e-07  !
16S_1300  16S  0       180    1308    1488    10-112  1318-1420  199.02  3.94e-09  6.19e-07  !
16S_900   16S  7       170    916     1079    16-100  925-1009   173.76  2.23e-08  1.16e-06  !
16S_400   16S  19      199    427     607     97-144  505-552    159.77  1.09e-07  1.42e-05  !
16S_300   16S  4       194    311     501     20-83   327-390    146.14  2.70e-07  3.40e-05  !
16S_800   16S  0       199    808     1008    73-199  882-1008   137.66  5.85e-07  6.67e-05  !
16S_1100  16S  49      199    1157    1307    68-96   1176-1204  132.35  9.51e-07  9.70e-05  !
16S_0     16S  10      161    17      168     85-102  92-109     87.73   6.57e-05  4.66e-03  !
16S_1200  16S  0       199    1208    1407    110-199 1318-1407  131.31  3.52e-05  5.39e-03  !
16S_1100  16S  40      194    55      209     170-185 185-200    80.03   4.52e-04  0.05      ?
16S_400   23S  25      155    1252    1382    74-92   1301-1319  87.09   3.55e-04  0.05      ?
16S_800   23S  47      164    1608    1725    64-81   1625-1642  80.20   4.64e-04  0.05      ?
16S_800   16S  46      161    30      145     56-71   40-55      76.82   6.86e-04  0.08      ?

where:

Field Description
Query Query ID
DB Database entry's ID
Qstart First aligned base of the query
Qend Last aligned base of the query
Dstart First aligned base of the database entry
Dend Last aligned base of the database entry
Qseed HSG position in the query
Dseed HSG position in the database entry
Score Alignment score
P-value The probability of finding by chance an alignment with the same (or greater) score
E-value The number of expected matches by chance, with the same (or greater) score, in a database of the same size

If the evalAlignFold option has been enabled, three additional fields will be reported:

Field Description
Alifold p-value The probability of obtaining by chance a consensus structure with the same (or lower) free energy
Dbpsupport Fraction of base-pairs in the consensus structure, that are supported by the database entry's sequence
Qbpsupport Fraction of base-pairs in the consensus structure, that are supported by the query sequence