Development and Validation of the JSI Splice Site Prediction Tool

Introduction


The JSI splice site prediction tool predicts changes in the quality of splice sites at and in close proximity to a site of genetic variation. For this, several scores are calculated, among them scores taking into account not only the overall quality of the known splice motif but also the probability of the respective sequence being present as a known splice site throughout the whole genome. The JSI splice site prediction tool has been trained with approximately 200 000 splice sites, i.e. the known splice sites throughout the whole genome (GRCh37).

Additionally, the MaxEntScan scoring tool for human splice sites has been integrated and the respective scores are displayed alongside the JSI scores for comparison. We would like to thank Gene Yeo for his friendly approval to integrate the MaxEntScan scoring algorithm into our software.

Calculated Scores


5' spliceSITE scores
Score Total score indicating the overall quality of a (possible) splice site.
A positive value predicts a functional splice site.
A negative value predicts that this position is not a functional splice site. Negative values below -1000 are shown as "neg.".
BCS Base Consensus Score: summed up likelihood for single bases 3 bp upstream to 4 bp downstream of GT
MCS Motif Consensus Score: calculation based on the likelihood and frequency of the complete 9mer sequence of the donor splice site (5'<3bp>GT<4bp>3')
5' MaxEntScan scores
ENT Maximum Entropy Model by MaxEntScan.
MDD Maximum Dependence Decomposition Model by MaxEntScan.
MM First-order Markov Model by MaxEntScan.
WMM Weight Matrix Model by MaxEntScan.
3' spliceSITE scores
Score Total score indicating the overall quality of a (possible) splice site.
A positive value predicts a functional splice site.
A negative value predicts that this position is not a functional splice site.
Negative values below -1000 are shown as "neg.".
BCS Base Consensus Score: summed up likelihood for single bases upstream AG.
MCS Motif Consensus Score: Sequence upstream of AG is divided to 6mer oligos. The score is calculated based on the frequency of each 6mer at a given position.
CAGG Base Consensus Score, indicates the splice site quality by summing up the likelihood of the occurrence of the bases flanking the AG.
BP Branch Point: sequence for branch point should match to yTnAy
Branch point must be located in the AGEZ (AG Exception Zone).
BPPos Branch Point Position: optimum is 23 bp downstream of 3' splice site.
PPT Polypyrimidine tract score, indicates C/T content of the polypyrimidine tract between Branch Point and 3' splice site.
U2BE U2 binding energy score, indicates the quality of binding capacity of the branch point motif regarding the U2 splicesosomal unit.
3' MaxEntScan scores
ENT Maximum Entropy Model by MaxEntScan.
MM First-order Markov Model by MaxEntScan.
WMM Weight Matrix Model by MaxEntScan.

Accuracy Test 1: wildtype splice sites


Splice Sites were predicted for the wild type genomic reference sequence of several genes.

For every GT and AG in the sequence, the scores for a 5' or 3' splice site were calculated, respectively. A score above zero indicates that the site is predicted to be a splice site, below zero means it is not predicted to be a splice site.

As the authentic and alternative splice sites are known for the gene, the predicted results can then be compared with the actual splice sites.

Additionally, the MaxENT algorithm was tested in the same manner to compare the predicted results. For MaxENT, a score above 3.5 was interpreted as a predicted splice site.

For authentic (including alternative) splice sites, the JSI tool correctly predicted more (usually all) splice sites than MaxENT. In cases where MaxENT also predicted all authentic splice sites of a gene correctly, the JSI tool's results were the same.
Hence, the JSI splice site prediction tool has a lower false negative rate.

Inversely, the JSI splice site tool is configured to be more sensitive, resulting in the prediction of more cryptic splice sites with positive scores than MaxENT. Due to the higher detection rate, the possibility to miss authentic splice sites is reduced. The higher number of false positive calls is adjusted afterwards. The scores of each called splice site are compared between the reference and the altered sequence. Thus, calls of non-authentic splice sites with positive scores can easily be filtered out in this step if their score is the same for reference and altered sequence.

Please refer to appendix 1 for the detailed result data of five example genes.


Accuracy Test 2: validated splice site variants


For the second test, the JSI Splice Site Prediction Tool was used to predict the effects of known splicing variants. All obvious splicing variants with alterations of the GT (5') or AG (3') of authentic splice sites are always displayed.

It was then intensely recherched for thoroughly validated splicing variants that are not caused by alterations of the highly conserved GT or AG bases. Compared to the GT or AG-affecting splice sites, fewer validated variants were found. This may be due to the fact that variants not directly affecting the GT or AG of a splice site were often not suspected to be splice-altering and therefore not examined for such an effect, along with the general expense of a thorough evaluation.

The JSI Splice Site Prediction Tool prove to be very accurate when being used to predict effects of variants not altering the consensus GT or AG.
Especially for 3' splice sites, the prediction was more accurate than the one of MaxENT. This is due to the fact that the JSI Splice Site Prediction Tool analyses a longer sequence than MaxENT. Whilst MaxENT only analyses 20 bases upstream of the AG, the JSI Splice Site Prediction Tool includes a wider range of upstream bases for the additional features of a 3' splice site. Thereby, it is able to detect changes in the polypyrimidine tract, branch point and AG exception zone.

Moreover, the JSI Splice Site Prediction Tool compares the scores of all possible splice sites surrounding the variant. Thus, it can predict when a new or cryptic splice site becomes a competitor of an authentic splice site.

Furthermore, the tool prove capable of detecting alterations in splice site strength (example 8 in appendix 2). There, the score was decreased due to the variant, but without becoming lower than the scores of surrounding cryptic splice sites. The prediction of this "medium decrease" in score matches the actual effect of a weakened splice site that results in lowered amounts of mRNA transcript.

Appendix 2 contains examples from the list of validated variants that were used to compare predicted and actual effect. To view the detailed scores, also of MaxENT, please click the given link on the gene name, which will open another browser tab with the splice site prediction query for the respective variant.


Appendix 1: Results of the Accuracy Test 1


Procedure

Splice Sites were predicted for the wild type genomic reference sequence of several genes.

For every GT and AG in the sequence, the scores for a 5' or 3' splice site were calculated, respectively. A score above zero indicates that the site is predicted to be a splice site, below zero means it is not predicted to be a splice site.

As the authentic and alternative splice sites are known for the gene, the predicted results can then be compared with the actual splice sites.

Additionally, the MaxENT algorithm was tested in the same manner to compare the predicted results. For MaxENT, a score above 3.5 was interpreted as a predicted splice site.

Results


Gene: TP53

Exons: 11
Genomic Sequence: 25722 bp

5' Splice Sites JSI SSP MaxENT
Correctly predicted authentic SS 10 9
Falsely predicted authentic SS 0 1*
Correctly predicted cryptic SS 1058 1124
Falsely predicted cryptic SS 184 118
3' Splice Sites JSI SSP MaxENT
Correctly predicted authentic SS 10 9
Falsely predicted authentic SS 0 1*
Correctly predicted cryptic SS 1674 1831
Falsely predicted cryptic SS 328 171

* Details of false predictions for authentic splice sites:

Splice Site Type Position JSI SSP Score MaxENT Score
5' IVS6+1 1171 2.59
3' IVS7-2 888 3.21

Gene: BRCA1

Exons: 24
Genomic Sequence: 81189 bp

5' Splice Sites JSI SSP MaxENT
Correctly predicted authentic SS 23 22
Falsely predicted authentic SS 0 1*
Correctly predicted cryptic SS 3818 4006
Falsely predicted cryptic SS 581 393
3' Splice Sites JSI SSP MaxENT
Correctly predicted authentic SS 22 21
Falsely predicted authentic SS 1* 2*
Correctly predicted cryptic SS 4894 5500
Falsely predicted cryptic SS 1219 613

* Details of false predictions for authentic splice sites:

Splice Site Type Position JSI SSP Score MaxENT Score
5' IVS6+1 1160 3.23
3' IVS1-2 (5'UTR) -465 4.90
3' IVS7-2 725 2.82
3' IVS13-2 797 1.93

Gene: COL1A1

Exons: 51
Genomic Sequence: 18351 bp

5' Splice Sites JSI SSP MaxENT
Correctly predicted authentic SS 50 50
Falsely predicted authentic SS 0 0
Correctly predicted cryptic SS 771 798
Falsely predicted cryptic SS 89 62
3' Splice Sites JSI SSP MaxENT
Correctly predicted authentic SS 50 50
Falsely predicted authentic SS 0 0
Correctly predicted cryptic SS 981 1054
Falsely predicted cryptic SS 216 143

Gene: RAD51C

Exons: 9
Genomic Sequence: 41770 bp

5' Splice Sites JSI SSP MaxENT
Correctly predicted authentic SS 8 7
Falsely predicted authentic SS 0 1*
Correctly predicted cryptic SS 1965 2046
Falsely predicted cryptic SS 271 190
3' Splice Sites JSI SSP MaxENT
Correctly predicted authentic SS 8 8
Falsely predicted authentic SS 0 0
Correctly predicted cryptic SS 2139 2475
Falsely predicted cryptic SS 730 394

* Details of false predictions for authentic splice sites:

Splice Site Type Position JSI SSP Score MaxENT Score
5' IVS8+1 761 1.98

Gene: ATM

Exons: 63
Genomic Sequence: 146619 bp

5' Splice Sites JSI SSP MaxENT
Correctly predicted authentic SS 59 58
Falsely predicted authentic SS 2* 3*
Correctly predicted cryptic SS 7104 7327
Falsely predicted cryptic SS 854 631
3' Splice Sites JSI SSP MaxENT
Correctly predicted authentic SS 61 58
Falsely predicted authentic SS 1* 4*
Correctly predicted cryptic SS 7748 8842
Falsely predicted cryptic SS 2432 1338

* Details of false predictions for authentic splice sites:

Splice Site Type Position JSI SSP Score MaxENT Score
5' IVS32+1 -500 -2.26
5' IVS35+1 -207 0.90
5' IVS58+1 677 3.39
3' IVS12-2 -40 2.57
3' IVS33-2 676 2.46
3' IVS39-2 605 3.03
3' IVS49-2 650 1.87

Appendix 2: Results of the Accuracy Test 2


Procedure

The JSI Splice Site Prediction Tool was used to predict the effects of known splicing variants.

Results


# Gene HGVS Correct prediction JSI SS Prediction summary Actual Effect Disease Reference
1 CDKN2A ⇒ NM_000077
c.457+1G>T
Possible loss of function for authentic splice site at c.457+1.
Score for cryptic Splice Site now highest score at c.384.
ss abolished Pancreatic cancer/melanoma syndrome Mucaki EJ, Shirley BC, Rogan PK: Prediction of Mutant mRNA Splice Isoforms by Information Theory-Based Exon Definition. Hum Mutat. 2013; 34(4): 557–565.
2 BRCA1 ⇒ NM_007294
c.212+1G>A
Possible loss of function for authentic splice site at c.212+1, score for cryptic splice site at c.212+13 now highest score. Alternative Splice Site may be activated at c.191l. ss abolished,
cryptic ss 22 nt upstream activated, deletion of 22 nucleotides from exon 4
Breast Cancer Mucaki EJ, Ainsworth P, Rogan PK: Comprehensive prediction of mRNA splicing effects of BRCA1 and BRCA2 variants. Hum Mutat. 2011; 32(7): 735–742.
3 POLH ⇒ NM_006502
c.490G>T
Possible loss of function for authentic 5' splice site at c.490+1.
(3' splice sitec at c.490+11: Score became positive.)
ss abolished Xeroderma pigmentosum,variant type Iniu H. et al., Xeroderma pigmentosum-variant patients from America, Europe, and Asia. J Invest Dermatol. 2008; 128(8): 2055-2068.
4 PLP1 ⇒ NM_000533
c.173A>G
Possible new splice site at c.173 with higher score than authentic splice site at c.191+1. ss abolished,
cryptic ss 19 nt upstream activated, deletion of 19 nucleotides from exon 3
Pelizaeus-Merzbacher disease Bonnet-Dupeyron MN, Combes P, Santander P, et al.: PLP1 splicing abnormalities identified in Pelizaeus-Merzbacher disease and SPG2 fibroblasts are associated with different types of mutations. Hum Mutat. 2008; 29(8): 1028–1036.
5 PLP1 ⇒ NM_000533
c.454-10A>G
Score for cryptic splice site at c.454-43 now highest score. Possible new splice site at c.454-11 with higher score than authentic splice site at c.454-2. Possible loss of function for authentic 3' splice site at c.454-2.
(5' splice site at c.454-14: Score became positive.)
retention of last 9bp of exon due to cryptic exon 4 skipping, intron retention Pelizaeus-Merzbacher disease Bonnet-Dupeyron MN, Combes P, Santander P, et al.: PLP1 splicing abnormalities identified in Pelizaeus-Merzbacher disease and SPG2 fibroblasts are associated with different types of mutations. Hum Mutat. 2008; 29(8): 1028–1036.
6 XPC ⇒ NM_004628
c.413-9T>A
Score for cryptic splice site at c.413-61 now highest score. Possible new splice site at c.413-9 with higher score than authentic splice site at c.413-2. Possible loss of function for authentic splice site at c.413-2. ss abolished, de novo ss created Xeroderma pigmentosum Khan SG, Metin A, Gozukara E, et al., Two essential splice lariat branchpoint sequences in one intron in a xeroderma pigmentosum DNA repair gene: mutations result in reduced XPC mRNA levels that correlate with cancer risk. Hum Mol Genet. 2004; 13(3): 343-352.
7 CERS3 ⇒ NM_001290343
c.609+1G>T
Score for cryptic splice site at c.540 now highest score. Possible loss of function for authentic splice site at c.609+1. ss abolished Autosomal recessive congenital ichthyosis 9 Radner et al., Mutations in CERS3 cause autosomal recessive congenital ichthyosis in humans. PLoS Genet. 2013 Jun;9(6).
8 SMAD4 ⇒ NM_005359
c.1448-6T>C
Medium decrease of score for authentic splice site at c.1448-2 (17.20%) ss weakened,
reduced amount of transcript
Hereditary pulmonary arterial hypertension Nasim MT, Ogo T, Ahmed M, et al.: Molecular genetic characterization of SMAD signaling molecules in pulmonary arterial hypertension. Hum Mutat. 2011; 32(12): 1385–1389.
9 ABCA1 ⇒ NM_005502
c.4465-34A>G
Possible new splice site at c.4465-35 with higher score than authentic splice site at c.4465-2. 33bp of intron 31 included in transcript Tangier disease Fasano T, Pisciotta L, Bocchi L, et al.: Lysosomal lipase deficiency: molecular characterization of eleven patients with Wolman or cholesteryl ester storage disease. Mol Genet Metab. 2012; 105(3): 450–456
10 ABCA1 ⇒ NM_005502
c.1195-27G>A
Possible new splice site at c.1195-27 with higher score than authentic splice site at c.1195-2. Possible loss of function for authentic splice site at c.1195-2. 25bp of intron 10 included in transcript Tangier disease Fasano T, Pisciotta L, Bocchi L, et al.: Lysosomal lipase deficiency: molecular characterization of eleven patients with Wolman or cholesteryl ester storage disease. Mol Genet Metab. 2012; 105(3): 450–456
11 CFTR ⇒ NM_006846
c.1820+53G>A
Score at c.1820+55 becomes higher than score for authentic splice site at c.1820+1. retention of 54bp of intron 19 with normal protein expression Netherton Syndrome Lacroix M, Lacaze-Buzy L, Furio L, et al.: Clinical expression and new SPINK5 splicing defects in Netherton syndrome: unmasking a frequent founder synonymous mutation and unconventional intronic mutations. J Invest Dermatol. 2012; 132(3 Pt 1): 575–582
12 SPINK5 ⇒ NM_006846
c.1820+53G>A
Score at c.1820+55 becomes higher than score for authentic splice site at c.1820+1. retention of 54bp of intron 19 with normal protein expression Netherton Syndrome Lacroix M, Lacaze-Buzy L, Furio L, et al.: Clinical expression and new SPINK5 splicing defects in Netherton syndrome: unmasking a frequent founder synonymous mutation and unconventional intronic mutations. J Invest Dermatol. 2012; 132(3 Pt 1): 575–582