Scientific Abstract
Background
It is estimated that up to 50% of all disease causing genetic variants disrupt splicing. Due to its complexity, our ability to predict which variants disrupt splicing is limited, meaning missed diagnoses for patients. The emergence of machine learning for targeted medicine holds great potential to improve prediction of splice disrupting variants. The recently published SpliceAI algorithm utilises deep neural networks and has been reported to have a greater accuracy than other commonly used methods.
Methods and findings
The original SpliceAI was trained on splice sites included in primary isoforms combined with novel junctions observed in GTEx data, which might introduce noise and de-correlate the machine learning input with its output. Limiting the data to only validated and manual annotated primary and alternatively spliced GENCODE sites in training may improve predictive abilities. All of these gene isoforms were collapsed (aggregated into one pseudo-isoform) and the SpliceAI architecture was retrained (CI-SpliceAI). Predictive performance on a newly curated dataset of 1,316 functionally validated variants from the literature was compared with the original SpliceAI, alongside MMSplice, MaxEntScan, and SQUIRLS.
Both SpliceAI algorithms outperformed the other methods, with the original SpliceAI achieving an accuracy of ~91%, and CI-SpliceAI showing an improvement at ~92% overall. Predictive accuracy increased in the majority of curated variants.
Conclusions
We show that including only manually annotated alternatively spliced sites in training data improves prediction of clinically relevant variants, and highlight avenues for further performance improvements.
Technical Abstract
This project is developed and open sourced by myself. The deep CNN models were trained on the IRIDIS HPC provided by the University of Southampton. Both the training and inference modules of CI-SpliceAI are released as Open Source software providing offline computation. Another open sourced code base was developed to compare CI-SpliceAI to other software in the field.
A core piece of CI-SpliceAI is the online annotation website running inference free of charge. As deep learning is not easy to set up for the less technical researchers in the field, the CI-SpliceAI website allows researchers to upload their variant data in a common format without any technical knowledge required. Variants are calculated on the google cloud functions backend and cached to a mySQL database. Google workers are Docker containers running python flask servers which in turn run the prediction model and annotation code.
The budget is monitored through a google pub/sub service which tells the backend if the monthly budget is depleted, which in turn shuts down all computation and prevents new invocations of the service. New data may still be submitted and will be run by a cronjob as soon as the new month starts and the budget is replenished.
The tech stack is somewhat deprecated due to constraints on the web hoster. The CI-SpliceAI portal uses php with twig rendering and a mySQL database.
Access
Read the full text published with PLOS One or visit the CI-SpliceAI web portal to access online predictions and all data and code.