Adaptive sequencing using nanopores and deep learning of mitochondrial DNA

From PubMed

Adaptive sequencing using nanopores and deep learning of mitochondrial DNA

Introduction

 

Next-generation sequencing
Next-generation sequencing has revolutionized DNA sequencing and laid the foundation for a plethora of scientific and clinical opportunities. One recent emerging sequencing technology uses nanopore sequencing (e.g. those developed by Oxford Nanopore Technologies, ONT) [1]. In this study, we used ONT’s portable MinION sequencer, which was released in 2014. The sequencing is performed by measuring changes in ionic current produced by individual nucleic acids as single DNA strands that pass through an array of protein nanopores. These changes are detected by a sensor and are saved on a computer for later analysis [2]. The recorded ionic current, known as the ‘raw signal’ or ‘squiggle’, is mainly used for basecalling by translating the raw signal into nucleotides. To date, the vast majority of studies that used nanopore sequencers ignored the raw signal after using it to generate a nucleotide sequence. A few studies, however, used this signal for other tasks, such as improving the accuracy of a consensus sequence, or for investigating chemical modifications on the DNA [3–5].

 

Deep learning
Deep learning is a subset of machine learning methods that have gained increased popularity in recent years, after overtaking other methods in the field of image classification [6]. Deep learning has been applied to fields such as image, video, audio and natural language processing, for performing tasks such as classification, generation, prediction and detection [6–8]. This raises the possibility that similar deep learning approaches can be applied to nanopore sequencing data analysis. These approaches include methods that have been used for audio signal analysis, such as convolutional neural network (CNN) and recurrent neural network (RNN) architectures [6, 9–11]. Initially, raw nanopore signals were translated to nucleotides using a Hidden Markov Model [12, 13]. However, the discovery that deep learning can perform the task better led to its use in translating a raw nanopore signal into a nucleotide sequence [14, 15]. Deep learning is also used to perform such tasks as predicting DNA methylation [16] and simulating a raw signal based on a reference genome [17]. These findings support our proposition that deep learning can be used to classify reads based on their raw signal. Hence, we tested several commonly used deep learning architectures that were previously applied on similar data, with the aim of selecting the preferable one for our analysis.

 

Traditional selective sequencing (targeted sequencing)
Selective sequencing (or sequencing of targeted genomic regions) is a widespread technique used in many applications designed to sequence specific portions of a DNA molecule from a larger pool of genetic material. The targeting of only a part of the DNA can save resources, time and money. Selective sequencing is traditionally based on physically isolating or amplifying desired parts of the total DNA during the library preparation steps and prior to sequencing [18–20]. Recently, this approach was also implemented during standard nanopore library preparation [21, 22]. Traditional selective methods, however, have been found to introduce bias to the output, such as lack of evenness of coverage and divergent results from different library preparation kits [23]. Additionally, all current targeted sequencing techniques rely on amplifying a region of interest that is based solely on the nucleotide sequence. This prompts the need for an alternative method.

 

Nanopore adaptive sequencing
With the introduction of nanopore sequencing, an exciting new feature, ‘Read Until’, enables selectively ‘rejecting’ DNA molecules before the entire molecule has been completely sequenced [24]. The rejection of molecules based on the initial portion of the DNA molecule potentially saves time and reagents by sequencing only selected DNA molecules. This establishes a unique method of selective sequencing on the nanopore sequencing platform. Several studies have demonstrated real-time selective sequencing using the nanopore Read Until feature. Notably, in a first published study, Loose et al. [24] demonstrated the ability to perform selective sequencing with the genome of Lambda phage; dynamic time warping was used to determine the DNA molecules to be sequenced. This approach stipulated the length of the possible target and reference sequences. In another study, Edwards et al. [25] performed real-time selective sequencing by online basecalling the start of the molecules and then selecting the molecules to sequence by mapping them to a reference library using the LAST aligner. Similarly, Payne et al. [26] mapped the base-called nucleotides to a reference genome using minimap2. This approach removed the constraints imposed by the dynamic time warping algorithm; however, it introduced two distinct steps (basecalling and mapping) into the decision process. Another study used the same concept of basecalling and mapping the reads to a reference genome, but for a different purpose, namely, to achieve more uniform coverage [27]. Finally, in a more recent work, Kovaka et al. probabilistically decoded the raw signal into k-mers that could be represented by the signal, and later used to align the signal to a reference. They compared this k-mer composition to a k-mer representation of a known reference. This led to an enrichment of 4.46 [28].

Overall, the above methods, de facto, comprised processing the raw signal and subsequently comparing it to a known reference. Our unique method enables classification of a raw signal without direct reliance on a reference or prior knowledge at the time of decision making (classification of the raw signal from the DNA molecule currently at the nanopore). The method only requires using the trained model for classification, which learns the intrinsic patterns of each group of different reads.

 

Our contribution
Here we apply selective sequencing on nanopore sequencing via a unique deep learning approach. We began by developing a deep learning model capable of accepting only the first 2000 values of a raw signal, which equates to roughly 200 base pairs as input. We chose to perform selective sequencing on mitochondrial DNA, which is a cellular organelle within eukaryotic cells, containing about 16 K base pairs and encoding 13 proteins. Mitochondrial DNA has been sequenced many times, has high coverage in publicly available nanopore datasets and is of biological and medical significance in the analysis of human sequencing data [29]. We trained the model to classify sequencing reads as ‘mitochondrial’ or ‘genomic’ based on the signal. Analysis of the raw signal directly bypasses the basecalling step, which is potentially error prone; while enabling the deep learning model to incorporate additional information present in the raw signal, such as DNA modifications [4, 5]. Although this experiment relied on information produced by the basecalling process, the classification models can likely be trained without basecalling at all, but rather from two groups on nanopore signals, separated by methods other than mapping. This approach can be further improved, and eventually surpass the current basecalling models. The upshot is potentially increased accuracy of DNA classification by eliminating data analysis steps and increasing the information volume for the deep learning model. Unlike previous attempts at real-time selective sequencing, our method requires neither a nucleotide reference nor a generated signal reference at the time of sequencing. Bypassing a reference decreases the run time and complexity restriction as the reference database expands. This is because in using other reference-based approaches, searching and parsing the data add size and complexity challenges. Moreover, our model can potentially use information that is not used by simple mapping algorithms, such as methylation and nucleotide composition. This can facilitate the rapid differentiation of complex DNA samples that are limited in their detection capacity without complete processing (mapping and referencing). The method might pose an advantage for clinical applications. We tested several deep learning architectures for sequence analysis and used several datasets of nanopore signal data, and applied them to classifying reads of a number of DNA origins. We selected the model with the highest classification accuracy and combined it with the Read Until API. The goal was to perform a sequencing experiment that used our model to successfully select and sequence mitochondrial DNA. The general workflow of this method is illustrated in Figure 1. Finally, we compared our results to theoretically expected results, based on the official information provided by ONT [30]. Our model achieved the expected enrichment results considering our experimental parameters.

 

Overall, the development of a new real-time selective sequencing method will not only alleviate the challenges posed by the additional steps during library preparation for targeted sequencing, but will enable classification of entire DNA molecules based on the raw signal. The latter is impossible by the alternative adaptive sequencing method of translation and mapping signals. Our method has the potential to increase accuracy and to accelerate the sequencing process, and can eventually be applied to any clinical setting in which time-sensitive DNA sequencing is of the essence.

 

Read More

Tel Aviv University makes every effort to respect copyright. If you own copyright to the content contained
here and / or the use of such content is in your opinion infringing, Contact us as soon as possible >>