Enrichment by hybridisation of long DNA fragments for Nanopore sequencing

Enrichment of DNA by hybridisation is an important tool which enables users to gather target-focused next-generation sequence data in an economical fashion. Current in-solution methods capture short fragments of around 200–300 nt, potentially missing key structural information such as recombination or translocations often found in viral or bacterial pathogens. The increasing use of long-read third-generation sequencers requires methods and protocols to be adapted for their specific requirements. Here, we present a variation of the traditional bait–capture approach which can selectively enrich large fragments of DNA or cDNA from specific bacterial and viral pathogens, for sequencing on long-read sequencers. We enriched cDNA from cultured influenza virus A, human cytomegalovirus (HCMV) and genomic DNA from two strains of Mycobacterium tuberculosis (M. tb) from a background of cell line or spiked human DNA. We sequenced the enriched samples on the Oxford Nanopore MinION™ and the Illumina MiSeq platform and present an evaluation of the method, together with analysis of the sequence data. We found that unenriched influenza A and HCMV samples had no reads matching the target organism due to the high background of DNA from the cell line used to culture the pathogen. In contrast, enriched samples sequenced on the MinION™ platform had 57 % and 99 % best-quality on-target reads respectively.


Introduction
While the cost of next-generation sequencing has been falling continuously in recent years, the enrichment of specific DNA regions or whole genomes from microorganisms allows the multiplexing of several samples per run whilst maintaining a high depth of coverage over the regions of interest. The capture of viral and bacterial organisms from mixed samples by in-solution bait hybridisation, followed by high-throughput sequencing, is advantageous for the evaluation of variant frequency and deconvolution of PCR duplicates, compared with the sequencing of PCR-generated amplicons (Samorodnitsky et al., 2015). This enrichment method can be used in a clinical setting to aid and refine timely diagnosis (Wlodarska et al., 2015), for example from extensively or totally drugresistant pathogens in a time of antibiotic overuse (Carlet, 2015). Data from whole-genome sequencing provides a wealth of information such as identification of resistance markers carried by the infecting agent(s), allowing for rapid, targeted and personalised treatment. Previous studies have shown that it is possible to bypass the traditional culturebased diagnosis and obtain information by sequencing metagenomic samples, but the throughput is low and the method prohibitively costly for routine use (e.g. Doughty et al., 2014, Loman et al., 2013. A potentially disruptive diagnostic platform to sequence enriched bacterial and viral pathogens directly from clinical samples has been previously described by Brown et al. (2015) and Christiansen et al. (2014). This approach employs custom baits to capture genomic material from the target organisms, thereby reducing the amount of human and commensal DNA in the clinical samples and allowing greater throughput of samples. However, this method is optimised for short-read sequencers such as the Illumina MiSeq and the Ion PGM, and is unsuitable for longread sequencers. Information from long-read platforms could be used, for example, to resolve highly repetitive regions such as those found in cytomegalovirus (Masse et al., 1992), detect large structural variations (Jiang et al., 2015) or provide evidence of recombination events such as those seen in Chlamydia trachomatis (Joseph & Read, 2012). Enrichment of specific genomic fragments by PCR-generated baits for sequencing on an Oxford Nanopore MinION sequencer was demonstrated by Karamitros & Magiorkinis (2015).
Here, we present an adaptation of the method used by Brown et al. (2015) and Christiansen et al. (2014), enrichment of DNA fragments of between 1 and 15 kb for sequencing on long-read platforms. We joined the Oxford Nanopore Technologies (ONT) MinION Access Program to assess the suitability of this platform used in combination with the targeted enrichment method. We compared sequence data from unenriched and enriched cultured influenza virus A and HCMV samples, run on the MinION and Illumina MiSeq platforms. We also mixed cultured Mycobacterium tuberculosis (M. tb) genomic DNA from two different strains with human DNA to evaluate the efficiency of enrichment by hybridisation for longer bacterial DNA fragments. We found that the long genomic fragments were readily purified from a background of the cell line used for producing the viruses, or, in case of M. tb, mixed with human DNA. DNA from HCMV strain Merlin grown in fibroblast cell culture (6.75Â10 6 copies ml À1 , determined by qPCR) was a kind gift of R. Milne at the Department for Virology, UCL Medical School, Royal Free Campus, London, UK.

Methods
Sample preparation and long-fragment hybridisation.
The different workflows for this study are outlined in Fig. 1. HCMV (500 ng, 6.7Â10 7 copies) and M. tb samples (500 ng) were diluted in TE to an end volume of 80 ml, and sheared in Covaris g-TUBEs (#520079, Covaris) with two passages at 7200 r.p.m./4200 g for 1 min in a desktop centrifuge (#5242, Eppendorf). The HCMV genomic DNA was subjected to PreCR (#M0309, New England Biolabs,) enzymatic repair according to the manufacturer's recommendations after shearing (Table 1). Influenza virus A samples were not sheared since the cDNA fragments were size-compatible with Nanopore sequencing. The equivalent of 1Â10 12 TCID 50 was used for the library preparation from the enriched influenza virus A cDNA.
Concentrations and fragment sizes were determined with a Qubit fluorometer (dsDNA BR Assay Kit #Q32850, Life

Impact Statement
Our work describes a method for the selective enrichment of known viral or bacterial pathogen DNA from a background of host DNA for sequencing on the Oxford Nanopore MinION long-read sequencer. We developed a protocol for enriching large DNA fragments (>1 kb) by in-solution hybridisation, as contrasted to short fragments (200-300 bp) used for second-generation sequencing. In this proof-of-principle experiment, we enriched long DNA fragments of influenza virus A, human cytomegalovirus and Mycobacterium tuberculosis from their culture cell line or from a laboratory-made mixture of bacterial and human genomic DNA. We believe our method and evaluation of the results will be of interest to the growing group of users of longread sequencers (Oxford Nanopore, Pacific Biosciences). For example, this method could be used in the pathogen field for the whole-genome sequencing of small target organisms in mixed/clinical samples and in the identification of structural variants such as translocations in small or large genomes.
Biotinylated custom RNA baits for the target organisms influenza virus A (49190 baits), HCMV (33809 baits) and M. tb (224612 baits) were designed with an in-house Perl script (Depledge et al., 2011), using a database of 4968 H1N1 and 2966 H3N2 influenza virus A genomes, 115 partial and complete HCMV genomes and the M. tb strain H37Rv reference genome (NC_018143.2), respectively, and manufactured by Agilent. Sheared genomic DNA (HCMV,   A second round of end repair and dA-tailing was performed on 500 ng of enriched, amplified PCR product using Sure-Select XT reagents as described above, but without purification after dA-tailing. Instead, leader/hairpin ligation and sample clean-up were performed according to the ONT protocols for kit SQK-MAP003 (used in the M. tb strain H37Rv experiments only) or SQK-MAP004. In detail, dAtailed sample, blunt/TA ligase master mix (#M0367, NEB), tethered adapter mix and hairpin adapters (ONT) were incubated for 10 min at room temperature in protein LoBind tubes (#0030108116, Eppendorf) for ligation. Libraries processed according to the ONT SQK-MAP003 protocol were cleaned up with AMPure XP beads; those made according to the SQK-MAP004 method were purified using Dynabeads for His-Tag isolation and pulldown (#10103D, Life Technologies) (Fig. 1a). Libraries were eluted from the beads by incubation for 10 min at room temperature in elution buffer (ONT). Library concentrations were typically 2-10 ng ml À1 , as assessed by Qubit fluorometer.
The influenza virus A control sample that did not undergo hybridisation (75 ng, the equivalent of 2.7Â10 11 TCID 50 ) was end-repaired, dA-tailed and amplified with Long Amp Taq polymerase as described above. Samples (500 ng) of this PCR product were processed as recommended in the ONT Genomic DNA sequencing protocol SQK-MAP004. For the non-hybridised HCMV sample, 500 ng (4.2Â10 7 copies) were used directly for Nanopore library preparation (SQK-MAP004) without amplification as enough material was available to proceed directly to sequencing (Fig. 1b).
Before each MinION run, flowcells were quality-tested with the script MAP_Platform_QC (MinKnow software version 0.46.2.8 to 0.49.2.9), then loaded with 12-60 ng of prepared library, library fuel mix and EP buffer (ONT) as per the manufacturer's instructions, and run with script MAP_48 Hr_Sequencing_Run, for an average of 26 h.
Reads were analysed by the Metrichor 2D basecalling (versions 2.19 to 2.29) cloud-based platform, and the resulting fast5 files ('pass' quality, both strands read while passing through the nanopore, resulting in higher confidence; and 'fail', where only one strand is read) converted to fasta format with Poretools . BLASR (Chaisson & Tesler, 2012) and LAST (Kiełbasa et al., 2011) were used to align reads to the pathogen reference sequences (HCMV herpesvirus HHV-5 GU179001.  . 1c) using Agilent reagents and SureSelect XT protocol steps as before. Briefly, samples were end-repaired, dAtailed, had adapters ligated and were PCR-amplified (six cycles) as described in the protocol. Following sample purification, the PCR products were re-amplified using postcapture indexed PCR2 primers for a further 15 cycles. Sequencing (2Â300 nt read length) was performed on an Illumina MiSeq instrument with paired-end 600V3 kits (#MS-102-3003) with automatic adapter trimming. Results from the Illumina MiSeq runs were aligned to the respective references with Bowtie version 1.1.1 (http://bowtie-bio. sourceforge.net/index.shtml). Additional alignment metrics from the bam files were obtained using the Picard Col-lectMultipleMetrics (http://broadinstitute.github.io/picard/) tool, which generates metrics such as percentage of reads aligned to a given target as well as coverage data.

Results
Comparison of Nanopore library size and read length Table 1 shows the peak sizes of the DNA samples after shearing, as determined on an Agilent Tape Station. The size distribution of the influenza virus A RNA and cDNA prior to processing, showed distinct peaks at 160 nt, 320 nt, 500 nt, 670 nt, 900 nt, 1.2 kb, 3 kb, (Fig. S1a, b, available in the online Supplementary Material), with fragments up to 15 kb. These were presumably short fragments of the eight influenza virus A segments NC_002016 to NC_002023, and residual dog cell line DNA. The size of fragments pre-and post-reverse transcription were broadly similar (Fig. S1).
Due to the shortness of the fragments, influenza virus A samples were not sheared.
The HCMV sample (g-TUBE-sheared and PreCR-treated) had a tight range of fragment sizes of around 12.8 kb. After PCR amplification, a broad range of fragment sizes both within and between individual reactions were observed. In general, the products were about half the size of the original DNA before hybridisation, ranging between 1.6 kb and 5.6 kb. One exception was strain M. tb C, which had shorter (median size 1.5 kb) PCR products.
The Nanopore reads (Table 1) were similarly variable in length, reflecting the input material, as indicated by the standard deviations in Table 1. Sequenced reads were shorter on average than the PCR products, but with a wide range. Reads classified as 'pass' quality by the Metrichor platform were longer than 'fail' quality reads. Non-hybridised samples had longer read lengths than enriched samples, either due to DNA damage during the hybridisation and wash processes, or preferential amplification of shorter fragments during PCR.

Comparison of BLASR and LAST aligners
We used BLASR (Chaisson & Tesler, 2012) and LAST (Kiełbasa et al., 2011), with the settings used in Quick et al. (2014) for the alignment of Nanopore reads to their respective references (pathogen and human/dog cell line). Table 2 shows statistics for the similarities to the target references obtained with the two aligners. We found that BLASR alignment of reads showed slightly higher identity to the references, shorter aligned regions and lower standard deviation. The LAST aligner produced longer alignments with lower identity and higher standard deviation. This is similar to the observations of Kilianski et al. (2015). A percentage of reads (10-35 %) aligned to the reference by LAST are not aligned by BLASR, and vice versa, indicating that neither aligner works optimally for aligning Nanopore reads to the reference.

Comparison of enriched and non-enriched Nanopore libraries
A total of 13 nanopore sequencing runs were included in our datasets. The average starting pore count per flowcell was 215. Most 'pass' quality reads aligned to either the target organism or the respective cell line, whereas most 'fail' quality reads did not match to target, cell line (Table 3) or sequences in the PubMed Nucleotide database (November 2015). This has been reported elsewhere (e.g. Greninger et al., 2015;Kilianski et al., 2015). Regions of alignment were shorter than read length, possibly due to regional increase of the error rates within reads.
Analysis of the 42 261 reads obtained from one nonenriched, PCR-amplified influenza virus A cDNA library run on the Nanopore MinION TM found 98.9 % 'pass' and 25.1 % 'fail' reads aligned to the MDCK dog cell line used for cultivation of the virus, whilst only one read aligned to the influenza virus A reference H1N1. After hybridisation and amplification, 57.2 % of 'pass' and 9.5 % of 'fail' reads (34 211 reads in total) from one Nanopore run could be aligned to influenza virus A. This amounts to an average read depth of the influenza virus A genome of 62.9Â. Fig. 2 shows uneven distribution of reads per fragment, with distinct peaks of increased coverage. This probably reflects the size distribution of the input RNA (Fig. S1a) rather than effects of reverse transcription, hybridisation or PCR bias. The frequency of cell line reads in influenza virus Aenriched samples dropped to 28.4 % ('pass') and 2.9 % ('fail') ( Table 3).
The unenriched HCMV library (a total of 432 reads from one flowcell) produced four reads (0.2 % of total) matching the HCMV reference HHV-5, while 47 reads (10.9 % of total) matched the human_g1k_v37 reference. After enrichment of the DNA with the HCMV-specific bait set, we obtained 37 589 reads from three runs, with almost all (98.7 %) 'pass' reads and 35 % of 'fail' reads aligning to the HCMV reference (Table 3). This amounts to an average read depth of 87.6Âof the HCMV genome. Panels a in Fig. 3 show the coverage of all Nanopore reads aligned to the reference.
A comparison of the consensus sequence generated from the enriched HCMV reads aligned to the HCMV HHV-5 reference using the genomic similarity search tool YASS (Noé & Kucherov, 2005) 194 363-194 698 and 195 851-195 977. The last two regions of difference coincide with inverted repeat regions (194 344-195 667, 195 090-197 626) (Masse et al., 1992). A number of mismatches to the reference HHV-5 were identified upstream of base 1270; these were due to low coverage of this region by Nanopore reads. We found regions with low (<5Â) coverage had a high number of mismatches compared with the reference, but areas of greater coverage matched near-perfectly.

Sequencing of enriched long fragments on the Illumina MiSeq
To assess the success of the long fragment hybridisation, Illumina libraries were generated from the remaining half of the hybridised material, and sequenced on a MiSeq instrument (results shown in Table 4). A high percentage of influenza virus A and HCMV reads from long enriched fragments aligned to the target reference in both Illumina and Nanopore 'pass' reads.
Illumina-generated reads showed higher percentages of alignment than Nanopore reads, presumably due to the lower error rates. Illumina libraries generated from the hybridisation of long fragments, particularly the independently generated, 10 % M. tb H37Rv libraries 1-4 in Table 4, show successful enrichment of mycobacterial DNA, with 56-96 % of reads aligning to the H37Rv genome. Results for M. tb strain C show a relatively low rate of alignment of reads to the H37Rv genome in both Nanopore and Illumina experiments. This could be due to less successful enrichment, and an imperfect match of the M. tb strain C reads aligned to strain H37Rv, which has 98.9 % identity to Results from the enriched influenza virus A (Fig. S1c) show concordance with the coverage by Nanopore results (Fig. 2). The unevenness of the coverage is presumably a result of the prevalence of short fragments in the original RNA sample (Fig. S1a), reverse-transcribed to cDNA (Fig.  S1b). Illumina reads (Fig. 3b) generally show less even coverage of the HCMV genome compared with Nanopore reads (Fig. 3a). Fifteen (out of a set of 23 525) aligned Nanopore reads span the repetitive replication origin oriLyt at position 94 488-94 588 (Chen et al., 1999) (Fig. S2c, d). The complete (3.5 M aligned reads) Illumina dataset (Table 4) has a 100 bp gap in the alignment at this repetitive position (Fig. S2a, b). Two Nanopore reads cover the inverted repeat region 194 293-195 565, while no Illumina reads aligned in this gap, and almost all Illumina reads in the adjacent 2.5 kb region show a mapping quality equal to zero, when visualised in the IGV (Fig. S2g, h). A similar outcome has been observed for a comparison of Nanopore and 454 reads for human herpesvirus type 1 (Karamitros et al., 2016). Areas with increased coverage can also be observed in Nanopore-and Illumina-generated datasets (Fig. 4) in M. tb. Here, this is presumably due to the redundancy of transposase-encoding sequences, which could result in localised increased aligning of reads.

Discussion
This study explores the capture of specific, long DNA fragments for sequencing on a long-read platform, the Oxford Nanopore MinION instrument. We demonstrate that our method can be used to enrich large, specific regions of interest in mixed samples. Previous work by Greninger et al. (2015) has shown that detection of moderate to high titres of pathogen DNA (chikungunya virus, Ebola and hepatitis C virus) from human blood samples is possible using Nanopore sequencing. However, this direct sequencing approach is inefficient if the region of interest is a small subset of the total DNA, the target is of low titre, or if high coverage is required for strain typing and variant identification. In our Nanopore sequencing experiments with un-enriched influenza virus A and HCMV DNA (from cell cultures), we detected very low numbers of reads from the pathogen compared with those from the host cell line. In contrast, sequencing data from enriched DNA produced good coverage of the influenza virus A and HCMV genomes and partial coverage of the M. tb genome. Control experiments using Illumina sequencing to assess the quality of enrichment (Table 4, Fig. 4) showed good overall and minimum coverage, similar to the sample enriched by short-fragment hybridisation (Brown et al., 2015;Christiansen et al., 2014, sample 9 and M. tb strain C sample 2 in Table 4), indicating that the enrichment of long fragments does not introduce bias. Preferential enrichment of certain regions (Fig. 4) seems to be due to redundancy of the captured sequence, in this case the transposases.
The drawbacks of our method, compared with the highthroughput protocol used by Brown et al. (2015), and Christiansen et al. (2014), were lower target coverage and throughput. Enrichment and library preparation take approximately 28 h and include a 16 h hybridisation step and 3-4 h of long-range PCR. In the future, the enrichment step could be shortened to 4 h by using a different hybridisation protocol, and PCR amplification could be replaced with whole-genome amplification. Addition of molecular barcodes would allow pooling of several samples to be run  simultaneously on one MinION flowcell. This, coupled with increasing speed, accuracy and throughput of MinION reads (e.g. results in Norris et al., 2016), will reduce the time and number of reads necessary for strain and variant identification, making this method amenable for diagnostic purposes. The relatively inexpensive and small-footprint MinION sequencers have been used in settings where conventional Illumina sequencing would be difficult (Quick et al., 2016).
We see the main application of our method of enriching long fragments in the detection of structural variants and in generating comprehensive coverage of specific target regions by long-read sequencing. Nanopore sequencing has previously been used to detect structural variants in pathogenic bacteria (Ashton et al., 2014), human DNA samples (Ammar et al., 2015) or human cancer cell lines (Norris et al., 2016); we believe our method could be employed as a non-amplicon-based alternative for this application, improving library complexity and uniformity of the sample, and aiding the detection of single-nucleotide variants (Samorodnitsky et al., 2015). As the enrichment approach is platform-agnostic, it could also be used to generate libraries compatible with the other long-read sequencers, benefitting the field of research into structural variation.