Comparison of bacterial genome assembly software for MinION data and their applicability to medical microbiology

Translating the Oxford Nanopore MinION sequencing technology into medical microbiology requires on-going analysis that keeps pace with technological improvements to the instrument and release of associated analysis software. Here, we use a multidrug-resistant Enterobacter kobei isolate as a model organism to compare open source software for the assembly of genome data, and relate this to the time taken to generate actionable information. Three software tools (PBcR, Canu and miniasm) were used to assemble MinION data and a fourth (SPAdes) was used to combine MinION and Illumina data to produce a hybrid assembly. All four had a similar number of contigs and were more contiguous than the assembly using Illumina data alone, with SPAdes producing a single chromosomal contig. Evaluation of the four assemblies to represent the genome structure revealed a single large inversion in the SPAdes assembly, which also incorrectly integrated a plasmid into the chromosomal contig. Almost 50 %, 80 % and 90 % of MinION pass reads were generated in the first 6, 9 and 12 h, respectively. Using data from the first 6 h alone led to a less accurate, fragmented assembly, but data from the first 9 or 12 h generated similar assemblies to that from 48 h sequencing. Assemblies were generated in 2 h using Canu, indicating that going from isolate to assembled data is possible in less than 48 h. MinION data identified that genes responsible for resistance were carried by two plasmids encoding resistance to carbapenem and to sulphonamides, rifampicin and aminoglycosides, respectively.


Introduction
The Oxford Nanopore MinION is a commercially available long read sequencer that connects to a personal computer through a USB port. It is able to generate relatively small amounts of data, making it ideally suited to working with microbes such as bacteria and viruses. To date, the technology has shown promise for microbiological applications, including the delineation of position and structure of bacterial antibiotic-resistance islands (Ashton et al., 2014), assembly of bacterial genomes (Loman et al., 2015;Risse et al., 2015) and tracking of viral outbreaks (Quick et al., 2016;Zika Real time Sequencing Consortium, 2016). This has been supported by the development of analysis tools for MinION data.
MinION data has been shown to be of sufficient quality to accurately detect the presence of antimicrobial-resistance genes (Bradley et al., 2015;Judge et al., 2015;Cao et al., 2015), but these studies focused on mapping long-read data to an existing reference to detect them. Here, we evaluate the performance of four open-access software tools in creating de-novo assemblies of genomic data, including plasmids, for a multidrug-resistant isolate of Enterobacter kobei. We consider factors key to medical microbiology including accuracy, time taken to generate assemblies and whether the assemblies were of sufficient quality to provide information on the presence and structure of plasmids carrying clinically relevant antimicrobial-resistance genes.

Methods
Microbiology. A multidrug-resistant E. kobei isolate was cultured from untreated wastewater in the United Kingdom in 2015 (unpublished data). A freezer vial was prepared based on a single colony, maintained at À80 C and re-grown from frozen stock for antimicrobial susceptibility testing and DNA extraction. Susceptibility testing was performed using the N206 card on the Vitek 2 instrument (bioMérieux) calibrated against European Committee on Antimicrobial Susceptibility Testing (EUCAST) breakpoints.
Illumina sequencing and bioinformatic analyses. DNA extraction and library preparation was performed as previously described (Quail et al., 2012). In brief, 0.5 mg DNA was sheared and end-prepped, A-tailed and adapter ligated according to the Illumima protocol. The library was amplified with six cycles of PCR using Kapa HiFi 2Â mastermix (KK2601, Kapa Biosystems). The mean insert size of the library was approximately 200 bp. DNA libraries were sequenced using the HiSeq platform (Illumina) to generate 100 bp paired-end reads. Reads were trimmed using Trimmomatic (Bolger et al., 2014) to remove adapter sequences and regions of low quality and overlapping reads were merged using PEAR (Zhang et al., 2014), with the reverse reads reverse complemented using fastaq. De novo assemblies were generated using Velvet (Zerbino & Birney, 2008) to create several assemblies by varying the kmer size. The assembly with the best length for which 50 % of all bases in the sequences are in a sequence of length L (N50) was chosen and contigs smaller than 300 bases were removed. The scaffolding software SSPACE was employed (Boetzer et al., 2010) and assemblies further improved using 120 iterations of GapFiller (Boetzer & Pirovano, 2012). Species identification was based on analysis of hsp60 and rpoB, as previously described (Hoffmann & Roggenkamp, 2003). To detect acquired genes encoding antimicrobial resistance, the denovo assembly was compared by BLAST to a manually curated version of the ResFinder database (compiled in 2012) (Zankari et al., 2012) as described previously (Reuter et al., 2013).
MinION sequencing and bioinformatic analysis. DNA was extracted using the QiaAMP DNA Mini kit (Qiagen), and quantified using the Qubit fluorimeter (Life Technologies). Sample preparation was carried out using the Genomic DNA Sequencing Kit SQK-MAP-006 (Oxford Nanopore Technologies) following the manufacturers instructions, including the optional NEBNext FFPE DNA repair step (NEB). A 6 ml aliquot of pre-sequencing mix was combined with 4 ml Fuel Mix (Oxford Nanopore), 75 ml running buffer (Oxford Nanopore) and 66 ml water and added to the flow cell. The 48 h genomic DNA sequencing script was run in MinKNOW V0.50.2.15 using the 006 workflow. Metrichor V2.33.1 was used for base calling. The flow cell was reloaded at 24 h with the pre-sequencing mix

Impact Statement
The Oxford Nanopore MinION sequencing technology has several advantages for pathogen sequencing in medical microbiology, but ongoing analysis needs to keep abreast of technological improvements to the instrument and release of new analysis software. Clinical use also requires the generation of data in a timeframe that can inform medical decisions. We compared the performance of four open-access software tools in assembling genome data generated by MinION for a multidrug-resistant isolate of Enterobacter kobei. We determined the optimal software in terms of accuracy and speed, and showed how sequence data can be used as early as 9 h into the sequencing run to generate assembled whole genomes. Sequence data detected the genes responsible for resistance to numerous clinically important antibiotics, and assemblies allowed these to be assigned to one of two mobile elements (plasmids). Our findings are relevant to biotechnologists working in medical practice, and to those working in the field of molecular epidemiology who study mobile elements that spread antimicrobial resistance within and between bacterial species of medical importance. Entire bacterial genomes can now be assembled without access to compute clusters or expensive sequencing hardware. prepared as above. MinION and Illumina sequence data have been deposited in the European Nucleotide Archive (Data citation 1).
Basecalled MinION reads were converted from FAST5 to FASTQ formats using the Python script fast52fastq.py. Read mapping was carried out to assess the quality of data and coverage using the BWA-MEM algorithm of BWA v0.7.12 with the flag -x ont2d (Li, 2013). Output SAM files from BWA-MEM were converted to sorted BAM files using SAMtools v0.1.19-44428cd (Li et al., 2009). Assembly using Min-ION data only was undertaken using PBcR (Koren et al., 2012), Canu (Berlin et al., 2015) and miniasm (Li, 2016). Canu version 1.0 was run using the commands maxTh-reads=8 maxMemory=16 useGrid=0 nanopore-raw. The PBcR pipeline with CA version 8.3rc2 was run using the options length 500, partitions 200 and the spec file shown in Supplementary Text 1, available in the online Supplementary Material. Minimap and miniasm were run as specified (Li, 2016). The resulting assembly was polished using Nanopolish v0.4.0 with settings as specified (Loman et al., 2015), with Poretools (Loman & Quinlan, 2014) used to extract fasta sequences from fast5 files in the format required by nanopolish using the option fasta. Hybrid assemblies were generated using SPAdes 3.8.1 (Bankevich et al., 2012) using the option '-careful', then filtered to exclude contigs of less than 1 kb. All assemblies were assessed against the manually finished assembly using QUAST (Gurevich et al., 2013) version 3.2 (Table S1, available in the online Supplementary Material). Assemblies were annotated using Prokka (Seemann, 2014). Manually finished genome. Assemblies were generated using Canu and SPAdes, as before. A gap5 database was made using corrected MinION pass reads from the Canu pipeline and Illumina reads. Manual finishing was undertaken using gap5 (Bonfield & Whitwham, 2010) version 1.2.14 ( Fig. S1), giving one chromosome and eleven confirmed plasmids. Icorn2 (Otto et al., 2010) was run on this for five iterations. The start positions of the chromosome and plasmids were fixed using circlator (Hunt et al., 2015) 1.2.0 using the command circlator fixstart. This assembly was annotated using Prokka (Seemann, 2014). Where the Canu and SPAdes assemblies did not match with regards to suspected integration of a plasmid into the chromosome, this was additionally investigated using long-range PCR. The assembly and annotation is available online (Data citation 3).

Results
Our analyses were based on a multidrug-resistant E. kobei isolate cultured from sewage. This was selected as a model organism on the basis of its multidrug-resistant phenotype (including resistance to the carbapenem drugs), and because of the additional challenge of working with an organism for which there was no available assembled whole-genome sequence and so reflecting a real-life scenario.
Raw data on the E. kobei genome from a single flow cell was initially analysed using the Oxford Nanopore base calling software and defined as pass or fail based on a threshold set at approximately 85 % accuracy (Q9) and including only 2D reads, where data is generated from both the forward and reverse strand of DNA as it passes through the nanopore. The error rate of MinION pass data exceeded that of the Illumina data (0.048 insertions, 0.027 deletions and 0.089 substitutions per base for MinION, compared with 5.8Â10 À6 insertions, 9.2Â10 À6 deletions and 0.0025 substitutions for Illumina). Three tools [PBcR (Koren et al., 2012), Canu (Berlin et al., 2015) and miniasm (Li, 2016)] were used to assemble MinION pass reads alone, and a fourth [SPAdes (Bankevich et al., 2012)] was used on the combination of MinION pass data and Illumina data to produce a hybrid assembly. PBcR and Canu perform a selfcorrection step on reads before generating an assembly, whereas miniasm assembles the reads as provided.
All four assemblies had a similar number of contigs and were more contiguous than the assembly using Illumina data alone, with SPAdes producing a single chromosomal contig (Table 1). We ran QUAST (Gurevich et al., 2013) to assess the quality of the assemblies, but found that it could not report all statistics for the miniasm assembly as this fell below the cut-offs for this tool. We used nanopolish (Loman et al., 2015) to correct the miniasm assembly using the raw current signal (pre-base calling) to obtain higher accuracy. The QUAST results showed that the miniasm and nanopolish assembly had a similar number of indels per kb to Canu, although it still had more mismatches per kb (Table 1). Small indels and mismatches were more common in the MinION-only assemblies than the hybrid or Illumina-only assemblies. Assemblies were annotated (Seeman, 2014) and the annotation searched for the housekeeping genes rpoB and hemB (Hoffmann & Roggenkamp, 2003). These were present in all assemblies with the exception of miniasm, where hemB could not be identified. However, the miniasm assembly after nanopolishing had both genes present.
The four assemblies were compared to evaluate their ability to reflect the genome structure. A manually finished assembly was produced and used as a reference, from which a single large inversion between the SPAdes assembly and the manually finished assembly was identified (Fig. 1). SPAdes also incorrectly integrated a plasmid into the chromosomal contig, caused by false joins. PBcR made a number of rearrangements compared with Canu ( Fig. 1), validating that Canu is an improvement over its predecessor PBcR.
We then evaluated the assembly of all (pass and fail) Min-ION reads using miniasm and Canu to determine whether adding additional (lower-quality) data would improve the assembly. Adding fail data increased the number of reads by almost 50 % (64 497 versus 43 260) but reduced the mean read length from 5221 bp to 4687 bp. Miniasm run on all reads produced the same number of contigs and a similar mean contig size as when run on pass reads. The longest contig produced with Canu was smaller when using all reads versus pass reads alone (Table S1). With Canu, using pass reads alone led to more reads at the correction step compared with using all reads (35 913 versus 30 728), indicating that working with all reads could cause good-quality data to be discarded during the read correction process. In both cases, using all reads did not produce a single chromosomal contig. We concluded from this that adding fail data did not consistently improve assembly.
We considered the time taken to generate sequence data, together with memory requirements to compute the assembly (Table 1). Almost 50 % of pass reads were generated in the first 6 h, almost 80 % within 9 h and 90 % within 12 h. This gave a theoretical coverage of 20Â, 32Â and 37Â, respectively. Only 31 pass reads were generated in the final 12 h of the 48 h run (<0.1 %). Using pass reads from the first 6 h alone led to a less accurate, fragmented assembly, but subsets of pass reads taken from the first 9 or 12 h of the run generated similar assemblies to pass data from the full 48 h run (Table S1). We also compared speed of data analysis. Miniasm completed assembly within 2 min, but the trade off from using this alone was lower accuracy (Table 1). Nanopolish improved the quality of the miniasm assembly but took over three days to run; Canu took 2 h and produced similar results to the miniasm assembly after nanopolish. With current methods, going from isolate to assembled data in less than 48 h is realistic.
Finally, we evaluated whether these assemblies could be used to identify the presence and position of genes associated with clinically significant drug resistance in the E. kobei genome. HiSeq data had detected bla OXA-48 encoding carbapenem resistance on a 2.5 kb contig and additional antimicrobial-resistance genes in a separate 8.7 kb contig (sul1, arr, aac3 and aac6 ¢ -IIc, which encode resistance to sulphonamides, rifampicin and aminoglycosides, respectively), but it was unclear whether these were on the same plasmid, on two different plasmids or chromosomally integrated. All assemblies using MinION data identified the carbapenemase bla OXA-48 on a contig with plasmid genes. The other resistance genes were identified in proximity to each other on a single large contig along with heavy-metal-resistance genes and plasmid genes. However, the SPAdes assembly misassembled this region into the chromosomal contig (5 Mb). We concluded that there are two separate plasmids carrying resistance determinants of interest.

Conclusion
MinION data alone could be used with the software described above to generate highly contiguous bacterial assemblies. Canu gave the best results overall, combining low error rate with a highly contiguous assembly. Miniasm created a similar assembly, although the error rate was considerably higher. This means that it has utility in generating an extremely rapid draft answer, but should not be relied upon for high accuracy without additional error correction steps such as nanopolish. SPAdes gave a better accuracy for mismatches and small indels, but created a false join that incorrectly integrated a plasmid into the chromosome. However, SPAdes may be useful where coverage of the genome with MinION data is too low to successfully assemble using MinION data alone. MinION-only assemblies were of sufficient quality to detect and characterise antimicrobial resistance and could be generated rapidly during an outbreak investigation. Whilst other sequencing technologies such as the PacBio RS II generate high-quality long-read sequence data, the portability of the MinION is a potential advantage for medical microbiology.