Region-specific diversification of the highly virulent serotype 1 Streptococcus pneumoniae

Serotype 1 Streptococcus pneumoniae is a leading cause of invasive pneumococcal disease (IPD) worldwide, with the highest burden in developing countries. We report the whole-genome sequencing analysis of 448 serotype 1 isolates from 27 countries worldwide (including 11 in Africa). The global serotype 1 population shows a strong phylogeographic structure at the continental level, and within Africa there is further region-specific structure. Our results demonstrate that region-specific diversification within Africa has been driven by limited cross-region transfer events, genetic recombination and antimicrobial selective pressure. Clonal replacement of the dominant serotype 1 clones circulating within regions is uncommon; however, here we report on the accessory gene content that has contributed to a rare clonal replacement event of ST3081 with ST618 as the dominant cause of IPD in the Gambia.


Introduction
Streptococcus pneumoniae is a commensal bacterium commonly isolated from the human nasopharynx (Turner et al., 2012). It is a major cause of morbidity and mortality worldwide, manifesting as a range of clinical infections, from sinusitis and acute otitis media to meningitis, septicaemia and pneumonia, with the highest disease burden in developing countries (O'Brien et al., 2009). S. pneumoniae is conventionally subdivided into more than 90 different serotypes (Bentley et al., 2006), which have differing disease associations (Hausdorff et al., 2005;Yildirim et al., 2010). For a century, serotype 1 has ranked among the most prevalent pneumococcal serotypes causing invasive pneumococcal disease (IPD) worldwide (Harboe et al., 2010). In Africa, serotype 1 is the second most common cause of IPD (proportion of IPD: 11.7 %) after serotype 14 (Johnson et al., 2010).
Serotype 1 has distinct characteristics compared with other pneumococcal serotypes; it shows even distribution of incidence across all age ranges, is linked to outbreaks in closed communities and is associated with epidemic outbreaks in West Africa (Antonio et al., 2008;Dagan et al., 2000;Ritchie et al., 2012). Furthermore, serotype 1 is rarely associated with antimicrobial resistance (Brueggemann et al., 2003;Hausdorff et al., 2005). Short duration and low densities of serotype 1 during colonization may explain low resistance levels because recombination with other streptococci in the nasopharynx is a major source of resistance in pneumococci (Brueggemann & Spratt, 2003;Williams et al., 2012).
The high burden of disease caused by serotype 1 was the impetus for clinical trials of the nine-valent conjugate vaccine (PCV9) in the Gambia and South Africa, the first conjugate vaccine to incorporate serotype 1 capsular polysaccharide (CPS) (Cutts et al., 2005;Klugman et al., 2003). PCV10 and PCV13 are currently being rolled out across Africa. Both incorporate serotype 1 CPS (Klugman et al., 2008); however, comprehensive impact evaluation data for these vaccines are not yet available (Blumental et al., 2015).
Genetic characterization of a global collection of 166 serotype 1 pneumococci using MLST identified three lineages each associated with different regions of the world, with lineage B associated with Africa (Brueggemann & Spratt, 2003). Within regions there was further suggestion of phylogeographic structure but the sample size and low resolution of MLST limited the conclusions that could be made. In recent years, whole genome sequencing has emerged as a practical method for studying bacterial genetics across large numbers of samples, allowing reconstruction of highresolution phylogenies and correlation of complete gene catalogues with phenotypic and clinical data (Croucher et al., 2011;Harris et al., 2010;Köser et al., 2012).

Impact Statement
Streptococcus pneumonia serotype 1 is a leading cause of pneumococcal pneumonia and meningitis globally, with the highest burden of disease in the developing world. In this study we sequenced a global collection of S. pneumoniae serotype 1 isolates and show that serotype 1 in Africa is genetically distinct from serotype 1 in the developed world. Within Africa, serotype 1 has disseminated and diversified between countries in response to region-specific selective pressures. Recombination and antibiotic usage have contributed to this diversification. Of interest, serotype 1 is subject to the same level of recombination as other serotypes commonly associated with nasopharyngeal carriage. This contradicts the view that short duration of carriage limits the opportunity for serotype 1 to recombine and acquire antibiotic resistance mechanisms. We demonstrate that long-range transmission of serotype 1 is rare, with locally circulating clones in countries remaining stable with little impact from imported clones. In a rare example of clone replacement, the serotype 1 clone ST3081 replaced ST618 as the dominant cause of invasive pneumococcal disease in the Gambia in 2006. We report on the virulence factors unique to ST3081, which have likely driven this replacement. This is the largest reported sequencing collection of a single S. pneumoniae serotype to date. Our analysis shows how country-specific selective pressures have driven the evolution and diversification of this important pathogen within Africa.
Here we present a whole genome phylogeny of 448 serotype 1 pneumococcal isolates, recovered from 27 countries worldwide (including 11 in Africa) in order to investigate the genetic diversity of this important pathogenic serotype and relate this to clinical phenotype. We also investigate the evolutionary mechanisms and specific selective pressures that have driven the diversification of serotype 1 within Africa.

Methods
Whole genome sequencing. S. pneumoniae was cultured and genomic DNA extractions were performed as described elsewhere (Everett et al., 2011). Multiplex DNA sequencing using the Illumina Genome Analyser GAII (Illumina) was performed, as previously described (Harris et al., 2010). The sequence reads generated were deposited in the European Nucleotide Archive (ENA) (http://www.ebi.ac. uk/ena/) under study number ERP000156, a full list of accession numbers is available in Table S1 (available in the online Supplementary Material). All isolates had previously tested positive as serotype 1 pneumococci by PCR using a standard protocol (Pai et al., 2005). To confirm all of the study isolates were serotype 1 we also employed an additional in silico serotyping method, whereby the sequence reads were redundantly aligned against the concatenated sequences of 94 pneumococcal CPS loci using BWA (Li & Durbin, 2009). The CPS locus with the highest proportion of its length covered by mapped reads was taken as the serotype (Croucher et al., 2011). We observed 100 % concordance between the PCR and in silico serotyping results; all of the study isolates exhibited the highest mapping to the serotype 1 CPS locus. The seven MLST loci were extracted from the assembled sequence reads and compared with the MLST.net pneumococcal database using the short read sequence typing (SRST) tool (Inouye et al., 2012).
Phylogeny reconstruction. Sequence reads were mapped against serotype 1 S. pneumoniae P1041 (accession number: FQ312030) using SMALT (http://www.sanger.ac.uk/resources/ software/smalt), giving, on average, 185| depth of coverage for more than 94.4 % of the reference genome ( Fig S1). SNPs were identified as described by Harris et al. (2010). A phylogenetic tree was reconstructed for all SNP sites in the genomes using RAxML v.7.0.4 (Stamatakis, 2006). A General Time-Reversible (GTR) model with gamma correction for among-site variation was used with 10 starting trees. To assess support for nodes, 100 random bootstrap replicates were performed. Recombinant sites in the lineage B isolates were identified as previously described and the phylogeny was reconstructed for all SNP sites independent of recombinant blocks (Croucher et al., 2011).
Nucleotide substitution rate. Rates of single nucleotide substitution for the five lineage B clades, which featured African isolates (clades i, ii, iii, v and vi) were calculated with Bayesian Evolutionary Analysis by Sampling Trees (BEAST), using 2 000 000 Markov Chain Monte Carlo (MCMC) iterations sampled every 1000 steps (Drummond et al., 2012). The nucleotide substitution rate between clades was compared using the Kruskal-Wallis test. Rates of recombination for each clade were calculated as previously described (Croucher et al., 2011) and compared using the Kruskal-Wallis Test.
Genome assembly and annotation. Genomes for the lineage B isolates were de novo assembled using a pipeline, which iteratively ran Velvet (Zerbino & Birney, 2008) (with k-mer size ranging between 60 and 90 % of the read length), SMALT and SSPACE (Boetzer et al., 2011). This pipeline gave on average a total length of 2 063 265 bp, with average contig length of 19 660 bp and average N50 of 9733 bp. Assembly statistics are summarized in supplementary Table S1. De novo gene prediction was performed using Glimmer (Delcher et al., 1999) and annotation was transferred using PROKKA (Seemann, 2014).
Antimicrobial resistance analysis. Individual gene trees for folA and folP were reconstructed with RAxML v.7.0.4 (Stamatakis, 2006) using a GTR model with GAMMA correction using 100 bootstraps. Display and manipulation of the single gene phylogenetic trees was performed using the online interactive tree of life (Letunic & Bork, 2011). On the basis of the prediction of recombination in isolates from the four main study sites (Malawi, The Gambia, South Africa and Niger) for which complete antimicrobial resistance data were available, isolates undergoing folA and folP recombination and their phenotypic resistance to co-trimoxazole were compared with strains with no recombination events detected at these sites. A Kruskall-Wallis test was used to estimate the statistical difference between the two groups.
Core genome analysis (lineage B). Annotated genes were translated to protein sequences and assigned to orthologous gene 'clusters' using OrthoMCL, with BLASTP E-value cut-off 1e 25 and inflation index 1.5 (Li et al., 2003). A stringent quality control process was applied to the genome assemblies, to ensure that poor assembly of small genomic regions did not result in an underestimate of the core genome size (Table S1). The resulting orthologous clusters were organized into a matrix of genome content using bespoke Perl scripts, with orthologues from the same genome arranged in columns and rows identifying annotated orthologues of similar function. A blank ('-') was inserted where an orthologue was absent. Genomes (the matrix columns) were randomly sampled in an arithmetic progression fashion: The number of random sampling events, S N , was established using the least number of genome(s) under consideration, a (i.e. 1 genome in this study), the total number of genomes under consideration (i.e. the dataset size), N, and the common difference of successive genomes, d, (i.e. 1) to be used during sampling. During random sampling, the total number of genomes, N, was initialized as one genome for the first event, and increased by one unit for each of the subsequent events, until it was equivalent to the total number of genomes in the dataset in the final event. The random genomes were only sampled once during each event and the total number of orthologues shared by all genomes was counted once for each cluster, thereby excluding paralogous counts. This was iterated 100 times resulting in 100 input orders for each event; the arithmetic mean core genome size was computed for each event to enable the mean core genome size to be related to the number of genomes sampled (Fig.  S2). Comparisons of core genomes from different datasets were achieved using Perl scripts (https://github.com/fy2/ pneumoscript).

Results
Global population structure of serotype 1 Whole genome sequencing was performed for 448 serotype 1 isolates collected from 27 countries between 1994 and 2009 (see supplementary material for sampling strategy, Fig. S3, Table S2, Table S3). Short reads for each isolate were mapped against the genome of serotype 1 S. pneumoniae P1041 and SNPs were identified. A maximum-likelihood phylogeny based on 58 718 SNPs is presented in Fig. 1. The study isolates grouped into four lineages (A-D). Lineages A to C have been previously reported based on MLST (Brueggemann & Spratt, 2003); here we report a previously unrecognized clade (lineage D), composed of Asian isolates. We observed a striking continental clustering, indicating significant geographical structure in the global population. The African isolates all grouped within a single lineage (lineage B The North African isolates showed clear separation from the southern African isolates (Fig. S4). The majority (61 %, 88/145) lay within the second most diverse clade, iii (based on average terminal branch lengths; Fig. S5), which also contained isolates from Asia. Despite the high number of countries represented in clade iii there was a relatively high level of country-specific clustering, implying that inter-country transfer events were infrequent. However, a single South African isolate grouped within clade iii, inferring a cross-region transfer event.
The Gambia isolates were an anomaly, only a subset (6 %, 3/50) grouped within clade iii with the other West African isolates. The majority clustered in clades i and v. The Gambia isolates in clade i (14 %, 7/50) and v (72 %, 36/ 50) belonged to ST618 and ST3081, respectively (Fig. S6). In the Gambia, ST618 previously dominated, causing w70 % of serotype 1 IPD from 1995 to 2006 (Antonio et al., 2008). ST3081, first detected in the Gambia in 2006, has replaced ST618 as the dominant cause of IPD. ST3081 has only been previously reported in Oman in 2004 (MLST.net). Consistent with the expansion of a newly emergent clone, clade v was the least diverse within the phylogeny (Fig. S5). A further three Gambia isolates (6 %) grouped within clade vi, separated from the Malawian and Mozambique subclade by a long branch length, suggesting a historical cross-region transfer event.
Clade v consisted solely of isolates from Asia and was excluded from subsequent analyses.

Recombination and resistance
We next investigated the evolutionary mechanisms that contributed to the diversification of serotype 1, lineage B in Africa. Consistent with the substitution rate reported elsewhere, (Croucher et al., 2011(Croucher et al., , 2013 the mean estimated substitution rates of the lineage B African clades ranged from 1.714|10 26 to 5.980|10 26 per site per year (P54.6|10 29 ). Of 38 187 SNPs identified, 30 597 (79.9 %) were introduced by 650 recombination events, ranging in size from 3 bp to 82 054 bp (Fig. S7).
We found a significant difference in the rec/m (ratio of homologous recombination events to point mutations) between clades (P53.2|10 24 ), which ranged from 0.03 in clade ii to 0.14 in clade i (Fig. 3). This is the first demonstration of varying recombination rates within a single pneumococcal lineage.
The majority of recombination events detected were on branches between clades (ancestral recombination) (426/ 650, 66 %) (Fig. 2). Analysis of recombination events on branches within clades (recent recombination) showed a number of genes recombined multiple times in some clades, but did not undergo recombination in others. These recombination events may have been driven by clade-specific selective pressures (Table S4) (Fig. 4). Unique to clade iii isolates from the Gambia, Ghana, Niger and Togo (and a subset of Asian isolates), were recombination events involving pbp1a, pbp1b and pbp2a, allelic variants of which confer beta-lactam resistance (Cornick & Bentley, 2012). Likewise, gyrA and parC, allelic variants of which confer fluoroquinolone resistance, were only subject to recombina-tion in Malawi isolates (clade vi). Insufficient antimicrobial resistance data were available to correlate recombination to resistance; however, it is probable that recombination in these areas reflects high antimicrobial consumption.
The gene subject to the highest number of unique recombination events in clades i and iii was folP (dihydropteroate [ii] [iii] [iv] [v] [vi] synthase); folA (dihydrofolate reductase) was also subject to multiple recombination events in clades iii and iv. Allelic variants of these two genes confer resistance to co-trimoxazole and sulfadoxine/pyrimethamine (SP) (Cornick et al., 2014) (Fig. 5, Table S5). We investigated if there was an association between recent recombination and antimicrobial resistance for the 210 isolates from the four main study sites where resistance data were available; (the Gambia n543, Malawi n564, Niger n531, South Africa n558). The isolates that underwent recent folA recombination (10 %, 25/210) were phenotypically more resistant than those that had not (average MIC 4.0 vs 2.1 mg ml 21 , P51.9|10 24 ), in contrast to a previous report (Chewapreecha et al., 2014). However, isolates that had undergone a recombination at folP (33 %, 40/210) were phenotypically less resistant than those that had not (average MIC 0.7 vs 2.6 mg ml 21 , P59.5|10 215 ).
Recent recombination events in folA were unique to a subset of the Malawian isolates and a single isolate from Niger. All of the Malawi isolates were co-trimoxazole resistant; recent recombination had introduced the resistant folA genotype (Ile-100-Leu of DHFR). All of the Malawi isolates also possessed the folP resistance genotype (a 1-2 aa insertion in DHPS). The folP phylogeny (Fig. 5) showed that the folP sequences clustered based on country of isolation, suggesting that the folP resistance genotype was introduced through a historical recombination event or spontaneous mutation and disseminated within the population. SP was extensively used as a first-line treatment for uncomplicated malaria in Malawi from 1993 to 2007. In 2005, co-trimoxazole preventative therapy (CPT) for individuals with HIV became national policy. This represents a potentially strong selective pressure for pneumococci to acquire and maintain co-trimoxazole resistance. Recent recombination events at folP were unique to a subset of South African isolates and four Gam-bian isolates. Despite extensive CPT use and evidence of recent folP recombination in the South African isolates, no resistance was observed in these isolates. South Africa, however, has a low incidence of malaria and therefore low SP consumption. The recent recombination events in folP in the South African isolates may be a result of the introduction of CPT placing a 'new' selective pressure and in time a recombination event may introduce the resistance genotype.

Resistance due to mobile genetic elements
Amongst the subset of 210 isolates we found that 49 % (102) were tetracycline-resistant and 31 % (65) were chloramphenicol-resistant (Table S5). The presence of tetM and cat, which confer tetracycline and chloramphenicol resistance, respectively, was assessed. TetM was identified in 53 % (112) and cat in 33 % (70) of isolates (Fig. 6). Two allelic variants of TetM dominated in the dataset, one associated with north-west Africa and the other southern Africa, both alleles were associated with a MIC of w8 mg ml 21 . All but one of the Malawi (clade vi) isolates harboured TetM; conversely none of the South Africa (clade ii) isolates encoded TetM. TetM was identified in all three of the north-west African clades, including 84 %, (26/31) of the Niger isolates and four Gambia isolates (9 %, 4/43). Allelic variants of Cat were also associated with different African regions. CatO was only identified in southern Africa; consistent with high phenotypic resistance, all of the Malawi isolates encoded CatO. The CatQ protein was unique to north-west Africa; however, it was only identified in three Gambia isolates (clade i), none of which displayed phenotypic resistance, and a single resistant Niger isolate. Tetracycline and chloramphenicol are readily available across Africa and it is unclear why these resistance mechanisms are only present in specific countries. The findings contrast with other studies, that report low levels of antimicrobial resistance in serotype 1 (Williams et al., 2012).

Region-and lineage-specific gene content may explain differences in disease phenotype
We defined the core and accessory serotype 1 genome for the six African lineage B clades. We identified 2305 orthologous gene clusters in the dataset. Fifty-nine per cent (59 %) (1358 core genes) were present in all of the study isolates. The remaining 842 (37 %) genes were present in two or more isolates (accessory genes), whilst 105 genes (5 %) were unique to a single isolate.
We identified 53 accessory genes that were absent in at least one clade but present in 100 % of isolates in at least one clade (Fig. 7, Table S6). Unique to the predominantly North African clades (i and iii) was a bacteriocin 972 family protein. This protein has been shown to contribute to the infectivity of Streptococcus iniae (Li et al., 2014). A peptidylprolyl isomerase (PpiA) protein was present in Region-specific diversification of serotype 1 S. pneumoniae the Malawi and Mozambique isolates in clade vi and a subset of North African isolates (clade iii). A homologue of PpiA reduces phagocytosis in Streptococcus mutans (Iyer et al., 2001). Three genes from a phosphotransferase system (PTS) were present in all of the isolates in the North African clades (i and ii). PTS plays a key role in pneumococcal colonization (Mukouhara et al., 2011). The distribution of these accessory genes may reflect host-specific selective pressures between regions and explain differences in disease severity between these regions.
As discussed earlier, the lineage B phylogeny (Fig. 2) indicates that there were two distinct, genetically unrelated groups of isolates from the Gambia, ST618 and ST3081. In 2006, ST3081 replaced ST618 as the dominant cause of IPD in the Gambia. MLST does not have the resolution to distinguish the genes exclusive to ST3081, which have driven sequence type (ST) replacement. To look at large accessory elements, rather than absence/presence of individual genes, we took a semi-manual approach to the cre-ation of the Gambia accessory genome (see Methods) ( Fig. 8). Accessory regions present in all ST3081 isolates but absent in ST618 isolates included a 6.2 kb phage element, encoding an integrase site-specific recombinase (XerD), a cell division protein, FtsK/SpoIIIE, and a fucose utilization operon first characterized in serotype 3 (SP3-BS71) (Higgins et al., 2009). Unique to all of the ST618 isolates was a type 2 fucose utilization operon, first characterized in S. pneumoniae TIGR4 (Chan et al., 2003) (Fig. S8). Fucose utilization operons have been shown to be essential to pneumococcal virulence in models of pneumonia and otitis media (Embry et al., 2007). It is possible that the SP3-BS71-type operon found in ST3081 allows ST3081 to outcompete ST618 isolates, which possess the type 2 operon. Of the remaining proteins unique to ST3081, XerD has previously been implicated in pneumococcal virulence (Chalker et al., 2000). FtsK has not previously been implicated in pneumococcal virulence; however, it interacts with Xer proteins (Le [ii] [iii] [iv] [v] [vi] Bourgeois et al., 2007), suggesting that FtsK may play an indirect role in the regulation of pneumococcal virulence.

Discussion
The global serotype 1 population shows a strong phylogeographic structure at a continental level with further country-specific phylogeographic structure evident within Africa. The data suggest that serotype 1 has disseminated within Africa and diversified independently within the continent, likely due to country-and population-specific selective pressures. There is limited evidence of intercountry transfer events within the African continent, arguably due to the dominance of locally circulating clones, limiting the scope of new clones to become established within the population, and the short carriage  Region-specific diversification of serotype 1 S. pneumoniae duration of serotype 1 limiting opportunity for intercountry transfer. Yet, the relatively high rec/m reported here suggests that this serotype is carried for periods long enough to allow extensive recombination. We report levels of recombination in serotype 1 in line with other serotypes that are commonly associated with carriage and antimicrobial resistance (Chewapreecha et al., 2014;Croucher et al., 2011). Our data suggest that recombination and antimicrobial consumption have contributed to the diversification of serotype 1 in Africa. We observed recombination events in the targets of co-trimoxazole or SP and to a lesser extent fluoroquinolone and beta-lactam antimicrobials. The varying rates of recombination and frequency of recombination encompassing these genes between lineage B African clades show evidence of adaptation to different environmental pressures within geographically isolated clades. The distribution of transposable resistance elements in the accessory genome conferring tetracycline and chloramphenicol resistance further supports that antimicrobial usage has shaped diversification. The accessory genome of serotype 1 is highly variable between regions; region/ clade-specific genes are implicated in virulence and evasion of the host immune response, suggesting that difference in host genetics between regions is also driving diversification. Furthermore, the phylogeny shows that long-range transmission of serotype 1 is rare. This lack of long-range transmission has led to the divergence of geographically isolated clades, which have remained stable with little impact from imported clones. In summary, our data show that serotype 1 is genetically distinct in Africa relative to other continents and furthermore is distinct between regions of Africa. Clonal replacement of the dominant serotype 1 clones circulating within regions is rare; however, we report the accessory gene content that has likely driven decisive serotype clonal replacement in the Gambia.