Another thing to keep in mind is that cosmogenic neutrinos are signatures of hadronic .... survive deployment and freeze-in completely; another 1% are impaired, but usable (usually, they have lost their local coincidence ...... One of the most import
Jul 11, 2016 - Setup and operation protocol of the platform is described in detail in .... FD and CB were supplied in square waves (300 s out of sync) to the left. (FD) and right (CB) channels (Fig. ..... tally.36 To calculate the diffusion coefficie
Jul 23, 2010 - AbstractâAs more and more multi-tier services are developed from commercial off-the-shelf components or heterogeneous middleware without source code available, both developers and administrators need a request tracing tool to (1) exa
Mar 17, 2015 - choices here: either use a word type that is only half the processor word size, so the product can be stored in a full processor word, or use some special function to get both the high and low part of a full sized multiplication in two
Feb 19, 2017 - to demystify the constitution of a graph embedding algorithm and analyze the function of each building block. In what follows, we discuss each ...... abs/1610.09950, 2016.  S. Brin and L. Page, âThe anatomy of a large-scale hyper
Aug 20, 2010 - l=1 Nl. Proposition 3: M = âªg l=1Ml contains full-rank, R- linearly independent matrices which satisfy. BH. 1 B2 + BH. 2 B1 = 0 for B1 â Ml1 ,B2 â Ml2 , for 1 â¤ l1 < l2 â¤ g. Proof: The fact that ......  Zafar Ali Khan Md.,
Jan 27, 2016 - Another challenge is to incorporate the increasing amount of epidemiologically relevant data into the models . It is therefore desirable to have simulation tools that are flexible to various disease spread models yet efficient to ha
Aug 10, 2017 - Vector space models excel at determining similarity between concepts, but they are severely ... an end-to-end pipeline grounded in the cybersecurity informatics domain. KEYWORDS. Knowledge Representation ... It involves designing compu
AbstractâGiven a dataset, the task of learning a transform that allows sparse representations of the data bears the name of dictionary learning. In many applications, these learned dictionaries represent the data much better than the static well-kn
The detailed analysis of the membrane formation ... figure 6. The increase in radius of the travelling contact lines is very fast ~1 mm/s and linear in time (â ~â 1,1) for all tested lipids. ... represented by three dielectrics in series (lipid
Jan 27, 2016 - is animal trade, creating a temporal network of contacts between farms . It has been shown that the ... crete Event simulation (DES), a general approach to evolve dynamical systems consisting of discrete ... as all the necessary in
Dec 27, 2016 - University of Padua, IT [email protected], [email protected] â . University of Oxford, UK ... within six seconds (90% within four seconds) after they leave their workstation's vicinity. .... for device free localiz
Mar 17, 2015 - 2 Overview of Established Algorithms. 4. 2.1 Representation of ..... However, FÃ¼rer conjectured that his new method only becomes faster than SchÃ¶nhage and. Strassen's algorithm for .... Lastly, Chapter 5 sums up the results and shows
Jun 23, 2016 - Abstract. There is an urgent need for compact, fast, and power-efficient hardware implementa- tions of state-of-the-art artificial intelligence. Here we propose a power-efficient approach for real-time inference, in which deep neural n
Research Questions. The objective of this study is to introduce, access and compare two new methods for national research impact indicators and to assess them for individual subjects. ... Electrochemistry; Computational Theory and Mathematics; Manage
This is to certify that the thesis entitled Fast computation of PERCLOS and Saccadic. Ratio submitted by ... I consider it worthy of consideration for the award of the degree of Master of Science by Research of the Institute. ... I have conformed to
Aug 19, 2015 - results prominently confirm its promising performance and advantages over the existing methods. Index Terms ... Consider the Discrete Fourier Transform (DFT) matrix D â CNÃN . From the separability of DFT, the 2D DFT of .... Fast Fo
Facebook offer certificates for their onion addresses issued by DigiCert. This helps ensure that users are ... how to create a PGP key and signature is taken into account, the time investment is dramatically less than ... attacks through certificate
Jul 20, 2016 - Abstract: We present a new framework for robust estimation and infer- ence on second-order stationary time series and random fields. This frame- work is based on the Generalized Method of Wavelet Moments which uses the wavelet variance
Dec 17, 2016 - 2.38 (see ), although as of December 2016 it still has not been decreased below 2.37, that is, in the last 3 ... In Section 9 we still pay tribute to the lasting interest to the exponent of infeasible MM, but ... of numerical and s
Jan 16, 2012 - objective function â schedulers such as those based on utility maximization, maximum weight ... We first consider a channel model where the channel gain in the frequency domain is flat and formulate the resource allocation problem as
estimation using the Fast Fourier Transform procedure and maximum decay sidelobes windows. The most ... Keywords: control of power, grid signal, amplitude and phase estimation, renewable energy, interpolated DFT, maximum ... under/over voltage protec
Dec 17, 2016 - e.g., ).6 Our personal communication in 2015 with the authors of  seems to help: the label ... of fast MM directed to minimizing communication cost and improving parallel implementation. The ...... though never serious for fast
Aug 9, 2010 - cessed to create an oracle that can answer (2k â 1)-approximate distance queries between any two vertices .... 2O(Î»)nlog2 n time and answers (1 + Îµ)-approximate distance queries in 2O(Î») time. .... we define a parent-child relation
Taxator-tk: Fast and Precise Taxonomic Assignment of Metagenomes by Approximating Evolutionary Neighborhoods J. Dröge1,2, I. Gregor1,2 and A. C. McHardy1,2,3* 1
Department for Algorithmic Bioinformatics, Heinrich Heine University, Universitätsstraße 1, 40225 Düsseldorf, Germany 2 Max-Planck Research Group for Computational Genomics and Epidemiology, Max-Planck Institute for Informatics, University Campus E1 4, 66123 Saarbrücken, Germany 3 Computational Biology of Infection Research, Helmholtz Center for Infection Research, Inhoffenstraße 7, 38124 Braunschweig, Germany
Abstract Metagenomics characterizes microbial communities by random shotgun sequencing of DNA isolated directly from an environment of interest. An essential step in computational metagenome analysis is taxonomic sequence assignment, which allows us to identify the sequenced community members and to reconstruct taxonomic bins with sequence data for the individual taxa. We describe an algorithm and the accompanying software, taxator-tk, which performs taxonomic sequence assignments by fast approximate determination of evolutionary neighbors from sequence similarities. Taxator-tk was precise in its taxonomic assignment across all ranks and taxa for a range of evolutionary distances and for short sequences. In addition to the taxonomic binning of metagenomes, it is well suited for profiling microbial communities from metagenome samples becauseit identifies bacterial, archaeal and eukaryotic community members without being affected by varying primer binding strengths, as in marker gene amplification, or copy number variations of marker genes across different taxa. Taxator-tk has an efficient, parallelized implementation that allows the assignment of 6 Gb of sequence data per day on a standard multiprocessor system with ten CPU cores and microbial RefSeq as the genomic reference data.
Introduction Metagenomics allows us to study microbial communities from natural environments without the need to obtain pure cultures for the individual member species (Hugenholtz, 2002; Riesenfeld et al., 2004). Metagenome sequencing of microbial community DNA with current shotgun techniques generates reads that range from less than 100 to several thousand nucleotides (Dröge and McHardy, 2012; Klumpp et al., 2012). By computational analyses of metagenome samples, we can estimate the abundances of different taxa for the sampled communities, known as taxonomic profiling, characterize of the functional and metabolic potential based on the predicted proteins and resolve the contributions of individual taxa to the latter by reconstructing taxonomic ‘bins’ of unassembled or assembled sequences originating from a common taxon. 1
A taxonomic profile of a microbial community can be inferred by either targeted sequencing of taxonomic marker genes or from metagenome shotgun data-sets (Lindner and Renard, 2013; Sunagawa et al., 2013). Most metagenome profiling methods classify reads based on predefined taxon-specific (Segata et al., 2012) or ‘universal’ marker genes (Darling et al., 2014), or directly estimate a taxonomic profile for the underlying microbial community from their k-mer composition (Koslicki et al., 2013). Frequently used phylogenetic placement programs within such frameworks are pplacer (Matsen et al., 2010) or EPA/RAxML (Berger et al., 2011), which both operate in a probabilistic framework to position a query gene sequence in a pre-computed reference phylogeny of a particular gene family. If this gene tree is an approximate representation of the respective species tree – or reference taxonomy – this can be used to assign a taxonomic identifier (ID) to the query sequence (Stark et al., 2010; Matsen and Gallagher, 2012). Taxon abundances are then derived from the individual read counts or gene frequencies within each taxonomic group. Metagenome binning places arbitrary shotgun sequences of a metagenome into bins representing the different taxa of the sampled microbial community. If a bin represents a low-ranking taxon, such as the species, then the set of reads or contigs of an individual taxonomic bin serves as a draftgenome reconstruction for a community member (Pope et al., 2011). Binning methods are either based on clustering or classification. Clustering methods group sequences into bins without consideration of external reference sequences or taxonomic information. Instead, bins are inferred based on similarities in GC-content, oligomer frequencies, the abundance of genes or contig coverage within one or multiple samples (Baran and Halperin, 2012; Carr et al., 2013; Albertsen et al., 2013; Alneberg et al., 2013), or by using a combination of these sequence features (Iverson et al., 2012). This allows draft genome recovery from deep lineages for sequences of sufficient length. Taxonomic classification, like profiling, uses the resemblance of a sequence to known taxa in either global sequence composition or local sequence similarity to assign a taxonomic ID. For the human gut microbiome, extensive genome sequencing of isolate cultures allowed species-level taxonomic binning for a substantial portion (approx. 40%) of a metagenome sample (Schloissnig et al., 2013) by mapping the reads to isolate genomes, which exist for the majority of abundant species (Sunagawa et al., 2013). However, this procedure is not suitable for environments in which most species are from deep-branching lineages without available reference genomes. Taxonomic binning of these requires more sophisticated similarity-based or composition-based taxonomic assignment methods (McHardy et al., 2007; Huson et al., 2011; Brady and Salzberg, 2011). Classification by sequence composition also allows draft genome recovery from deep-branching lineages, based on limited amounts of sequences for the individual taxa. Such programs also achieve linear classification times regarding metagenome sample size whereas similarity-based methods require considerably more computational resources due to homology searches in large reference sequence collections. Nevertheless, they are more accurate in the assignment of sequences shorter than approx. 1 kb (Patil et al., 2011). As described, a common problem of both taxonomic profiling and binning is the identification of known taxa. As ‘taxonomic profiling’ estimates only taxon abundances, it can be performed by analysis of smaller sets of marker genes alone, though one needs to account for variations in read counts among different clade-specific markers within taxa (Lindner and Renard, 2013). Taxonomic 2
binning assigns taxonomic IDs to arbitrary sequences, to allow taxon-specific functional and metabolic analyses, where as a side-effect, a bin can also be used to quantify the corresponding taxon abundance based on individual read counts assigned to these bins. From a methodological standpoint, the differences between the mentioned phylogenetic-placement-based methods for profiling and alignment-score-based binning methods, such as MEGAN4 (Huson et al., 2011), CARMA3 (Gerlach and Stoye, 2011) or SOrT-ITEMS (Monzoorul Haque et al., 2009) are that the latter lack a well-motivated evolutionary framework. However, they have the advantages of being computationally lightweight and applicable to arbitrary genes. This is necessary for binning because pre-computing or even inferring de-novo trees for non-marker genes on a metagenome-wide scale is too computationally demanding – particularly with next-generation sequencing (NGS) data. Our taxator toolkit (taxator-tk) is a software package for the taxonomic assignment of genomic sequences with application to metagenome profiling and binning. Conceptually, it lies between sequence-similarity-based programs which use local sequence alignment scores and those using trees. Taxator-tk extends the alignment-score-based approach by approximating phylogenetic gene trees and thereby provides more accurate taxonomic assignments, without assuming universal, rank or clade-specific gene conservation levels as parameters. We improve in terms of applicability to large data sets compared to phylogenetic methods by assigning arbitrary genomic sequences without the computationally demanding steps of de-novo multiple sequence alignment (MSA) and tree inference. Taxator-tk determines a subset of homologs, which represent the approximate evolutionary neighbors for a query sequence with a linear number of pairwise sequence comparisons with regard to the number of considered homologs and then assigns a taxonomic ID using a reference taxonomy based on the taxonomic identities of these neighbors. We have furthermore reduced the run-time by limiting the analysis to distinct homology-supported regions of the query sequence, which we termed query segmentation. Our method can be applied to arbitrary nucleotide sequences including large assembled or unassembled metagenome sequence samples. The software is released as an open-source package (GPLv3) and can be downloaded from http://algbio.cs.uni-duesseldorf.de/software/.
Methods (Algorithm and Evaluation) Taxator-tk’s workflow for taxonomic assignment The workflow for the taxonomic assignment of a query sequence comprises three stages (Fig. 1a–c). The first stage uses a (nucleotide) local sequence aligner to identify similar regions from a reference sequence collection, such as microbial RefSeq (mRefSeq) (Sayers et al., 2009). The program supports different aligners and we used NCBI BLAST (Camacho et al., 2009) and LAST (Frith et al., 2010). Before applying the taxator algorithm in stage two, overlapping regions on the query, each defined by local alignment to a reference sequence, are merged into larger subsequences called segments (Supplementary Fig. 24). These query segments are flanked by regions without similarity to any reference data (Supplementary Fig. 16) and are not considered further. This step reduces the overall number of positions in the following alignment computations and improves the taxonomic assignment of queries that have undergone genome rearrangements, resulting in a different order of 3
these segments. The reference sequence regions corresponding to the local alignments are extended at both sides by the missing number of nucleotides to match to the corresponding query segment with respect to its length. We refer to these as reference segments. Each independent set of homologous segments is the input to core algorithm in the program taxator in stage two (Fig. 1b), which calculates a taxonomic ID independently and in parallel for the corresponding query segment. In the third stage (Fig. 1c), multiple segments belonging to the same query are considered and their IDs are combined in the program binner, to derive a consensus taxonomic ID. The corresponding algorithm weights the individual segment assignments by the number of positions matching the closest reference sequence and assigns to the entire query the taxonomic ID supported by the majority (default = 70%) of weighted assignments with a minimum number of matching positions (default = 50 bp) (Supplementary Methods, “Consensus binning algorithm”). Binner has optional parameters to specify additional constraints for binning (minimum sequence identity, minimum. sample abundance) but these were not applied in our analysis. If the taxonomic information is limited or contradictory, taxator and binner assign identifiers to higher ranking taxa in a conservative fashion to obtain maximum reliability for the resulting taxonomic assignments. The taxonomic assignment algorithm (taxator) The input to the algorithm is a segment q of the original query sequence from an (unknown) taxon Q and a set of homologous segments with known taxonomic IDs. The term “segment” refers to a gap-less subsequence of either thequery or a reference sequence. Given that for the set of homologs we know the correct underlying tree of taxa (Fig. 2a), we can see that for our query taxon Q, the closest evolutionary neighbors would be A, B and S. If we simply assign X, the parental taxon of A, B and S, as a taxonomic identifier, this would be inaccurate, as A, B and S are more closely related to each other than to Q. Instead, the correct taxonomic assignment would be a parent of X, Q and at least one additional outgroup taxon (O) in the reference tree, such that Q becomes a descendant of the identified parent (R in Fig. 2a). If we therefore can identify the taxa A, B, S and O in the reference tree, we can determine the taxonomic ID of R as the lowest common ancestor (LCA) of these taxa and assign it to Q (and q). Assuming that the underlying gene tree for a set of homologs is similar to the species tree, a natural procedure to identify the segments corresponding to the leaf taxa within R among the homologs would be to construct a MSA for the segment and a phylogenetic tree with a corresponding subtree as in Fig. 2a. However, the computational effort for this approach is superlinear with respect to the number of homologs being compared and substantial for all the query segments in a large sample, even using fast techniques for MSA construction and tree inference. The taxator algorithm attempts to identify these segments with a linear number of pair-wise segment comparisons. Let us consider the evolutionary distances between pairs of segments within the underlying tree to be represented in an undirected graph with the nodes representing the segments (tree leaves) and the edges scaled by evolutionary distances between pairs of segments (Fig. 2b). In this graph, a monophyletic group in the species tree is a subgraph. For all pairs of subgraph nodes, the following inequality is true, given that the segments have evolved with a constant rate of evolution (i.e. the segment tree is 4
ultrametric): The distance between any two subgraph nodes is smaller than that to any other node outside the subgraph. The relationship becomes clearer when thinking of the evolutionary distance between two nodes as the divergence time from their most recent ancestor. Members of a monophyletic group derive from a single common ancestor and thus there is a maximum distance for all possible pairs. If one member’s distance to an outside node is smaller than this maximum, both must share a more recent common ancestor and the corresponding group is not monophyletic by definition. The stated inequality can be used to augment an incomplete group or corresponding subgraph iteratively by taking an internal distance, ideally close to the maximum, as a threshold and adding outside nodes to the group which have a smaller distance to some internal node. In this manner, taxator-tk searches for the leaf node taxa of clade R among all segments based on a linear number of sequence comparisons between the input segments and adds them to an (initially empty) working set M. 0. A ranking by alignment scores from the input local alignments is used at the beginning to identify the reference segment s that is most similar to the query q. The working set M is then augmented in two passes: 1. In the first pass, all segments are aligned to s using fast nucleotide alignment and the edit distance. All segment taxa with a distance less than or equal to the threshold, distance(s,q), are added to M (Fig. 2c). 2. The outgroup segment o is determined as the first segment for which distance(s,o) larger than distance(s,q). In the second pass, all segments are then aligned to o and segment taxa with distances smaller than or equal to distance(o,q) are added to M as well (Fig. 2d). This procedure requires approximately 2n alignments, where n is the number of reference segments. 3. The resulting set M of taxa (implicit in the partially resolved tree in Fig. 2e) is used to determine the taxonomic ID for q, corresponding to the LCA of these taxa in a reference taxonomy, such as the NCBI taxonomy. If no outgroup could be determined or if M is so diverse that the LCA corresponds to the taxonomy root, q is left unassigned. The algorithm requires at least two homologous segments (s and o) to determine a valid taxonomic ID. The taxa in M become more diverse if the alignment scores are inaccurate ultrametric distance estimates, if the species subtree’s topology deviates from the respective part of the taxonomy or if the gene tree’s topology deviates from the species tree, for instance due to varying rates of evolution or the inclusion of non-homologous segments in the analysis. The robustness of the algorithm in avoiding incorrect assignments under these circumstances depends on the number of taxa in M and the subsequent LCA operation. Further details relating to the robustness of the implementation are given in the Supplementary Methods, “Taxonomic assignment of sequence segments”.
Evaluation procedures Before evaluating all the methods, we removed the smallest predicted bins (1%) as likely errors. We used the macro-precision and macro-recall as measures of assignment performance (Supplementary Methods, “Performance measures”). The macro-precision specifies the fraction of correct assignments per predicted bin (precision), averaged over all such bins, while the macro-recall measures the fraction of correctly recovered sequence data per truly existing bin (recall), averaged over all such bins. To account for strong differences in bin size, we also pooled the species, genus and family assignments, and reported the overall precision for these ranks as the total fraction of correct assignments. We tested the assignment performance of different methods using three simulated short read datasets, a simulated 16S rRNA dataset, three simulated assembled metagenome samples and a cow rumen metagenome sample. For every simulated dataset, we performed seven cross-validation experiments (Supplementary Methods, “Cross-validation”). In each experiment, we created a partition for every query sequence considered, simulating a specific taxonomic distance between this query and the reference sequences. For the first scenario, all reference data, including the species genome from which the query had been sampled, were made available to the method for assigning a single query sequence as an idealized test case. In the other six scenarios, all reference data belonging to the species, genus, family, order, class or phylum of the query sequence, respectively, were made inaccessible for the method. We added the sequence assignmentsfrom these experiments to characterize a method’s assignment performance across the entire range of taxonomic distances. For evaluation with the cow rumen metagenome sample, for which no true taxonomic labels were known, we divided the assembled contig sequences into multiple ‘chunks’ and characterized the consistency of taxonomic assignments for chunks originating from same contig (Supplementary Methods, “Consistency Analysis”).
Results Evaluation with unassembled data We first evaluated the performance of taxator-tk for classification of the most widely used taxonomic marker in bacterial diversity studies, the 16S rRNA gene (Supplementary Material, Supplementary Fig. 1). This served as a proof of concept, as taxator-tk classifies arbitrary sequence regions including taxonomic marker genes. We did not expect it to perform better than sophisticated phylogenetic models for this task, but wanted to confirm a satisfactory performance. The macroprecision for the taxonomic assignment of 7176 16S rRNA genes was constantly above 92% (Fig. 3a) in the combined cross-validation (Methods), using the whole-genome reference sequences in mRefSeq47 (Supplementary Fig. 19), not just the 16S genes. More precisely, the average error rate per bin (one minus precision) was 7.4% at the species level and 4.6% at the order level. Next, we simulated 100,000 reads at 100, 500 and 1000 bp by subsampling randomly from 1729 species in mRefSeq47 and evaluated taxator-tk with these three datasets using (combined) crossvalidation. The performance was very similar for the different fragment sizes (Fig 2b, 2c and Supplementary Fig. 2–4a). Overall, taxator-tk showed high precision in simulated read assignment: The macro-precision for all short read lengths remained above 74% and was 82–99% for the genus 6
to kingdom ranks, about 10% lower on average than for the 16S data. This was still good for the assignment of short sequence fragments from arbitrary genomic regions compared to marker genes. Longer reads had a slightly higher macro-recall than the shorter ones. At genus level, the macrorecall was 19–23 % (~33% genera recovered) if genomes of the same species as the query sequence were provided in the reference (Supplementary Fig. 2–4b) and as low as 5–7 % (~16% genera recovered) otherwise (Supplementary Fig. 2–4c). The macro-recall must decrease when removing reference data for cross-validation. For example, if all reference data at genus level are removed, then no correct assignments to the genus rank should be possible. The macro-recall scale also very much depends on the validation sample. Generally, it increases with higher genome coverage or lower organismal sample complexity. Here, the macro-recall was low due to the large number of sample taxa and their uneven sampling caused by the taxonomic bias in mRefSeq47. Since longer sequences yield better recall and because overlapping reads contain redundant information, leading to more alignment computations, we recommend applying taxator-tk to (partially) assembled data. We observed that on longer query sequences, we were more likely to find segments for processing and therefore assign a larger portion of the sample. Evaluation with simulated metagenome samples For the tests on assembled simulated samples, we compared taxator-tk to CARMA3 and MEGAN4 using the same taxonomy and the same nucleotide alignments against mRefSeq54 (Supplementary Fig. 20). We used the recommended parameter settings (Supplementary Methods, “Program parameters”) and cross-validation, as before (Methods, “Evaluation procedures”). Both CARMA3 and MEGAN4 allow the use of protein or nucleotide alignments. We exclusively used faster nucleotide alignments because of the large size of the metagenome datasets and because we did not observe an improvement in performance when using protein instead of nucleotide local alignments. We used the SimMC/AMD and SimHC/soil simulated metagenome datasets of the FAMeS collection (Mavromatis et al., 2007) for our evaluation. These metagenome datasets were generated by Sanger sequencing in the year 2006 and are several orders of magnitude smaller than datasets generated with the current NGS technologies (Dröge and McHardy, 2012). The medium complexity SimMC/AMD consists of ~17 Mb/7307 contigs and the high complexity SimHC/soil sample comprises ~1 Mb/578 contigs. We evaluated MEGAN4 and taxator-tk with SimMC and SimHC (Supplementary Fig. 21, 22), but omitted CARMA3 due to its long run-time for each of the seven cross-validation experiments (Methods, “Evaluation procedures”). Due to the small sample sizes and because the FAMeS data could have been used for the method development, we created an additional simulated NGS metagenome dataset (simArt49e, composition in Supplementary Fig. 23) for our evaluation. This sample included 49 equally abundant species (51 strains) and was created by Illumina paired read simulation with pIRS (Hu et al., 2012), followed by SOAPdenovo assembly (Luo et al., 2012). Around 160 Mb or 267,178 contigs remained after removal of 0.03% chimeric sequences. On the FAMeS datasets, taxator-tk produced fewer errors for all taxonomic ranks than MEGAN4, which was accompanied by a moderate reduction in macro-recall (Supplementary Fig. 5-8) throughout all individual experiments and in the combined cross-validation experiments: For 7
SimMC, the macro-precision was three to four times as large as MEGAN4's for species to order, with higher macro-recall (Supplementary Fig. 5-6). The species to family overall precision was ~91% for taxator-tk (~59% for MEGAN4) and taxator-tk estimated 54 species bins (MEGAN4 188) for the 47 actual species in SimMC. Similarly, for SimHC, taxator-tk achieved a higher macroprecision for all ranks, which was most pronounced for class and phylum (Supplementary Fig. 7–8). By contrast, the macro-recall was slightly reduced and both methods underestimated the 96 existing species in SimHC. Our simulated metagenome sample simArt49e was difficult to assign for all methods when the sequences from the corresponding species and genus were removed from the reference (Supplementary Fig. 9–11d). In this case, all methods showed a reduced macro-precision for the assignment at the family rank in comparison to the FAMeS datasets. Still, taxator-tk was the most precise, though it had a lower recall than the other methods (taxator-tk: 56% family macroprecision, 60% overall precision for species to family, 10% family macro-recall; CARMA3: 13%, 27% and 20%; MEGAN4: 22%, 27% and 31%). Similar to this individual experiment, in the combined cross-validation (Supplementary Fig. 9–11a), most sequences were assigned to bacteria or archaea by all methods or, in the case of CARMA3, remained unassigned. For the data assigned to the species to family ranks, taxator-tk had 91% correct assignments compared to 52% for CARMA3 and 59% for MEGAN4. The macro-precision was substantially higher for taxator-tk than for the other methods, e.g. 61% at the species level (taxator-tk), compared to 3% (CARMA3) and 5% (MEGAN4). The low macro-precision observed for CARMA3 and MEGAN4 is largely due to the prediction of many small bins with many false assignments (Supplementary Methods, “Performance measures”). Likewise, consistent with the results for FAMeS, more species bins were predicted by CARMA3 (1672) and MEGAN4 (824) than by taxator-tk (65), with 49 species being actually present in the sample. When simulating novel families (Supplementary Fig. 9–11e), MEGAN4 predicted 69 distinct orders to be present, CARMA3 81 and taxator-tk only 27, compared to the existing 32 in simArt49e (Fig. 4). Taxonomic assignments of taxator-tk were considerably rarer to false taxa at low ranks than with the other methods, and instead were to higher-ranking existing taxa. The other two methods assigned a substantial amount of sequence data incorrectly to bins at the family level or below, which can be a seriously misleading result, depending on its further use. The prediction of much fewer taxa which are not truly present in the sample makes taxator-tk more suitable as a tool for determining microbial community members, in addition to the plain recovery of taxonomic sequence bins from shotgun datasets. To investigate the reason for the observed differences between overall and macro-precision, we plotted the per-bin precision at the family level in the combined cross-validation as a function of predicted size with a k-nearest-neighbor (kNN) estimate of macro-precision (Fig. 5; for all ranks, see Supplementary Fig. 17). Overall, the bins predicted by taxator-tk were smaller, more precise and much more likely to represent truly existing taxa than those predicted by the other two programs; larger bins tended to be more accurate for all methods. CARMA3 and MEGAN4 predicted a substantial number of mostly smaller-sized incorrect bins with zero precision. Even though the size-dependent kNN precision curve should be unaffected by these bins, for larger bin sizes, they never reached 70% (CARMA3) or 80% (MEGAN4), whereas the taxator-tk curve reached almost 8
100%. For the smallest bins, taxator-tk’s kNN precision was ~20% whereas bins below 500 kb for CARMA3 and MEGAN4 were practically indistinguishable from noise. When comparing to the composition-based program PhyloPythiaS (Patil et al., 2011), we could not apply cross-validation due to the computational effort of training a large number of models. Therefore we evaluated this using the published evaluation scenario with SimMC (Patil et al., 2011), in which all genome sequences of the SimMC genera were removed from the reference sequenced genomes. All programs were provided with the remaining sequenced genomes and an additional 100 kb of reference data for each of the three dominant strains. The latter could be used by PhyloPythiaS to infer a corresponding species model but were less helpful for the similaritybased classifiers. We generated assignments with taxator-tk, CARMA3 and MEGAN4 under equivalent conditions and compared them to the published PhyloPythiaS assignments (Supplementary Fig. 12). The results for the similarity-based programs were consistent with the previous evaluations with SimMC concerning the error distributions. The PhyloPythiaS results showed that composition-based classification with additional training data correctly assigned most data at the genus and family levels (species assignments were not given in the original publication), which were either rarely assigned (taxator-tk) or mostly incorrectly assigned (MEGAN4, CARMA3) by the other programs. However, PhyloPythiaS predicted fewer families (6), compared to 29 underlying families versus 14 (taxator-tk), 50 (CARMA3) and 17 (MEGAN4). Apart from an increased macro-recall with PhyloPythiaS, the macro-precision (~50% at genus to order level) was also higher than it was with MEGAN4 (~9–30%) or CARMA3 (~7–24%) but less than with taxatortk (~50–68%). However, unlike the other programs, for PhyloPythiaS the modeled taxa should be specified a priori to achieve optimal performance. It is therefore best applied when the taxonomic composition of a microbial community has already been determined and sufficient training data are available for the identified taxa. Evaluation with a real metagenome sample For microbial communities in many environments, only distantly related reference genomes are available. We analyzed a medium complex metagenome sample of such a microbial community from a cow rumen (Hess et al., 2011) with taxator-tk, CARMA3, MEGAN4 and PhyloPythiaS (the general model with the 100 most abundant species among sequenced prokaryotes). For this particular sample, we considered scaffolds to be unreliable compared to contigs, which we reconstructed by splitting the available scaffolds at gaps of more than 200 positions (A. Sczyrba, personal communication). We subsequently divided contigs longer than 10 kb into sequence ‘chunks’ of 2 kb (minimum 5), resulting in a 319 Mb dataset, which we used to assess the assignment consistency for chunks originating from the same contig. The chunk sequences were assigned independently with taxator-tk, CARMA3 and MEGAN4 (identical alignments) and with PhyloPythiaS. As the standard of truth for each contig, we determined the taxon minimizing the inconsistency between all corresponding chunk assignments (Gregor et al., unpublished). A chunk assignment was considered consistent if it was to the same taxon as the one for entire contig, and was inconsistent otherwise. The consistency of a taxonomic bin is the fraction of chunk data with matching contig assignments and the macro-consistency is the consistency averaged over all predicted taxa, comparable to the macro-precision. 9
In agreement with the tests on the simulated metagenome datasets, the taxator-tk results were more consistent than those of the other methods (Supplementary Fig. 13): 76–78% macro-consistency at species to order level, in comparison to MEGAN4 (34–40%) and PhyloPythiaS (56–65%). The overall consistency (analogous to overall precision) for species to family levels was 97% with taxator-tk, 39% with CARMA3, 62% with MEGAN4 and 82% with PhyloPythiaS. Likewise, taxator-tk assigned less data at species to family level, with a total of 12.8 Mb being consistent compared to CARMA3 (8kb), MEGAN4 (46.9 Mb) or PhyloPythiaS (13.8 Mb). Different methods again identified different numbers of taxa to be present: CARMA3 identified 572 genera with a macro-consistency of 53%, MEGAN4 264 (34%), PhyloPythiaS 33 genera (63%) and taxator-tk found 110 genera (76%). These results are in agreement with the results for simulated samples and suggest that taxator-tk is a precise taxonomic classifier for metagenomic shotgun sequences. Run-time analyses The run-time for taxonomic assignment of a metagenome sample consists of the time to find homologs and to process them for assigning taxonomic IDs to all sequences. We evaluated the runtimes of all methods using the same set of alignments generated with either BLAST or LAST, so the run-time for the initial similarity search was identical with all methods. We determined the time for the taxonomic assignment of simArt49e for all methods when performing a cross-validation with the data of the families present in the test dataset removed from the reference data. This took less than one hour for MEGAN4 (interactive mode), less than 6 hours for taxator-tk (~10 CPU cores) and almost a week for CARMA3 (~20 CPU cores). On our system, the parallelization of taxator-tk led to a linear parallel speedup (with the number of CPU cores) for up to 15 cores and deteriorated with 20 cores (Supplementary Fig. 14). To provide a more specific estimate of the throughput of taxator-tk, we aligned ~1 Gb of cow rumen sequence data with BLAST against mRefSeq54 and assigned the data with taxator-tk on 10 CPU cores (AMD Opteron 6386 SE). We measured an average throughput of 5.9 Gb per day for the combined alignment and taxonomic assignment steps with this dataset. We also determined how our implementation scaled for increasing input sequence lengths and reference exclusion scenarios (Supplementary Fig. 15a). The run-time scaled approximately linearly except when the same or very similar species were among the reference genomes. In general, the greater the number of similar sequences in the reference data, the longer taxator-tk’s run-time was for the alignment of longer sequence stretches with more homologs. Simultaneously, we investigated the impact of the query segmentation on taxator-tk’s run-time (Supplementary Fig. 15b) and found that it reduced the total run-time by up to 30%.
Discussion Here, we have described taxator-tk, a taxonomic assignment software package which generates very precise taxonomic assignments (i.e. assignments with few errors) for metagenome shotgun sequencing data. To provide a fair comparison, we invested extensive effort into ensuring that we evaluated all methods under identical conditions with the same reference sequences, test datasets and background taxonomies, using their recommended settings. We evaluated taxator-tk on 16S gene sequences, simulated short reads, and with a simulated and a cow rumen assembled 10
metagenome sample across a wide range of evolutionary distances between the query and reference sequences by using cross-validation. For comparisons to other methods, we used older and new larger NGS samples. Taxator-tk was the most precise of all tested methods with the most realistic number of identified taxa overall. This property was very pronounced for lower taxonomic ranks from species to family level. However, this also means that taxator-tkassigned fewer data overall than other methods from species to family. For the small assembled SimMC dataset, it assigned fewer data, particularly in comparison to the composition-based classifier PhyloPythiaS when 100 kb of data were provided for individual community members to train species-level models. For the real cow rumen dataset, taxator-tk was the most consistent in terms of classifying multiple pieces of one contig. All results consistently indicate that taxator-tk’s strength is its high precision of assignments, which allows us to confidently assign a core of sample sequences and thereby to infer the taxonomic composition of the community. In comparison to assignments based on marker genes, it has the advantages that it makes assignments across all domains of life and that corresponding abundance estimates from shotgun sequences are less affected by copy number variations of individual genes. Such shotgun estimates are also unaffected by PCR primer amplification biases, unlike marker gene sequencing techniques, and do not require high-quality reference gene phylogenies of individual marker genes . The amount of recoverable low ranking taxonomic assignments depends on the available reference data, as it does for all taxonomic classification methods. To target draft genome reconstructions, the data assigned to individual taxonomic bins by taxator-tk can be used as training data for complementary approaches, such as composition-based methods, or as independent information in combination with recently proposed clustering methods using the abundance of genes or contigs across multiple samples. From a methodological point of view, we have introduced a method for the fast approximation of the evolutionary neighborhood of a query sequence with a run-time that increases linearly with the number of homologs. In de-novo phylogenetic inference methods, the run-time increases at least log-linearly with the number of homologs or they rely on time-consuming optimizations of parameter-rich phylogenetic models, which generate excessive computational requirements for the analysis of Gb-sized NGS samples. Our software, therefore, provides an easy to use alternative to taxonomic classification of marker genes that is applicable to any gene or gene fragment in a scalable manner. Unlike similarity-based taxonomic classifiers for shotgun data, our algorithm handles different degrees of sequence conservation without preset or user-specified parameters and without being restricted to the analysis of a number of high-quality homologs with a minimal length. At the same time, the inferred evolutionary neighborhood is extended by the identification of an outgroup, leading to more precise taxonomic assignments. Importantly, genes are evaluated with regard to their taxonomic information content in the process of assignment, which discards information in conflict with the taxonomy. We post-process independent taxonomic assignments of query segments to infer an assignment for the entire query and do this using a majority vote algorithm with a few robust default parameters. This computationally lightweight step is quickly repeated with other values for the majority and minimum support parameters, if required. In addition to the algorithmic considerations and other run-time optimizations, we implemented query sequence segmentation and program parallelization, which allow large-scale data analysis with a throughput of several Gb per day on a standard multiprocessor system. 11
We have applied taxator-tk to characterize the taxonomic composition of Bacteria, Archaea and Eukaryotes in multiple samples of the barley rhizosphere, which correlated with results from 16S rRNA profiling and showed the most notable deviations for taxa known to be affected by primer biases or having multiple copies of the 16S rRNA gene (results not shown). The program’s scope is also not limited to the taxonomic assignment of metagenomes: It can be applied to arbitrary DNA or RNA sequences. A successful in-house application was the detection of contamination in isolate sequencing data. Furthermore, the program taxator within taxator-tk provides taxonomic information for individual query segments (Supplementary Fig. 16), which could be used to identify assembly errors or regions acquired by lateral gene transfer.
References Albertsen, M. et al. (2013) Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol., 31, 533-538. Alneberg, J. et al. (2013) CONCOCT: Clustering cONtigs on COverage and ComposiTion. Preprint, arXiv:q-bio.GN/1312.4038v1. Baran, Y. and Halperin, E. (2012) Joint analysis of multiple metagenomic samples. PLoS Comput. Biol., 8, e1002373. Berger, S.A. et al. (2011) Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood. Syst. Biol., 60, 291–302. Brady, A. and Salzberg, S. (2011) PhymmBL expanded: confidence scores, custom databases, parallelization and more. Nat. Methods, 8, 367. Camacho, C. et al. (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10, 421. Carr, R. et al. (2013) Reconstructing the genomic content of microbiome taxa through shotgun metagenomic deconvolution. PLoS Comput. Biol., 9, e1003292. Darling, A.E. et al. (2014) PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ, 2, e243. Dröge, J. and McHardy, A.C. (2012) Taxonomic binning of metagenome samples generated by next-generation sequencing technologies. Brief. Bioinform., 13, 646–655. Frith, M.C. et al. (2010) Parameters for accurate genome alignment. BMC Bioinformatics, 11, 80. Gerlach, W. and Stoye, J. (2011) Taxonomic classification of metagenomic shotgun sequences with CARMA3. Nucleic Acids Res., 1–11. Hess, M. et al. (2011) Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science, 331, 463–467. Hugenholtz, P. (2002) Exploring prokaryotic diversity in the genomic era. Genome Biol., 3, 1–8. Huson, D.H. et al. (2011) Integrative analysis of environmental sequences using MEGAN4. Genome Res., 21, 1552–1560. Hu, X. et al. (2012) pIRS: Profile-based Illumina pair-end reads simulator. Bioinformatics, 28, 1533–5. Iverson, V. et al. (2012) Untangling genomes from metagenomes: revealing an uncultured class of marine Euryarchaeota. Science, 335, 587–590. 12
Klumpp, J. et al. (2012) Next generation sequencing technologies and the changing landscape of phage genomics. Bacteriophage, 2, 190–199. Koslicki, D. et al. (2013) Quikr: a method for rapid reconstruction of bacterial communities via compressive sensing. Bioinformatics, 29, 2096–2102. Kunin, V. et al. (2008) A bioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev., 72, 557–578. Lindner, M.S. and Renard, B.Y. (2013) Metagenomic abundance estimation and diagnostic testing on species level. Nucleic Acids Res., 41, e10. Luo, R. et al. (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience, 1, 18. Matsen, F.A. et al. (2010) pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics, 11, 538. Matsen, F. A. and Gallagher, A. (2012) Reconciling taxonomy and phylogenetic inference: formalism and algorithms for describing discord and inferring taxonomic roots. Algorithms Mol. Biol., 7, 8. Mavromatis, K. et al. (2007) Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat. Methods, 4, 495–500. McHardy, A.C. et al. (2007) Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods, 4, 63–72. Monzoorul Haque, M. et al. (2009) SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences. Bioinformatics, 25, 1722–1730. Patil, K.R. et al. (2011) Taxonomic metagenome sequence assignment with structured output models. Nat. Methods, 8, 191–192. Pope, P.B. et al. (2011) Isolation of Succinivibrionaceae implicated in low methane emissions from Tammar wallabies. Science, 333, 646–648. Riesenfeld, C.S. et al. (2004) Metagenomics: genomic analysis of microbial communities. Annu. Rev. Genet., 38, 525–552. Sayers, E.W. et al. (2009) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 37, D5–15. Schloissnig, S. et al. (2013) Genomic variation landscape of the human gut microbiome. Nature, 493, 45–50. Segata, N. et al. (2012) Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods, 1–7. Stark, M. et al. (2010) MLTreeMap – accurate maximum likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies. BMC Genomics, 11, 461. Sunagawa, S. et al. (2013) Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods, 10, 1196–1199.
Figure 1: Workflow diagram for the taxonomic assignment of a query sequence with taxatortk. Taxonomic assignment with taxator-tk includes three steps. In step (a), local alignments between a query sequence and sequences from a reference collection are identified using e.g. NCBI’s BLAST or LAST. In step (b), the program taxator separates the query sequence into distinct segments with homologs and determines a taxonomic ID for each segment. In step (c), the program binner determines a taxonomic ID for the entire query sequence based on the taxonomic assignments of the individual segments.
Figure 2: Algorithm for taxonomic labeling of query segments (realignment placement algorithm) Realignment placement algorithm (RPA) for assigning a taxon ID to a query segment q. Panel (a): Exemplary phylogenetic tree with query taxon Q and reference taxa A, B, C, D, O and S. Panel (b): Approximate graph representation of the tree metric (pair-wise distances between taxa in the tree). The subgraph corresponding to the clade X is highlighted. Panels (c and d) show the two augmentation passes of the RPA in which segment taxa are added to an (empty) set M. (c) In the first pass, all segments are aligned to s, which is the segment with the best initial local alignment score (program input). The resulting scores are ordered and the taxa are added to M if they have an equal or smaller score than the threshold, distance(s,q). The outgroup segment o is the segment with the shortest distance but which is larger than the threshold. (d) In the second pass, all segments are aligned to o and the resulting scores are ranked. All taxa of segments with a score smaller than distance(o,q) are again added to M. Finally, M includes all the nearest evolutionary neighbors for the query segment q; in this example, they are the taxa corresponding to segments a, b, c, d, o and s. The taxonomic ID then assigned to q is the lowest common ancestor in a reference taxonomy such as the NCBI taxonomy of all taxa in M. Panel (e) gives an example of the constructed subtree at node R from the pair-wise segment alignments in (d and e), where the exact position of segments a, b, c and d is left unresolved (dashed lines) by the RPA. 15
(a) 16S gene
(b) 100 bp simulated reads
(c) 500 bp simulated reads
Figure 3: Taxator-tk combined cross-validation performance for unassembled sequence data. Taxonomic assignment performance over seven cross-validation experiments simulating taxonomic assignments across a range of evolutionary distances. The bars show the absolute number of correct and false assignments for the corresponding rank (x axis). The macro-precision and macro-recall shown are cumulative from low to high taxonomic ranks, such that, for instance, the family level macro-precision includes the species, genus and family assignments. Assignment performance is shown for 16S rRNA genes in mRefSeq47 with a minimum length of (a) 1000 bp and simulated reads with a length of (b) 100 bp and (c) 500 bp. 16
Figure 4: Comparison of taxonomic assignments for sequences from novel families with three different classifiers on the simulated metagenome sample (simArt49e) with 49 species Taxonomic assignments of CARMA3 (dark blue), MEGAN4 (light blue) and taxator-tk (green) for simArt49e in the cross-validation experiment, for which all sequences of the same families were removed from the reference sequence data (Supplementary Fig. 9–11e). Only assignments at the order level or above can therefore be correctly assigned, and all family level assignments (shaded in light red) are incorrect. For further details on the precision and recall of the methods, see Supplementary Fig. 9–11e and Fig. 5. Due to the large number of taxa additionally predicted by CARMA3 and MEGAN4, only the 32 existing order level taxa of simArt49e are shown, as well as the predicted families of these orders, where -assignments below family level were included in counts at the family level.
Figure 5: Family-level bin precision for the simulated metagenome sample with 49 species (simArt49e) Comparison of the assignment precision to bin sizes (logarithmic scale) for individual family-level bins of simArt49e sample. We show the precision for the seven cross-validation experiments assessing taxonomic assignment performance combined across a range of evolutionary distances between the query sequence and the reference sequences. Correct assignments at ranks below family were also considered correct at the family level. (a–c) show the corresponding data, excluding the 1% (in total bp) of least abundant bins for (a) CARMA3, (b) MEGAN4 and (c) taxator-tk. We added a smoothed k-nearest-neighbor estimate of the mean precision using the R function wapply (width=0.3) followed by smooth.spline (df=10). CARMA3 and MEGAN4 identified substantially more small taxonomic bins which were not present in the analyzed dataset than taxator-tk. Panel (d) gives the number of correct, false and undetermined family-level assignments for the different classifiers on the dataset.
Supplementary Methods for ‘Taxator-tk: Fast and Reliable Taxonomic Assignment of Metagenomes by Approximating Evolutionary Neighborhoods’ Taxonomic Assignment of Sequence Segments Here we describe in detail the individual steps and the run-time properties of the algorithm which is implemented in the program taxator, the second stage of the overall binning workflow using taxator-tk (Fig. 2b). We propose the re-alignment placement algorithm (RPA) for the taxonomic assignment of a query segment q, which can be any subsequence of the full query sequence (i.e. read, contig or scaffold). The algorithm constitutes two pair-wise alignment passes and in each, q is aligned to segments of nucleotide reference sequences. It aims at identifying as many as possible taxa of the prediction clade (node R in Fig. 2a) in a set M without explicitly resolving its phylogenetic structure. 1.
Among the given set of homologous segments constructed from overlapping
alignments before application of the RPA, we define s to be the most similar segment to q, i.e. the one with the best local alignment score of all reference segments. In the first pass, all segments are aligned against s (
alignments). The resulting pair-wise scores, our
implementation uses the edit distance (mismatches + gaps), define an ordering among all segments or their corresponding taxa. The distinction between segments and associated taxa will be neglected in the following for better readability. All taxa which are less distant to s than q, including s itself, are added to M which holds all identified taxa of the prediction clade. The first more distant taxon than q is defined to be the outgroup segment o (Fig. 1) and used as the alignment target in the following second and last pass in which similar taxa to o are added M. 2.
We align all segments, including q, against o and rank the resulting scores. Then we
add all taxa to M which have a lower score than q. With some fine-tuning, we chose to also add taxa with a higher score than q, within a small range accounting for erroneous scores, because o and q can be very distant homologs with noisy alignment. The width of this error band is determined on a per-segment basis as a linear score function of the taxonomic disorder in the alignment scores and not a universal or configurable run-time parameter. We interpret a rank disorder (e.g. a known family member of o being more
Taxator-tk Supplementary Material Page 1 of 48
similar to o than a corresponding species member segment) as a discordance between gene tree and taxonomy and proportionally scale the effective score of q to enlarge M by taxa which are slightly more distant to o than q. This second pass requires
alignments, or less if some segments are identical to either q or s. If multiple best references (s) or outgroup segments (o) were present in these two passes with identical alignment scores, the calculations are repeated for every such segment in order to produce stable output. We reduced the additional computational effort in our implementation by detecting frequent identical segments and uninformative homologs. The final assignment taxon ID of q is the LCA of the taxa in M, or none if no outgroup had been found. The theoretical run-time of the segment assignment algorithm is in
. Segments with an excessive number of homologs, most often short segments of abundant and uninformative regions, have a negative impact on the program run-time. Therefore we currently limit the number of homologs per query to the top-scoring 50 by default (configurable run-time parameter in program alignments-filter or directly in the local alignment search program), before passing them to taxator. Other tested values gave similar results and the parameter, if changed, should be chosen based on hardware limitations. If this parameter is set lower, then the number of reference segments drops below a critical value such that no outgroup can be determined for some q and which therefore remain unassigned (but without impacting the taxon ID of other segments). Consensus Binning Algorithm Due to sparse segments and taxonomic assignment thereof with taxator in stage two of the workflow (Fig. 2b), a final processing step (Fig. 2c) is required to determine a taxon ID for the entire query sequence. Therefore we have implemented a simplistic, weighted consensus assignment scheme in the program binner, which optionally permits to apply custom constraints, e.g. the minimum percentage identity (PID) for classification at the species level or the removal of taxa with low counts in the whole sample. However, there are currently only two mandatory run-time parameters to control the actual post-processing consensus algorithm. First we define the support of a query segment to be the number of total matching positions to the best reference segment. The first run-time parameter specifies the minimum combined support at any rank (50 positions by default) and serves to ignore false predictions caused by short and often noisy segments. The other parameter specifies the
Taxator-tk Supplementary Material Page 2 of 48
minimum percentage of the summed support (70% by default) to allow a majority taxon to outvote a contradicting minority. Inconsistent taxa below this support are resolved by the LCA operation until the threshold is reached. Probably due to the conservative nature of the RPA, we found those two parameters to have minimal impact on the binning results in practice. The actual output of taxator is very detailed, allowing diverse information to be taken into account. We provide Python language bindings for processing with other applications. Taxonomy and Phylogeny Taxator-tk assumes that the NCBI taxonomy used for the assignment correctly captures the evolutionary process of speciation, although we know that the categorization of some taxa might be inconsistent with their evolution. If the phylogenetic information inferred from similarity scores disagrees with the taxonomic structure, assignments are made to a consistent higher rank. For instance horizontal gene transfer and upstream sequence misassembly can cause multiple similar copies of a sequence to be distributed across unrelated taxa. In case a query sequence cannot be traced by the algorithm to have evolved with either copy, it is usually assigned to the LCA of these clades. However, if the donor clade is unknown, the query may also be assigned to the recipient clade and the horizontal transfer or missassembly can go undetected. Thus assignment errors caused by the evolution of genes, upstream technical errors or taxonomy cannot always be eliminated in this framework. It remains to be assessed whether the use of an alternative microbial taxonomy such as the GreenGenes or the SILVA taxonomy would improve on the taxonomic assignment. Comparison and Innovations Taxator-tk shares some ideas with previous programs: Starting with MEGAN1, which uses local alignments scores to derive a "neighborhood of related sequences" and the taxonomic estimate is the LCA of the corresponding taxa. This neighborhood threshold is a percentage of the local alignment score and can be interpreted to reflect the rate of evolution within a taxonomic group. Its value is empirical and lacks stronger justification. The threshold selection has been improved in taxator-tk and other programs. To our knowledge, SOrtITEMS2 was the first algorithm to use the logic of re-alignment to the best reference (termed reciprocal similarity) for read assignment but is restricted to protein level alignment and is
Taxator-tk Supplementary Material Page 3 of 48
implemented as a wrapper around NCBI BLAST 3. Protein alignment in general triples the runtime of the local alignment step (translation into three frame shifts) and cannot make use of faster nucleotide aligners. SOrt-ITEMS also uses fixed similarity thresholds in terms of percentage identity to define universal levels of conservation within taxonomic groups assuming the same rate of evolution for different genetic regions and clades. Furthermore SOrt-ITEMS was primarily designed for reads and if it performs well for longer sequences, its run-time increases proportionally with input sequence lengths. Both, taxator-tk and CARMA34, adopt the logic of reciprocal alignment and extends it and remove the assumption of universal conservation levels. CARMA3 accounts for a heterogenous rate of evolution for different genetic regions. The initial identification of similar sequences in the reference can be based on nucleotide or protein BLAST search or profile Hidden Markov Models with HMMER5. In BLAST mode, CARMA3, like SOrt-ITEMS, uses one reciprocal alignment search and then extra or interpolates alignment scores to select a taxonomic rank for prediction. It therefore assumes a parametric model for the conservation level at a taxonomic rank, a linear function which is fitted to local alignment scores. With taxator-tk, we use a non-parametric score ranking algorithm, instead. Also, to our knowledge, we provide the first algorithm to determine an outgroup and to sparsify the input data being able to assign distinct regions on the query sequence to possibly different taxonomic groups. Also, we at most assume segment-wise constant rates of evolution (equally long branches from a common ancestor). This makes the major algorithmic component parameter-less and robust in itself, independent of the segment size. Through the sparsification procedure it can deal with structural rearrangements among distant relatives and scales better with the length of the input sequences. The individual segment assignments allow for a robust consensus voting scheme for the assignment of entire sequence fragments. The segment-specific classification could also be used for the detection of horizontal gene transfers events (HGTs) and assembly errors. Different from most previous approaches, taxator-tk was developed and tested using fast nucleotide sequence local alignments instead of protein sequence alignments, although for the local alignments in stage 1 of the workflow both can be used, but with our data we did not find advantages in using protein alignments as input. Thus, taxonomic binning of a metagenome sample with taxator-tk requires no more than specification of reference sequences, their taxonomic affiliations and an aligner like BLAST or LAST6. On the implementation side, all workflow steps
Taxator-tk Supplementary Material Page 4 of 48
for taxonomic assignment with taxator-tk are designed in a modular way making it easy to save, compress, reuse or recompute results. The computation intensive classification of segments in taxator can be run in parallel on many CPU cores in which we make use of the open source C++ algorithm library SeqAn7 for fast alignment. Performance Measures As metagenome data sets can have varying taxonomic composition in terms of which taxa are present and their relative abundances, this needs to be taken into consideration in evaluating taxonomic assignment methods. If an algorithm performs better for some clades than for others at a given rank we call it taxonomically biased. Oftentimes a classifier is biased, if it uses parameters that fit one clade better than another. This can be the case if the parameters were chosen to give good overall assignment accuracy (low total number of false predictions) on training data with uneven taxonomic composition. Such a method will not generalize well when applied to a sample of different taxonomic structure and abundances. To account for uneven taxonomic composition in evaluation data sets and to obtain comparable performance estimates across data sets of different taxonomic composition, we used as the primary evaluation measure the bin-averaged precision (or positive predictive value), also known as macro-precision. (Equation 1) where
is number of all predicted bins with a single bin precision (Equation 2)
True positives (
) are the correct assignments to the
bin and false positives (
incorrect assignments to the same bin. The macro-precision is the fraction of correct sequence assignments over all assignments to a given taxonomic bin, averaged over all predicted bins for a given rank. For falsely predicted bins which do not occur in the data, the precision is therefore zero. This value reflects how trustworthy the bin assignments are on average from a user’s perspective, as it is averaged overall predicted bins. In addition to the macro-precision, we report the raw numbers of true and false predictions for every cross-validation, as well as a quick overall precision for pooled ranks.
Taxator-tk Supplementary Material Page 5 of 48
This overall precision is most informative for species+genus+family and reports the fraction of true classifications among the predictions for all these ranks in a single pooled bin. (Equation 3) We measure the taxonomic bias of a method in terms of the standard deviation over all individual bin precisions. (Equation 4) where (Equation 5) The standard deviation is small if all predicted bins have a similar precision. A universally good method should have a high macro-precision with a low taxonomic bias. The recall (or sensitivity) is a measure of completeness of a predicted bin and, analogously, the macro-recall is the fraction of correctly assigned sequences of all sequences belonging to a certain bin, averaged over all existing bins in the test data8. (Equation 6) where
is a set of all actually existing bins in the test data and
is a single such bin. (Equation 7)
False negatives (
) are the assignments belonging to the
bin but which where classified
to another bin or left unassigned. The macro-recall reflects how well the classifier works more from a developer’s perspective than from the user perspective, as it is usually not known which predicted bins correspond to existing ones and which do not. Low-abundance Filtering The number of predicted bins at each rank can be quite large, at most the number of known taxa in the taxonomy and reference sequence data. When noise is considered to
Taxator-tk Supplementary Material Page 6 of 48
occur evenly distributed across this large output space, bins with few assigned sequences are more likely to be falsely identified, than larger bins (the chance to independently classify the same bin by chance n times is
. Since the macro precision is
an average over all predicted bins, it is heavily affected by bins with few sequences assigned. As a result, classifiers that predict clades present at low frequencies in the sample score badly under this measure. To correct for this effect, we define a truncated average precision ignoring the least abundant predicted bins and consider only the largest predicted bins constituting a minimum fraction
of the total assignments (equal size bins are also
included). This modification acts as a noise filter and accounts for different behavior of classifiers without explicitly considering the size of the model space or the number of existing species in the actual sample. We set
to 0.99 for our evaluations.
Cross-validation Despite the limitations of simulated metagenomes, which incorporate assumptions about sequencing error rates or species abundance distributions, it is very informative to evaluate taxonomic assignment methods on simulated sequence data as real metagenome samples lack taxon IDs for evaluation. Our canonical way of evaluating a method on simulated data is a version of leave-one-out cross-validation: Each query sequence is classified by removing all identical or related sequences up to a given rank from the reference collection: For example, to assess the performance in assigning query sequences from a new species, all sequences belonging to this species are removed from the reference sequence collection for the classifier. Performance measures (macro-recall, macro-precision), along with other statistics (true/false/unassigned data, overall precision, bin counts) which are available in the coupled tables, were normally calculated for number of assigned basepairs or for the number of assigned sequences, if these had comparable lengths. These values were calculated for all ranks (species, genus, family, order, class, phylum, domain/superkingdom) for seven simulations: either all reference data was used (per query) or all data from the query species, genus, family, order, class or phylum was removed from the reference data prior to classification. The assignments of these seven cross-validation experiments were averaged for a combined performance summary with standard measures. Consistency Analysis In order to evaluate the predictions of taxator-tk for real metagenome samples where
Taxator-tk Supplementary Material Page 7 of 48
no underlying correct taxon IDs are known for the sequences, we assigned sequences linked by assembly and calculate a measure of assignment consistency. We split long contigs into multiple pieces and classified each piece independently. Assuming that the sequence assembly was correct in the first place, contradicting assignments of pieces that originate from the same contig represent false assignments. This unveils part of the errors made by a particular method but some, if not the majority, will go undetected because the actual ID stays unknown and the assignments for a contig can be consistently wrong. Hence these results are generally more difficult to interpret than those from simulated data. Nucleotide Alignment In the course of evaluation we created many local alignments as input to the taxonomic assignment programs CARMA3, MEGAN4 and taxator-tk. These were usually generated using the fast alignment program LAST because it ran faster without noticeable differences in the output alignments than BLAST nucleotide alignments. For short sequence length evaluation (Supplementary Fig. 2-4), evaluation of a published SimMC scenario (Supplementary Fig. 12) and evaluation of a simulated metagenome sample with 49 species (Fig. 3, Supplementary Fig. 9-11), standard BLAST search was chosen. We used the standard alignment parameters and scoring schemes with each aligner. The generated alignments were converted into BLAST tabular format to work with CARMA3 and MEGAN4. Program Parameters For taxonomic assignment with MEGAN4 we used minscore=2, toppercent=20, minsupport=5 and mincomplexity=0.44 parameters. In CARMA3, we used the standard parameters in the contained configuration file. Taxator-tk was restricted the 50 best scoring local alignments to avoid long run-times for some of the query sequences. This was purely a convenient filter at the current state of development and is meant to be replaced by an adaptive per-segment heuristic. 16S Evaluation We evaluated the performance of taxator-tk in classifying the most widely used taxonomic marker gene in studies of microbial diversity, the 16S rRNA gene, as a proof of concept. For our evaluation, we extracted 7,175 annotated 16S rRNA genes each with a minimum length of 1 kb from mRefSeq47 (Suppl. Fig. 19, 20). The sequences were assigned with taxator-tk using the entire mRefSeq as reference, not just 16S genes. The cross-
Taxator-tk Supplementary Material Page 8 of 48
validation assesses the performance of 16S gene assignment in a wide range of situations. The performance statistics were calculated based on the number of assigned sequences, as all have comparable length. When using the complete reference sequences, 87% of sequences were assigned to the ranks of species, genus and family with 100% accuracy (Supplementary Fig. 1b), the remaining 13% were correctly assigned at higher ranks. This is an ideal situation showing the rank depth baseline on our data set. In more realistic simulations, when we tested assignment of genes from novel species or novel higher-level clades, assignments were accordingly made to higher ranks in most cases. For instance, when simulation novels species, 2678 contigs were assigned to the correct genera, while 491 erroneous species and genus assignments were made. The macro-precision in the combined cross-validation (Fig. 2) was always above 92%, with standard deviations from 10 to 25%, which demonstrates a good and even performance of taxator-tk for all clades in the case of 16S rRNA data. Supplementary Files The submission includes the files which are necessary to reproduce the results which are shown in the article. A more complete benchmark data-set can be downloaded from the software download web page. References 1. Huson, D. H., Auch, A. F., Qi, J. & Schuster, S. C. MEGAN analysis of metagenomic data. Genome Res. 17, 377–86 (2007). 2. Monzoorul Haque, M., Ghosh, T. S., Komanduri, D. & Mande, S. S. SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences. Bioinformatics 25, 1722–30 (2009). 3. Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009). 4. Gerlach, W. & Stoye, J. Taxonomic classification of metagenomic shotgun sequences with CARMA3. Nucleic Acids Res. 1–11 (2011). doi:10.1093/nar/gkr225 5. Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39 Suppl 2, W29–37 (2011). 6. Frith, M. C., Hamada, M. & Horton, P. Parameters for accurate genome alignment. BMC Bioinformatics 11, 80 (2010). 7. SeqAn. at
Taxator-tk Supplementary Material Page 9 of 48
8. McHardy, A. C., Martín, H. G., Tsirigos, A., Hugenholtz, P. & Rigoutsos, I. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods 4, 63–72 (2007).
Taxator-tk Supplementary Material Page 10 of 48
Supplementary Figure 1: 16S gene assignment with taxator-tk true
Supplementary Figure 14: Parallel speedup of program taxator
t axator Assignment Speedup 14 12
6 4 2 0 1
number or CPU cores Execution time analysis with taxator for parallelized processing with multiple CPU cores. Taxonomic placement of sequence segments with taxator on input alignments for sequences of length 1000 bp (syn1000 data set aligned against mRefSeq47 with LAST). The speedup was calculated using wall clock time for a parallelized run relative to serial execution with one CPU thread. With multiple threads there is always one producer thread (consumer-producer model), thus for more than two threads, multiple consumers work on the input data in parallel. Approximate linear scale-up was observed up to 15 threads and saturation effects appear when using 20 CPU cores on our system.
, with : serial execution time : execution time using p threads and CPU cores
Taxator-tk Supplementary Material Page 35 of 48
Supplementary Figure 15: Effect of input sequence length and segmentation on taxator-tk processing time. (a)
t axator processing t ime time in minutes (wall clock)
with segmentation all reference new species new genus new family new order new class new phylum
180 150 120 90 60 30 0 100
individual sequence length in bp
t axator processing t ime time in minutes (wall clock)
without segmentation all reference new species new genus new family new order new class new phylum
180 150 120 90 60 30 0 100
individual sequence length in bp
We processed approximately the same number of sequences of length 100, 500 and 1000 bp with taxator-tk (syn100,syn500,syn1000), once with the segmentation procedure being enabled (a) and once with segmentation disabled (b). The time increases for both cases approximately linear with the input length, with the slope depends on the completeness of the reference sequence data. With all reference data available, the time increases slightly more than linear, as there is no segmentation of queries during computations. For all other cases, segmentation substantially decreses the execution time.
Taxator-tk Supplementary Material Page 36 of 48
Supplementary Figure 16: Taxonomic assignment of segments
Four long contigs of the SimMC data set. Colored boxes show segments that where assigned by taxator, when all species reference data was removed (new species simulation). Light regions in between had no region aligned by a local alignment search and have therefore no homologs for assignment. All assigned regions in this example are consistently assigned. The segments are used by the program binner to derive consistent whole-sequence taxonomic assignments as used in our evaluations.
Taxator-tk Supplementary Material Page 37 of 48
Supplementary Figure 17: Bin precision plots for 49 species simulated metagenomic sample (simArt49e)
Taxator-tk Supplementary Material Page 38 of 48
Supplementary Figure 17: Bin precision plots for 49 species simulated metagenomic sample (simArt49e)
Taxator-tk Supplementary Material Page 39 of 48
Supplementary Figure 17: Bin precision plots for 49 species simulated metagenomic sample (simArt49e)
Taxator-tk Supplementary Material Page 40 of 48
Supplementary Figure 17: Bin precision plots for 49 species simulated metagenomic sample (simArt49e) Comparison of assignment quality of CARMA3, MEGAN4 and taxator-tk for a simulated metagenome sample from a 49 species microbial community. Values are shown for the summary scenario (sum of all seven cross-validation scenarios), for assignments to the (a) species, (b) genus, (c) family, (d) order, (e) class and (f) phylum ranks, respectively. The first of each panels shows the precision and size for every predicted bin (after removing low abundance bins). The colored line shows a smoothed k-nearest-neighbor estimate of the mean precision as a function of predicted bin size using the R function wapply (width=0.3) followed by smooth.spline (df=10). The second panel for each rank shows bin precisions relative to recall. The F-score partitioning helps to identify similar quality bins if precision and recall are equally weighted, however we consider precision more important than recall. The third panel illustrates the total number of true (blue) and false (red) and unassigned (grey) portion of assignments at the respective ranks. Note that partially incorrect assignments are considered incorrect for the low ranking false part of the assignment and correct for the higher-ranks.
Taxator-tk Supplementary Material Page 41 of 48
Supplementary Figure 18: Taxonomic composition of microbial RefSeq 47
Taxonomic composition down to family level of the microbial (Bacteria, Archaea, Viruses) portion of the RefSeq47 sequence data collection using Krona (http://krona.sourceforge.net). An interactive version can be found in the supplementary files (RefSeq47.krona.html). Abundance is measured in terms of accumulated sequence lengths per clade.
Taxator-tk Supplementary Material Page 42 of 48
Supplementary Figure 19: Taxonomic composition of 16S genes extracted from RefSeq47
Taxonomic composition down to genus level of the 16S benchmark data set using Krona (http://krona.sourceforge.net). The data set was simulated by extracting every annotated 16S gene in RefSeq47 which was at least 1000 bp long. An interactive version can be found in the supplementary files (refseq-16S.krona.html). Abundance is measured as the number of 16S genes.
Taxator-tk Supplementary Material Page 43 of 48
Supplementary Figure 20: Taxonomic composition of microbial RefSeq54
Taxonomic composition down to family level of the microbial (Bacteria, Archaea, Viruses) portion of the RefSeq54 sequence data collection using Krona (http://krona.sourceforge.net). An interactive version can be found in the supplementary files (RefSeq54.krona.html). Abundance is measured in terms of accumulated sequence lengths per clade.
Taxator-tk Supplementary Material Page 44 of 48
Supplementary Figure 21: Taxonomic composition of SimMC/AMD
Methanococcoides burtonii 0.01%
Taxonomic composition of the FAMeS simulated metagenome sample SimMC/AMD using Krona (http://krona.sourceforge.net). An interactive version can be found in the supplementary files (SimMC.krona.html). Abundance is measured in terms of accumulated contigs lengths.
Taxator-tk Supplementary Material Page 45 of 48
Supplementary Figure 22: Taxonomic composition of SimHC/soil
Taxonomic composition of the FAMeS simulated metagenome sample SimHC/soil using Krona (http://krona.sourceforge.net). An interactive version can be found in the supplementary files (SimHC.krona.html). Abundance is measured in terms of accumulated contigs lengths.
Taxator-tk Supplementary Material Page 46 of 48
Supplementary Figure 23: Taxonomic composition of simArt49e
Taxonomic composition of the simulated metagenome sample simArt49e using Krona (http://krona.sourceforge.net). An interactive version can be found in the supplementary files (simArt49e.krona.html). Abundance is measured in terms of accumulated contigs lengths. The data set was simulated using equal coverage for every strain, so differences in the data proportions result from a variable genome size and assembly bias.
Taxator-tk Supplementary Material Page 47 of 48
Supplementary Figure 24: Query sequence segmentation and segment splicing
Query and corresponding reference segments from local alignment region extension and splicing. Blue bars correspond to original local alignment regions on reference nucleotide sequences which are positionally aligned to the query nucleotide sequence in red. These alignments are generated by a local (nucleotide) sequence aligner such as BLAST or LAST before running taxator. If alignments overlap on the query, they are joined into query segments which are flanked by regions without detected similarity to any known reference sequence. Reference segments are constructed from the original alignment reference regions (blue) by extension (grey bars) with the same number of nucleotides which are missing to match the length of the query segment. The corresponding sets of homologs are the input to the core taxonomic assignment algorithm in taxator.