Microbiota II: From raw sequences to complete dataset

All reads from all samples are reported together in no specific order. Thus, raw reads were first demultiplexed, and the sequencing centres removed index primers and genomic primers. Reads were then quality trimmed, overlapped and merged using FLASH (concept illustrated in Figure 8).¹⁹⁹ The V3-V4 region (Paper I) is approximately 430 base pairs long, providing an overlap of 170 base pairs. The V4 region (Paper III) is entirely covered by both reads yielding increased quality.^197,202 The merged reads were subsequently quality filtered based on truncating reads at three consecutive low quality base calls (phred score <

25) and discarding reads with a truncated length of < 75% of its original length.

The methods used in the rest of this section are identical for Paper I and III, unless specifically noted. We used the Quantitative Insights Into Microbial Ecology (QIIME) platform (version 1.8.0) for quality control and further sequence processing.²⁰⁴ QIIME is a community developed, open source bioinformatics pipeline. It constitutes a framework that to some degree standardises the post-sequencing workflow. It incorporates bioinformatics software from several different developers, at the same time assuring a certain degree of quality control.

Operational taxonomic unit (OTU) picking was performed on reads left after quality control. Sequences were clustered together into OTUs with 97% sequence similarity using a closed-reference approach, with mapping against the GreenGenes database (v13_8).²⁰⁵ An OTU is thus an artificial construct, and the 97% threshold is merely based on convention, and may be set otherwise. In addition to the closed-reference approach, de novo and open-reference approaches can be used as shown in Figure 11. In closed-open-reference OTU-picking, reads that do not get a match in the pre-clustered database are discarded. A lot of reads (~3-6% in our studies), originating from possibly less well-described species, might be lost.

However, the method has the advantage of being more specific when assigning taxonomy, is faster to perform and facilitates better comparisons between studies, especially if different primers are used.⁸³ The GreenGenes database has the advantage of very low chimera levels, but one must be aware that all the available databases may contain errors in sequences and their taxonomic assignment.⁷⁷ Chimeras could make up to 45% of sequences in one run and can be found in many 16S databases.⁷⁷

A B C

Figure 11. Workflow for different approaches for making an OTU table in QIIME.²⁰⁴

(A) De novo: sequences are compared internally and then clustered together depending on a similarity-threshold. A representative sequence from each OTU cluster is subsequently matched to a reference database and taxonomy assigned to the OTU cluster accordingly. It is considerably slower and more computationally intensive than the other methods, but without sequence-loss.

(B) Closed-reference (used in Paper I and III): sequences are compared directly to representative sequences from a pre-clustered reference database and discarded if no match is found. Taxonomy is directly inherited from the matching reference-OTU. It is computationally fast and facilitates comparison between studies where different methods is used, but is biased toward the reference database and sequences are discarded if no match is found in the reference-database.

(C) Open reference: here the de novo and closed-reference approaches are combined.⁸³

OTU, operational taxonomic unit; Rep-seq, representative sequence. Figure is inspired by Navas-Molina et al.⁸³ Pictures used in the figure are licenced under the Creative Commons Zero licence.

Taxonomy assignment was done based on the GreenGenes database, which also provides a phylogenetic tree of reference OTUs. The phylogenetic tree is necessary for downstream analyses, like UniFrac. Finally, an OTU table is generated and used in subsequent analyses.

To reduce the number of comparisons and greatly reduce the problem of spurious OTUs, OTUs containing <0.005% of the total number of sequences were discarded at this stage in Paper I, as is also recommended in the literature.⁸³ However, by doing this we risk missing less prevalent, but potentially important OTUs. In Paper III, OTUs represented in only one single sample in one sample site in each experiment were discarded. In Paper III, OTUs (mapping to the mitochondria family and chloroplast class) misclassified in Greengenes to the Bacteria-kingdom were removed.

After evaluating rarefication curves, samples with <8000 reads were discarded in Paper I, while this was not necessary in Paper III, due to high coverage. This resulted in a mean sequencing depth of 34.490 and 242.046 reads for Paper I and III, respectively.

Because sequencing depth is not equal in all samples, α-diversity (Chao1 bacterial richness estimate [Chao1], Shannon diversity index and phylogenetic diversity) and β-diversity (unweighted UniFrac) were calculated on rarefied OTU tables. This is the main argument for discarding samples with low read-count as described above, as the sample with the lowest read-count decides the rarefication level.

Chao1 has the advantage of being a simple estimate of community richness.²⁰⁶ Chao1 estimates the total OTU-count one could expect in a sample with infinite sampling. Shannon diversity has the advantage of being frequently reported in the literature, and takes both richness and evenness into account.⁸² Phylogenetic diversity is recommended by several authors and exploits phylogenetic information.^83,207

Generally, β-diversity metrics have the advantage of being robust to noise and low sequence counts, although the latter is less important today.⁷⁸ β-diversity metrics can be quantitative (using sequence abundance) like Bray-Curtis and weighted UniFrac, or qualitative (considering only presence-absence) like binary Jaccard or unweighted UniFrac). We used UniFrac in Paper I and III since it is phylogeny based and also has been shown to outperform other metrics in community comparisons.^78,84

In document Primary sclerosing cholangitis and the gut microbiota - a study on mice, man and microbes (sider 52-55)