Analysis of sequencing data - The potential role of tap water bacteria in inflammatory bowel di

2.4.1 Analysis in QIIME

Raw sequence data from the water samples and the biopsies were uploaded and processed

separately in QIIME by co-supervisor. Sequences were initially demultiplexed and filtered to secure that only sequences of satisfactory quality were used for downstream analysis. For this, minimum sequence length and E-value was set to be 350 nucleotides and 0,2 respectively. For sequences from both biopsies and water samples, a cut-off value of 3000 sequences from each sample was set and served as the basis for the subsequent designation of sequences into OTUs. This was performed using usearch, the UPARSE algorithm, and a closed OTU-picking strategy. Ultimately, Greengenes database served as the reference system. Sequences were screened for potential chimeras using ChimeraSlayer. OTUs were then subject to several diversity estimates using the command

core_diversity_analysis in QIIME. Phylogenetic diversity whole tree, Observed species, Shannon, Simpson and Chao1 served as indices for estimates of alpha diversity, while weighted and non-weighted UniFrac, Jaccard and Bray Curtis indices were implemented for estimates of beta

diversity. This was visualized through rarefraction and PCoA-plots respectively. Graphics of charts showing relative abundance of taxonomic groups in the microbial communities were included

2.4.2 Statistical analysis of datasets

To test for differences of a given OTU within the groups of the biopsy and water sample data set, the command OTU_category_significance was incorporated to the QIIME workflow by co-supervisor, using statistical principles of Kruskal-Wallis test. Correction of the resulting p-value with the Bonferroni approach was included, to further reduce chances of getting false positives.

A Principal Component Analysis (PCA) including a score plot and loading plot was performed on the biopsy OTUs. Thus, the dataset was reduced to a smaller and more manageable pattern of data referred to as principal components, alleviating further downstream statistical analysis. To test for potential interactions between the independent variables and their impact on the dependent

variables, ASCA Analysis of Variance (ANOVA) was employed as a statistical method using PLS Toolbox. (Eigenvector Inc, Washington, USA) A significance level of 5% were used for all

statistical tests. To further identify if potential intragroup differences were present in the OTUs implicated to be of significance for potential interactions between age and diagnosis, a Kruskal-Wallis test was performed using SYSTAT13 (Systat Software Inc, California, USA). As this

method do not detect where potential intragroup differences occur, Conover-Inman test for pairwise comparisons was implemented as a statistical method as well on the median percentagewise

prevalence of the OTU, also using of SYSTAT13. The latter analysis does however not announce the direction of significance in each pair. For this, the median values of the tested OTU were in each group of pair were compared. To test for potential significances of the prevalence of OTUs in different combinations of inflammation category and age, Kruskal-Wallis followed by Conover-Inman was implemented. This was performed on all of the enrolled patients. Tissue of unknown category was excluded from this analysis to prevent the introduction of possible biases. All of these analyses were performed by supervisor.

In situations were further identification of OTUs on a lower hierarchical level was of interest, the reference sequence generated by QIIME during the designation of sequences into the respective OTU was uploaded to BLAST by student. The16S ribosomal RNA sequence database was used for identification. Only suggested taxonomic annotations with the most suitable query cover, identity and E-value were presented and discussed.

2.4.3 Analysis of associations between OTUs in water and biopsies Identifying and selecting matches

In order to unveil any potential transmissions of OTUs from water to mucosa, the reference sequence from each OTUs in the biopsy and water sample data set were first mapped against each other by a postdoc from the department using MATLAB^®. (MathWorks, Natick) A threshold of

>97% sequence similarity was employed for the identification of potential OTU matches using the following Jukes-Cantor model for sequence divergence estimates: 𝑑 = − 3 4⁄ 𝑙𝑛(1 − 4 3⁄ 𝑝) where d represents the evolutionary distance between two sequences, and p is the proportion of

substitutions across the sequence alignment, i.e. the sequence distance. (Xiong 2006) This was initially performed without taking the prevalence of the OTUs into consideration. Each match was then given a taxonomic identity.

Owing to the complexity of identifying potential associations on all matches, only selected matches from the water sample data set, the biopsy data set, and from the Jukes-Cantor data set were

submitted to further analysis by student. Only matches showing significant Bonferroni-corrected p-values with respect to diagnosis during the implemented statistical testing with Kruskal-Wallis in QIIME, were chosen for this purpose in the first to data sets. As the aim was searching for potential transmission of OTUs from water to mucosa, matches in the Jukes-Cantor dataset on the other hand, were selected based on the prevalence of the water OTUs.

To reduce chances of analysing water OTUs present by mere coincidence, matches were narrowed down to include those connected to the 50 most prevalent OTUs from the water sample data set. An overview of taxonomic belonging was made to evaluate if a further narrowing of the matches was needed prior to subsequent analysis. To account for the possibility that spurious OTUs still might comprise a part of the remaining matches between the datasets, water OTUs were plotted against its percentwise prevalence and a threshold was established where a change of decline could be

observed. Thus, OTUs from water sample data set showing an average prevalence above this threshold were submitted to further analysis of potential transmission using Fisher exact.

Statistical testing with Fisher exact

Selected matches from the biopsy, water sample and Jukes-Cantor dataset were subject to statistical analysis of any plausible associations between OTU matches in water and biopsies. This was performed with the Fisher exact method by student, with the rationale that plausible associations potentially could be used for further evaluation of OTU transmission from water to mucosa. A match was considered to be present in both samples if >1 sequence(s) from each of the OTUs in the match could be detected in both water sample and in biopsy. In cases were a patient presented two water samples or more than one biopsy, of which only one of the respective samples contained the OTU of interest, the OTU was considered present. Level of significance was set to be 0,05.

Characteristics of the samples such as age-group, diagnosis etc. was not taken into consideration, as the primary aim was searching for potential associations regardless of origin.

Matches presenting a Fisher exact value below the level of significance were submitted to an additional round of Fisher exact testing by student to see if possible associations could be attributed to certain diagnosis groups. Dataset of the OTU match of significance was decomposed into Non, IBD, CD and UC groups, where the IBD group encompassed patients from the latter two groups and IBDU. Patients having a status of diagnosis marked as possible or unknown were excluded from this final analysis to prevent the introduction of possible biases.

In document The potential role of tap water bacteria in inflammatory bowel disease (sider 43-46)