Data treatment and Statistcal Analysis - Sampling and analytical methods

2.4 Sampling and analytical methods

2.4.6 Data treatment and Statistcal Analysis

To extract information from the data the analyzes have provided it is necessary to perform some statistics. Firstly, it is useful to obtain descriptive statistics such as the sample mean, median, range of the values and the standard deviation. The mean is the sum of all values divided by the number of samples (65). The median is the value in the middle of the dataset when it is organized with increasing or decreasing values. An important feature is

the standard deviation, which is a measure of the spread in the dataset. The standard deviation is the deviation any observation is expected to have from the calculated mead value. A high standard deviation is synonymous with a huge spread of the values in the dataset (65).

Environmental samples are taken to represent the true environment that are investigated.

The mean value of the samples is not necessarily the same as the true mean. A larger number of samples will generally result in a better estimation of what the true mean (µ) is, but it in theory it is needed an infinite number of samples to get the true value. It is useful to include a confidence interval, often set at 95%. This is an interval of which it is 95% certain that contains the true value, for example the true mean, based on the dataset.

In other words it is a 5% chance it is not in the given interval. It is important to mention that the data obtained from the analyzes itself also comes with uncertainty. Even though an analysis can provide high precision in its measurements, it is no guarantee that it is high accuracy.

To test a hypothesis about a probable cause and effect relationship, it is necessary to do a statistical test. A statistical test aim to determine whether the differences that are observed in the data are random or a significant real differences, and gives a probability for this which is the p-value (66). The p-value is a measure of how likely it is that the null hypothesis is correct. When conducting a hypothesis test you formulate a null hypothesis and an alternative one. Typically, that the group means are equal and the group means are unequal.

In order to choose the correct test, it is important to know if the data complies with the underlying assumptions of the tests. The most important is if the data are normally distributed or not. A Shapiro-Wilk test can be used to test whether the data is significantly differently distributed than a normal distribution (67). If the data are normally distributed, parametric tests can be used. Some relevant tests in this regard are the Student’s T-test to test if two means are significantly different from each other (66). If the data is not normally distributed, a non-parametric test can be used instead. For the test of two means, the Mann-Whitney U (MWU-test) test is an option. The t-test tests the differences in means for two groups and include the two sample sizes and the sample standard deviations. It can differ whether it is one-tailed or two-tailed, depending on whether the hypothesis is to only check if the mean is larger or smaller, or if it tests both. The MWU-test work in a bit different way, where a random value from one group is tested against a random value from the other group. For more than one group ANOVA can be used for normally distributed data, or Kruskal-Wallis for non-normally distributed data (66, 68).

Correlations aim to quantify whether one variable can predict the value of another. It does not however say anything about cause and effect, just if there is some relationship in how the values for the two variables vary. The correlation coefficient is a number between -1 and 1, where -1 is perfectly negatively correlated and 1 is perfect positively correlated (69). If negatively correlated, the value of one variable increase when the other decrease, but for positively they both increase if one of them increase. One such correlation, which also test if it is significant or not, is the Pearson correlation.

There are a lot of available software that can be used to perform data treatment and statistics. SPSS is a statistical software from IBM and can be used for a wide variety of statistical analysis. From descriptive data, creating charts, perform hypothesis testing and correlations.

Boxplots from SPSS display the median as a horizontal line in the box, the box around it is from the first to the third quartile and lines above and beneath the boxes are maximum and minimum (70). First quartile (Q1) is the median of the lower half of the data and the third quartile (Q3) is the median of the upper half of the data. The maximum and minimum is within a certain range based on Q1 and Q3, so outliers are plotted as single points.

Principal component analysis (PCA) is a technique that is used to make interpretation of large datasets easier. With many variables, samples, and parameters, it can be difficult to get a proper understanding of what affects what. PCA makes this easier, since it aims to reduce the number of dimensions of the data to a 2D plot (71). At the same time, it aims to preserve as much of the statistical information about variability in the data. It does this by project the data onto principal components (PCs) and aims to summarize this in an optimal way. The different PCs are not correlated with each other (71).

The result is usually a set of 2D plots, where the x- and y-axis represent principal components (PCs) that explains the variability in the data, to a different degree (71). The first plot is made up by PC1 and PC2, and PC1 explains the variability in a horizontal direction and PC2 in a vertical. In addition PC1 explains more of the variation than PC2, meaning that on unit of distance between two points along the x-axis implies a larger difference between the two observed data, than the same distance between two points along the y-axis. Data points that are similar to each other will group together in a cluster, and cluster that are different from each other will be separated in x and y direction depending on which of the PCs that explain their differences the most, and the difference in how much PC1 explains compared to PC2.

In document Mercury release from thawing permafrost in the Arctic (sider 32-35)