• No results found

The classification experiments will use different data sets from colorectal and hepatic can-cer. These data sets have few samples but gather several hundred to thousands of microR-NAs. The samples are either labeled as ’normal’ or ’tumor’. For colorectal cancer samples different types of tissue from different parts of the colon are used e.g. rectal, ascending and sigmoid. These were initially split into separate groups but PCA plots, seen in section 4.2, showed these were quite comparable.

The samples are generated using different technologies. One data set is made using microarray technology and the rest is generated using RNA-sequencing technology. These different technologies are not inherently comparable, therefore Equation (3.4) is used to normalize gene sequencing data to comparable values to microarray data. Log normalized values are preferred as sequencing values are absolutes which leaves us to wonder if the sample was twice as large or if it had twice us much miRNA. Furthermore as sequencing technology picks up a lot more miRNAs, only miRNAs with at least a mean of 1.0 in normalized ni values is kept. A overview over each data set can be found in Table 4.1 where miRNAs are already filtered.

Density plots give us an idea of how well the equation works to make the different tech-nologies comparable. Figure 4.1 gives us such a plot for Microarray, Hepmark-Tissue and Hepmark-Paired-Hepmark-Tissue. In general, the ideal plot is overlapping lines equally stretched in width and with equal peaks. Although this is not exactly the case they still

Table 4.1: Overview of data sets. ID is the internal ID for the data set. In each data set samples are the number of rows and number of miRNAs are number of columns. Technology refers to what technology were used to generate the data set. Type refers to what type of disease the data set has.

HCC - Heptatocellular carcinoma and CRC - Colorectal cancer.

Name ID Samples MiRNAs Technology Type

Hepmark-Microarray D1 146 396 Microarray HCC

Hepmark-Tissue D2 150 472 RNA-seq HCC

Hepmark-Paired-Tissue D3 37 381 RNA-seq HCC

ColonCancer GCF-2014-295 D4 92 424 RNA-seq CRC

GuihuaSun-PMID 26646696 D5 66 425 RNA-seq CRC

PublicCRC GSE46622 D6 15 441 RNA-seq CRC

PublicCRC PMID 23824282 D7 57 485 RNA-seq CRC

PublicCRC PMID 26436952 D8 51 433 RNA-seq CRC

do pair up quite well. By close inspection the outline of two main bodies stretching from 0 to 15 and from 0 to 20 is seen. The first main body consists ofD1, the microarray set, while the other isD2andD3, the RNA-sequencing sets. The peaks for each body is also quite close at around 5 forD1 and 8 forD2andD3and the peaks having the density in range 0.12 to 0.14. The separation of samples in -1 is due to the microarray set having its missing transcribed miRNAs filled as -1 from the technology.

One important problem is that the set of features between these data sets do not match.

Initially, the missing features were filled in as -1 because missing certain miRNAs can itself be a biomarker for tumor. It was discovered that this filling for missing values wors-ened the overall performance in classification for both of the combined data sets and thus all features that were missing for one or more of the data sets were dropped from the com-bined set. The baseline ROC curves of using the filling of features can be found on page 73 and for intersection of features can be found on page 71 in the Appendix.

This had a couple of important complications. When considering the individual data sets of Figure 4.2 the density at its peak slightly differs from Figure 4.1. For D1 the peak has slightly higher density alone while forD2the density is the same but at a higher normalized expression. This is because a different feature subset is used for the combined case. The expectation is that the dropping of features impacts the lower expressed features more than the higher expressed ones. In addition, samples that deviates a lot from the rest such the orange centered at -1 in Figure 4.1 occurs because most of the features that had a value were dropped when combining the data sets thus leaving it with mostly -1 values from the microarray technology. Density plots proved quite effective to identify such samples. In most cases these are samples that had been contaminated during the process of making the data sets and were simply removed from the data sets when found.

The excluded samples can be found in the source code Appendix A.2.

Ironically the data sets are more similar in terms of density before a combination is done. However, the alternative of filling missing features created similar situations were more samples contained mostly -1 values. The RPM normalization is done to make the

samples comparable and is not the only normalization that has to be done to make the data sets comparable. More density plots with additional feature scaling can be found in Appendix A.5.3.

−5 0 5 10 15 20

Normalized Expression 0.00

0.05 0.10 0.15 0.20

Density

Density Plot of Multiple Datasets

Figure 4.1:A density plot of hepmark data sets:D1,D2andD3. Each line represents a sample and its values, the higher the line is for some value the more common the value is in the sample.

−5 0 5 10 15 20

Microarray Expression 0.00

0.05 0.10 0.15 0.20

Density

Density Plot of Hepmark Tissue

−5 0 5 10 15 20

Normalized Expression 0.00

0.05 0.10 0.15 0.20

Density

Density Plot of Hepmark Tissue

Figure 4.2:A density plot of hepmark data setsD1andD2. Each line represents a sample and its distribution of values, the higher the line is at a value the more common the value is in the sample.