Experiment Discussions - Cross-Computer Malware Detection in Digital Forensics

The experiments were conducted in such a way that it was possible to execute it on a per-sonal computer with normal performance. This means that maximum five test computers could be ran at the same time for the experiments involving link mining between the ma-chines. However, this number of machines were enough to test whether it was possible to

identify correlations among them or not. The results clearly show that correlations exist between all of the machines in both experiments involving multiple ones.

The malware used for the two multi-host experiments shows how easy one can create and configure malware that employ robot functionalities of a botnet with spying features suitable for distributed OB attacks. Even though theKeylog Bot Malwareonly has a one-way communication with the controller, it still shows a realistic approach of capturing keystrokes and sending them away. The Spybot malware used for the final experiment has however numerous C&C features suitable for malicious attacks against naive OB clients. The use of a three experiments, where one of them were infected in a realistic manner, with a real malware code, built the foundation for a solid assessment of the correlation method.

An important aspect of correlating information from multiple sources in digital foren-sics is to represent the time stamps by adjusting their values in relation to system clocks and time zones. In the case of our experiments, all machines were running in the same time zone with the same system clock values. This removed the issues of drifted system clocks and other changes that might affect the integrity of computer timestamps.

For the clustering task, where similar file objects were grouped, most files were found common over the set of computers. Due to the way the experiments were conducted, by cloning the machines, this lead to minimal variations in the file systems of the machines.

The temporal aspect of when the machines were used, the environment they were used in, imposed activities and actions are possible impact factors. Especially for the Mal-ware from the Wildexperiment, the clustering procedure suffered slightly in the way the identified clusters had vague dissimilarities. However, due to the file objects features, especially the IP, email and URL strings, along with the entropy value, timestamps and the files content type, it was still possible to identify the anomalies clearly.

Regarding the efficiency of the method, it is hard to estimate whether or not the feature extraction and reduction of file objects would be affected significantly when the data volumes increase. This is because it is hard to estimate the number of string features, e.g., IP addresses, present in the files. However, the main purpose is to improve the efficiency and effectiveness of the analysis, here being the link mining. Since the feature file of one machine, initially having 4GB of data, is not larger than 148KB, and the size of 5 approximately estimated as 5 times 740KB, the size for data sets of, e.g., 5000 machines would not be larger than 740000KB or 722,6MB. This is a volume that still is possible to handle efficiently, when performing link mining analysis. In such cases, an efficient implementation of the clustering algorithm would be preferable.

All analysis tasks performed, depend heavily on the features representing the files for each machine. During the experiments we saw that many of the IP, URL and email strings were FPs, due to the regular expressions used by the Feature Extraction Tool.

We do not know for sure whether FN also exist for these strings. IP-addresses which is typically associated with local area networks, in which the computers never had any contact with, we suspect is FP from another type of data source representing numbers in the same format as IP addresses. It is especially the machinespagefilethat is associated with multiple IP addresses. The access to strings in thepagefilealso reflect the methods ability to extract strings from paged memory. In addition to these string features, partial overlap were identified over multiple features. E.g., name_type and unallocated were overlapping in the way that name_type had an unknown value that always were set

whenunallocatedwere set. This could have lead to small priorities, due to the features redundancy. In addition, the file contentdatawas always present when the type of a file was a directory.

The result of cluster centroids having the common feature values reflect the the fea-ture’s impact on the analysis and the fact that it should be evaluated properly in order to improve the results. This counts especially for the MD5 cluster centroid values in the Malware from the Wildexperiment.

Overall, the method has great advantages when it comes to identifying similarities between multiple machines, and using this information to detect possible malware. If presented properly, the results obtained from using a correlation based method, can pro-vide useful information for further investigations and as a tool to improve associated entities in a court or for incident response.

7 Discussions and Implications

In this chapter, the implications of the proposed correlation method and the outcome of the executed experiments are presented. The project’s value and its affiliation to the research area of digital and computational forensics is provided to express the project’s outcome. The thesis, work involved and answers to the research questions are summari-zed at the end of this chapter.

In document Cross-Computer Malware Detection in Digital Forensics (sider 106-109)