Proof-of-Concept, single drive - 1 - Experiment Execution

6.2 Experiment Execution

6.2.2 Proof-of-Concept, single drive - 1

This experiment was executed to see whether the tools used by the correlation method worked properly and whether it was capable to detect distinctions between a single unin-fected and an inunin-fected machine.

Machine Configuration

The experiment was set up with one machine, having a clean (uninfected) state and an infected state. Hash values of the hard drive from the different machine states are given in Appendix G. The uninfected machine created a baseline, while the other machine was an infected machine. Instead of configuring two individual machines we decided to create two states, representing each machine, as reflected in Figure 20. This was achieved using VMwares snapshot functionality.

Figure 20:Proof-of-ConceptVirtual Machine states

The machine(s) had the following configurations:

• CPU: 1 core

• Hard disk size: 4GB

• Physical Memory: 512MB

• OS: Windows XP w/SP2

• Virtual Environment: VMWare Workstation 7.0.0 build-203739

• examplefile.txt file inDocuments and Settingsfolder

The initial state of the virtual machine was created using typical settings, making an easy and fast installation. At installation time, VMware tools were automatically installed on the machine, followed by a restart. When the machine was up and running, a snapshot was taken to preserve the clean state (performing the steps for Clean System Hashesin Section 6.1.1). Next, the examplefile.txt file (found in Appendix F) was copied to the machinesDocument and Settingsfolder, and a snapshot of the infected state was taken.

This file includes special string features (IP, email and URL) in order to verify the methods extraction capabilities.

Data Collection

With the hash database of NSRL RDS and the machines clean state, the data from the infected machine was collected. The following procedure was performed for the infected machine:

• A snapshot of the infected system was taken in VMware

• The snapshot disk was added to the Forensic System and started

• The clean disk and its partition was identified as a device

• ADisk Imageof the clean machine’s partition was created, using dcfldd if=/dev/sdb1 of=/"image file"

• The access rights for the image file was change to read-only, using the following command:chmod 444 "image file"

• It was verified that the input partition and the output file held the same data, using md5sum

Examination

With a copied image file of the machine’s hard disk partition, we started the examination.

In order to do this, metadata about the files on the hard disk partition was extracted:

• File Metadata Extraction (2)was performed on theDisk Imageimage file to create Fea-ture File 1, usingfiwalk -f -A /"output location".arff /"input location"

Removing Known Files

Since we wanted to filter out all unaltered and clean data from the infected machine, the hash database files of NSRL and the machines clean state were used. The procedure was executed as follows:

• The Hash Reduction Tool was used forHash Filtering (3)to reduce the number of file objects and to createFeature File 2, by runningpython /"tool location"/*tool.py. Feature File 1was used as input when prompted. Also the hash database text file and the desired name for the output file were added. This task was performed twice, one for the hash database of the clean system and one for the NSRL RDS hash database.

Since the hash database of the clean system was smallest, it was more efficient to use this before the large NSRL RDS hash database file. It was important that file, with reduced file objects from the first run, was used as input when reducing more file objects.

Feature Extraction

Having filtered out clean and known files, we extracted additional features in order to obtain the desiredFeature File 3.

• The Feature Extraction Tool was executed by running python /"tool location"/*tool.py

TheDisk ImageandFeature File 2was used as input, along with the desired output file name.

• Forensic case supplied metadata, i.e., machine number and media number were added too, before extracting the features.

Analysis

In this experiment there was no link mining, since we only considered one machine.

However, we pre-processed the extracted feature data in Weka in order to get a better representation it. It also allowed us to verify the method’s ability to reduce the num-ber of file objects, to verify that the extracted data was a correct representation of the original source and to find "examplefile.txt". The feature file was opened in Weka and pre-processed as shown in Table 5.

Pre-processing Task Argument Feature

Table 5: Pre-processing performed on extracted features

The features not pre-processed and included in Table (Table 5) were already in a suitable format (numeric) and the total representation of all features are given in Table 6.

Features Type

ID, Filesize, Inode, Mode, Nlink, Uid, Seq, Unalloc, Machine, Media, Entropy

Numeric Mtime, Ctime, Atime, Filename,

Name_type, Crtime, Libmagic, MD5

Nominal

IPs, Emails, URLs Numeric for all individual values Table 6: All features and their type after pre-processing

In addition to removing superficial attributes, thestring to word vectorpresented se-veral invalid values. These invalid strings were caused by regular expressions used for extracting IPs, Emails and URLs. E.g., email addresses like "21@shell32.dll", IP adresses with values exceeding the range (four groups ranging from 0-255 [105]) and incomplete URLs. However, the low number of invalid strings were removed manually.

The final, pre-processed version of Feature File 3 was stored and opened again in Weka. Weka Explorer, and its Viewer functionality, along with the Visualization functio-nality, was used to analyze and present the machine’s file objects and their features.

Results and Discussions

The number of file objects extracted before filtering was 13869. After filtering out file objects, based on clean system hashes, the number of file objects were reduced to 415.

The NSRL filtering did not reduce additional number of file objects.

Weka Explorer’s Viewer function presented a clear and overall view of the file objects (represented in rows), and their features (represented in columns). The sorting functio-nality, on the features, improved the visual presentation of the file objects and related feature values. This can be seen in conjunction with the way of looking at patterns by using different views of, e.g., users and files for forensics as presented in [12].

When analyzing the results, the "examplefile.txt" was successfully located through Weka Explorer’s Viewer function. Unallocated files, and examples of extracted IPs, Emails and URLs associated with file’s content were also found. Depending on the context, au-tomatic extraction of communication associated features reduces the manual work of finding such correlations over multiple file objects. Appendix G presents different views of the identified "examplefile.txt", unallocated objects and special string features in Weka Explorer’s Viewer function.

The visualization feature of Weka gave us a visual presentation of the data, where ex-pert knowledge of the IP-address (192.168.0.1) was used to detect associated file objects.

A screenshot of Weka’s Visualizer is shown in Figure 21. In the figure, file objects with the IP address in its content is assigned value 1 (X-axis) and time of creation (Y-axis).

The colors represent raw files (blue), directory (red) and neither/unallocated (green).

Four files were identified as anomalies and associated with the particular IP-address, shown in the red cirle. The files were the$LogFile,$MFT,Documents and Settings\Admin-istrator\My Documents\examplefile.txtand Documents and Settings\Administrator\Local Settings\Temporary Internet Files\Content.IE5\K7Q1AF63\examplefile[1].txt. All of them are presented in a time perspective of when they were created. Due to the limited space available in the figure, the time stamps are only represented by the first number 2 from 2010, and not the whole ’yyyy-MM-dd HH:mm:ss’ string. represent anomalies,

Figure 21: Visualization of files Creation Time and IP

It is clear that the way of representing file metadata, and special string features im-proves the efficiency due to the decreased data volumes to process. The initial image file was 4GB, while the filtered feature file of the same hard disk image, after pre-processing was 148KB. The representation provides the ability to visualize the data in order to, e.g., simplify the task of detect anomalies.

In document Cross-Computer Malware Detection in Digital Forensics (sider 86-91)