Tools and Data formats - Practical Implementation

5.2 Practical Implementation

5.2.1 Tools and Data formats

The core of the practical implementation is its tools. Most of the correlation detection process is done in a configuredForensic System, being a Linux machine with proper condi-tions for open source forensic tools and features.

The tools and data format used to create the feature files, along with the tool used to find links between the machines, play central roles for the correlation method. It is therefore important to present them, their advantages and limitations. The categories of tools can be seen in conjunction with Figure 14, whereData CollectionandExamination

are mainly forensic tools andLink Miningare machine learning algorithms implemented in a software application for handling data mining challenges. The involved tools are presented consecutively, in the order they are used in the correlation method, based on the steps in Figure 16. However,CombiningandLinking Machinesare simple operations that do not require advanced tools or techniques.

ARFF

Before getting into the tools, we will present the data format used to represent the de-tection features, defined in Section 5.1.4. Attribute-Relation File Format (ARFF) is a file format that represent data sets with independent objects which share a defined set of attributes [87, 59]. It was primarily developed for the Weka, the machine learning tool.

An example file is shown in Figure 17.

C:\Users\Anders\Desktop\new1.arff 28. april 2010 12:32

%Example ARFF file with file metadata from a computer hard disk

@relation files

@attribute id NUMERIC

@attribute filename string

@attribute size { 100, 150, 300 }

@data

1, textfile, 100 2, executable, 300 3, picture, 150

Figure 17: Example ARFF file with file objects

The ARFF format has a @relationdeclaration in the beginning of the file, reflecting the file’s data. This is followed by the attribute declarations. Finally there is the actual data section with objects. All objects needs to have a value for all attributes (also known as features of the objects), which has to be supported by the attribute’s format. Empty is also allowed and is represented as a ’?’. The attributes have to be defined in the beginning of the ARFF file and their format can be;numeric(real or integer values),nominal (re-presenting a set of defined values),string(textual values) ordate(string format, default using ISO-8601 combined time and date: "yyyy-MM-dd’T’HH:mm:ss").

The reason for using ARFF for the correlation method is because of available tools for creating and managing the ARFF format, e.g., WEKA and Fiwalk (presented later in this section). ARFF is also a format successfully used to represent features in feature files for data mining in other malware detection methods, e.g., as in [58].

Dcfldd

Dcfldd is an open-source UNIX tool for copying raw data developed by the U.S. Depart-ment of Defense Computer Forensics Lab. It is an improved version of GNU dd with additional features for digital forensics. For digital forensics, the tool is primarily used to create a copy of a computer’s hard disk, encapsulating the raw data in an image file.

While dd works for copying data from one source to another, independent of type and context, dcfldd has additional functionality of, e.g., hashing input data on the fly, give

copy status updates, split the output file, verify source and destination integrity [27, 88].

There exist several other imaging tools with even more features, but for the purpose of the correlation method, dcfldd is mainy chosen due to its simplicity and status updates during imaging. Dcfldd is the tool used forData Collection (1)in Figure 16.

Fiwalk

Fiwalk is an open-source meta data extraction software developed by Simson L. Garfin-kel [17]. The tool was mentioned in Section 2.1, as a way of improving the efficiency and effectiveness in digital forensics. It is in fact exactly what it does. By using a image file from a computers hard disk partition, it extracts file system and document metadata.

It can represent the files it extracts as objects in XML or ARFF format.

The tool is built on The Sleuthkit (TSK) digital investigation tool and in particular the tsk_vs_part_walk(),tsk_fs_dir_walk()andtsk_fs_file_walk()libraries in TSK.

Due to the importance of file metadata information in the correlation method and the fact that Fiwalk is able to produce ARFF files and use TSK libraries to acquire the data, Fiwalk was suitable as a file metadata extraction tool (coveringFile Metadata Extraction (2)in Figure 16).

Table 3: Default Fiwalk attributes of NTFS partition, with−foption

The attributes and their type generated by Fiwalk, is presented in Table 3. These are the default attributes obtained from an imaged NTFS hard disk partition, where the−f option is added to get file content information of thefiletool andlibmagiclibrary. The table has multiple occurrences of MD5 and SHA1 values, where the interesting ones are the last two, having a hash of the file’s content.

Hash-Based Removal Tool

This is a self developed tool, for filtering out known file objects. Since Fiwalk finds and outputs all files in a target file system and extracts metadata information about them, the file objects with known content can be removed to improve effectiveness (discussed in Section 2.1 as part of a forensic examination procedure). TheHash-Based Removal Tool is responsible forHash Filtering(3) in Figure 16.

The scripting language Python is used to develop this tool (as for all the other develo-ped tools for the correlation method), due to its easy way of including UNIX commands and modules (e.g, the ARFF module package). As stated in [17], "Python makes an ex-cellent language for writing forensic tools because of flexible object model, its built-in garbage collection, and its interactive prototyping environment."

TheHash-Based Removal Toolis based on hfind, which is a tool from TSK, that can search for hash values in large hash databases. The reason for choosinghfindis because of its relatively effective searching approach. By creating an index file of the input hash set, the search for hash values are much faster, compared to standard keyword searches, e.g., using thegreptool. A small experiment was conducted, showing that agrepsearch through the NSRL hash database-file was much more time consuming than using the hfindtool.hfindsupports nsrl-md5, nsrl-sha1, md5sum and hk (hashkeeper) hash repre-sentation types.

TheHash-Based Removal Toolis built on the basic pseudo code presented below and the source code is found in Appendix C:

1 − Get and p a r s e i n p u t ARFF f i l e 2 − Get hash d a t a b a s e f i l e

3 − For each o b j e c t , lookup hash

4 − For each found hash , remove o b j e c t Hash Extraction Tool

In order to remove not only hash values found in official hash databases, e.g., NSRL RDS, another tool that extracts hash values and creates a hash database file with the same output format ofmd5sum, has been developed. This tool makes it possible to obtain hash values from a clean system, with more or less the same configurations as the infected ones, and to remove even more known file objects from the feature file. The source code of this tool is found in Appendix B and uses many of the same techniques as the Hash-Based Removal Tool, e.g., parsing an input ARFF file. A simple example output-file ofHash Extraction Toolis presented in Figure 18.

The Feature Extraction Tool requires only an ARFF file as input and the user has to select an output filename being in text format (e.g., file.txt).

Feature Extraction Tool

TheFeature Extraction Toolis developed for extracting special and case specific features that can improve the correlation method’s ability to identify malware traces. The tool is used in the correlation method for Feature Extraction (4) in Figure 16. Since Fiwalk

extracts most of the desired metadata features already, this tool extract and adds the following content-based features to the feature file:

• Machine ID

ad617ac3906958de35eacc3d90d31043 o b j e c t 1 45 e02ce7d5f9f6034cb010429ce19205 o b j e c t 2 8c363d02d0b129277453563befb68380 o b j e c t 3 2466459 bc0cf09beef06ba445d2f7b4e o b j e c t 4 2466459 bc0cf09beef06ba445d2f7b4e o b j e c t 5 8e215da06984db90d45f84386e562799 o b j e c t 6 5ee0a1c448311ce476968856f4b706e2 o b j e c t 7 1690 aad47a0f7c82a60041ef28eb5221 o b j e c t 8 7b9cf841881493c700027214e9db753d o b j e c t 9 649 db99f45048fa7b189fc58fe4fb850 o b j e c t 1 0

Figure 18: Example output of Hash Extraction Tool

The first two ID values are metadata specific to the case, which is user supplied when features are to be extracted. The entropy is extracted for each file, based on the file’s content, using theEnttool. IP¹, Email and URL strings are extracted using regular ex-pressions against the file’s content, which are extracted using TSKsicattool. The regular expressions are important to extract as much relevant strings possible and they are found in Appendix A, along with the rest of the tool’s source code. The chosen regular expres-sions are based on existing once, modified and tested against typical IP, URL and Email strings [89, 90].

This tool has to deal with the ARFF format in another way than theHash-based remo-val tooland theHash extraction toolbecause it adds attributes to each object. This affect how the parsed input file have to be handled by the program since all the objects are parsed in as lists. In order to comply with the requirements of the ARFF format, where each object needs equal numbers of attributes, the additional features had to be added to all objects. The machine and media ID would be the same for all objects, but the entropy value would be calculated for each file object separately. When it comes to the IP, mail and URL strings, a file can have0-nsuch strings. This means that one cannot add each found, e.g., IP, as an extra attribute for each file because that would give a variating num-ber of attributes for each file (since not all files have same numnum-ber of attributes). What the tool does, is that each string feature found in each file is added to one long string attribute (one for each IP, Email and URL). This gives a defined number of attributes.

Since a long string attribute would make it hard to compare individual strings of several

1IPv4 is the type of Internet Protocol version we are using

file objects with each other, the long string attribute needs to be pre-processed before any linking is applied. This pre-processing is handled by Weka, wich is presented in the following section.

TheFeature Extraction Toolrequires the following input data:

• An image file of a computers hard disk partition

• The corresponding ARFF file generated by Fiwalk for the same partition

• Machine number

• Media number (in case of multiple in one machine)

In addition, the user must provide the desired file output name, being in ARFF format (e.g., marking it file.arff). The Feature Extraction Toolis built on the basic pseudocode presented below:

1 − Get and p a r s e i n p u t ARFF f i l e 2 − For each o b j e c t :

− Add F o r e n s i c c a s e s u p p l i e d metadata ( machine and media number )

− Add Entropy v a l u e

− Add I P s

− Add E m a i l s

− Add URLs Weka

Weka is an open source java based machine learning workbench with state-of-the-art algorithms and pre-processing capabilities. The name Weka stem fromWaikato Environ-ment for Knowledge Analysisand was developed at University of Waikato, New Zealand.

Weka consist of four applications, where the Explorer is what will be focused on in this section. The Explorer, with the example ARFF file from Figure 18, is shown in Figure 19. The figure shows the preprocessing stage with a list of attributes, the values for the selected one (size) and visualization of the their weight. As reflected in the top banner, the tool can be used for most data mining problems; preprocessing, classification, clustering, regression, association rule mining and attribute selection [59, 85].

Due to all the features tied to each file object in the feature files, the pre-processing capabilities was crucial in order to perform clustering. With varying feature values, e.g., numeric and strings, the pre-processing filters in Weka make it possible to represent them properly in order to cluster. This is especially true for the long string attribute of IPs, Emails and URLs. It is desirable that each address string is represented by an attribute, such that each file having the string will be marked. The concrete pre-processing filters steps used for the correlation method are presented in Table 4.

Weka Explorer has an easy to use implementation of K-means algorithm, called sim-pleKmeans. The default distance function and the one considered for the correlation me-thod is the Euclidean Distance.

For the correlation method, it is the pre-processing capabilities of the tool, the clus-tering capabilities and corresponding algorithms, along the easy to use GUI that makes Weka the first choice for Link Mining. This great tool influenced the choice of the ARFF format, originally created for the tool. From Figure 16, Weka is responsible for Pre-processing(6) andClustering(7).

Figure 19: Illustration of the Weka Explorer

Pre-processing Task Argument

Remove Used to remove all features represented

with only one uniform value, overlap-ping and system specific features (mar-ked with *)

Weka-filters-unsupervised-attribute-numeric to nominalfilter

Used for numeric attributes, and espe-cially date attributes to make them No-minal and possible for the clustering al-gorithm to handle

Weka-filters-unsupervised-attribute-string to nominalfilter

Used to convert strings features to no-minal. File objects with string attributes will be given a 1 or 0, depending on whether it is present or not

Weka-filters-unsupervised-attribute-string to word vectorfilter

Used to divide features with many " -seperated strings

Table 4: Applied pre-processing steps from Weka

Finally, the Viewer and Visualization functionality of Weka’s Explorer, give good in-sight to the data set in question where different views and relations between the data objects and their features emphasize their details.

Viscovery SOMine

Viscovery SOMine is the tool used for creating SOM diagrams. SOMine is a commercial software that provides data mining tasks based on SOM maps [91]. This tool is used only as part of the clustering procedure to estimate the number of initial clusters. The tool cannot handle .arff files directly, only .xls, special Viscovery data mart and XML files, SPSS files, and text files.

In document Cross-Computer Malware Detection in Digital Forensics (sider 72-79)