block-hashing - Blockhashing as a forensic method

Traditionally, hashing of data objects is very common in the digital forensics environ-ment, but mostly to create hashes of whole objects such as a full disk, disk image or file.

block-hashing is a technique to create digital signatures from pieces of such objects [29].

In 2010 there was written an article ”Garfinkel [13] that focused on using block hashes to perform sub-file forensics”. Sub-file forensics is the same as block-hashing used in this project.

In [14] the term hash-based carving is used to describe the process of comparing equal size blocks from files with blocks from a data media. This term is a good description on the phase of comparing blocks, but the term is also a part of the whole block-hashing process. In 2015 the same technique is described as ”Distinct Sector Hashes for Target File Detection” [14].

In this project, the terminology block-hashing is used. The use of ”sub-file” in [13]

and ”Distinct Sector Hashes” [14] is technically the same but it‘s easy to relate these two terms to either files or disk sectors. block-hashing is not specifically related to a particular type of object and covers disks, files and other objects. In block-hashing the objects are divided into smaller pieces and earlier work on this subject uses either 512 or 4,096 byte.

2https://www.netmarketshare.com/operating-system-market-share.aspx?qprid=8&qpcustomd=0

There are other methods to perform block hashing such as fuzzy hashing which is a method to detect likeness specially related to malware [24]. In 2006 there was written an article ”Kornblum [22] that focus on identifying almost identical files used context triggered piecewise hashing (CTPH)”. The article also focused fuzzy hashing. This project has no intention of locating likeness but equality. The di↵erence is demonstrated with the algorithms below where B is an arbitrary block of data. This project depend on B1 == B2 while fuzzy hashing depend on B1 ⇡B2. A more in-depth discussion of fuzzy hashing is available in [19]. This project does not involve fuzzy hashing.

There are two major types of hashing, sliding hashing and aligned block-hashing.

2.3.1 Sliding block-hashing

In sliding block-hashing, each block-hash has the same size, but the blocks are not aligned to each other. This is demonstrated in Figure 2.8. Sliding block-hashes are described in [22] as ”the rolling hash” and is basically the same thing.

In that example the block size is 64 byte and the sliding is 32 byte. This means that each block has 32 bytes from the previous block. The most intensive search using this method, is to set the sliding gap to one byte. By using the block size 64 as in the figure, first block is hashed from o↵set 0-63, next from 1-64, 2-65 ...bS 64 bS 1 where S is the size of the object.

Using this method could be very efficient, specially when we need to locate pieces of information not aligned in the file system to sectors or blocks. This method is particu-larly suitable for the location of malware that follows no such alignment to either sector, block or not even files.

The downside of this method is the amount of block-hashes which will increase by

Block size

Slidinggap which in our example from Figure 2.8 provide a multiple of ⁶⁴₃₂ = 2. A block size of 512 and sliding gap of 1 byte, gives us a multiply of ⁵¹²₁ = 512.

0-63

2.3.2 Aligned block-hashing

Block-hashing using alignment is the most obvious way of performing block-hashing on a file system where all data is minimum aligned to sector size (512 or 4096 bytes). The only data that is not aligned in a file system is the data inside objects like files. In that cate-gory, we also define files copied into containers as not aligned to the file system boundary.

Even if operating with aligning to 512 or 4096 byte sector size, there is no problem using multiples of these sizes and also fragments of these sizes. ₅₁₂²ⁿ

The principle of aligned block-hashing is shown in Figure 2.9 where we have a block size of 64 byte

64-127

0-63 128-191 192-255 256-319 320-383 384-447 448-511 512-575 576-639 640-703 704-767 768-831 832-895

Figure 2.9: Principle sketch of aligned block-hashing

The articles [14, 15] diverge between di↵erent block sizes and entropy is also men-tioned, but not in depth. The article has no focus on determining bias for number of blocks/contiguous blocks in conjunction with entropy. In the thesis of Kristina Fos-ter [12] there was done a huge work on similar challenges. The thesis was not focused on determining bias for entropy and the amount of hits to identify an object. In the two papers in [14] and [15] the block size is discussed and the conclusion is rather vague but it seems like a block size of 4,096 bytes is to prefer prior to 512 byte blocks.

2.3.3 Data Reduction in large datasets

Data reduction is a method to reduce the amount of data, usually in large datasets [25].

Typically, this is a method used in digital forensics to exclude known files from a set of files in a disk image. By performing hashing of files in a dataset and comparing the hash values with known files (files from known clean operating system installations, clean installations of software or using other criteria to reduce the number of files). Another method to perform data reduction, is to remove all files in a dataset with zero bytes (empty files).

This method could also be implemented in block-hashing forensics by reducing the number of hash values in the database. This could typically be hashes from blocks with certain known content like blocks filled with equal bytes (blocks with only 0x00, 0x↵

etc). Data reduction could also involve other criteria related to block-hashing. By elim-inating all blocks with a very low entropy, the dataset could easily be reduced with a huge amount of records in the database of hash values. A combination of the above ex-amples is also possible. In [25], di↵erent theoretical methods to perform data reduction are covered.

In [1] the method of using hash values to perform data reduction is covered and gives a brief introduction to that method with reference to the NIST collection of hash databases. The NIST NSRL (National Software Reference Library) is a large collection of hashes from known files. None of the data sets covers hash values from pieces of data (block hashes).

Another method to perform block-hash data reduction is to eliminate all hashes from allocated data in a large data set of files on a disk or disk image. In normal circum-stances, we could assume that many files on a disk/disk-image have been copied, the original or copy have been erased and there is other forms of duplicates maybe erased at a certain time. Usually, when using the block-hashing technology to discover pieces of a reference file in a storage, we primarily search inside the unallocated area [2] described as an area on a disk not used by any file system object which could contain information from previous existing files or folders.

The allocated area is the opposite of the unallocated area and contain files, folders and other objects registered in the file system. This method of performing data reduc-tion is not found in any reference material but is an extension of the commonly used

”known files” hash data reduction. Neither of the papers and master thesis in [12,13,15]

discuss the data reduction topic in any context related to block-hashing.

In document Blockhashing as a forensic method (sider 22-25)