Abstract:
As the volume of potential digital evidence increases, digital forensics investigators are
challenged to find the best allocation of their limited resources. While automation will
continue to partially mitigate this problem, the preliminary question of which media should
be examined by human or machine remains largely unsolved.
Prior work has established various methods to assess digital media similarity which
may aid in prioritization decisions. Similarity measures may also be used to establish links
between media, and by extension, links between the individuals or organizations associated
with that media. Existing similarity measures, however, have high computational costs
which can delay identification of digital media warranting immediate attention or render
link establishment across large collections of data impractical.
In this work, I propose, develop, and validate a methodology for assessing digital media
similarity to assist with digital media triage decisions. The application of my work is
predicated on the idea that unexamined media is likely to be relevant and interesting to an
investigator if this unexamined media is similar to other media previously determined to be
interesting and relevant. My methodology builds on prior work using sector hashing and
the Jaccard index similarity measure. I combine these methods in a novel way and
demonstrate the accuracy of my method against a test set of hard disk images with known
ground truth. My method is called Jaccard Index with Normalized Frequency (JINF) and
calculates the similarity measure between two disk images by normalizing the frequency of
the distinct sectors.
I also developed and tested two extensions to improve performance. The first extension
randomly samples sectors from digital media under examination and applies a modified
JINF method. I demonstrate that the JINF disk similarity measure remains useful with
sampling rates as low as 5%. The second extension takes advantage of parallel processing.
The method distributes the computation across multiple processors after partitioning the
digital media, then it combines the results into an overall similarity measure which preserves
the accuracy of the original method on a single processor. Experimental results provided
as much as a 51% reduction in processing time.
My work goes beyond interesting file and file fragment matching; rather, I assess the
overall similarity of digital media to identify systems which might share applications and
user content, and hence be related, even if some common files of interest are encrypted,
deleted, or otherwise not available. In addition to triage decisions, digital media similarity
may be used to infer links and associations between the disparate entities owning or using
the respective digital devices.