A Digital Media Similarity Measure for Triage of Digital Forensic Evidence



Lim, Myeong Lyel

Journal Title

Journal ISSN

Volume Title



As the volume of potential digital evidence increases, digital forensics investigators are challenged to find the best allocation of their limited resources. While automation will continue to partially mitigate this problem, the preliminary question of which media should be examined by human or machine remains largely unsolved. Prior work has established various methods to assess digital media similarity which may aid in prioritization decisions. Similarity measures may also be used to establish links between media, and by extension, links between the individuals or organizations associated with that media. Existing similarity measures, however, have high computational costs which can delay identification of digital media warranting immediate attention or render link establishment across large collections of data impractical. In this work, I propose, develop, and validate a methodology for assessing digital media similarity to assist with digital media triage decisions. The application of my work is predicated on the idea that unexamined media is likely to be relevant and interesting to an investigator if this unexamined media is similar to other media previously determined to be interesting and relevant. My methodology builds on prior work using sector hashing and the Jaccard index similarity measure. I combine these methods in a novel way and demonstrate the accuracy of my method against a test set of hard disk images with known ground truth. My method is called Jaccard Index with Normalized Frequency (JINF) and calculates the similarity measure between two disk images by normalizing the frequency of the distinct sectors. I also developed and tested two extensions to improve performance. The first extension randomly samples sectors from digital media under examination and applies a modified JINF method. I demonstrate that the JINF disk similarity measure remains useful with sampling rates as low as 5%. The second extension takes advantage of parallel processing. The method distributes the computation across multiple processors after partitioning the digital media, then it combines the results into an overall similarity measure which preserves the accuracy of the original method on a single processor. Experimental results provided as much as a 51% reduction in processing time. My work goes beyond interesting file and file fragment matching; rather, I assess the overall similarity of digital media to identify systems which might share applications and user content, and hence be related, even if some common files of interest are encrypted, deleted, or otherwise not available. In addition to triage decisions, digital media similarity may be used to infer links and associations between the disparate entities owning or using the respective digital devices.



Jaccard index, Link discovery, Sampling, Drive similarity, Sector hash, Parallel computation