Docovery: Toward Generic Automatic Document Recovery

Application crashes and errors that occur while loading a document are one of the most visible defects of consumer software. While documents become corrupted in various ways–from storage media failures to incompatibility across applications to malicious modifications–the root cause is always that the code cannot handle certain types of unusual input data.

Docovery is a novel document recovery technique based on symbolic execution that makes it possible to fix broken documents without any prior knowledge of the file format. Starting from the code path executed by a broken document, Docovery explores alternative paths that avoid the error, and makes small changes to the document in order to force the application to follow one of these alternative paths.

We implemented our approach in a prototype tool based on the modern symbolic execution engine KLEE. We performed a preliminary case study, which shows that Docovery can successfully recover broken documents processed by several popular applications such as the e-mail client pine, the pagination tool pr and the binary file utilities dwarfdump and readelf.

Tomasz Kuchta is a second-year PhD student in the Department of Computing at Imperial College London, supervised by Dr Cristian Cadar. He received a Master’s degree in Computer Science from Cracow University of Technology and then worked in the telecommunications industry as a software engineer. In his PhD research, Tomasz works on a technique for fixing corrupt documents that either crash the application or fail to load.