A novel document recovery technique based on symbolic execution that makes it possible to fix broken documents without any prior knowledge of the file format.

Overview

Application crashes and errors that occur while loading a document are one of the most visible defects of consumer software. While documents become corrupted in various ways–from storage media failures to incompatibility across applications to malicious modifications–the root cause is always that the code cannot handle certain types of unusual input data.

Docovery is a novel document recovery technique based on symbolic execution that makes it possible to fix broken documents without any prior knowledge of the file format. Starting from the code path executed by a broken document, Docovery explores alternative paths that avoid the error, and makes small changes to the document in order to force the application to follow one of these alternative paths.

We implemented our approach in a prototype tool based on the modern symbolic execution engine KLEE. We performed a preliminary case study, which shows that Docovery can successfully recover broken documents processed by several popular applications such as the e-mail client pine, the pagination tool pr and the binary file utilities dwarfdump and readelf.

This is a joint project with Dr Miguel Castro and Dr Manuel Costa of Microsoft Research.

Docovery VM

Docovery is now available in a binary form as a downloadable VM. Please follow this link in order to get more details about the VM and our benchmarks.

Broken documents wanted!

As every research project, we need good real-world benchmarks. If you have a document, an image, a video or a sound file that doesn’t load or crashes your favourite application, please consider uploading it using this website.

Research Support

This research is generously supported by Microsoft Research through its PhD Scholarship Programme and by EPSRC through an Early-Career Fellowship.

Publications

  • Docovery: Toward Generic Automatic Document Recovery

    Tomasz Kuchta, Cristian Cadar, Miguel Castro, Manuel Costa

    International Conference on Automated Software Engineering (ASE 2014)