Docovery: Toward Generic Automatic Document Recovery

Abstract

Application crashes and errors that occur while loading a document are one of the most visible defects of consumer software. While documents become corrupted in various ways—from storage media failures to incompatibility across applications to malicious modifications—the underlying reason they fail to load in a certain application is that their contents cause the application logic to exercise an uncommon execution path which the software was not designed to handle, or which was not properly tested.

We present Docovery, a novel document recovery technique based on symbolic execution that makes it possible to fix broken documents without any prior knowledge of the file format. Starting from the code path executed when opening a broken document, Docovery explores alternative paths that avoid the error, and makes small changes to the document in order to force the application to follow one of these alternative paths.

We implemented our approach in a prototype tool based on the symbolic execution engine KLEE. We present a preliminary case study, which shows that Docovery can successfully recover broken documents processed by several popular applications such as the e-mail client pine, the pagination tool pr and the binary file utilities dwarfdump and readelf.