A study of correctness of PDF documents and PDF document readers, based on clustering errors emitted by the PDF readers and visual inconsistencies detection.

Overview

Electronic documents are widely used to store and share information such as bank statements, contracts, articles, maps and tax information. Many different applications exist for displaying a given electronic document, and users rightfully assume that documents will be rendered similarly independently of the application used. However, this is not always the case, and these inconsistencies, regardless of their causes - bugs in the application or the document file - can become critical sources of miscommunication.

We present a study on the correctness of PDF documents and readers. We start by manually investigating a large number of real-world PDF documents to understand the frequency and characteristics of cross-reader inconsistencies, and find that such inconsistencies are common - 13.5% PDF files are inconsistently rendered by at least one popular reader. We then propose an approach to detect and localize the source of such inconsistencies automatically.

We evaluate our automatic approach on a large corpus of over 230K documents using 11 popular readers. Our experiments have detected 30 unique bugs in these readers and files, some of which have already been confirmed or fixed by developers.

Results

Project results are available under this link.

Research Support

This research is generously supported by Natural Sciences and Engineering Research Council of Canada (NSERC), by Microsoft Research through its PhD Scholarship Programme and by EPSRC through an Early-Career Fellowship.