Artefact for Combining Static Analysis Error Traces with Dynamic Symbolic Execution (Experience Paper) -Software Reliability Group

This artefact provides a Docker image that contains:

source code and binaries of our instrumentation/bug injection tools and KLEE extension
all benchmarks with their bitcode files
scripts to reproduce our experiments
static analysis reports for investigated real-world applications

Software Versions

KLEE: based on git commit #04f5031c
LLVM/Clang Static Analyzer (CSA): 11.0.1
Infer: 1.0.0
Z3: 4.8.8
Ubuntu: 18.04 for Docker image and host machines

Setup

Install Docker based on your system configuration: Get Docker.
Download the artefact: kleesa_artefact.tgz (Version 1.1 - 26.07.2022).
Import the artefact into Docker: docker load --input kleesa_artefact.tgz

Run the Docker image:

docker run -it --ulimit nofile=1000000:1000000 klee-sa/artefact /bin/bash

In our cluster setup we use a less restricted Docker environment by adding --privileged --security-opt seccomp=unconfined --security-opt apparmor=unconfined. We expect that the experiments have a similar outcome without these flags. Alternatives like Podman should work as well but were not tested by us.

Midnight commander (mc), nano and tmux are pre-installed, as well as a complete build setup for all tools and benchmarks. Other packages can be installed as usual via apt update && apt install <package>.

Getting Started

The aim of our paper is to confirm true positive reports from off-the-shelf static analysers (SA) with dynamic symbolic execution (DSE) by guiding a DSE engine along the generated SA traces.

We give a short overview of how our tooling combines Clang Static Analyzer with the KLEE symbolic execution engine, how to use Infer to generate traces, and finally how bugs can be injected into arbitrary C code. A more detailed description of the tools can be found in the following sections.

The instrumentation/injection tools are research tools and after our initial negative results we did not develop them further to standalone tools. Expect them to fail whenever the use-case differs from our evaluation.

Clang Static Analyzer (CSA) (Figure 1)

First, we apply Clang Static Analyzer to the source code from Figure 1a:

$ cd /project/example
$ cp figure1.c figure1_original.c # create backup
$ scan-build -o csa_report clang figure1.c -o figure1 # run CSA

Its report can be read in a browser (replace fields in <> with actual values) and should contain the use-after-free bug from Figure 1b.

$ elinks csa_report/<report>/report-<tag>.html # exit with q

Afterwards, we convert that report into a simpler format (uaf.csa) and instrument the original program:

$ html_parser.py csa_report/<report>/report-<id>.html > uaf.csa
$ instrumentation figure1.c uaf.csa 5 -- -I/project/root/lib/clang/11.0.0/include # instrument source file

figure1.c now contains calls to instrumentation functions (INSTR_LINE_*) which are defined in inst.c (Figure 1c).

$ less -SN figure1.c
$ less -SN inst.c

To run programs with the symbolic execution engine, in our case KLEE, they have to be compiled to LLVM bitcode (.bc):

$ clang -emit-llvm -fno-discard-value-names -Wall -Wextra -g -O1 -Xclang -disable-llvm-passes -D__NO_STRING_INLINES  -D_FORTIFY_SOURCE=0 -U__OPTIMIZE__ -c figure1_original.c # original
$ clang -emit-llvm -fno-discard-value-names -Wall -Wextra -g -O1 -Xclang -disable-llvm-passes -D__NO_STRING_INLINES  -D_FORTIFY_SOURCE=0 -U__OPTIMIZE__ -I /project/root-top/include -c figure1.c # instrumented

First run the original program:

$ klee --exit-on-error-location=figure1_original.c:47 --only-output-states-covering-new --posix-runtime --libc=uclibc figure1_original.bc --sym-stdin 8
...
KLEE: ERROR: figure1_original.c:47: memory error: out of bound pointer (asm: 733)
KLEE: NOTE: now ignoring this error at this location
KLEE: halting execution, dumping remaining states                         

KLEE: done: total instructions = 3803911
KLEE: done: completed paths = 7175
KLEE: done: generated tests = 15

Then run the instrumented code together with our targeted searcher. This combination reduces the number of instructions and hence the run-time significantly:

$ klee --search=step-over-minnextdistpath --exit-on-error-location=figure1.c:49 --only-output-states-covering-new --posix-runtime --libc=uclibc figure1.bc --sym-stdin 8
...
KLEE: ERROR: figure1.c:49: memory error: out of bound pointer (asm: 8849)
KLEE: NOTE: now ignoring this error at this location
KLEE: halting execution, dumping remaining states

KLEE: done: total instructions = 57784
KLEE: done: completed paths = 43
KLEE: done: generated tests = 7

Keep in mind there is non-determinism in KLEE and the numbers above might vary slightly. KLEE’s statistics for the two runs can be displayed with:

$ klee-stats --print-all klee-out-0 klee-out-1 | less -S
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   Path   |  Instrs|  Time(s)|  TUser(s)|  ICov(%)|  BCov(%)|  ICount|  TSolver(s)|  TSolver(%)|  States|  AvgStates|  Mem(MB)|  Queries|  AvgQC|  Tcex(%)|  Tfork(%)|  TResolve(%)|  QCexCMisses|  QCexCHits|  KStates|  NumAssume|  MaxAssume|  MaxReachedAssume|  ReachedAssume|  SkippedAssume|  ExitOnLocation|  SearcherTime|  CoveredInstructions|  FinalAssumeInExitHandler|  FinalAssumeStates|  FitNegFail|  FitNegSAT|  FitNegUNSAT|  FullBranches|  NumBranches|  NumQueryConstructs|  PartialBranches|  QueryTime|  RelCexCacheTime|  RelForkTime|  RelResolveTime|  RelSearcherTime|  RelUserTime|  UncoveredInstructions|  UnreachableAssumeStates|  avgMem|  maxMem|  maxState|
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|klee-out-0| 3803911|    42.17|     13.79|    23.80|    14.70|   13204|       28.86|       68.44|       0|    2960.12|    35.65|     1115| 107.00|    27.22|      0.77|         0.23|         1115|       5146|        0|          0|          0|                 0|              0|              0|               1|          0.19|                 3143|                         0|                  0|           0|          0|            0|            60|          891|              119470|              142|      26.97|            64.54|         1.83|            0.55|             0.45|        32.70|                  10061|                        0|  120.02|     217|      6606|
|klee-out-1|   57784|     2.22|      1.01|    23.81|    14.09|   13294|        2.00|       89.89|       0|       8.50|    35.89|      143|  63.00|     1.97|      0.01|         0.00|          143|        329|        0|         11|         10|                10|            102|              0|               1|          0.00|                 3165|                         0|                  0|           0|          0|            0|            55|          891|                9014|              141|       1.96|            88.50|         0.32|            0.16|             0.02|        45.34|                  10129|                        0|   35.86|      37|        21|
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|Total (2) | 3861695|    44.39|     14.80|    23.81|    14.39|   26498|       30.86|       79.17|       0|    1484.31|    71.54|     1258|  85.00|    14.59|      0.39|         0.12|         1258|       5475|        0|         11|         10|                10|            102|              0|               2|          0.19|                 6308|                         0|                  0|           0|          0|            0|           115|         1782|              128484|              283|      28.93|           153.04|         2.15|            0.71|             0.48|        78.04|                  20190|                        0|  155.88|     254|      6627|
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Infer (Figure 2)

During our experiments we observed that Infer sometimes generates infeasible traces. It is demonstrated by an example in the paper. In this program, Infer’s trace iterates the loop once (i = 1, x is even at the end), enters the loop a second time (i = 2, x = x + 2) but then claims that x % 2 == 0 is false, which is a contradiction as x is even. You can generate and inspect the trace with:

$ rm -rf infer-out
$ cat inconsistent.c # show source code
$ infer run --no-filtering --biabduction -- gcc inconsistent.c -o inconsistent
$ jq '.[] | select(.bug_type == "BIABDUCTION_MEMORY_LEAK")' infer-out/report.json | less -S

Parsing Infer reports is a little more complex and we refer to the container-driver.sh script in /tmp. On a high level, you have to extract the relevant report from infer-out/report.json with jq and then use our JSON-parser as seen in the script (alias json_parser works in the terminal). The json parser currently only works with injected bugs and does not support traces with multiple parts/sections (e.g. figure1.c above). For our evaluation and the generated traces this functionality was not needed.

Bug Injection

The Docker container provides two programs to inject bugs into C code: null-deref and use-after-free. Given an arbitrary C program:

$ cd /project/example
$ nl -b a injection.c
     1  #include<stdio.h>
     2  #include<stdlib.h>
     3
     4  int main(int argc, char *argv[]) {
     5          int i = argc + 5;
     6
     7          if (i == 8) {
     8                  --i;
     9                  puts("hello world");
    10          } else {
    11                  ++i;
    12          }
    13
    14          return i;
    15  }

a bug, here null dereference, can be injected at a specific non-empty line (11):

$ cp injection.c injection_original.c # create backup
$ null-deref injection.c 11 -- -I/project/root/lib/clang/11.0.0/include # inject bug
$ nl -b a injection.c
     1  #include<stdio.h>
     2  #include<stdlib.h>
     3
     4  int wtmp = 1, *xtmp = &wtmp;
     5  int main(int argc, char *argv[]) {
     6          int i = argc + 5;
     7
     8          if (i == 8) {
     9                  --i;
    10                  puts("hello world");
    11          } else {
    12                  {xtmp = NULL; printf("%d", *xtmp);} ++i;
    13          }
    14
    15          return i;
    16  }

or a sequence of lines (5, 11):

$ cp injection_original.c injection.c # restore backup
$ null-deref injection.c 5 11 -- -I/project/root/lib/clang/11.0.0/include
$ nl -b a injection.c
#include<stdio.h>
#include<stdlib.h>

int wtmp = 1, *xtmp = &wtmp;
int main(int argc, char *argv[]) {
        xtmp = NULL; int i = argc + 5;

        if (i == 8) {
                --i;
                puts("hello world");
        } else {
                {printf("%d", *xtmp);} ++i;
        }

        return i;
}

Detailed Description

Instrumenting Benchmarks (§2.1 and §3.2)

A static trace generated by an analyser provides a subset of events of an execution path that leads to a reported bug. We parse traces to extract relevant information such as the number of steps in the path and their corresponding locations in the source code along with the message associated with each step. The following instructions instrument a source file, provided CSA or Infer generates a trace for it:

$ instrumentation <src_file> <trace> 5

The ominous 5 at the end selects the instrumentation mode. We only use that one in the paper.

For more complex programs compilation flags can be forwarded after --. For instance, Coreutils require additional include paths:

$ cd /project/src/coreutils-8.31/src
$ instrumentation <src_file> <trace> 5 -- -I./lib -I./obj-llvm/lib -I./obj-llvm/src -I/project/root/lib/clang/11.0.0/include

There could be multiple steps in the trace associated with a given location in the source file:

Line no.	Step no.	Message
28	2	Loop condition is true
28	6	Loop condition is false

For example, line number 28 is an entry point of a while-loop. (We show only the steps associated with line number 28 in the source code.) Step 2 of the trace indicates that the condition of the while-loop holds and the control should flow into the body of the loop, whereas step 6 indicates the termination of the loop.

The source code is instrumented as follows for the above two steps of the trace:

Original code snippet	Instrumented code snippet	inst.c
28: while(cond) 29: { 30: ... 67: }	01: #include "inst.c" ... 30: while(INSTR_LINE_28(cond)) 31: { 33: ... 69: }	bool INSTR_LINE_28(bool COND) { assume_sa(2, COND); assume_sa(6, !COND); return COND; }

The traces associated with other control-flow elements like if-then-else and switch are handled in a similar fashion.

Instrumentation for a step with a message type concretizing a variable:

Line no.	Step no.	Message
46	4	Assuming `x` is equal to 0

is performed below:

Original code	Instrumented code	inst.c
46: x = getCode();	01: #include "inst.c" ... 48: x = getCode(); INSTR_LINE_46(x);	void INSTR_LINE_46(int X) { assume_sa(4, X == 0); }

Fault Injection (§5.2)

As described in the paper, we struggled to find real-world bugs with CSA and Infer and decided to inject artificial bugs for our evaluation. In order to inject a multi-event null dereference bug, specify the source file in which you want to inject the bug and the sequence of locations at which you would like to add the events for the bug. The below instruction modifies the source file by injecting a null dereference bug.

$ null-deref <src_file> <sequence_of_locations>

Similarly, you can inject a use-after-free bug:

$ use-after-free <src_file> <sequence_of_locations>

It is possible to forward compilation flags as well and we do this in the Coreutils experiment to prevent build errors. Depending on the application, an invocation example is:

/project/src/coreutils-8.31/src> null-deref <src_file> <sequence_of_locations> -- -I./lib -I./obj-llvm/lib -I./obj-llvm/src -I/project/root/lib/clang/11.0.0/include

Benchmark Selection (§5.1)

We chose 10 applications from GNU Coreutils 8.31 for fault injection experiments: comm, csplit, cut, env, join, ln, nl, od, split, and uniq.

The inclusion criteria for these experiments are based on the following factors:

No interference with the test setup.
KLEE supports all LLVM intrinsics and throws no assertion in the Z3 frontend.
KLEE produces reasonably deterministic results.
KLEE covers new code in the main source file after 10 mins (this is important for non-trivial fault injection).

Our initial coverage runs are in /project/coverage_results. There is a .csv file for each application (and a second one in cov1/ for the first run). kill_60.csv lists excluded tools and the reason for exclusion. The format for the coverage logs is:

elapsed time (µs), assembly line, source file, source line, no. of executed instructions

Injecting Faults and Instrumenting Coreutils (§5.2)

The injection/instrumentation script is located in /tmp. To re-run the (time-consuming) instrumentation process, (re)move the existing instrumentation directory and run the script to inject bugs at lines that were covered by KLEE in 10-120 minutes:

$ cd /tmp
$ mv /project/instrumentation /project/instrumentation_original
$ ./container-driver.sh

The APPS and BUG_TYPE variables in the script header can be modified to only instrument a subset of applications and bug classes. We provide this script only for the sake of completeness. This step is not required to re-run our evaluation, as the Docker image already contains all benchmarks.

Bugs were injected at locations that are hard to reach for KLEE without any static guidance. We used the coverage information of KLEE for our selected Coreutils (see above section) and randomly selected a number of program statements for fault injection provided KLEE took more than 10 minutes to cover that program statement for the first time.

This allowed us to inject 297 1-event bugs, 632 2-event, 478 3-event, and 357 4-event bugs across all 10 Coreutils. The number of distinct 1-event bugs is lower than for other counts because a single statement may be reachable by multiple execution paths, catering for numerous multi-event bugs.

All injected bugs for Coreutils 8.31 can be found in the sub-directories /project/null-dereference-bugs and /project/use-after-free-bugs. For instance, a 4-step null-dereference bug in lines 284, 194, 385, 243 for comm is located in /project/null-dereference-bugs/comm/comm-4/comm-284-194-385-243/fault/comm.c. Simply replace the original file in the Coreutils repository with this file and run the analysis phase of a static analyser. The results highly depend on the version of the analyser, its command line flags and if the program under test is built in-tree or out-of-tree. A large percentage of 1-event and 2-event bugs were found by CSA and Infer but only a few 3- and 4-event bugs were found by the analysers. For the Clang Static Analyzer we noticed that an under-approximation of the error() function lead to missed bugs.

For the KLEE runs in §5.3 we manually selected 55 distinct bugs across all the 10 Coreutils based on the traces generated by CSA and Infer. The details about selected bugs can be found in /project/instrumentation-info.

Targeted KLEE

We extended KLEE in several ways and list the most relevant flags and statistics below. Be aware, that in our implementation the function assume_sa from the paper is named klee_assume_sa.

Command-Line Flags

--search=step-over-minnextdistpath: enable the targeted searcher (we use it in combination with -search=random-path)
-skip-step-when-unreachable: skip intermediate steps that become unreachable (enabled in all experiment runs)
-target-only-last-step: target only the klee_assume_sa call with the highest step number (TargetLast in paper)
--assume-sa-mode=(noop|fit|strict): strategies to handle klee_assume_sa calls, we use different names in the paper (noop=ignore, fit=try, strict=require)
--assume-sa-order=(sequential|max): keep step order or continue with max. reached step, we only use sequential in our experiments
--exit-on-early-termination: terminate KLEE when it starts terminating states due to memory pressure (used in initial runs to determine the maximum number of states for threshold)
--exit-on-error-location=<location>: exit when error location reached, location is a file name or a file name and line number separated by a colon (e.g. uniq.c:653)
--test-fit-condition: determine if the negated constraint in klee_assume_sa is feasible (see statistics below)
--state-limit=<n>: set a limit of n active states (checked every 65k instructions)
--uncovered-update-after-instructions=<n> re-calculate coverage information after n instructions (to improve determinism across runs)

Statistics

Relevant new statistics are:

ExitOnLocation: number of states terminated at error location (means bug found)
MaxAssume: largest step number in klee_assume_sa call in bitcode
MaxReachedAssume: largest step number reached in klee_assume_sa call
NumAssume: number of klee_assume_sa call sites
ReachedAssume: number of executed klee_assume_sa calls
SkippedAssume: number of klee_assume_sa calls that were unreachable and hence skipped
UnreachableAssumeStates: number of states terminated due to unreachable next step

When condition checks (--test-fit-condition) for klee_assume_sa calls are enabled, the following statistics are populated:

FitNegFail: solver failure during test
FitNegSAT: negated condition satisfiable
FitNegUNSAT: negated condition unsatisfiable

Evaluation (§5.3)

The raw data for Figure 6 can be found in /project/results/figure_6.csv and /project/results/figure_6_targetlast.csv. We repeated all runs 5 times and each experiment is named

<app>_<step>(-<step>)*_(ignore|require|try)_(default|targeted)_(Infer|CSA)_<run>(_targetlast)?

in the Path column.

For instance, ln_194-351_ignore_targeted_CSA_0_targetlast describes the first (0) run of ln with steps in lines 194, 351, ignore strategy, and the targeted searcher which in this case only targets the last step (targetlast). The trace from CSA was used for this experiment.

For Section §5.3 we did an additional run, summarised in /project/results/fitneg.csv, to evaluate the percentage of constraints in klee_assume_sa that were already implied: in 875,886 cases the negated constraint has been infeasible (FitNegUNSAT) and only in 192 cases (FitNegSAT) the negated constraint could have been added as well with the given path constraints. That means in 99.98% of the cases the provided constraint did not (significantly) restrict the search space further.

Reproducing Results (§5.3)

We provide two scripts that can be used to reproduce our results:

/tmp/run-exp.sh: Runs 55 experiments with five different seeds to initialise KLEE’s random number generator, three strategies to handle klee_assume_sa, and both searchers. On our machines all 1650 runs finished in 812 machine hours and required 13GB of disk space.
/tmp/run-exp-targetlast.sh: Runs 55 experiments with five seeds, the ignore strategy and the targeted searcher that in this case only targets the last step. On our machines all 275 runs finished in 45 machine hours and required 3GB of disk space.

KLEE’s output directories will be written to /tmp/results. The directory names contain the tool name, relevant source lines, klee_assume_sa strategy, searcher, static analyser used to generate the trace, the seed id (the counter for the experiment repetitions), and if the searcher only targeted the last step, e.g. uniq_499-513-538-654_noop_targeted_CSA_4_targetonlylast. KLEE’s klee-stats tool can be used to generate a summary of all runs:

/tmp/results> klee-stats --print-all --table-format=csv . > summary.csv

or inspect the results in the terminal:

/tmp/results> klee-stats --print-all . | less -S

Source Code

The source code of our KLEE version is located in /project/src/klee-src-top. It can be modified and then easily rebuild from /project/src/klee-build-top. For the sake of brevity, we omit a description of all modifications and point to the two core components:

The targeted searcher is internally named StepOverMinNextPathSearcher and can be found in lib/Core/Searcher.cpp.
klee_assume_sa calls are handled in lib/Core/SpecialFunctionHandler.cpp by the handleAssumeSA method.

Static Analysis Traces (Table 3)

For Table 3 we used Clang Static Analyzer and Infer with their default flags. We provide all reports in:

/project/sa-reports/csa: reports are html files and can be inspected with elinks
/project/sa-reports/infer: report summaries as json and txt files

Unfortunately, the build directories and esp. Infer’s SQLite databases require several gigabytes of disk space and we refrained from adding them to the Docker image. To reproduce our results, the applications can be downloaded from the respective project pages and build with CSA or Infer. We used the following configure flags:

Application	Configure flags
apr-1.5.2	`--disable-threads --disable-posix-shm --disable-dso --disable-lfs`
flex-2.5.39	`--disable-nls`
gawk-4.1.2	`--disable-mpfr --disable-largefile --disable-nls --disable-extensions`
bc-1.06
binutils-2.25.1	`--disable-nls --disable-largefile --disable-gdb --disable-sim --disable-readline --disable-libdecnumber --disable-libquadmath --disable-libstdcxx --disable-ld --disable-gprof --disable-intl --disable-etc`
combine-0.4.0	`--disable-nls --disable-threads`
coreutils-8.24	`--disable-nls --without-gmp --disable-threads --disable-acl --disable-libsmack --disable-xattr --disable-libcap --without-selinux`
datamash-1.0.6	`--disable-nls`
diffutils-3.3	`--disable-nls --disable-largefile`
findutils-4.4.2	`--disable-nls --disable-largefile --disable-threads`
grep-2.21	`--disable-nls --disable-largefile --disable-threads`
gzip-1.6	`--disable-threads --disable-largefile`
libtasn1-4.5	`--disable-doc`
m4-1.4.17	`--disable-threads --with-included-regex --disable-largefile`
make-4.1	`--disable-nls --disable-largefile --disable-job-server --disable-load`
libosip2-4.1.0
sed-4.2	`--disable-largefile --disable-threads --disable-acl --disable-nls --disable-i18n --without-selinux -disable-threadsafe`
trueprint-5.4
ImageMagick6-6.9.4-8	`--enable-shared=no -disable-shared --disable-openmp --without-threads --without-magick-plus-plus --disable-hdri --without-zstd --without-bzlib --without-jbig --disable-cipher --without-fftw --without-flif --without-fpx --without-djvu --without-fontconfig --without-freetype --without-raqm --without-lcms --without-openjp2 --without-lqr --without-openexr --without-pango --without-webp --without-xml`
jasper-1.900.1
jpeg-9a
libtiff-Release-v3-9-7	`--disable-largefile --disable-zlib --disable-jpeg --disable-jbig --disable-lzma --disable-zstd --disable-webp --disable-cxx`
libxml2-2.9.2	`--without-catalog --without-debug --without-docbook --without-ftp --without-http --without-iconv --without-legacy --without-output --without-python --without-threads --without-modules --without-zlib --without-lzma`
tcpdump-4.7.4	`--without-sandbox-capsicum --without-crypto --without-cap-ng --without-smi`
vorbistools-1.4.0	`--disable-nls --disable-largefile --disable-threads --disable-shared --disable-ogg123 --without-flac --without-speex --without-kate --without-curl --disable-curltest --without-curl --disable-oggtest --disable-vorbistest --disable-curltest`