docker load --input moklee_artefact.tgz
docker run -it --ulimit nofile=1000000:1000000 moklee /bin/bash
This should work in most scenarios. However, in our setup we use:
docker run -it --ulimit nofile=1000000:1000000 --privileged --security-opt seccomp=unconfined --security-opt apparmor=unconfined moklee /bin/bash
Be aware that --security-opt ...
and --privileged
remove some security restrictions and allow access to /dev
on the host!
MoKlee, our memoization framework, is based on KLEE, a symbolic execution engine for C and C++. It is based on version 1.4 with multiple changes back-ported from version 2.0. For simplicity, we provide MoKlee as pre-compiled binary. However, we will propose our changes to be incorporated with upstream KLEE.
MoKlee is compiled and linked against the following components:
Cap’n Proto | 0.6.1 |
CVC4 | 1.7 |
LLVM | 3.8.1 |
Snappy | 1.1.7 |
SQLite | 3.29.0 |
STP | 2.3.3 |
TCMalloc | 2.7 |
Z3 | 4.8.4 |
Although this document should be self-contained, additional resources can be found on the KLEE website.
We placed all of the relevant information in the /project
directory:
evaluation/
our experimental data and the scripts to compute the figures from the paperexperiments/
the scripts to re-produce our experimentsbinutils/
, coreutils/
, diffutils/
, findutils/
, grep/
, libspng/
, tcpdump/
the benchmark applications with/project/VERSIONS
andconfig
), patches (patches
) and bitcode files (bc/
)driver.c
for libspng
Unfortunately, we do not include Clang or wllvm to keep the Docker image small.
If you want to run your own benchmarks, you need to compile them with Clang 3.8 into bitcode files and use docker cp
to copy them into the container.
MoKlee modifies KLEE in numerous ways. Some notable changes are:
-max-depth
handling (KLEE continues branching for some branch types)To execute MoKlee, start the Docker image as described in step 4 of the setup section. Inside the container, you can start MoKlee:
klee [MoKlee arguments] [application] [application arguments]
with the following arguments:
.bc
)As an example, run the following inside the container:
klee -posix-runtime -libc=uclibc /project/coreutils/bc/echo.bc "Hello ISSTA!"
An output similar to the following should be displayed:
KLEE: load program '/project/coreutils/bc/echo.bc'...
...
KLEE: output directory is "/project/coreutils/bc/klee-out-0"
KLEE: Using STP solver backend
...
Hello ISSTA!
...
KLEE: done: total instructions = 127616
KLEE: done: skipped instructions = 0
KLEE: done: completed paths = 1
KLEE: done: generated tests = 0
As part of the output, the destination directory for the results of this run is printed:
KLEE: output directory is "/project/coreutils/bc/klee-out-0"
Use klee-stats
and provide one or more result directories to display analysis results.
For our example use the following command:
klee-stats /project/coreutils/bc/klee-out-0
The output should look similar to:
----------------------------------------------------------------------------------------------------------------------------------- |Path | Instrs| Time(s)| ICov(%)| BCov(%)| ICount| TSolverChain(%)| QueryTime(%)| CachesTime(%)| ----------------------------------------------------------------------------------------------------------------------------------- |/project/coreutils/bc/klee-out-0| 127616| 0.42| 16.72| 100.00| 20572| 0.68| 0.67| 0.01| -----------------------------------------------------------------------------------------------------------------------------------
For more details use the additional --print-all
flag; and, to avoid line-wrapping, use less -S
.
klee-stats --print-all /project/coreutils/bc/klee-out-0 | less -S
To convert tabular data into comma-separated values you can use the stats2csv
alias:
klee-stats --print-all /project/coreutils/bc/klee-out-0 | stats2csv > echo.csv
The Docker image includes csvkit.
We use it heavily in our scripts but it may also help you analyse .csv
files with SQL queries, for instance:
csvsql --query "SELECT Path,[Time(s)] FROM echo WHERE Instrs > 1000" echo.csv
You may notice that MoKlee tracks much more information than KLEE. The most relevant statistics are:
Of course, the most interesting feature of MoKlee is its memoization capability.
In the following example we run grep
with symbolic input (--sym-args
0
1
10
--sym-args
0
2
2
--sym-files
1
8
--sym-stdin
8
--sym-stdout
) for 1 minute (--max-time
) with depth-first search (DFS) exploration (-search
), store the execution tree to disk (-write-ptree
), enable divergence detection (-hash-constants
-hash-structs
-disable-sequential-hashes
), use a fixed set (here empty) of environment variables (-environ
) to reduce divergences, track instruction coverage (-coverage-instruction
), and write the results into the record
subdirectory (--output-dir
).
klee -max-time=1min -search=dfs -write-ptree -hash-constants -hash-structs -disable-sequential-hashes -environ=/project/experiments/empty.env -coverage-instruction -posix-runtime -libc=uclibc --output-dir=record /project/grep/bc/grep.bc --sym-args 0 1 10 --sym-args 0 2 2 --sym-files 1 8 --sym-stdin 8 --sym-stdout
This example shows the minimal number of arguments to record an execution tree.
Nonetheless, we recommend additional arguments to limit the memory usage (--max-memory=2000
) and the number of written test cases (-dump-states=error
), to set a solver timeout (-max-solver-time=10s
) or to continue branching close to the memory limit (-max-memory-inhibit=FALSE
).
The complete set of arguments for our experiments is documented in the shell scripts in the Evaluation section.
Afterwards, we replay the recording run from above (--continue-with=record
), use the same exploration strategy during replay (-replay-search=dfs
), store the execution tree to disk and stop when we finished replaying the recording run (-stop-after-replay
).
In this run we do not prune any paths as we -restart-all-states
.
klee --continue-with=record -replay-search=dfs -search=dfs -stop-after-replay -restart-all-states -write-ptree -hash-constants -hash-structs -disable-sequential-hashes -environ=/project/experiments/empty.env -coverage-instruction -posix-runtime -libc=uclibc --output-dir=replay /project/grep/bc/grep.bc --sym-args 0 1 10 --sym-args 0 2 2 --sym-files 1 8 --sym-stdin 8 --sym-stdout
Now let’s omit the -restart-all-states
flag and enable path-pruning.
Let’s also enable the RndCov default strategy by omitting the searcher flags to explore different paths.
Instead of only replaying the whole tree and stopping afterwards as before, we now re-use the execution tree (--continue-with=record
) and continue for 1 minute.
Keep in mind that all replay runs modify the recorded process tree file (ptree.db
) directly and do not create a copy of that file.
klee -max-time=1min --continue-with=record -write-ptree -hash-constants -hash-structs -disable-sequential-hashes -environ=/project/experiments/empty.env -coverage-instruction -posix-runtime -libc=uclibc --output-dir=resume /project/grep/bc/grep.bc --sym-args 0 1 10 --sym-args 0 2 2 --sym-files 1 8 --sym-stdin 8 --sym-stdout
The command
klee-stats --print-all record replay resume | less -S
should now display something like
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |Path | Instrs| Time(s)| ICov(%)| BCov(%)| ICount| QTime(%)| TSolver(%)| maxStates| avgStates| Mem(MB)| maxMem(MB)| avgMem(MB)| Queries| AvgQC| Tcex(%)| Tfork(%)| TResolve(s)| QCexCMisses| QCexCHits| Irepl| Iskip| Inew| Ilost| IlostT| ST| SD| SDSolver| STrepl| SDrepl| SPrepl| SEDErepl| SDiv| SDivT| SDivP| Qconst| Qrepl| Qunrepl| Qrec| Qunrec| QVrec| PTTime(s)| ATime(s)| FTime(s)| Nodes| States| KStates| KStatesIntr| KExtCalls| CoverableBB| CoveredBB| CoverableInst| CoveredInst| TSearch(%)| DMemUsage| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |record | 564757| 61.41| 17.18| 100.00| 83555| 97.99| 98.37| 25| 21.00| 103.90| 115.88| 113.45| 849| 54| 98.12| 0.03| 0.01| 849| 775| 0| 0| 564757| 0| 0| 52| 23| 0| 0| 0| 0| 0| 0| 0| 0| 290825| 0| 0| 853| 62| 29| 0.00| 0.02| 0.00| 149| 75| 0| 0| 0| 0| 0| 0| 0| 0.00| 0| |replay | 564757| 3.41| 17.18| 100.00| 83555| 79.01| 79.70| 23| 12.60| 102.98| 114.98| 101.23| 15| 299| 79.05| 0.41| 2.69| 15| 43| 564757| 0| 0| 0| 0| 52| 23| 0| 0| 23| 0| 1| 0| 0| 0| 290825| 853| 62| 0| 0| 0| 0.00| 0.02| 0.00| 149| 75| 0| 0| 0| 0| 0| 0| 0| 0.02| 0| |resume | 1960268| 62.35| 6.94| 100.00| 83555| 58.76| 67.87| 8417| 4440.68| 123.72| 302.90| 213.05| 4275| 105| 61.60| 2.65| 1.69| 4275| 3803| 152354| 0|1807914| 0| 0| 6|8417| 0| 2| 5| 5| 1| 0| 0| 0| 964094| 7| 6| 22129| 332| 76| 0.26| 0.02| 0.00| 16849| 8425| 0| 0| 0| 0| 0| 0| 0| 1.82| 0| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |Total (3)| 3089782| 127.17| 13.77| 100.00| 250665| 78.25| 82.91| 8465| 4474.28| 330.60| 533.75| 427.73| 5139| 97| 79.71| 1.33| 4.39| 5139| 4621| 717111| 0|2372671| 0| 0| 110|8463| 0| 2| 28| 5| 2| 0| 0| 0| 1545744| 860| 68| 22982| 394| 105| 0.26| 0.06| 0.00| 17147| 8575| 0| 0| 0| 0| 0| 0| 0| 0.89| 0| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The replay run re-executed all instructions in a fraction of time (3.41s) without any divergences (SDiv
) while
the resume run aimed to explore different paths and by doing so replayed 152,354 instructions (Irepl
).
Note that if you’d like to run the commands above (or variants of) multiple times, you need to change the names of the output directories passed with --output-dir
.
Similarly, you need to pass those directory names to the klee-stats
command.
In the first part of this document, we walk the reader over how they can inspect the data generated by our experiments. In the second part, we show how the reader can re-run our experiments.
In the following section, we provide the raw data for each of our experiments in separate subdirectories. Each directory contains, if necessary, a minimal script to re-compute the relevant data. These scripts not only document our methodology but also can be used to generate and compare results when reproducing our experiments.
From our evaluation we exclude several Coreutils:
base32
, sha1sum
, sha224sum
, sha384sum
, sha512sum
as they are similar to base64
cat
because KLEE/MoKlee ran instantly out of memorychcon
, chgrp
, chmod
, chown
, chroot
, dd
, ginstall
, kill
, rm
, rmdir
, stdbuf
, truncate
, unlink
for they interfere (sometimes) with our experiment setup.
For instance, chmod
removed the access rights to our sandbox directory and truncate
destroyed Docker’s overlay file system. Otherwise, MoKlee works fine with those tools and if you are brave enough, you might run kill
, rm
, … with symbolic arguments.Directory: /project/evaluation/early_termination
For Figure 1, we ran MoKlee for 2 hours on 87 Coreutils with the default memory limit of 2GB.
We use MoKlee’s low-overhead recording mode here so that we can re-use this run for other experiments, especially the long-running experiment in Section 4.6.
We also provide the data for three non-recording runs (not shown in the paper) with three different memory limits: 2GB, 3GB, and 10GB, showing a similar behaviour and only slightly reduced termination rates for the 10GB run.
The raw data produced by MoKlee for the four runs can be found in 2GB_record.stats
, 2GB.stats
, 3GB.stats
, and 10GB.stats
.
The relevant columns are the number of created states (States
) and the number of early terminated states (KStates
).
We provide a script that outputs the data in a more readable format, e.g.:
./early_termination.sh 2GB_record.stats
Directory: /project/evaluation/replay_coreutils
In this experiment, we record 2h runs of various Coreutils with the default search strategy, a combination of random-path and coverage-guided heuristics (RndCov). Afterwards, we replay those runs with and without enabled path-pruning and with two different search strategies. The raw data is available in:
record.stats
recorded RndCov runsreplay_rc.stats
replayed runs with RndCov strategyreplay_dfs.stats
replayed runs with depth-first-search (DFS) strategyprune_rc.stats
replayed runs with RndCov strategy and path-pruning enabledprune_dfs.stats
replayed runs with DFS strategy and path-pruning enabledPlease note that when a tool is fully explored in the recording stage, not a single instruction (Instrs
) needs to be executed with path-pruning enabled.
For the comparisons shown in Figure 7 and 9, we had to exclude all tools that:
SDiv > 0
) and lose instructions (Ilost > 0
) orKStatesIntr > 0
)We include a script that computes:
excludes.csv
)record_execution_times.csv
)execution_times_summary.csv
):After running:
./execution_times.sh
you can retrieve data for Figure 7 in textual form with e.g.:
csvstat -c "Time(s)" record_execution_times.csv
The command above shows the results for the Record-RndCov bar in Figure 7, displaying the results in seconds rather than minutes. The other bars can be obtained in a similar fashion by querying the appropriate .csv files.
The results for Figure 9 can be obtained with:
csvlook execution_times_summary.csv
Directory: /project/evaluation/replay_coreutils_dfs
This experiment is similar to the one above but uses DFS as search strategy in the recording run. Again, we provide the raw data and a script to create the timing data.
After running:
./execution_times.sh
you can retrieve data for Figure 8 in textual form from the generated .csv files with e.g.:
csvstat -c "Time(s)" record_execution_times.csv
Again, this shows the results in seconds rather than minutes. The data for Figure 10 can be displayed with:
csvlook execution_times_summary.csv
Although we excluded the replay run with RndCov strategy from the paper as too many tools terminate states early, we do provide the raw data (replay_rc.stats
).
We use the same data for Figure 13 in Section 4.3 to report the time spent maintaining the process tree in MoKlee relative to the overall execution time. The statistics for the relevant runs can be retrieved with:
csvstat -c "TTree(%)" record_relttime.csv
and
csvstat -c "TTree(%)" replay_dfs_relttime.csv
Directory: /project/evaluation/pruned_states
Figure 11 is based on the path-pruning experiments for Coreutils from above. We use the same data but a different naming scheme:
record_<record heuristic>.stats
and
prune_<record heuristic>_<replay heuristic>.stats
After running:
./compute_pruned_states.sh
the distribution of pruned states can be displayed with, e.g.:
csvstat -c "PrunedStates(%)" summary_rc_rc.csv
With the current implementation we can only compute the number of pruned states as the difference of created states between recording and replaying runs.
When a replaying run terminates states early due to memory pressure (KStates > 0
) or removes subtrees because of divergences (Ilost > 0
), we have to remove it from the dataset.
Directory: /project/evaluation/ptree_sizes
We used ls
to determine the sizes of the recorded process trees (ptree.db
) for the 2h RndCov runs.
Hence, the sizes in the corresponding .csv
files include the overhead for the database file format (sqlite).
For the sake of completeness, we report all sizes and not just the runs reaching the 2h limit as in Figure 12.
Directory: /project/evaluation/runtime_overhead
For this experiment, we use the recorded DFS run from above (record.stats
) and re-run the same tools with the same number of instructions but without recording (norecord.stats
) to compare execution times.
Obviously, to compare two executions of the same tool, we need to make sure that they execute the same instructions and create the same execution tree.
Unfortunately, we can not use our divergence detection for the non-recording run as we do not record the necessary data.
Therefore, we use various statistics to ensure that different executions are identical, particularly the number of:
Instrs
)Queries
)QCexCHits
) and misses (QCexCMisses
)ST
) and interrupted states (SD
)CoveredInst
)After generating the timing data with:
./compute_overhead.sh
the reported data for the remaining 42 tools can be retrieved with:
csvstat -c "Overhead(%)" overhead.csv
Directory: /project/evaluation/lost_instructions
Figure 14 shows the instructions lost due to divergences in non-pruning runs.
For that, we select all diverging runs (SDiv > 0
) that do not terminate unreplayed states (Ilost > 0
) early due to memory pressure (KStatesIntr = 0
).
After running:
./compute_lost_instructions.sh
the results can be inspected via, e.g.:
csvstat -c "LostInstructions(%)" summary_rc_rc.csv
Directory: /project/evaluation/replay_others
We provide statistics for all experiments (.stats
) and a script to generate the data for Table 1:
./compute_table.sh
For the sake of completeness, we also include the path-pruning runs. All runs are shown with:
csvlook table.csv
In Section 4.4 we report that replaying find
with KLEE’s deterministic memory model reduces the number of divergences significantly.
These results are also shown as
find-detmem
.
Directory: /project/evaluation/iterative_deepening
We use the non-Coreutils applications for this experiment and excluded find
due to its many divergences.
First, we determined the maximum depth that can be computed using BFS (d = 22 for diff
, 13 for grep
, 42 for libspng
, 22 for readelf
, and 19 for tcpdump
) without terminating states early.
These depths were then used as starting depths (00
) for iterative deepening with DFS, with and without memoization (nomem
).
After 5 more levels (05
) we terminated the deepening process.
./compute_times.sh
creates summary files that can be inspected with, e.g.:
csvlook diff_summary.csv
The data for the record-only runs can be found in *_deep_nomem_ptree.stats
.
We use the ptree-comparison
tool to compare process trees (ptree.db
) of the highest depth and provide the results in tree_comparison.txt
.
Directory: /project/evaluation/long_running
For this experiment, we select all Coreutils from the RndCov record run that finish before our 2h timeout for they have to terminate states early due to memory pressure. We then run those experiments repeatedly using memoization but without a time limit until they run out of states due to early termination. When the combined runtime for each application exceeds 7 days, we stop the experiment.
We report replayed instructions (Irepl
) in Figure 16 and the number of new instructions (Inew
) in Figure 17.
You may use, e.g.:
./long_run.sh base64_resumeN.stats
to reduce the output to relevant columns.
ptree_sizes.csv
contains the sizes of the final process trees.
The coverage data for every tool as shown in Figure 18 is included in the coverage_<tool>.csv
files.
For each data point it contains the overall line coverage (cov
) which is the sum of the covered lines in KLEE’s runtime (runtimeCov
), the uclibc C library (uclibcCov
), and the application itself (appCov
).
A helper script shows the relevant columns, e.g.:
./coverage.sh coverage_tac.csv
This section is a guideline through the setup of our experiments. For each experiment we provide a script to reproduce our results. All scripts have a similar structure and it should be easy to spot configuration changes or adapt them if need be. The final section explains how to generate the evaluation data for your own experiments.
We ran our experiments on a carefully-configured homogeneous cluster of machines with Intel Core i7-4790 CPUs at 3.6 GHz, 16 GiB of RAM, and 1TB HDDs. Please be aware that these experiments are time-consuming and resource-intensive and you might want to reduce the set of benchmarks.
The Docker image comes with nano and the more user-friendly nice editor (ne) pre-installed.
Feel free to install other editors (e.g. apt update && apt install vim
) to configure and adapt our scripts.
Directory: /project/experiments/early_termination
In our introduction we compute the percentages of states terminated due to memory pressure. You can re-run this experiment with:
./run_experiments.sh
and configure the selection of applications (e.g. tools=( "base64" "grep" )
), the timeout (e.g. max_time=60min
) and memory limits (memory_limits
) in the configuration section of our script.
The output directories are located at ./<limit>/<tool>
.
Directory: /project/experiments/replay
For all benchmarks (Binutils, Coreutils, Diffutils, Findutils, grep, libspng, and tcpdump) we offer a single script:
./run_experiments.sh
to reproduce our replaying experiments. For each application it runs MoKlee 10 times:
Running 93 applications 10 times with an initial timeout of 2h is of course time-consuming. You can alter the set of applications, select the search strategies for record and replaying runs, and reduce the timeout in the configuration section of the script.
After successfully running the script you can find MoKlee’s output directories in the following subdirectories:
./record/<strategy>/<tool>
./replay/<record strategy>/<replay strategy>/<tool>
./prune/<record strategy>/<replay strategy>/<tool>
Directory: /project/experiments/overhead
This experiment compares the runtimes of a recorded run (DFS) and a non-recording run (DFS) that executes the same number of instructions (-stop-after-n-instructions=<n>
).
Hence, it requires that some applications were recorded in DFS mode with the script in the previous section.
Similar to the last sections, you can adapt the set of applications (tools
) in the configuration section and execute:
./run_experiments.sh
MoKlee’s output directories are located at ./norecord/<tool>
.
Directory: /project/experiments/iterative_deepening
We first determine the largest depth that can be computed with breadth-first strategy without terminating states due to memory pressure.
We omit those runs from the artefact.
In case you want to determine the relevant depths for other tools or different memory limits, iteratively increase the -max-depth
and terminate when the memory limit is reached (-terminate-when-oom
).
The messages.txt
file in the output directory should contain a message like:
KLEE: Reached memory limit, shutting down Current state (12446) has depth: 22 KLEE: halting execution, dumping 141536 remaining states
For two of the applications, diff
and readelf
, the runtimes became quite large (several days) for the higher depths and we had to reduce the solver timeout (-max-solver-time
) to 1s.
Again, this directory provides the script (run_experiments.sh
) to execute memoized and non-memoized deepening runs.
You can change the start depths (start_depths
) for each application and the number of deepening steps (iterations
) in the configuration section.
With a combined run-time of more than 3 months we consider it impractical to run this experiment in a single Docker container and provide no ready-made script in this case. The setup is identical to the replaying runs with path-pruning: iteratively replay the respective last run with RndCov and omit the following flags
-stop-after-replay
-replay-kill-replayed
-max-time
-replay-search
and -search
until the combined run time per application exceeds 7 days.
Directory: /project/experiments/evaluation
For convenience we provide a script (collect_data.sh
) that gathers the data from the previous runs and builds up a directory structure similar to the one in the evaluation section.
Each subdirectory contains files with the same names and the same scripts as above.
The main difference is that we do not distinguish between the different application suites and all runs are either in (replay_rc
) or (replay_dfs
) depending on the search strategy of the recording run.
Keep in mind that compute_overhead.sh
filters experiments with a runtime shorter than 2h.
This script needs to be adapted accordingly when you change the maximum runtime.