118 results found
Iorga D, Wickerson J, Donaldson A, 2023, Simulating operational memory models using off-the-shelf program analysis tools, IEEE Transactions on Software Engineering, ISSN: 0098-5589
Memory models allow reasoning about the correctness of multithreaded programs. Constructing and using such models is facilitated by simulators that reveal which behaviours of a given program are allowed. While extensive work has been done on simulating axiomatic memory models, there has been less work on simulation of operational models. Operational models are often considered more intuitive than axiomatic models, but are challenging to simulate due to the vast number of paths through the model’s transition system. Observing that a similar path-explosion problem is tackled by program analysis tools, we investigate the idea of reducing the decision problem of “whether a given memory model allows a given behaviour” to the decision problem of “whether a given C program is safe”, which can be handled by a variety of off-the-shelf tools. We report on our experience using multiple program analysis tools for C for this purpose—a model checker (CBMC), a symbolic execution tool (KLEE), and three coverage-guided fuzzers (libFuzzer, Centipede and AFL++)—presenting two case-studies. First, we evaluate the performance and scalability of these tools in the context of the x86 memory model, showing that fuzzers offer performance competitive with that of RMEM, a state-of-the-art bespoke memory model simulator. Second, we study a more complex, recently developed memory model for hybrid CPU/FPGA devices for which no bespoke simulator is available. We highlight how different encoding strategies can aid the various tools and show how our approach allows us to simulate the CPU/FPGA model twice as deeply as in prior work, leading to us finding and fixing several infidelities in the model. We also experimented with applying three analysis tools that won the “falsification” category in the 2023 Annual Software Verification Competition (SV-COMP). We found that these tools do not scale to our use cases, motivating us to submit example C pr
Sharma M, Yu P, Donaldson AF, 2023, RustSmith: Random Differential Compiler Testing for Rust, Pages: 1483-1486
We present RustSmith, the first Rust randomised program generator for end-to-end testing of Rust compilers. RustSmith generates programs that conform to the advanced type system of Rust, respecting rules related to borrowing and lifetimes, and that are guaranteed to yield a well-defined result. This makes RustSmith suitable for differential testing between compilers or across optimisation levels. By applying RustSmith to a series of versions of the official Rust compiler, rustc, we show that it can detect insidious historical bugs that evaded detection for some time. We have also used RustSmith to find previously-unknown bugs in an alternative Rust compiler implementation, mrustc. In a controlled experiment, we assess statement and mutation coverage achieved by RustSmith vs. the rustc optimisation test suite.
Even-Mendoza K, Sharma A, Donaldson A, et al., 2023, GrayC: Greybox fuzzing of compilers and analysers for C, International Symposium on Software Testing and Analysis, Publisher: ACM, Pages: 1219-1231
Fuzzing of compilers and code analysers has led to a large number of bugs being found and fixed in widely-used frameworks such as LLVM, GCC and Frama-C. Most such fuzzing techniques have taken a blackbox approach, with compilers and code analysers startingto become relatively immune to such fuzzers. We propose a coverage-directed, mutation-based approach for fuzzing C compilers and code analysers, inspired by the success of this type of greybox fuzzing in other application domains. The main challenge of applying mutation-based fuzzing in this context is that naive mutations are likely to generate programs that do not compile. Such programs are not useful for finding deep bugs that affect optimisation, analysis, and code generation routines.We have designed a novel greybox fuzzer for C compilers and analysers by developing a new set of mutations to target common C constructs, and transforming fuzzed programs so that they producemeaningful output, allowing differential testing to be used as a test oracle, and paving the way for fuzzer-generated programs to be integrated into compiler and code analyser regression test suites.We have implemented our approach in GrayC, a new open-source LibFuzzer-based tool, and present experiments showing that it provides more coverage on the middle- and back-end stages of compilers and analysers compared to other mutation-based approaches, including Clang-Fuzzer, PolyGlot, and a technique similar to LangFuzz.We have used GrayC to identify 30 confirmed compiler and code analyser bugs: 25 previously unknown bugs (with 22 of them already fixed in response to our reports) and 5 confirmed bugs reported independently shortly before we found them. A further 3 bug reports are under investigation. Apart from the results above, wehave contributed 24 simplified versions of coverage-enhancing test cases produced by GrayC to the Clang/LLVM test suite, targeting 78 previously uncovered functions in the LLVM codebase.
Lecoeur B, Mohsin H, Donaldson AF, 2023, Program Reconditioning: Avoiding Undefined Behaviour When Finding and Reducing Compiler Bugs, PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, Vol: 7
Mitenkov G, Magkanaris I, Awile O, et al., 2023, MOD2IR: High-Performance Code Generation for a Biophysically Detailed Neuronal Simulation DSL, Pages: 203-215
Advances in computational capabilities and large volumes of experimental data have established computer simulations of brain tissue models as an important pillar in modern neuroscience. Alongside, a variety of domain specific languages (DSLs) have been developed to succinctly express properties of these models, ensure their portability to different platforms, and provide an abstraction that allows scientists to work in their comfort zone of mathematical equations, delegating concerns about performance optimizations to downstream compilers. One of the popular DSLs in modern neuroscience is the NEURON MODeling Language (NMODL). Until now, its compilation process has been split into first transpiling NMODL to C++ and then using a C++ toolchain to emit the efficient machine code. This approach has several drawbacks including the reliance on different programming models to target heterogeneous hardware, maintainability of multiple compiler back-ends and the lack of flexibility to use the domain information for C++ code optimization. To overcome these limitations, we present MOD2IR, a new open-source code generation pipeline for NMODL. MOD2IR leverages the LLVM toolchain to target multiple CPU and GPU hardware platforms. Generating LLVM IR allows the vector extensions of modern CPU architectures to be targeted directly, producing optimized SIMD code. Additionally, this gives MOD2IR significant potential for further optimizations based on the domain information available when LLVM IR code is generated. We present experiments showing that MOD2IR is able to produce on-par execution performance using a single compiler back-end implementation compared to code generated via state-of-the-art C++ compilers, and can even surpass them by up to 1.26×. Moreover, MOD2IR supports JIT-execution of NMODL, yielding both efficient code and an on-the-fly execution workflow.
Evrard H, Donaldson AF, 2023, Model Checking Futexes, Pages: 41-58, ISSN: 0302-9743
The futex Linux system call enables implementing performant inter-thread synchronisation primitives, such as mutexes and condition variables. However, the futex system call is notoriously tricky to use correctly. In this case study, we use the Spin model checker to verify safety properties of a number of futex-based mutex and condition variable implementations. We show how model checking is able to detect bugs that affected real-world implementations, and confirm current implementations are correct. The Promela models we have developed are available as open source, and may be useful as teaching material for classes that cover futex-based synchronisation primitives, and as a template on how to perform formal verification on new synchronisation primitive designs.
Klimis V, Clark J, Baker A, et al., 2023, Taking Back Control in an Intermediate Representation for GPU Computing, PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, Vol: 7
Donaldson AF, Clayton B, Harrison R, et al., 2023, Industrial Deployment of Compiler Fuzzing Techniques for Two GPU Shading Languages, 16th IEEE International Conference on Software Testing, Verification and Validation (ICST), Publisher: IEEE COMPUTER SOC, Pages: 374-385, ISSN: 2381-2834
Even-Mendoza K, Cadar C, Donaldson A, 2022, CsmithEdge: more effective compiler testing by handling undefined behaviour less conservatively, Empirical Software Engineering: an international journal, Vol: 27, ISSN: 1382-3256
Compiler fuzzing techniques require a means of generating programs that are free from undefined behaviour (UB) to reliably reveal miscompilation bugs. Existing program generators such as CSMITH achieve UB-freedom by heavily restricting the form of generated programs. The idiomatic nature of the resulting programs risks limiting the test coverage they can offer, and thus the compiler bugs they can discover. We investigate the idea of adapting existing fuzzers to be less restrictive concerning UB, in the practical setting of C compiler testing via a new tool, CSMITHEDGE, which extends CSMITH. CSMITHEDGE probabilistically weakens the constraints used to enforce UB-freedom, thus generated programs are no longer guaranteed to be UB-free. It then employs several off-the-shelf UB detection tools and a novel dynamic analysis to (a) detect cases where the generated program exhibits UB and (b) determine where CSMITH has been too conservative in its use of safe math wrappers that guarantee UB-freedom for arithmetic operations, removing the use of redundant ones. The resulting UB-free programs can be used to test for miscompilation bugs via differential testing. The non-UB-free programs can still be used to check that the compiler under test does not crash or hang.Our experiments on recent versions of GCC, LLVM and the Microsoft VisualStudio Compiler show that CSMITHEDGE was able to discover 7 previously unknown miscompilation bugs (5 already fixed in response to our reports) that could not be found via intensive testing using CSMITH, and 2 compiler-hang bugs that were fixed independently shortly before we considered reporting them.
Busse F, Gharat P, Cadar C, et al., 2022, Combining static analysis error traces with dynamic symbolic execution (experience paper), International Symposium on Software Testing and Analysis (ISSTA 2022), Publisher: ACM, Pages: 568-579
This paper reports on our experience implementing a techniquefor sifting through static analysis reports using dynamic symbolicexecution. Our insight is that if a static analysis tool produces apartial trace through the program under analysis, annotated withconditions that the analyser believes are important for the bugto trigger, then a dynamic symbolic execution tool may be ableto exploit the trace by (a) guiding the search heuristically so thatpaths that follow the trace most closely are prioritised for explo-ration, and (b) pruning the search using the conditions associatedwith each step of the trace. This may allow the bug to be quicklyconfirmed using dynamic symbolic execution, if it turns out to be atrue positive, yielding an input that triggers the bug.To experiment with this approach, we have implemented the ideain a tool chain that allows the popular open-source static analysistools Clang Static Analyzer (CSA) and Infer to be combined withthe popular open-source dynamic symbolic execution engine KLEE.Our findings highlight two interesting negative results. First, whilefault injection experiments show the promise of our technique,they also reveal that the traces provided by static analysis tools arenot that useful in guiding search. Second, we have systematicallyapplied CSA and Infer to a large corpus of real-world applicationsthat are suitable for analysis with KLEE, and find that the staticanalysers are rarely able to find non-trivial true positive bugs forthis set of applications.We believe our case study can inform static analysis and dynamicsymbolic execution tool developers as to where improvements maybe necessary, and serve as a call to arms for researchers interestedin combining symbolic execution and static analysis to identifymore suitable benchmark suites for evaluation of research ideas.
Windsor M, Donaldson AF, Wickerson J, 2022, High-coverage metamorphic testing of concurrency support in C compilers, SOFTWARE TESTING VERIFICATION & RELIABILITY, Vol: 32, ISSN: 0960-0833
Lascu A, Donaldson AF, Grosser T, et al., 2022, Metamorphic Fuzzing of C plus plus Libraries, 15th IEEE International Conference on Software Testing, Verification and Validation (ICST), Publisher: IEEE COMPUTER SOC, Pages: 35-46, ISSN: 2381-2834
Iorga D, Donaldson A, Sorensen T, et al., 2021, The semantics of shared memory in Intel CPU/FPGA systems, Proceedings of the ACM on Programming Languages, Vol: 5, Pages: 1-28, ISSN: 2475-1421
Heterogeneous CPU/FPGA devices, in which a CPU and an FPGA can execute together while sharing memory,are becoming popular in several computing sectors. In this paper, we study the shared-memory semanticsof these devices, with a view to providing a firm foundation for reasoning about the programs that run onthem. Our focus is on Intel platforms that combine an Intel FPGA with a multicore Xeon CPU. We describe theweak-memory behaviours that are allowed (and observable) on these devices when CPU threads and an FPGAthread access common memory locations in a fine-grained manner through multiple channels. Some of thesebehaviours are familiar from well-studied CPU and GPU concurrency; others are weaker still. We encodethese behaviours in two formal memory models: one operational, one axiomatic. We develop executableimplementations of both models, using the CBMC bounded model-checking tool for our operational modeland the Alloy modelling language for our axiomatic model. Using these, we cross-check our models againsteach other via a translator that converts Alloy-generated executions into queries for the CBMC model. Wealso validate our models against actual hardware by translating 583 Alloy-generated executions into litmustests that we run on CPU/FPGA devices; when doing this, we avoid the prohibitive cost of synthesising ahardware design per litmus test by creating our own ‘litmus-test processor’ in hardware. We expect that ourmodels will be useful for low-level programmers, compiler writers, and designers of analysis tools. Indeed, as ademonstration of the utility of our work, we use our operational model to reason about a producer/consumerbuffer implemented across the CPU and the FPGA. When the buffer uses insufficient synchronisation – asituation that our model is able to detect – we observe that its performance improves at the cost of occasionaldata corruption.
Sorensen T, Salvador LF, Raval H, et al., 2021, Specifying and testing GPU workgroup progress models, Proceedings of the ACM on Programming Languages, Vol: 5, Pages: 1-30, ISSN: 2475-1421
As GPU availability has increased and programming support has matured, a wider variety of applications arebeing ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; assuch, their correct execution depends on a degree of relative forward progress between threads (or threadgroups). Unfortunately, many GPU programming specifications (e.g. Vulkan and Metal) say almost nothingabout relative forward progress guarantees between workgroups. Although prior work has proposed aspectrum of plausible progress models for GPUs, cross-vendor specifications have yet to commit to any model.This work is a collection of tools and experimental data to aid specification designers when consideringforward progress guarantees in programming frameworks. As a foundation, we formalize a small parallelprogramming language that captures the essence of fine-grained synchronization. We then provide a means offormally specifying a progress model, and develop a termination oracle that decides whether a given programis guaranteed to eventually terminate with respect to a given progress model. Next, we formalize a set ofconstraints that describe concurrent programs that require forward progress to terminate. This allows us tosynthesize a large set of 483 progress litmus tests. Combined with the termination oracle, we can determinethe expected status of each litmus test – i.e. whether it is guaranteed to eventually terminate – under variousprogress models. We present a large experimental campaign running the litmus tests across 8 GPUs from 5different vendors. Our results highlight that GPUs have significantly different termination behaviors underour test suite. Most notably, we find that Apple and ARM GPUs do not support thelinear occupancy-boundmodel, as was hypothesized by prior work
Lascu A, Windsor M, Donaldson A, et al., 2021, Dreaming up metamorphic relations: experiences from three fuzzer tools, 2021 IEEE/ACM 6th International Workshop on Metamorphic Testing (MET), Publisher: IEEE, Pages: 61-68
Metamorphic testing requires the availability of a suitable set of metamorphic relations (MRs) for the application domain of interest. A software testing practitioner interested in using metamorphic testing is thus blocked unless they can devise a suitable set of MRs. In this paper we offer some practical advice on sources of inspiration for MRs, based on our experience building three fuzzing tools based on metamorphic testing: MF++, which supports automated testing of C++11libraries,C4, which tests concurrency support in C11 compilers, and spirv-fuzz, which aims to find bugs in compilers for the SPIR-V programming language (mainly used in computer graphics).The MRs we have devised have taken inspiration from three main sources: (1) careful study of specification documents related to the libraries and programming languages that these tools target, (2) consultation of prior work and discussion with domain experts, and (3) manual inspection of the results of automated code coverage analysis on the systems under test. We describe these sources of inspiration in detail, giving a range of concrete examples for each. We hope that this experience report will help to inform developers of future metamorphic testing tools as to the steps they can take to discover MRs in their domains of interest.
Donaldson A, Thomson P, Teliman V, et al., 2021, Test-case reduction and deduplication almost for free with transformation-based compiler testing, PLDI 2021, Publisher: Association for Computing Machinery (ACM), Pages: 1017-1032
Recent transformation-based approaches to compiler testing look for mismatches between the results of pairs of equivalent programs, where one program is derived from the other by randomly applying semantics-preserving transformations. We present a formulation of transformation-based compiler testing that provides effective test-case reduction almost for free: if transformations are designed to be as small and independent as possible, standard delta debugging can be used to shrink a bug-inducing transformation sequence to a smaller subsequence that still triggers the bug. The bug can then be reported as a delta between an original and minimally-transformed program. Minimized transformation sequences can also be used to heuristically deduplicate a set of bug-inducing tests, recommending manual investigation of those that involve disparate types of transformations and thus may have different root causes. We demonstrate the effectiveness of our approach via a new tool, spirv-fuzz, the first compiler-testing tool for the SPIR-V intermediate representation that underpins the Vulkan GPU programming model.
Windsor M, Donaldson AF, Wickerson J, 2021, C4: The C Compiler Concurrency Checker, 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), Publisher: ASSOC COMPUTING MACHINERY, Pages: 670-673
Even Mendoza K, Cadar C, Donaldson A, 2020, Closer to the edge: testing compilers more thoroughly by being less conservative about undefined behaviour, IEEE/ACM International Conference on Automated Software Engineering, New Ideas and Emerging Results Track (ASE-NIER 2020), Publisher: IEEE / ACM, Pages: 1219-1223
Randomised compiler testing techniques require a means of generating programs that are free from undefined behaviour (UB) in order to reliably reveal miscompilation bugs. Existing program generators such asCsmithheavily restrict the form of generated programs inorder to achieve UB-freedom. We hypothesise that the idiomatic nature of such programs limits the test coverage they can offer. Our idea is to generate less restricted programs that are still UB-free—programs that get closer to the edge of UB, but that do not quite cross the edge. We present preliminary support for our idea via a prototype tool, CsmithEdge, which uses simple dynamic analysis to determine whereCsmithhas been too conservative in its use of safe math rappers that guarantee UB-freedom for arithmetic operations. By eliminating redundant wrappers,CsmithEdge was able to discover two new miscompilation bugs in GCC that could not be found via intensive testing using regular Csmith, and to achieve substantial differences in code coverage on GCC compared with regular Csmith.
MacIver DR, Donaldson AF, 2020, Test-case reduction via test-case generation: Insights from the hypothesis reducer, 34th European Conference on Object-Oriented Programming (ECOOP 2020), Publisher: DROPS, Pages: 1-27, ISSN: 1868-8969
We describe internal test-case reduction, the method of test-case reduction employed by Hypothesis, a widely-used property-based testing library for Python. The key idea of internal test-case reduction is that instead of applying test-case reduction externally to generated test cases, we apply it internally, to the sequence of random choices made during generation, so that a test case is reduced by continually re-generating smaller and simpler test cases that continue to trigger some property of interest (e.g. a bug in the system under test). This allows for fully generic test-case reduction without any user intervention and without the need to write a specific test-case reducer for a particular application domain. It also significantly mitigates the impact of the test-case validity problem, by ensuring that any reduced test case is one that could in principle have been generated. We describe the rationale behind this approach, explain its implementation in Hypothesis, and present an extensive evaluation comparing its effectiveness with that of several other test-case reducers, including C-Reduce and delta debugging, on applications including Python auto-formatting, C compilers, and the SymPy symbolic math library. Our hope is that these insights into the reduction mechanism employed by Hypothesis will be useful to researchers interested in randomized testing and test-case reduction, as the crux of the approach is fully generic and should be applicable to any random generator of test cases.
Donaldson AF, Evrard H, Thomson P, 2020, Putting randomized compiler testing into production, 34th European Conference on Object-Oriented Programming (ECOOP 2020), Publisher: DROPS, Pages: 1-29, ISSN: 1868-8969
We describe our experience over the last 18 months on a compiler testing technology transfer project: taking the GraphicsFuzz research project on randomized metamorphic testing of graphics shader compilers, and building the necessary tooling around it to provide a highly automated process for improving the Khronos Vulkan Conformance Test Suite (CTS) with test cases that expose fuzzer-found compiler bugs, or that plug gaps in test coverage. We present this tooling for test automation – gfauto – in detail, as well as our use of differential coverage and test case reduction as a method for automatically synthesizing tests that fill coverage gaps. We explain the value that GraphicsFuzz has provided in automatically testing the ecosystem of tools for transforming, optimizing and validating Vulkan shaders, and the challenges faced when testing a tool ecosystem rather than a single tool. We discuss practical issues associated with putting automated metamorphic testing into production, related to test case validity, bug de-duplication and floating-point precision, and provide illustrative examples of bugs found during our work.
Donaldson AF, Torlak E, 2020, Message from the Chairs, Pages: III-IV
Iorga D, Sorensen T, Wickerson J, et al., 2020, Slow and steady: measuring and tuning multicore interference, IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), Publisher: IEEE
Now ubiquitous, multicore processors provide repli-cated compute cores that allow independent programs to runin parallel. However, shared resources, such as last-level caches,can cause otherwise-independent programs to interfere with oneanother, leading to significant and unpredictable effects on theirexecution time. Indeed, prior work has shown that speciallycraftedenemy programscan cause software systems of interest toexperience orders-of-magnitude slowdowns when both are run inparallel on a multicore processor. This undermines the suitabilityof these processors for tasks that have real-time constraints.In this work, we explore the design and evaluation of tech-niques for empirically testing interference using enemy programs,with an eye towards reliability (how reproducible the interferenceresults are) and portability (how interference testing can beeffective across chips). We first show that different methodsof measurement yield significantly different magnitudes of, andvariation in, observed interference effects when applied to anenemy process that was shown to be particularly effective inprior work. We propose a method of measurement based onpercentiles and confidence intervals, and show that it providesboth competitive and reproducible observations. The reliability ofour measurements allows us to exploreauto-tuning, where enemyprograms are further specialised per architecture. We evaluatethree different tuning approaches (random search, simulatedannealing, and Bayesian optimisation) on five different multicorechips, spanning x86 and ARM architectures. To show that ourtuned enemy programs generalise to applications, we evaluatethe slowdowns caused by our approach on the AutoBench andCoreMark benchmark suites. Our method achieves a statisticallylarger slowdown compared to prior work in 35 out of 105benchmark/chip combination
Despite much recent interest in randomised testing (fuzzing) of compilers, the practical impact of fuzzer-foundcompiler bugs on real-world applications has barely been assessed. We present the first quantitative andqualitative study of the tangible impact of miscompilation bugs in a mature compiler. We follow a rigorousmethodology where the bug impact over the compiled application is evaluated based on (1) whether the bugappears to trigger during compilation; (2) the extent to which generated assembly code changes syntacticallydue to triggering of the bug; and (3) whether such changes cause regression test suite failures, or whetherwe can manually find application inputs that trigger execution divergence due to such changes. The studyis conducted with respect to the compilation of more than 10 million lines of C/C++ code from 309 Debianpackages, using 12% of the historical and now fixed miscompilation bugs found by four state-of-the-art fuzzersin the Clang/LLVM compiler, as well as 18 bugs found by human users compiling real code or as a by-productof formal verification efforts. The results show that almost half of the fuzzer-found bugs propagate to thegenerated binaries for at least one package, in which case only a very small part of the binary is typicallyaffected, yet causing two failures when running the test suites of all the impacted packages. User-reportedand formal verification bugs do not exhibit a higher impact, with a lower rate of triggered bugs and one testfailure. The manual analysis of a selection of the syntactic changes caused by some of our bugs (fuzzer-foundand non fuzzer-found) in package assembly code, shows that either these changes have no semantic impact orthat they would require very specific runtime circumstances to trigger execution divergence.
Marcozzi M, Tang Q, Cadar C, et al., 2019, Compiler Fuzzing: How Much Does It Matter
Despite much recent interest in randomised testing (fuzzing) of compilers, the practical impact of fuzzer-found compiler bugs on real-world applications has barely been assessed. We present the first quantitative and qualitative study of the tangible impact of miscompilation bugs in a mature compiler. We follow a rigorous methodology where the bug impact over the compiled application is evaluated based on (1) whether the bug appears to trigger during compilation; (2) the extent to which generated assembly code changes syntactically due to triggering of the bug; and (3) whether such changes cause regression test suite failures, or whether we can manually find application inputs that trigger execution divergence due to such changes. The study is conducted with respect to the compilation of more than 10 million lines of C/C++ code from 309 Debian packages, using 12% of the historical and now fixed miscompilation bugs found by four state-of-the-art fuzzers in the Clang/LLVM compiler, as well as 18 bugs found by human users compiling real code or as a by-product of formal verification efforts. The results show that almost half of the fuzzer-found bugs propagate to the generated binaries for at least one package, in which case only a very small part of the binary is typically affected, yet causing two failures when running the test suites of all the impacted packages. User-reported and formal verification bugs do not exhibit a higher impact, with a lower rate of triggered bugs and one test failure. The manual analysis of a selection of the syntactic changes caused by some of our bugs (fuzzer-found and non fuzzer-found) in package assembly code, shows that either these changes have no semantic impact or that they would require very specific runtime circumstances to trigger execution divergence.
Marcozzi M, Tang Q, Cadar C, et al., 2019, Compiler Fuzzing: How Much Does It Matter?
Despite much recent interest in compiler randomized testing (fuzzing), the practical impact of fuzzer-found compiler bugs on real-world applications has barely been assessed. We present the first quantitative and qualitative study of the tangible impact of miscompilation bugs in a mature compiler. We follow a rigorous methodology where the bug impact over the compiled application is evaluated based on (1) whether the bug appears to trigger during compilation; (2) the extent to which generated assembly code changes syntactically due to triggering of the bug; and (3) how much such changes do cause regression test suite failures and could be used to manually trigger divergences during execution. The study is conducted with respect to the compilation of more than 10 million lines of C/C++ code from 309 Debian packages, using 12% of the historical and now fixed miscompilation bugs found by four state-of-the-art fuzzers in the Clang/LLVM compiler, as well as 18 bugs found by human users compiling real code or by formal verification. The results show that almost half of the fuzzer-found bugs propagate to the generated binaries for some packages, but rarely affect their syntax and cause two failures in total when running their test suites. User-reported and formal verification bugs do not exhibit a higher impact, with less frequently triggered bugs and one test failure. Our manual analysis of a selection of bugs, either fuzzer-found or not, suggests that none can easily trigger a runtime divergence on the packages considered in the analysis, and that in general they affect only corner cases.
Liew D, Cadar C, Donaldson A, et al., 2019, Just fuzz it: solving floating-point constraints using coverage-guided fuzzing, ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’19), Publisher: ACM, Pages: 521-532
We investigate the use of coverage-guided fuzzing as a means ofproving satisfiability of SMT formulas over finite variable domains,with specific application to floating-point constraints. We show howan SMT formula can be encoded as a program containing a locationthat is reachable if and only if the program’s input corresponds toa satisfying assignment to the formula. A coverage-guided fuzzercan then be used to search for an input that reaches the location,yielding a satisfying assignment. We have implemented this ideain a tool,JustFuzz-itSolver (JFS), and we present a large experi-mental evaluation showing that JFS is both competitive with andcomplementary to state-of-the-art SMT solvers with respect tosolving floating-point constraints, and that the coverage-guidedapproach of JFS provides significant benefit over naive fuzzing inthe floating-point domain. Applied in a portfolio manner, the JFS approach thus has the potential to complement traditional SMTsolvers for program analysis tasks that involve reasoning aboutfloating-point constraints.
Lidbury C, Donaldson AF, 2019, Sparse record and replay with controlled scheduling, the 40th ACM SIGPLAN Conference, Publisher: ACM Press, Pages: 576-593
Modern applications include many sources of nondeterminism, e.g. due to concurrency, signals, and system calls that interact with the external environment. Finding and reproducing bugs in the presence of this nondeterminism has been the subject of much prior work in three main areas: (1) controlled concurrency-testing, where a custom scheduler replaces the OS scheduler to find subtle bugs; (2) record and replay, where sources of nondeterminism are captured and logged so that a failing execution can be replayed for debugging purposes; and (3) dynamic analysis for the detection of data races. We present a dynamic analysis tool for C++ applications, tsan11rec, which brings these strands of work together by integrating controlled concurrency testing and record and replay into the tsan11 framework for C++11 data race detection. Our novel twist on record and replay is a sparse approach, where the sources of nondeterminism to record can be configured per application. We show that our approach is effective at finding subtle concurrency bugs in small applications; is competitive in terms of performance with the state-of-the-art record and replay tool rr on larger applications; succeeds (due to our sparse approach) in replaying the I/O-intensive Zandronum and QuakeSpasm video games, which are out of scope for rr; but (due to limitations of our sparse approach) cannot faithfully replay applications where memory layout nondeterminism significantly affects application behaviour.
Marcozzi M, Tang Q, Donaldson A, et al., 2019, A systematic impact study for fuzzer-found compiler bugs
Despite much recent interest in compiler fuzzing, the practical impact offuzzer-found miscompilations on real-world applications has barely beenassessed. We present the first quantitative and qualitative study of thetangible impact of fuzzer-found compiler bugs. We follow a novel methodologywhere the impact of a miscompilation bug is evaluated based on (1) whether thebug appears to trigger during compilation; (2) the extent to which generatedassembly code changes syntactically due to triggering of the bug; and (3) howlikely such changes are to cause runtime divergences during execution. Thestudy is conducted with respect to the compilation of more than 10 millionlines of C/C++ code from 309 Debian packages, using 12% of the historical andnow fixed miscompilation bugs found by four state-of-the-art fuzzers in theClang/LLVM compiler, as well as 18 other bugs found by the Alive formalverification tool or human users. The results show that almost half of thefuzzer-found bugs propagate to the generated binaries for some applications,but barely affect their syntax and only cause two failures in total whenrunning their regression test suites. Our manual analysis of a selection ofbugs suggests that these bugs cannot trigger on the packages considered in theanalysis, and that in general they affect only corner cases which have a lowprobability of occurring in practice. User-reported and Alive bugs do notexhibit a higher impact, with less frequently triggered bugs and one testfailure.
Sorensen T, Pai S, Donaldson AF, 2019, Performance Evaluation of OpenCL Standard Support (and Beyond), 8th International Workshop on OpenCL (IWOCL), Publisher: ASSOC COMPUTING MACHINERY
Sorensen T, Pai S, Donaldson AF, 2019, One Size Doesn't Fit All: Quantifying Performance Portability of Graph Applications on GPUs, 15th IEEE International Symposium on Workload Characterization (IISWC), Publisher: IEEE, Pages: 155-166
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.