108 results found
Even-Mendoza K, Cadar C, Donaldson A, 2022, CsmithEdge: more effective compiler testing by handling undefined behaviour less conservatively, Empirical Software Engineering: an international journal, Vol: 27, ISSN: 1382-3256
Compiler fuzzing techniques require a means of generating programs that are free from undefined behaviour (UB) to reliably reveal miscompilation bugs. Existing program generators such as CSMITH achieve UB-freedom by heavily restricting the form of generated programs. The idiomatic nature of the resulting programs risks limiting the test coverage they can offer, and thus the compiler bugs they can discover. We investigate the idea of adapting existing fuzzers to be less restrictive concerning UB, in the practical setting of C compiler testing via a new tool, CSMITHEDGE, which extends CSMITH. CSMITHEDGE probabilistically weakens the constraints used to enforce UB-freedom, thus generated programs are no longer guaranteed to be UB-free. It then employs several off-the-shelf UB detection tools and a novel dynamic analysis to (a) detect cases where the generated program exhibits UB and (b) determine where CSMITH has been too conservative in its use of safe math wrappers that guarantee UB-freedom for arithmetic operations, removing the use of redundant ones. The resulting UB-free programs can be used to test for miscompilation bugs via differential testing. The non-UB-free programs can still be used to check that the compiler under test does not crash or hang.Our experiments on recent versions of GCC, LLVM and the Microsoft VisualStudio Compiler show that CSMITHEDGE was able to discover 7 previously unknown miscompilation bugs (5 already fixed in response to our reports) that could not be found via intensive testing using CSMITH, and 2 compiler-hang bugs that were fixed independently shortly before we considered reporting them.
Busse F, Gharat P, Cadar C, et al., 2022, Combining static analysis error traces with dynamic symbolic execution (experience paper), International Symposium on Software Testing and Analysis (ISSTA 2022), Publisher: ACM, Pages: 568-579
This paper reports on our experience implementing a techniquefor sifting through static analysis reports using dynamic symbolicexecution. Our insight is that if a static analysis tool produces apartial trace through the program under analysis, annotated withconditions that the analyser believes are important for the bugto trigger, then a dynamic symbolic execution tool may be ableto exploit the trace by (a) guiding the search heuristically so thatpaths that follow the trace most closely are prioritised for explo-ration, and (b) pruning the search using the conditions associatedwith each step of the trace. This may allow the bug to be quicklyconfirmed using dynamic symbolic execution, if it turns out to be atrue positive, yielding an input that triggers the bug.To experiment with this approach, we have implemented the ideain a tool chain that allows the popular open-source static analysistools Clang Static Analyzer (CSA) and Infer to be combined withthe popular open-source dynamic symbolic execution engine KLEE.Our findings highlight two interesting negative results. First, whilefault injection experiments show the promise of our technique,they also reveal that the traces provided by static analysis tools arenot that useful in guiding search. Second, we have systematicallyapplied CSA and Infer to a large corpus of real-world applicationsthat are suitable for analysis with KLEE, and find that the staticanalysers are rarely able to find non-trivial true positive bugs forthis set of applications.We believe our case study can inform static analysis and dynamicsymbolic execution tool developers as to where improvements maybe necessary, and serve as a call to arms for researchers interestedin combining symbolic execution and static analysis to identifymore suitable benchmark suites for evaluation of research ideas.
Windsor M, Donaldson AF, Wickerson J, 2022, High-coverage metamorphic testing of concurrency support in C compilers, SOFTWARE TESTING VERIFICATION & RELIABILITY, Vol: 32, ISSN: 0960-0833
Iorga D, Donaldson A, Sorensen T, et al., 2021, The semantics of shared memory in Intel CPU/FPGA systems, Proceedings of the ACM on Programming Languages, Vol: 5, Pages: 1-28, ISSN: 2475-1421
Heterogeneous CPU/FPGA devices, in which a CPU and an FPGA can execute together while sharing memory,are becoming popular in several computing sectors. In this paper, we study the shared-memory semanticsof these devices, with a view to providing a firm foundation for reasoning about the programs that run onthem. Our focus is on Intel platforms that combine an Intel FPGA with a multicore Xeon CPU. We describe theweak-memory behaviours that are allowed (and observable) on these devices when CPU threads and an FPGAthread access common memory locations in a fine-grained manner through multiple channels. Some of thesebehaviours are familiar from well-studied CPU and GPU concurrency; others are weaker still. We encodethese behaviours in two formal memory models: one operational, one axiomatic. We develop executableimplementations of both models, using the CBMC bounded model-checking tool for our operational modeland the Alloy modelling language for our axiomatic model. Using these, we cross-check our models againsteach other via a translator that converts Alloy-generated executions into queries for the CBMC model. Wealso validate our models against actual hardware by translating 583 Alloy-generated executions into litmustests that we run on CPU/FPGA devices; when doing this, we avoid the prohibitive cost of synthesising ahardware design per litmus test by creating our own ‘litmus-test processor’ in hardware. We expect that ourmodels will be useful for low-level programmers, compiler writers, and designers of analysis tools. Indeed, as ademonstration of the utility of our work, we use our operational model to reason about a producer/consumerbuffer implemented across the CPU and the FPGA. When the buffer uses insufficient synchronisation – asituation that our model is able to detect – we observe that its performance improves at the cost of occasionaldata corruption.
Sorensen T, Salvador LF, Raval H, et al., 2021, Specifying and testing GPU workgroup progress models, Proceedings of the ACM on Programming Languages, Vol: 5, Pages: 1-30, ISSN: 2475-1421
As GPU availability has increased and programming support has matured, a wider variety of applications arebeing ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; assuch, their correct execution depends on a degree of relative forward progress between threads (or threadgroups). Unfortunately, many GPU programming specifications (e.g. Vulkan and Metal) say almost nothingabout relative forward progress guarantees between workgroups. Although prior work has proposed aspectrum of plausible progress models for GPUs, cross-vendor specifications have yet to commit to any model.This work is a collection of tools and experimental data to aid specification designers when consideringforward progress guarantees in programming frameworks. As a foundation, we formalize a small parallelprogramming language that captures the essence of fine-grained synchronization. We then provide a means offormally specifying a progress model, and develop a termination oracle that decides whether a given programis guaranteed to eventually terminate with respect to a given progress model. Next, we formalize a set ofconstraints that describe concurrent programs that require forward progress to terminate. This allows us tosynthesize a large set of 483 progress litmus tests. Combined with the termination oracle, we can determinethe expected status of each litmus test – i.e. whether it is guaranteed to eventually terminate – under variousprogress models. We present a large experimental campaign running the litmus tests across 8 GPUs from 5different vendors. Our results highlight that GPUs have significantly different termination behaviors underour test suite. Most notably, we find that Apple and ARM GPUs do not support thelinear occupancy-boundmodel, as was hypothesized by prior work
Lascu A, Windsor M, Donaldson A, et al., 2021, Dreaming up metamorphic relations: experiences from three fuzzer tools, 2021 IEEE/ACM 6th International Workshop on Metamorphic Testing (MET), Publisher: IEEE, Pages: 61-68
Metamorphic testing requires the availability of a suitable set of metamorphic relations (MRs) for the application domain of interest. A software testing practitioner interested in using metamorphic testing is thus blocked unless they can devise a suitable set of MRs. In this paper we offer some practical advice on sources of inspiration for MRs, based on our experience building three fuzzing tools based on metamorphic testing: MF++, which supports automated testing of C++11libraries,C4, which tests concurrency support in C11 compilers, and spirv-fuzz, which aims to find bugs in compilers for the SPIR-V programming language (mainly used in computer graphics).The MRs we have devised have taken inspiration from three main sources: (1) careful study of specification documents related to the libraries and programming languages that these tools target, (2) consultation of prior work and discussion with domain experts, and (3) manual inspection of the results of automated code coverage analysis on the systems under test. We describe these sources of inspiration in detail, giving a range of concrete examples for each. We hope that this experience report will help to inform developers of future metamorphic testing tools as to the steps they can take to discover MRs in their domains of interest.
Donaldson A, Thomson P, Teliman V, et al., 2021, Test-case reduction and deduplication almost for free with transformation-based compiler testing, PLDI 2021, Publisher: Association for Computing Machinery (ACM), Pages: 1017-1032
Recent transformation-based approaches to compiler testing look for mismatches between the results of pairs of equivalent programs, where one program is derived from the other by randomly applying semantics-preserving transformations. We present a formulation of transformation-based compiler testing that provides effective test-case reduction almost for free: if transformations are designed to be as small and independent as possible, standard delta debugging can be used to shrink a bug-inducing transformation sequence to a smaller subsequence that still triggers the bug. The bug can then be reported as a delta between an original and minimally-transformed program. Minimized transformation sequences can also be used to heuristically deduplicate a set of bug-inducing tests, recommending manual investigation of those that involve disparate types of transformations and thus may have different root causes. We demonstrate the effectiveness of our approach via a new tool, spirv-fuzz, the first compiler-testing tool for the SPIR-V intermediate representation that underpins the Vulkan GPU programming model.
Windsor M, Donaldson AF, Wickerson J, 2021, C4: The C Compiler Concurrency Checker, 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), Publisher: ASSOC COMPUTING MACHINERY, Pages: 670-673
Even Mendoza K, Cadar C, Donaldson A, 2020, Closer to the edge: testing compilers more thoroughly by being less conservative about undefined behaviour, IEEE/ACM International Conference on Automated Software Engineering, New Ideas and Emerging Results Track (ASE-NIER 2020), Publisher: IEEE / ACM, Pages: 1219-1223
Randomised compiler testing techniques require a means of generating programs that are free from undefined behaviour (UB) in order to reliably reveal miscompilation bugs. Existing program generators such asCsmithheavily restrict the form of generated programs inorder to achieve UB-freedom. We hypothesise that the idiomatic nature of such programs limits the test coverage they can offer. Our idea is to generate less restricted programs that are still UB-free—programs that get closer to the edge of UB, but that do not quite cross the edge. We present preliminary support for our idea via a prototype tool, CsmithEdge, which uses simple dynamic analysis to determine whereCsmithhas been too conservative in its use of safe math rappers that guarantee UB-freedom for arithmetic operations. By eliminating redundant wrappers,CsmithEdge was able to discover two new miscompilation bugs in GCC that could not be found via intensive testing using regular Csmith, and to achieve substantial differences in code coverage on GCC compared with regular Csmith.
MacIver DR, Donaldson AF, 2020, Test-case reduction via test-case generation: Insights from the hypothesis reducer, 34th European Conference on Object-Oriented Programming (ECOOP 2020), Publisher: DROPS, Pages: 1-27, ISSN: 1868-8969
We describe internal test-case reduction, the method of test-case reduction employed by Hypothesis, a widely-used property-based testing library for Python. The key idea of internal test-case reduction is that instead of applying test-case reduction externally to generated test cases, we apply it internally, to the sequence of random choices made during generation, so that a test case is reduced by continually re-generating smaller and simpler test cases that continue to trigger some property of interest (e.g. a bug in the system under test). This allows for fully generic test-case reduction without any user intervention and without the need to write a specific test-case reducer for a particular application domain. It also significantly mitigates the impact of the test-case validity problem, by ensuring that any reduced test case is one that could in principle have been generated. We describe the rationale behind this approach, explain its implementation in Hypothesis, and present an extensive evaluation comparing its effectiveness with that of several other test-case reducers, including C-Reduce and delta debugging, on applications including Python auto-formatting, C compilers, and the SymPy symbolic math library. Our hope is that these insights into the reduction mechanism employed by Hypothesis will be useful to researchers interested in randomized testing and test-case reduction, as the crux of the approach is fully generic and should be applicable to any random generator of test cases.
Donaldson AF, Evrard H, Thomson P, 2020, Putting randomized compiler testing into production, 34th European Conference on Object-Oriented Programming (ECOOP 2020), Publisher: DROPS, Pages: 1-29, ISSN: 1868-8969
We describe our experience over the last 18 months on a compiler testing technology transfer project: taking the GraphicsFuzz research project on randomized metamorphic testing of graphics shader compilers, and building the necessary tooling around it to provide a highly automated process for improving the Khronos Vulkan Conformance Test Suite (CTS) with test cases that expose fuzzer-found compiler bugs, or that plug gaps in test coverage. We present this tooling for test automation – gfauto – in detail, as well as our use of differential coverage and test case reduction as a method for automatically synthesizing tests that fill coverage gaps. We explain the value that GraphicsFuzz has provided in automatically testing the ecosystem of tools for transforming, optimizing and validating Vulkan shaders, and the challenges faced when testing a tool ecosystem rather than a single tool. We discuss practical issues associated with putting automated metamorphic testing into production, related to test case validity, bug de-duplication and floating-point precision, and provide illustrative examples of bugs found during our work.
Donaldson AF, Torlak E, 2020, Message from the Chairs, Pages: III-IV
Iorga D, Sorensen T, Wickerson J, et al., 2020, Slow and steady: measuring and tuning multicore interference, IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), Publisher: IEEE
Now ubiquitous, multicore processors provide repli-cated compute cores that allow independent programs to runin parallel. However, shared resources, such as last-level caches,can cause otherwise-independent programs to interfere with oneanother, leading to significant and unpredictable effects on theirexecution time. Indeed, prior work has shown that speciallycraftedenemy programscan cause software systems of interest toexperience orders-of-magnitude slowdowns when both are run inparallel on a multicore processor. This undermines the suitabilityof these processors for tasks that have real-time constraints.In this work, we explore the design and evaluation of tech-niques for empirically testing interference using enemy programs,with an eye towards reliability (how reproducible the interferenceresults are) and portability (how interference testing can beeffective across chips). We first show that different methodsof measurement yield significantly different magnitudes of, andvariation in, observed interference effects when applied to anenemy process that was shown to be particularly effective inprior work. We propose a method of measurement based onpercentiles and confidence intervals, and show that it providesboth competitive and reproducible observations. The reliability ofour measurements allows us to exploreauto-tuning, where enemyprograms are further specialised per architecture. We evaluatethree different tuning approaches (random search, simulatedannealing, and Bayesian optimisation) on five different multicorechips, spanning x86 and ARM architectures. To show that ourtuned enemy programs generalise to applications, we evaluatethe slowdowns caused by our approach on the AutoBench andCoreMark benchmark suites. Our method achieves a statisticallylarger slowdown compared to prior work in 35 out of 105benchmark/chip combination
Despite much recent interest in randomised testing (fuzzing) of compilers, the practical impact of fuzzer-foundcompiler bugs on real-world applications has barely been assessed. We present the first quantitative andqualitative study of the tangible impact of miscompilation bugs in a mature compiler. We follow a rigorousmethodology where the bug impact over the compiled application is evaluated based on (1) whether the bugappears to trigger during compilation; (2) the extent to which generated assembly code changes syntacticallydue to triggering of the bug; and (3) whether such changes cause regression test suite failures, or whetherwe can manually find application inputs that trigger execution divergence due to such changes. The studyis conducted with respect to the compilation of more than 10 million lines of C/C++ code from 309 Debianpackages, using 12% of the historical and now fixed miscompilation bugs found by four state-of-the-art fuzzersin the Clang/LLVM compiler, as well as 18 bugs found by human users compiling real code or as a by-productof formal verification efforts. The results show that almost half of the fuzzer-found bugs propagate to thegenerated binaries for at least one package, in which case only a very small part of the binary is typicallyaffected, yet causing two failures when running the test suites of all the impacted packages. User-reportedand formal verification bugs do not exhibit a higher impact, with a lower rate of triggered bugs and one testfailure. The manual analysis of a selection of the syntactic changes caused by some of our bugs (fuzzer-foundand non fuzzer-found) in package assembly code, shows that either these changes have no semantic impact orthat they would require very specific runtime circumstances to trigger execution divergence.
Marcozzi M, Tang Q, Cadar C, et al., 2019, Compiler Fuzzing: How Much Does It Matter
Despite much recent interest in randomised testing (fuzzing) of compilers, the practical impact of fuzzer-found compiler bugs on real-world applications has barely been assessed. We present the first quantitative and qualitative study of the tangible impact of miscompilation bugs in a mature compiler. We follow a rigorous methodology where the bug impact over the compiled application is evaluated based on (1) whether the bug appears to trigger during compilation; (2) the extent to which generated assembly code changes syntactically due to triggering of the bug; and (3) whether such changes cause regression test suite failures, or whether we can manually find application inputs that trigger execution divergence due to such changes. The study is conducted with respect to the compilation of more than 10 million lines of C/C++ code from 309 Debian packages, using 12% of the historical and now fixed miscompilation bugs found by four state-of-the-art fuzzers in the Clang/LLVM compiler, as well as 18 bugs found by human users compiling real code or as a by-product of formal verification efforts. The results show that almost half of the fuzzer-found bugs propagate to the generated binaries for at least one package, in which case only a very small part of the binary is typically affected, yet causing two failures when running the test suites of all the impacted packages. User-reported and formal verification bugs do not exhibit a higher impact, with a lower rate of triggered bugs and one test failure. The manual analysis of a selection of the syntactic changes caused by some of our bugs (fuzzer-found and non fuzzer-found) in package assembly code, shows that either these changes have no semantic impact or that they would require very specific runtime circumstances to trigger execution divergence.
Marcozzi M, Tang Q, Cadar C, et al., 2019, Compiler Fuzzing: How Much Does It Matter?
Despite much recent interest in compiler randomized testing (fuzzing), the practical impact of fuzzer-found compiler bugs on real-world applications has barely been assessed. We present the first quantitative and qualitative study of the tangible impact of miscompilation bugs in a mature compiler. We follow a rigorous methodology where the bug impact over the compiled application is evaluated based on (1) whether the bug appears to trigger during compilation; (2) the extent to which generated assembly code changes syntactically due to triggering of the bug; and (3) how much such changes do cause regression test suite failures and could be used to manually trigger divergences during execution. The study is conducted with respect to the compilation of more than 10 million lines of C/C++ code from 309 Debian packages, using 12% of the historical and now fixed miscompilation bugs found by four state-of-the-art fuzzers in the Clang/LLVM compiler, as well as 18 bugs found by human users compiling real code or by formal verification. The results show that almost half of the fuzzer-found bugs propagate to the generated binaries for some packages, but rarely affect their syntax and cause two failures in total when running their test suites. User-reported and formal verification bugs do not exhibit a higher impact, with less frequently triggered bugs and one test failure. Our manual analysis of a selection of bugs, either fuzzer-found or not, suggests that none can easily trigger a runtime divergence on the packages considered in the analysis, and that in general they affect only corner cases.
Liew D, Cadar C, Donaldson A, et al., 2019, Just fuzz it: solving floating-point constraints using coverage-guided fuzzing, ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’19), Publisher: ACM, Pages: 521-532
We investigate the use of coverage-guided fuzzing as a means ofproving satisfiability of SMT formulas over finite variable domains,with specific application to floating-point constraints. We show howan SMT formula can be encoded as a program containing a locationthat is reachable if and only if the program’s input corresponds toa satisfying assignment to the formula. A coverage-guided fuzzercan then be used to search for an input that reaches the location,yielding a satisfying assignment. We have implemented this ideain a tool,JustFuzz-itSolver (JFS), and we present a large experi-mental evaluation showing that JFS is both competitive with andcomplementary to state-of-the-art SMT solvers with respect tosolving floating-point constraints, and that the coverage-guidedapproach of JFS provides significant benefit over naive fuzzing inthe floating-point domain. Applied in a portfolio manner, the JFS approach thus has the potential to complement traditional SMTsolvers for program analysis tasks that involve reasoning aboutfloating-point constraints.
Lidbury C, Donaldson AF, 2019, Sparse record and replay with controlled scheduling, the 40th ACM SIGPLAN Conference, Publisher: ACM Press, Pages: 576-593
Modern applications include many sources of nondeterminism, e.g. due to concurrency, signals, and system calls that interact with the external environment. Finding and reproducing bugs in the presence of this nondeterminism has been the subject of much prior work in three main areas: (1) controlled concurrency-testing, where a custom scheduler replaces the OS scheduler to find subtle bugs; (2) record and replay, where sources of nondeterminism are captured and logged so that a failing execution can be replayed for debugging purposes; and (3) dynamic analysis for the detection of data races. We present a dynamic analysis tool for C++ applications, tsan11rec, which brings these strands of work together by integrating controlled concurrency testing and record and replay into the tsan11 framework for C++11 data race detection. Our novel twist on record and replay is a sparse approach, where the sources of nondeterminism to record can be configured per application. We show that our approach is effective at finding subtle concurrency bugs in small applications; is competitive in terms of performance with the state-of-the-art record and replay tool rr on larger applications; succeeds (due to our sparse approach) in replaying the I/O-intensive Zandronum and QuakeSpasm video games, which are out of scope for rr; but (due to limitations of our sparse approach) cannot faithfully replay applications where memory layout nondeterminism significantly affects application behaviour.
Marcozzi M, Tang Q, Donaldson A, et al., 2019, A systematic impact study for fuzzer-found compiler bugs
Despite much recent interest in compiler fuzzing, the practical impact offuzzer-found miscompilations on real-world applications has barely beenassessed. We present the first quantitative and qualitative study of thetangible impact of fuzzer-found compiler bugs. We follow a novel methodologywhere the impact of a miscompilation bug is evaluated based on (1) whether thebug appears to trigger during compilation; (2) the extent to which generatedassembly code changes syntactically due to triggering of the bug; and (3) howlikely such changes are to cause runtime divergences during execution. Thestudy is conducted with respect to the compilation of more than 10 millionlines of C/C++ code from 309 Debian packages, using 12% of the historical andnow fixed miscompilation bugs found by four state-of-the-art fuzzers in theClang/LLVM compiler, as well as 18 other bugs found by the Alive formalverification tool or human users. The results show that almost half of thefuzzer-found bugs propagate to the generated binaries for some applications,but barely affect their syntax and only cause two failures in total whenrunning their regression test suites. Our manual analysis of a selection ofbugs suggests that these bugs cannot trigger on the packages considered in theanalysis, and that in general they affect only corner cases which have a lowprobability of occurring in practice. User-reported and Alive bugs do notexhibit a higher impact, with less frequently triggered bugs and one testfailure.
Sorensen T, Pai S, Donaldson AF, 2019, Performance Evaluation of OpenCL Standard Support (and Beyond), 8th International Workshop on OpenCL (IWOCL), Publisher: ASSOC COMPUTING MACHINERY
Sorensen T, Pai S, Donaldson AF, 2019, One Size Doesn't Fit All: Quantifying Performance Portability of Graph Applications on GPUs, 15th IEEE International Symposium on Workload Characterization (IISWC), Publisher: IEEE, Pages: 155-166
Sorensen T, Evrard H, Donaldson AF, 2018, GPU schedulers: How fair is fair enough?, 29th International Conference on Concurrency Theory (CONCUR 2018), Publisher: Schloss Dagstuhl - Leibniz-Zentrum für Informatik, ISSN: 1868-8969
Blocking synchronisation idioms, e.g. mutexes and barriers, play an important role in concurrent programming. However, systems with semi-fair schedulers, e.g. graphics processing units (GPUs), are becoming increasingly common. Such schedulers provide varying degrees of fairness, guaranteeing enough to allow some, but not all, blocking idioms. While a number of applications that use blocking idioms do run on today’s GPUs, reasoning about liveness properties of such applications is di cult as documentation is scarce and scattered. In this work, we aim to clarify fairness properties of semi-fair schedulers. To do this, we define a general temporal logic formula, based on weak fairness, parameterised by a predicate that enables fairness per-thread at certain points of an execution. We then define fairness properties for three GPU schedulers: HSA, OpenCL, and occupancy-bound execution. We examine existing GPU applications and show that none of the above schedulers are strong enough to provide the fairness properties required by these applications. It hence appears that existing GPU scheduler descriptions do not entirely capture the fairness properties that are provided on current GPUs. Thus, we present two new schedulers that aim to support existing GPU applications. We analyse the behaviour of common blocking idioms under each scheduler and show that one of our new schedulers allows a more natural implementation of a GPU protocol.
Ketema J, Donaldson AF, 2017, Termination analysis for GPU kernels, Science of Computer Programming, Vol: 148, Pages: 107-122, ISSN: 0167-6423
We describe a thread-modular technique for proving termination of massively parallel GPU kernels. The technique reduces the termination problem for these kernels to a sequential termination problem by abstracting the shared state, and as such allows us to leverage termination analysis techniques for sequential programs. An implementation in KITTeL is able to show termination of 94% of 604 kernels collected from various sources.
Donaldson AF, Evrard H, Lascu A, et al., 2017, Automated testing of graphics shader compilers, Object-oriented Programming, Systems, Languages, and Applications (OOPSLA), Publisher: ACM, Pages: 1-27, ISSN: 2475-1421
We present an automated technique for finding defects in compilers for graphics shading languages. key challenge in compiler testing is the lack of an oracle that classifies an output as correct or incorrect; this is particularly pertinent in graphics shader compilers where the output is a rendered image that is typically under-specified. Our method builds on recent successful techniques for compiler validation based on metamorphic testing, and leverages existing high-value graphics shaders to create sets of transformed shaders that should be semantically equivalent. Rendering mismatches are then indicative of shader compilation bugs. Deviant shaders are automatically minimized to identify, in each case, a minimal change to an original high-value shader that induces a shader compiler bug. We have implemented the approach as a tool, GLFuzz, targeting the OpenGL shading language, GLSL. Our experiments over a set of 17 GPU and driver configurations, spanning the main 7 GPU designers, have led to us finding and reporting more than 60 distinct bugs, covering all tested configurations. As well as defective rendering, these issues identify security-critical vulnerabilities that affect WebGL, including a significant remote information leak security bug where a malicious web page can capture the contents of other browser tabs, and a bug whereby visiting a malicious web page can lead to a ``blue screen of death'' under Windows 10. Our findings show that shader compiler defects are prevalent, and that metamorphic testing provides an effective means for detecting them automatically.
Donaldson AF, Ketema J, Sorensen T, et al., 2017, Forward progress on GPU concurrency, 28th International Conference on Concurrency Theory, Publisher: Schloss Dagstuhl - Leibniz-Zentrum für Informatik, ISSN: 1868-8969
The tutorial at CONCUR will provide a practical overview of work undertaken over the last six years in the Multicore Programming Group at Imperial College London, and with collaborators internationally, related to understanding and reasoning about concurrency in software designed for acceleration on GPUs. In this article we provide an overview of this work, which includes contributions to data race analysis, compiler testing, memory model understanding and formalisation, and most recently efforts to enable portable GPU implementations of algorithms that require forward progress guarantees.
Sorensen T, Evrard H, Donaldson AF, 2017, Cooperative kernels: GPU multitasking for blocking algorithms, 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, Publisher: ACM
There is growing interest in accelerating irregular data-parallelalgorithms on GPUs. These algorithms are typicallyblocking, sothey require fair scheduling. But GPU programming models (e.g.OpenCL) do not mandate fair scheduling, and GPU schedulers areunfair in practice. Current approaches avoid this issue by exploit-ing scheduling quirks of today’s GPUs in a manner that does notallow the GPU to be shared with other workloads (such as graphicsrendering tasks). We proposecooperative kernels, an extension tothe traditional GPU programming model geared towards writingblocking algorithms. Workgroups of a cooperative kernelarefairlyscheduled, and multitasking is supported via a small set of languageextensions through which the kernel and scheduler cooperate. Wedescribe a prototype implementation of a cooperative kernel frame-work implemented in OpenCL 2.0 and evaluate our approach byporting a set of blocking GPU applications to cooperative kernelsand examining their performance under multitasking.
Liew D, Schemmel D, Cadar C, et al., 2017, Floating-Point Symbolic Execution: A Case Study in N-version Programming, IEEE/ACM International Conference on Automated Software Engineering (ASE 2017), Publisher: IEEE
Betts A, Chong N, Deligiannis P, et al., 2017, Implementing and evaluating candidate-based invariant generation, IEEE Transactions on Software Engineering, Vol: 44, Pages: 631-650, ISSN: 1939-3520
The discovery of inductive invariants lies at the heart of static program verification. Presently, many automatic solutions toinductive invariant generation are inflexible, only applicable to certain classes of programs, or unpredictable. An automatic techniquethat circumvents these deficiencies to some extent iscandidate-based invariant generation, whereby a large number of candidateinvariants are guessed and then proven to be inductive or rejected using a sound program analyser. This paper describes our efforts toapply candidate-based invariant generation in GPUVerify, a static checker of programs that run on GPUs. We study a set of383GPUprograms that contain loops, drawn from a number of open source suites and vendor SDKs. Among this set,253benchmarks requireprovision of loop invariants for verification to succeed.We describe the methodology we used to incrementally improve the invariant generation capabilities of GPUVerify to handle thesebenchmarks, throughcandidate-based invariant generation, whereby potential program invariants are speculated using cheap staticanalysis and subsequently either refuted or proven. We also describe a set of experiments that we used to examine the effectiveness ofour rules for candidate generation, assessing rules based on theirgenerality(the extent to which they generate candidate invariants),hitrate(the extent to which the generated candidates hold),effectiveness(the extent to which provable candidates actually help in allowingverification to succeed), andinfluence(the extent to which the success of one generation rule depends on candidates generated byanother rule). We believe that our methodology for devising and evaluation candidate generation rules may serve as a useful frameworkfor other researchers interested in candidate-based invariant generation.The candidates produced by GPUVerify help to verify231of these253programs. An increase in precision, however, has createdsluggishness in GPUVerify because more candidates are gen
Magron V, Constantinides G, Donaldson AF, 2017, Certified roundoff error bounds using semidefinite programming, ACM Transactions on Mathematical Software, Vol: 43, Pages: 1-31, ISSN: 0098-3500
Roundoff errors cannot be avoided when implementing numerical programs with finite precision. The ability to reason about rounding is especially important if one wants to explore a range of potential representations, for instance, for FPGAs or custom hardware implementations. This problem becomes challenging when the program does not employ solely linear operations as non-linearities are inherent to many interesting computational problems in real-world applications.Existing solutions to reasoning possibly lead to either inaccurate bounds or high analysis time in the presence of nonlinear correlations between variables. Furthermore, while it is easy to implement a straightforward method such as interval arithmetic, sophisticated techniques are less straightforward to implement in a formal setting. Thus there is a need for methods that output certificates that can be formally validated inside a proof assistant.We present a framework to provide upper bounds on absolute roundoff errors of floating-point nonlinear programs. This framework is based on optimization techniques employing semidefinite programming and sums of squares certificates, which can be checked inside the Coq theorem prover to provide formal roundoff error bounds for polynomial programs. Our tool covers a wide range of nonlinear programs, including polynomials and transcendental operations as well as conditional statements. We illustrate the efficiency and precision of this tool on non-trivial programs coming from biology, optimization, and space control. Our tool produces more accurate error bounds for 23% of all programs and yields better performance in 66% of all programs.
Donaldson AF, Gopalakrishnan G, Chong N, et al., 2017, Formal analysis techniques for reliable GPU programming: Current solutions and call to action, Advances in GPU Research and Practice, Pages: 3-21, ISBN: 9780128037386
Graphics processing units (GPU)-accelerated computing is being adopted increasingly in a number of areas, ranging from high-end scientific computing to mobile and embedded computing. While GPU programs routinely provide high computational throughput in a number of areas, they also prove to be notoriously difficult to write and optimize correctly, largely because of the subtleties of GPU concurrency. This chapter discusses several issues that make GPU programming difficult and examines recent progress on rigorous methods for formal analysis of GPU software. Our key observation is that given the fast-paced advances in GPU programming, the use of rigorous specification and verification methods must be an integral part of the culture of programming and training, and not an afterthought.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.