90 results found
Liew D, Cadar C, Donaldson A, et al., 2019, Just fuzz it: solving floating-point constraints using coverage-guided fuzzing, ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’19), Publisher: ACM, Pages: 521-532
We investigate the use of coverage-guided fuzzing as a means ofproving satisfiability of SMT formulas over finite variable domains,with specific application to floating-point constraints. We show howan SMT formula can be encoded as a program containing a locationthat is reachable if and only if the program’s input corresponds toa satisfying assignment to the formula. A coverage-guided fuzzercan then be used to search for an input that reaches the location,yielding a satisfying assignment. We have implemented this ideain a tool,JustFuzz-itSolver (JFS), and we present a large experi-mental evaluation showing that JFS is both competitive with andcomplementary to state-of-the-art SMT solvers with respect tosolving floating-point constraints, and that the coverage-guidedapproach of JFS provides significant benefit over naive fuzzing inthe floating-point domain. Applied in a portfolio manner, the JFS approach thus has the potential to complement traditional SMTsolvers for program analysis tasks that involve reasoning aboutfloating-point constraints.
Lidbury C, Donaldson AF, 2019, Sparse record and replay with controlled scheduling, the 40th ACM SIGPLAN Conference, Publisher: ACM Press, Pages: 576-593
Modern applications include many sources of nondeterminism, e.g. due to concurrency, signals, and system calls that interact with the external environment. Finding and reproducing bugs in the presence of this nondeterminism has been the subject of much prior work in three main areas: (1) controlled concurrency-testing, where a custom scheduler replaces the OS scheduler to find subtle bugs; (2) record and replay, where sources of nondeterminism are captured and logged so that a failing execution can be replayed for debugging purposes; and (3) dynamic analysis for the detection of data races. We present a dynamic analysis tool for C++ applications, tsan11rec, which brings these strands of work together by integrating controlled concurrency testing and record and replay into the tsan11 framework for C++11 data race detection. Our novel twist on record and replay is a sparse approach, where the sources of nondeterminism to record can be configured per application. We show that our approach is effective at finding subtle concurrency bugs in small applications; is competitive in terms of performance with the state-of-the-art record and replay tool rr on larger applications; succeeds (due to our sparse approach) in replaying the I/O-intensive Zandronum and QuakeSpasm video games, which are out of scope for rr; but (due to limitations of our sparse approach) cannot faithfully replay applications where memory layout nondeterminism significantly affects application behaviour.
Sorensen T, Pai S, Donaldson AF, 2019, Performance Evaluation of OpenCL Standard Support (and Beyond), 8th International Workshop on OpenCL (IWOCL), Publisher: ASSOC COMPUTING MACHINERY
Sorensen T, Evrard H, Donaldson AF, 2018, GPU schedulers: How fair is fair enough?, 29th International Conference on Concurrency Theory (CONCUR 2018), Publisher: Schloss Dagstuhl - Leibniz-Zentrum für Informatik, ISSN: 1868-8969
Blocking synchronisation idioms, e.g. mutexes and barriers, play an important role in concurrent programming. However, systems with semi-fair schedulers, e.g. graphics processing units (GPUs), are becoming increasingly common. Such schedulers provide varying degrees of fairness, guaranteeing enough to allow some, but not all, blocking idioms. While a number of applications that use blocking idioms do run on today’s GPUs, reasoning about liveness properties of such applications is di cult as documentation is scarce and scattered. In this work, we aim to clarify fairness properties of semi-fair schedulers. To do this, we define a general temporal logic formula, based on weak fairness, parameterised by a predicate that enables fairness per-thread at certain points of an execution. We then define fairness properties for three GPU schedulers: HSA, OpenCL, and occupancy-bound execution. We examine existing GPU applications and show that none of the above schedulers are strong enough to provide the fairness properties required by these applications. It hence appears that existing GPU scheduler descriptions do not entirely capture the fairness properties that are provided on current GPUs. Thus, we present two new schedulers that aim to support existing GPU applications. We analyse the behaviour of common blocking idioms under each scheduler and show that one of our new schedulers allows a more natural implementation of a GPU protocol.
Magron V, Constantinides G, Donaldson AF, 2018, Certified Roundoff Error Bounds Using Semidefinite Programming, ACM Transactions on Mathematical Software, Vol: 43, ISSN: 1557-7295
Roundoff errors cannot be avoided when implementing numerical programs with finite precision. The abilityto reason about rounding is especially important if one wants to explore a range of potential representations,for instance for FPGAs or custom hardware implementations. This problem becomes challenging whenthe program does not employ solely linear operations as non-linearities are inherent to many interestingcomputational problems in real-world applications.Existing solutions to reasoning possibly lead to either inaccurate bounds or high analysis time in thepresence of nonlinear correlations between variables. Furthermore, while it is easy to implement a straight-forward method such as interval arithmetic, sophisticated techniques are less straightforward to implementin a formal setting. Thus there is a need for methods which output certificates that can be formally validatedinside a proof assistant.We present a framework to provide upper bounds on absolute roundoff errors of floating-point nonlinearprograms. This framework is based on optimization techniques employing semidefinite programming andsums of squares certificates, which can be checked inside the Coq theorem prover to provide formal roundofferror bounds for polynomial programs. Our tool covers a wide range of nonlinear programs, includingpolynomials and transcendental operations as well as conditional statements. We illustrate the efficiencyand precision of this tool on non-trivial programs coming from biology, optimization and space control. Ourtool produces more accurate error bounds for23% of all programs and yields better performance in66%of all programs.
Ketema J, Donaldson AF, 2017, Termination analysis for GPU kernels, Science of Computer Programming, Vol: 148, Pages: 107-122, ISSN: 0167-6423
We describe a thread-modular technique for proving termination of massively parallel GPU kernels. The technique reduces the termination problem for these kernels to a sequential termination problem by abstracting the shared state, and as such allows us to leverage termination analysis techniques for sequential programs. An implementation in KITTeL is able to show termination of 94% of 604 kernels collected from various sources.
Donaldson AF, Evrard H, Lascu A, et al., 2017, Automated Testing of Graphics Shader Compilers, Object-oriented Programming, Systems, Languages, and Applications (OOPSLA), Publisher: ACM, ISSN: 2475-1421
We present an automated technique for finding defects in compilers for graphics shading languages. key challenge in compiler testing is the lack of an oracle that classifies an output as correct or incorrect; this is particularly pertinent in graphics shader compilers where the output is a rendered image that is typically under-specified. Our method builds on recent successful techniques for compiler validation based on metamorphic testing, and leverages existing high-value graphics shaders to create sets of transformed shaders that should be semantically equivalent. Rendering mismatches are then indicative of shader compilation bugs. Deviant shaders are automatically minimized to identify, in each case, a minimal change to an original high-value shader that induces a shader compiler bug. We have implemented the approach as a tool, GLFuzz, targeting the OpenGL shading language, GLSL. Our experiments over a set of 17 GPU and driver configurations, spanning the main 7 GPU designers, have led to us finding and reporting more than 60 distinct bugs, covering all tested configurations. As well as defective rendering, these issues identify security-critical vulnerabilities that affect WebGL, including a significant remote information leak security bug where a malicious web page can capture the contents of other browser tabs, and a bug whereby visiting a malicious web page can lead to a ``blue screen of death'' under Windows 10. Our findings show that shader compiler defects are prevalent, and that metamorphic testing provides an effective means for detecting them automatically.
Donaldson AF, Ketema J, Sorensen T, et al., 2017, Forward progress on GPU concurrency, 28th International Conference on Concurrency Theory, Publisher: Schloss Dagstuhl - Leibniz-Zentrum für Informatik, ISSN: 1868-8969
The tutorial at CONCUR will provide a practical overview of work undertaken over the last six years in the Multicore Programming Group at Imperial College London, and with collaborators internationally, related to understanding and reasoning about concurrency in software designed for acceleration on GPUs. In this article we provide an overview of this work, which includes contributions to data race analysis, compiler testing, memory model understanding and formalisation, and most recently efforts to enable portable GPU implementations of algorithms that require forward progress guarantees.
Sorensen T, Evrard H, Donaldson AF, 2017, Cooperative kernels: GPU multitasking for blocking algorithms, 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, Publisher: ACM
There is growing interest in accelerating irregular data-parallelalgorithms on GPUs. These algorithms are typicallyblocking, sothey require fair scheduling. But GPU programming models (e.g.OpenCL) do not mandate fair scheduling, and GPU schedulers areunfair in practice. Current approaches avoid this issue by exploit-ing scheduling quirks of today’s GPUs in a manner that does notallow the GPU to be shared with other workloads (such as graphicsrendering tasks). We proposecooperative kernels, an extension tothe traditional GPU programming model geared towards writingblocking algorithms. Workgroups of a cooperative kernelarefairlyscheduled, and multitasking is supported via a small set of languageextensions through which the kernel and scheduler cooperate. Wedescribe a prototype implementation of a cooperative kernel frame-work implemented in OpenCL 2.0 and evaluate our approach byporting a set of blocking GPU applications to cooperative kernelsand examining their performance under multitasking.
Liew D, Schemmel D, Cadar C, et al., 2017, Floating-Point Symbolic Execution: A Case Study in N-version Programming, IEEE/ACM International Conference on Automated Software Engineering (ASE 2017), Publisher: IEEE
Betts A, Chong N, Deligiannis P, et al., 2018, Implementing and evaluating candidate-based invariant generation, IEEE Transactions on Software Engineering, Vol: 44, Pages: 631-650, ISSN: 1939-3520
The discovery of inductive invariants lies at the heart of static program verification. Presently, many automatic solutions toinductive invariant generation are inflexible, only applicable to certain classes of programs, or unpredictable. An automatic techniquethat circumvents these deficiencies to some extent iscandidate-based invariant generation, whereby a large number of candidateinvariants are guessed and then proven to be inductive or rejected using a sound program analyser. This paper describes our efforts toapply candidate-based invariant generation in GPUVerify, a static checker of programs that run on GPUs. We study a set of383GPUprograms that contain loops, drawn from a number of open source suites and vendor SDKs. Among this set,253benchmarks requireprovision of loop invariants for verification to succeed.We describe the methodology we used to incrementally improve the invariant generation capabilities of GPUVerify to handle thesebenchmarks, throughcandidate-based invariant generation, whereby potential program invariants are speculated using cheap staticanalysis and subsequently either refuted or proven. We also describe a set of experiments that we used to examine the effectiveness ofour rules for candidate generation, assessing rules based on theirgenerality(the extent to which they generate candidate invariants),hitrate(the extent to which the generated candidates hold),effectiveness(the extent to which provable candidates actually help in allowingverification to succeed), andinfluence(the extent to which the success of one generation rule depends on candidates generated byanother rule). We believe that our methodology for devising and evaluation candidate generation rules may serve as a useful frameworkfor other researchers interested in candidate-based invariant generation.The candidates produced by GPUVerify help to verify231of these253programs. An increase in precision, however, has createdsluggishness in GPUVerify because more candidates are gen
Betts A, Chong N, Deligiannis P, et al., Implementing and Evaluating Candidate-Based Invariant Generation, IEEE Transactions on Software Engineering, ISSN: 1939-3520
Lidbury C, Donaldson AF, 2017, Dynamic race detection for C++11, ACM SIGPLAN Symposium on Principles of Programming Languages, Publisher: ACM
The intricate rules for memory ordering and synchronisation associated with the C/C++11 memory model mean that data races can be difficult to eliminate from concurrent programs. Dynamic data race analysis can pinpoint races in large and complex applications, but the state-of-the-art ThreadSanitizer (tsan) tool for C/C++ considers only sequentially consistent program executions, and does not correctly model synchronisation between C/C++11 atomic operations. We present a scalable dynamic data race analysis for C/C++11 that correctly captures C/C++11 synchronisation, and uses instrumentation to support exploration of a class of non sequentially consistent executions. We concisely define the memory model fragment captured by our instrumentation via a restricted axiomatic semantics, and show that the axiomatic semantics permits exactly those executions explored by our instrumentation. We have implemented our analysis in tsan, and evaluate its effectiveness on benchmark programs, enabling a comparison with the CDSChecker tool, and on two large and highly concurrent applications: the Firefox and Chromium web browsers. Our results show that our method can detect races that are beyond the scope of the original tsan tool, and that the overhead associated with applying our enhanced instrumentation to large applications is tolerable.
Donaldson AF, Gopalakrishnan G, Chong N, et al., 2017, Formal analysis techniques for reliable GPU programming: Current solutions and call to action, Advances in GPU Research and Practice, Pages: 3-21, ISBN: 9780128037386
© 2017 Elsevier Inc. Graphics processing units (GPU)-accelerated computing is being adopted increasingly in a number of areas, ranging from high-end scientific computing to mobile and embedded computing. While GPU programs routinely provide high computational throughput in a number of areas, they also prove to be notoriously difficult to write and optimize correctly, largely because of the subtleties of GPU concurrency. This chapter discusses several issues that make GPU programming difficult and examines recent progress on rigorous methods for formal analysis of GPU software. Our key observation is that given the fast-paced advances in GPU programming, the use of rigorous specification and verification methods must be an integral part of the culture of programming and training, and not an afterthought.
Sorensen T, Donaldson AF, Batty M, et al., 2016, Portable Inter-workgroup Barrier Synchronisation for GPUs, The 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, Publisher: ACM, Pages: 39-58
Despite the growing popularity of GPGPU programming,there is not yet a portable and formally-specified barrierthat one can use to synchronise across workgroups. Moreover,the occupancy-bound execution model of GPUs breaksassumptions inherent in traditional software execution barriers,exposing them to deadlock. We present an occupancydiscovery protocol that dynamically discovers a safe estimateof the occupancy for a given GPU and kernel, allowingfor a starvation-free (and hence, deadlock-free) interworkgroupbarrier by restricting the number of workgroupsaccording to this estimate. We implement this idea byadapting an existing, previously non-portable, GPU interworkgroupbarrier to use OpenCL 2.0 atomic operations,and prove that the barrier meets its natural specification interms of synchronisation.We assess the portability of our approach over eightGPUs spanning four vendors, comparing the performanceof our method against alternative methods. Our key findingsinclude: (1) the recall of our discovery protocol isnearly 100%; (2) runtime comparisons vary substantiallyacross GPUs and applications; and (3) our method providesportable and safe inter-workgroup synchronisation acrossthe applications we study.
Liew D, Cadar C, Donaldson A, 2016, Symbooglix: A symbolic execution engine for boogie programs, IEEE International Conference on Software Testing, Verification, and Validation, Publisher: IEEE
We present the design and implementation of Sym-booglix, a symbolic execution engine for the Boogie intermediateverification language. Symbooglix aims to find bugs in Boogieprograms efficiently, providing bug-finding capabilities for anyprogram analysis framework that uses Boogie as a targetlanguage. We discuss the technical challenges associated withhandling Boogie, and describe how we optimised Symbooglixusing a small training set of benchmarks. This empirically-driven optimisation approach avoids over-fitting Symbooglix toour benchmarks, enabling a fair comparison with other tools.We present an evaluation across 3749 Boogie programs generatedfrom the SV-COMP suite of C programs using the SMACK front-end, and 579 Boogie programs originating from several OpenCLand CUDA GPU benchmark suites, translated by the GPUVerifyfront-end. Our results show that Symbooglix significantly out-performs Boogaloo, an existing symbolic execution tool for Boogie,and is competitive with GPUVerify on benchmarks for whichGPUVerify is highly optimised. While generally less effective thanthe Corral and Duality tools on the SV-COMP suite, Symbooglixis complementary to them in terms of bug-finding ability.
Sorensen T, Donaldson AF, 2016, Exposing errors related to weak memory in GPU applications, 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, Publisher: ACM, Pages: 100-113
We present the systematic design of a testing environment that uses stressing and fuzzing to reveal errors in GPU applications that arise due to weak memory effects. We evaluate our approach on seven GPUs spanning three Nvidia architectures, across ten CUDA applications that use fine-grained concurrency. Our results show that applications that rarely or never exhibit errors related to weak memory when executed natively can readily exhibit these errors when executed in our testing environment. Our testing environment also provides a means to help identify the root causes of such errors, and automatically suggests how to insert fences that harden an application against weak memory bugs. To understand the cost of GPU fences, we benchmark applications with fences provided by the hardening strategy as well as a more conservative, sound fencing strategy.
Donaldson AF, Lascu A, 2016, Metamorphic testing for (graphics) compilers, 1st IEEE/ACM International Workshop on Metamorphic Testing (MET), Publisher: ACM, Pages: 44-47
We present strategies for metamorphic testing of compil-ers using opaque value injection, and experiences using themethod to test compilers for the OpenGL shading language.
Sorensen T, Donaldson AF, 2016, The Hitchhiker's Guide to Cross-Platform OpenCL Application Development, 4th International Workshop on OpenCL (IWOCL '16), Publisher: ACM
One of the benefits to programming of OpenCL is platform portability. That is, an OpenCL program that follows the OpenCL specification should, in principle, execute reliably on any platform that supports OpenCL. To assess the current state of OpenCL portability, we provide an experience report examining two sets of open source benchmarks that we attempted to execute across a variety of GPU platforms, via OpenCL. We report on the portability issues we encountered, where applications would execute successfully on one platform but fail on another. We classify issues into three groups: (1) framework bugs, where the vendor-provided OpenCL framework fails; (2) specification limitations, where the OpenCL specification is unclear and where different GPU platforms exhibit different behaviours; and (3) programming bugs, where non-portability arises due to the program exercising behaviours that are incorrect or undefined according to the OpenCL specification. The issues we encountered slowed the development process associated with our sets of applications, but we view the issues as providing exciting motivation for future testing and verification efforts to improve the state of OpenCL portability; we conclude with a discussion of these.
Pflanzer M, Donaldson AF, Lascu A, 2016, Automatic Test Case Reduction for OpenCL, 4th International Workshop on OpenCL (IWOCL '16), Publisher: ACM
We report on an extension to the C-Reduce tool, for automatic reduction of C test cases, to handle OpenCL kernels. This enables an automated method for detecting bugs in OpenCL compilers, by generating large random kernels using the CLsmith generator, identifying kernels that yield result differences across OpenCL platforms and optimisation levels, and using our novel extension to C-Reduce to automatically reduce such kernels to minimal forms that can be filed as bug reports. A major part of our effort involved the design of ShadowKeeper, a new plugin for the Oclgrind simulator that provides accurate detection of accesses to uninitialised data. We present experimental results showing the effectiveness of our method for finding bugs in a number of OpenCL compilers.
Deligiannis P, McCutchen M, Thomson P, et al., 2016, Uncovering Bugs in Distributed Storage Systems during Testing (not in Production!), 14th USENIX Conference on File and Storage Technologies, Publisher: USENIX
Testing distributed systems is challenging due to multiplesources of nondeterminism. Conventional testingtechniques, such as unit, integration and stress testing,are ineffective in preventing serious but subtle bugs fromreaching production. Formal techniques, such as TLA+,can only verify high-level specifications of systems at thelevel of logic-based models, and fall short of checkingthe actual executable code. In this paper, we present anew methodology for testing distributed systems. Ourapproach applies advanced systematic testing techniquesto thoroughly check that the executable code adheresto its high-level specifications, which significantly improvescoverage of important system behaviors.Our methodology has been applied to three distributedstorage systems in the Microsoft Azure cloud computingplatform. In the process, numerous bugs were identified,reproduced, confirmed and fixed. These bugs required asubtle combination of concurrency and failures, makingthem extremely difficult to find with conventional testingtechniques. An important advantage of our approach isthat a bug is uncovered in a small setting and witnessedby a full system trace, which dramatically increases theproductivity of debugging.
Thomson P, Donaldson AF, Betts A, 2016, Concurrency testing using controlled schedulers: an empirical study, ACM Transactions on Parallel Computing, Vol: 2, ISSN: 2329-4949
We present an independent empirical study on concurrency testing using controlled schedulers. We have gathered 49 buggy concurrent software benchmarks, drawn from public code bases, which we call SCTBench. We applied a modified version of an existing concurrency testing tool to SCTBench, testing five controlled scheduling techniques: depth-first search, preemption bounding, delay bounding, a controlled random scheduler, and probabilistic concurrency testing (PCT). We attempt to answer several research questions: Which technique performs the best, in terms of bug finding ability? How effective are the two main schedule bounding techniques—preemption bounding and delay bounding—at finding bugs? What challenges are associated with applying concurrency testing techniques to existing code? Can we classify certain benchmarks as trivial or nontrivial? Overall, we found that PCT (with parameter d = 3) was the most effective technique in terms of bug finding; it found all bugs found by the other techniques, plus an additional three, and it missed only one bug. Surprisingly, we found that the naive controlled random scheduler, which randomly chooses one thread to execute at each scheduling point, performs well, finding more bugs than preemption bounding and just two fewer bugs than delay bounding. Our findings confirm that delay bounding is superior to preemption bounding and that schedule bounding is superior to an unbounded depth-first search. The majority of bugs in SCTBench can be exposed using a small schedule bound (1--2), supporting previous claims, although one benchmark requires five preemptions. We found that the need to remove nondeterminism and control all synchronization (as is required for systematic concurrency testing) can be nontrivial. There were eight distinct programs that could not easily be included in out study, such as those that perform network and interprocess communication. We report various properties about the benchmarks tested, such a
Cadar C, Donaldson A, 2016, Analysing the Program Analyser, International Conference on Software Engineering, Visions of 2025 and Beyond Track, Publisher: ACM
It is the year 2025. FutureCorp, the privatesector defence contractor to whom the US government hasoutsourced its management of nuclear weapons, has justhad its missile control software hijacked by terrorists. Itis only a matter of hours before Armageddon. The CEOof FutureCorp, Dr F. Methods, is incredulous: “This isimpossible”, he told one of our reporters. “We used programanalysis to formally prove that the software was secure!”.“Ah, Dr Methods,” responded open-source developer Mr B.Door, “But did you check the analyser?”
Batty M, Donaldson AF, Wickerson JP, 2016, Overhauling SC Atomics in C11 and OpenCL, ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), Publisher: ACM, Pages: 634-648, ISSN: 0362-1340
Despite the conceptual simplicity of sequential consistency (SC), the semantics of SC atomic operations and fences in the C11 and OpenCL memory models is subtle, leading to convoluted prose descriptions that translate to complex axiomatic formalisations. We conduct an overhaul of SC atomics in C11, reducing the associated axioms in both number and complexity. A consequence of our simplification is that the SC operations in an execution no longer need to be totally ordered. This relaxation enables, for the first time, efficient and exhaustive simulation of litmus tests that use SC atomics. We extend our improved C11 model to obtain the first rigorous memory model formalisation for OpenCL (which extends C11 with support for heterogeneous many-core programming). In the OpenCL setting, we refine the SC axioms still further to give a sensible semantics to SC operations that employ a ‘memory scope’ to restrict their visibility to specific threads. Our overhaul requires slight strengthenings of both the C11 and the OpenCL memory models, causing some behaviours to become disallowed. We argue that these strengthenings are natural, and that all of the formalised C11 and OpenCL compilation schemes of which we are aware (Power and x86 CPUs for C11, AMD GPUs for OpenCL) remain valid in our revised models. Using the HERD memory model simulator, we show that our overhaul leads to an exponential improvement in simulation time for C11 litmus tests compared with the original model, making *exhaustive* simulation competitive, time-wise, with the *non-exhaustive* CDSChecker tool.
Deligiannis P, Donaldson AF, Rakamaric Z, 2015, Fast and Precise Symbolic Analysis of Concurrency Bugs in Device Drivers, 30th IEEE/ACM International Conference on Automated Software Engineering, Publisher: IEEE, Pages: 166-177
Concurrency errors, such as data races, make device drivers notoriously hard to develop and debug without automated tool support. We present Whoop, a new automated approach that statically analyzes drivers for data races. Whoop is empowered by symbolic pairwise lockset analysis, a novel analysis that can soundly detect all potential races in a driver. Our analysis avoids reasoning about thread interleavings and thus scales well. Exploiting the race-freedom guarantees provided by Whoop, we achieve a sound partial-order reduction that significantly accelerates Corral, an industrial-strength bug-finder for concurrent programs. Using the combination of Whoop and Corral, we analyzed 16 drivers from the Linux 4.0 kernel, achieving 1.5 -- 20× speedups over standalone Corral.
Riyadh Baghdadi R, Beaugnon U, Cohen A, et al., 2015, PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming, 2015 International Conference on Parallel Architecture and Compilation, Publisher: IEEE, Pages: 138-149
Programming accelerators such as GPUs withlow-level APIs and languages such as OpenCL and CUDAis difficult, error-prone, and not performance-portable. Automaticparallelization and domain specific languages (DSLs)have been proposed to hide complexity and regain performanceportability. We present PENCIL, a rigorously-defined subset ofGNU C99—enriched with additional language constructs—thatenables compilers to exploit parallelism and produce highlyoptimized code when targeting accelerators. PENCIL aims toserve both as a portable implementation language for libraries,and as a target language for DSL compilers.We implemented a PENCIL-to-OpenCL backend using astate-of-the-art polyhedral compiler. The polyhedral compiler,extended to handle data-dependent control flow and non-affinearray accesses, generates optimized OpenCL code. To demonstratethe potential and performance portability of PENCILand the PENCIL-to-OpenCL compiler, we consider a numberof image processing kernels, a set of benchmarks from theRodinia and SHOC suites, and DSL embedding scenarios forlinear algebra (BLAS) and signal processing radar applications(SpearDE), and present experimental results for four GPUplatforms: AMD Radeon HD 5670 and R9 285, NVIDIAGTX 470, and ARM Mali-T604.
Wickerson J, Batty M, Beckmann BM, et al., 2015, Remote-Scope Promotion: Clarified, Rectified, and Verified, ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), Pages: 731-747, ISSN: 1523-2867
Modern accelerator programming frameworks, such asOpenCLTM, organise threads into work-groups. Remotescopepromotion (RSP) is a language extension recentlyproposed by AMD researchers that is designed to enableapplications, for the first time, both to optimise for the commoncase of intra-work-group communication (using memoryscopes to provide consistency only within a work-group) andto allow occasional inter-work-group communication (asrequired, for instance, to support the popular load-balancingidiom of work stealing).We present the first formal, axiomatic memory model ofOpenCL extended with RSP. We have extended the HERDmemory model simulator with support for OpenCL kernelsthat exploit RSP, and used it to discover bugs in severallitmus tests and a work-stealing queue, that have been usedpreviously in the study of RSP. We have also formalised theproposed GPU implementation of RSP. The formalisationprocess allowed us to identify bugs in the description of RSPthat could result in well-synchronised programs experiencingmemory inconsistencies. We present and prove sound anew implementation of RSP that incorporates bug fixesand requires less non-standard hardware than the originalimplementation.This work, a collaboration between academia and industry,clearly demonstrates how, when designing hardware supportfor a new concurrent language feature, the early applicationof formal tools and techniques can help to prevent errors,such as those we have found, from making it into silicon
We address the compiler correctness problem for many-core systems through novel applications of fuzz testing to OpenCL compilers. Focusing on two methods from prior work, random differential testing and testing via equivalence modulo inputs (EMI), we present several strategies for random generation of deterministic, communicating OpenCL kernels, and an injection mechanism that allows EMI testing to be applied to kernels that otherwise exhibit little or no dynamically-dead code. We use these methods to conduct a large, controlled testing campaign with respect to 21 OpenCL (device, compiler) configurations, covering a range of CPU, GPU, accelerator, FPGA and emulator implementations. Our study provides independent validation of claims in prior work related to the effectiveness of random differential testing and EMI testing, proposes novel methods for lifting these techniques to the many-core setting and reveals a significant number of OpenCL compiler bugs in commercial implementations.
Deligiannis P, Donaldson AF, Ketema J, et al., 2015, Asynchronous programming, analysis and testing with state machines, 36th annual ACM SIGPLAN conference on Programming Language Design and Implementation, Publisher: ACM, Pages: 154-164
Programming efficient asynchronous systems is challenging because it can often be hard to express the design declaratively, or to defend against data races and interleaving-dependent assertion violations. Previous work has only addressed these challenges in isolation, by either designing a new declarative language, a new data race detection tool or a new testing technique. We present P#, a language for high-reliability asynchronous programming co-designed with a static data race analysis and systematic concurrency testing infrastructure. We describe our experience using P# to write several distributed protocols and port an industrial-scale system internal to Microsoft, showing that the combined techniques, by leveraging the design of P#, are effective in finding bugs.
Betts A, Chong N, Donaldson AF, et al., 2015, The Design and Implementation of a Verification Technique for GPU Kernels, ACM Transactions on Programming Languages and Systems, Vol: 37, ISSN: 1558-4593
We present a technique for the formal verification of GPU kernels, addressing two classes of correctness properties: data races and barrier divergence. Our approach is founded on a novel formal operational semantics for GPU kernels termed <i>synchronous, delayed visibility (SDV)</i> semantics, which captures the execution of a GPU kernel by multiple groups of threads. The SDV semantics provides operational definitions for barrier divergence and for both inter- and intra-group data races. We build on the semantics to develop a method for reducing the task of verifying a massively parallel GPU kernel to that of verifying a sequential program. This completely avoids the need to reason about thread interleavings, and allows existing techniques for sequential program verification to be leveraged. We describe an efficient encoding of data race detection and propose a method for automatically inferring the loop invariants that are required for verification. We have implemented these techniques as a practical verification tool, GPUVerify, that can be applied directly to OpenCL and CUDA source code. We evaluate GPUVerify with respect to a set of 162 kernels drawn from public and commercial sources. Our evaluation demonstrates that GPUVerify is capable of efficient, automatic verification of a large number of real-world kernels.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.