202 results found
Soudris D, Papadopoulos L, Kessler CW, et al., 2018, EXA2PRO programming environment: Architecture and applications, Pages: 202-209
© 2018 ACM. The EXA2PRO programming environment will integrate a set of tools and methodologies that will allow to systematically address many exascale computing challenges, including performance, performance portability, programmability, abstraction and reusability, fault tolerance and technical debt. The EXA2PRO tool-chain will enable the efficient deployment of applications in exascale computing systems, by integrating high-level software abstractions that offer performance portability and efficient exploitation of exascale systems' heterogeneity, tools for efficient memory management, optimizations based on trade-offs between various metrics and fault-tolerance support. Hence, by addressing various aspects of productivity challenges, EXA2PRO is expected to have significant impact in the transition to exascale computing, as well as impact from the perspective of applications. The evaluation will be based on 4 applications from 4 different domains that will be deployed in JUELICH supercomputing center. The EXA2PRO will generate exploitable results in the form of a tool-chain that support diverse exascale heterogeneous supercomputing centers and concrete improvements in various exascale computing challenges.
Cristal A, Unsal OS, Martorell X, et al., 2018, LEGaTO: First steps towards energy-efficient toolset for heterogeneous computing, Pages: 210-217
© 2018 ACM. LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC.
Cristal A, Unsal OS, Martorell X, et al., 2018, LEGaTO: Towards energy-efficient, secure, fault-tolerant toolset for heterogeneous computing, Pages: 276-278
© 2018 Association for Computing Machinery. LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Madein- Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC.
© 2018 by Taylor & Francis Group, LLC. Computer technology has become an essential driver for the financial industry in almost all of its areas. Advances in hardware and software technology, novel numerical methods, financial models, and algorithms have made computers a key technology that became essential for all financial institutions. High-performance computing (HPC) systems are widely used to price financial products or to quickly calculate the risk of complex portfolios. Often, the available computational power determines the types of problems that can be practically solved. Being able to handle a more complex problem or to obtain the results faster than all other organizations directly translates into a competitive advantage.
Voss N, Bacis M, Mencer O, et al., 2017, Convolutional Neural Networks on Dataflow Engines, 35th IEEE International Conference on Computer Design (ICCD), Publisher: IEEE, Pages: 435-438, ISSN: 1063-6404
In this paper we discuss a high performance implementation for Convolutional Neural Networks (CNNs) inference on the latest generation of Dataflow Engines (DFEs). We discuss the architectural choices made during the design phase taking into account the DFE chip properties. We then perform design space exploration, considering the memory bandwidth and resources utilisation constraints derived from the used DFE and the chosen architecture. Finally, we discuss the high performance implementation and compare the obtained performance against other implementations, showing that our proposed design reaches 2,450 GOPS when running VGG16 as a test case.
Cooper B, Girdlestone S, Burovskiy P, et al., 2017, Quantum Chemistry in Dataflow: Density-Fitting MP2., Journal of Chemical Theory and Computation, Vol: 13, Pages: 5265-5272, ISSN: 1549-9618
We demonstrate the use of dataflow technology in the computation of the correlation energy in molecules at the Møller-Plesset perturbation theory (MP2) level. Specifically, we benchmark density fitting (DF)-MP2 for as many as 168 atoms (in valinomycin) and show that speed-ups between 3 and 3.8 times can be achieved when compared to the MOLPRO package run on a single CPU. Acceleration is achieved by offloading the matrix multiplications steps in DF-MP2 to Dataflow Engines (DFEs). We project that the acceleration factor could be as much as 24 with the next generation of DFEs.
Koliogeorgi K, Masouros D, Zervakis G, et al., 2017, AEGLE's Cloud Infrastructure for Resource Monitoring and Containerized Accelerated Analytics, IEEE Computer Society Annual Symposium on VLSI, ISVLSI, Pages: 362-367, ISSN: 2159-3469
© 2017 IEEE. This paper presents the cloud infrastructure of the AEGLE project, that targets to integrate cloud technologies together with heterogeneous reconfigurable computing in large scale healthcare systems for Big Bio-Data analytics. AEGLEs engineering concept brings together the hot big-data engines with emerging acceleration technologies, putting the basis for personalized and integrated health-care services, while also promoting related research activities. We introduce the design of AEGLE's accelerated infrastructure along with the corresponding software and hardware acceleration stacks to support various big data analytics workloads showing that through effective resource containerization AEGLE's cloud infrastructure is able to support high heterogeneity regarding to storage types, execution engines, utilized tools and execution platforms. Special care is given to the integration of high performance accelerators within the overall software stack of AEGLE's infrastructure, which enable efficient execution of analytics, up to 140× according to our preliminary evaluations, over pure software executions.
© 2017 IEEE. Exascale computing is facing a gap between the ever increasing demand for application performance and the underlying chip technology that does no longer deliver the expected exponential increases in CPU performance. The industry is now progressively moving towards dedicated accelerators to deliver high performance and better energy efficiency. However, the question of programmability still remains. To address this challenge we propose a dedicated high-level accelerator programming and execution model where performance and efficiency are primary targets. Our model splits the computation into a conventional CPU-oriented part and a highly efficient fully programmable data flow part. We present a number of systematic transformations and optimisations targeting Maxeler dataflow systems that typically yield one to two orders of magnitude improvements in terms of both performance and energy efficiency. These significant gains are enabled by addressing fundamental algorithmic properties and on-demand numerical requirements. This approach is demonstrated by a case study from computational finance.
Ciobanu CB, Gaydadjiev G, Pilato C, et al., 2017, The Case for Polymorphic Registers in Dataflow Computing, International Journal of Parallel Programming, Pages: 1-35, ISSN: 0885-7458
© 2017 The Author(s) Heterogeneous systems are becoming increasingly popular, delivering high performance through hardware specialization. However, sequential data accesses may have a negative impact on performance. Data parallel solutions such as Polymorphic Register Files (PRFs) can potentially accelerate applications by facilitating high-speed, parallel access to performance-critical data. This article shows how PRFs can be integrated into dataflow computational platforms. Our semi-automatic, compiler-based methodology generates customized PRFs and modifies the computational kernels to efficiently exploit them. We use a separable 2D convolution case study to evaluate the impact of memory latency and bandwidth on performance compared to a state-of-the-art NVIDIA Tesla C2050 GPU. We improve the throughput up to 56.17X and show that the PRF-augmented system outperforms the GPU for (Formula presented.) or larger mask sizes, even in bandwidth-constrained systems.
DeMara RF, Gaydadjiev G, 2017, RAW Keynote Speakers, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Publisher: IEEE
Trifunovic N, Palikareva H, Becker T, et al., 2017, Cloud deployment and management of dataflow engines
© 2017 Copyright held by the owner/author(s). Maxeler Technologies successfully commercialises high-performance computing systems based on dataflow technology. Maxeler dataflow computers have been deployed in a wide range of application domains including financial data analytics, geoscience and low-latency transaction processing. In the context of cloud computing steadily growing acceptance in new domains, we illustrate how Maxeler dataflow systems can be integrated and employed in a self-organising self-managing heterogeneous cloud environment.
Voss N, Becker T, Mencer O, et al., 2017, Rapid development of gzip with MaxJ, 13th International Symposium on Applied Reconfigurable Computing (ARC), Publisher: Springer Verlag, Pages: 60-71, ISSN: 0302-9743
Design productivity is essential for high-performance application development involving accelerators. Low level hardware description languages such as Verilog and VHDL are widely used to design FPGA accelerators, however, they require significant expertise and considerable design efforts. Recent advances in high-level synthesis have brought forward tools that relieve the burden of FPGA application development but the achieved performance results can not approximate designs made using low-level languages. In this paper we compare different FPGA implementations of gzip. All of them implement the same system architecture using different languages. This allows us to compare Verilog, OpenCL and MaxJ design productivity. First, we illustrate several conceptional advantages of the MaxJ language and its platform over OpenCL. Next we show on the example of our gzip implementation how an engineer without previous MaxJ experience can quickly develop and optimize a real, complex application. The gzip design in MaxJ presented here took only one man-month to develop and achieved better performance than the related work created in Verilog and OpenCL.
© 2016 IEEE. In this paper we present several algorithms used to construct a tool that automatically optimizes static dataflow graphs for the purpose of high level hardware synthesis. Our target is to automatically merge multiple dataflow graphs in order to create a single structure implementing all distinct operations with minimal area overhead by time-slicing hardware resources. We show that a combination of dedicated optimizations and a simple greedy approach for graph merging reduces the overall area by up to 4x compared to a naive hardware implementation.
Stroobandt D, Varbanescu AL, Ciobanu CB, et al., 2016, EXTRA: Towards the exploitation of eXascale technology for reconfigurable architectures, ReCoSoC 2016, Publisher: IEEE
To handle the stringent performance requirements of future exascale-class applications, High Performance Computing (HPC) systems need ultra-efficient heterogeneous compute nodes. To reduce power and increase performance, such compute nodes will require hardware accelerators with a high degree of specialization. Ideally, dynamic reconfiguration will be an intrinsic feature, so that specific HPC application features can be optimally accelerated, even if they regularly change over time. In the EXTRA project, we create a new and flexible exploration platform for developing reconfigurable architectures, design tools and HPC applications with run-time reconfiguration built-in as a core fundamental feature instead of an add-on. EXTRA covers the entire stack from architecture up to the application, focusing on the fundamental building blocks for run-time reconfigurable exascale HPC systems: new chip architectures with very low reconfiguration overhead, new tools that truly take reconfiguration as a central design concept, and applications that are tuned to maximally benefit from the proposed run-time reconfiguration techniques. Ultimately, this open platform will improve Europe's competitive advantage and leadership in the field.
Gu X, Angelov PP, Ali AM, et al., 2016, Online evolving fuzzy rule-based prediction model for high frequency trading financial data stream, Pages: 169-175
©2016 IEEE. Analyzing and predicting the high frequency trading (HFT) financial data stream is very challenging due to the fast arrival times and large amount of the data samples. Aiming at solving this problem, an online evolving fuzzy rule-based prediction model is proposed in this paper. Because this prediction model is based on evolving fuzzy rule-based systems and a novel, simpler form of data density, it can autonomously learn from the live data stream, automatically build/remove its rules and recursively update the parameters. This model responds quickly to all unpredictable sudden changes of financial data and re-adjusts itself to follow the new data pattern. Experimental results show the excellent prediction performance of the proposed approach with real financial data stream regardless of quick shifts of data patterns and frequent appearances of abnormal data samples.
Kachris C, Gaydadjiev G, Nguyen HN, et al., 2016, The VINEYARD project: Versatile integrated accelerator-based heterogeneous data centres
© 2016 IEEE. Emerging applications like cloud computing and big data analytics have created the need for powerful centers hosting hundreds of thousands of servers. Currently, the data centers are based on general purpose processors that provide high flexibility but lacks the energy efficiency of customized accelerators. VINEYARD1 aims to develop novel servers based on programmable hardware accelerators. Furthermore, VINEYARD will develop an integrated framework for allowing end-users to seamlessly utilize these accelerators in heterogeneous computing systems by using typical data-center programming frameworks (i.e. Spark). VINEYARD will foster the expansion of the soft-IP cores industry, currently limited in the embedded systems, to the data center market. VINEYARD plans to demonstrate the advantages of its approach in three real use-cases a) a bio-informatics application for high-accuracy brain modeling, b) two critical financial applications and c) a big-data analysis application.
Lynn T, Xiong H, Dong D, et al., 2016, CLOUDLIGHTNING: A framework for a self-organising and self-managing heterogeneous cloud, 6th International Conference on Cloud Computing and Services Science, Publisher: Scitepress Digital Library, Pages: 333-338
As clouds increase in size and as machines of different types are added to the infrastructure in order to maximize performance and power efficiency, heterogeneous clouds are being created. However, exploiting different architectures poses significant challenges. To efficiently access heterogeneous resources and, at the same time, to exploit these resources to reduce application development effort, to make optimisations easier and to simplify service deployment, requires a re-evaluation of our approach to service delivery. We propose a novel cloud management and delivery architecture based on the principles of self-organisation and self-management that shifts the deployment and optimisation effort from the consumer to the software stack running on the cloud infrastructure. Our goal is to address inefficient use of resources and consequently to deliver savings to the cloud provider and consumer in terms of reduced power consumption and improved service delivery, with hyperscale systems particularly in mind. The framework is general but also endeavours to enable cloud services for high performance computing. Infrastructure-as-a-Service provision is the primary use case, however, we posit that genomics, oil and gas exploration, and ray tracing are three downstream use cases that will benefit from the proposed architecture.
This paper describes the vision behind and the mission of the Maxeler Application Gallery (AppGallery.Maxeler.com) project. First, it concentrates on the essence and performance advantages of the Maxeler dataflow approach. Second, it reviews the support technologies that enable the dataflow approach to achieve its maximum. Third, selected examples of the Maxeler Application Gallery are presented; these examples are treated as the final achievement made possible when all the support technologies are put to work together (internal infrastructure of the AppGallery.Maxeler.com is given in a follow-up paper). As last, the possible impact of the Application Gallery is presented and the major conclusions are drawn.
Friston S, Steed A, Tilbury S, et al., 2016, Construction and Evaluation of an Ultra Low Latency Frameless Renderer for VR., IEEE Transactions on Visualization and Computer Graphics, Vol: 22, Pages: 1377-1386, ISSN: 1941-0506
Latency - the delay between a user's action and the response to this action - is known to be detrimental to virtual reality. Latency is typically considered to be a discrete value characterising a delay, constant in time and space - but this characterisation is incomplete. Latency changes across the display during scan-out, and how it does so is dependent on the rendering approach used. In this study, we present an ultra-low latency real-time ray-casting renderer for virtual reality, implemented on an FPGA. Our renderer has a latency of ~1 ms from 'tracker to pixel'. Its frameless nature means that the region of the display with the lowest latency immediately follows the scan-beam. This is in contrast to frame-based systems such as those using typical GPUs, for which the latency increases as scan-out proceeds. Using a series of high and low speed videos of our system in use, we confirm its latency of ~1 ms. We examine how the renderer performs when driving a traditional sequential scan-out display on a readily available HMO, the Oculus Rift OK2. We contrast this with an equivalent apparatus built using a GPU. Using captured human head motion and a set of image quality measures, we assess the ability of these systems to faithfully recreate the stimuli of an ideal virtual reality system - one with a zero latency tracker, renderer and display running at 1 kHz. Finally, we examine the results of these quality measures, and how each rendering approach is affected by velocity of movement and display persistence. We find that our system, with a lower average latency, can more faithfully draw what the ideal virtual reality system would. Further, we find that with low display persistence, the sensitivity to velocity of both systems is lowered, but that it is much lower for ours.
Kachris C, Soudris D, Gaydadjiev G, et al., 2016, The VINEYARD approach: Versatile, integrated, accelerator-based, heterogeneous data centres, 12th International Symposium, ARC 2016, Publisher: Springer, Pages: 3-13, ISSN: 0302-9743
Emerging web applications like cloud computing, Big Data and social networks have created the need for powerful centres hosting hundreds of thousands of servers. Currently, the data centres are based on general purpose processors that provide high flexibility buts lack the energy efficiency of customized accelerators. VINEYARD aims to develop an integrated platform for energy-efficient data centres based on new servers with novel, coarse-grain and fine-grain, programmable hardware accelerators. It will, also, build a high-level programming framework for allowing end-users to seamlessly utilize these accelerators in heterogeneous computing systems by employing typical data-centre programming frameworks (e.g. MapReduce, Storm, Spark, etc.). This programming framework will, further, allow the hardware accelerators to be swapped in and out of the heterogeneous infrastructure so as to offer high flexibility and energy efficiency. VINEYARD will foster the expansion of the soft-IP core industry, currently limited in the embedded systems, to the data-centre market. VINEYARD plans to demonstrate the advantages of its approach in three real use-cases (a) a bio-informatics application for high-accuracy brain modeling, (b) two critical financial applications, and (c) a big-data analysis application.
Becker T, Mencer O, Gaydadjiev G, 2016, Spatial programming with OpenSPL, FPGAs for Software Programmers, Pages: 81-95, ISBN: 9783319264066
© Springer International Publishing Switzerland 2016. In this chapter we present OpenSPL, a novel programming language that enables designers to describe their computational structures in space and benefit from parallelism at multiple levels. We start with our motivation why spatial programming is currently among the most promising approaches for building future computing systems in Sect. 5.1. In Sect. 5.2 we introduce the basic principles behind OpenSPL and exemplify them with few simple examples targeting the first commercial offering of a Spatial Computer system by Maxeler Technologies. We validate the potential of Spatial Computers in Sect. 5.3 and conclude in Sect. 5.4.
Rodrigues D, Nazarian G, Moreira Á, et al., 2015, A non-conservative software-based approach for detecting illegal CFEs caused by transient faults, Pages: 221-226
© 2015 IEEE. Software-based methods for the detection of control-flow errors caused by transient fault usually consist in the introduction of protecting instructions both at the beginning and at the end of basic blocks. These methods are conservative in nature, in the sense that they assume that all blocks have the same probability of being the target of control flow errors. Because of that assumption they can lead to a considerable increase both in memory and performance overhead during execution time. In this paper, we propose a static analysis that provide a more refined information about which basic blocks can be the target of control-flow-errors caused by single-bit flips. This information can then be used to guide a program transformation in which only susceptible blocks have to be protected. We implemented the static analysis and program transformation in the context of the LLVM framework and performed an extensive fault injection campaign. Our experiments show that this less conservative approach can potentially lead to gains both in memory usage and in execution time while keeping high fault coverage.
© 2015 Imperial College. Reconfigurable hardware has been used before for low latency image synthesis. These are typically low level implementations with tight vertical integration. For example the apparatus of both Regan et al and Ng et al had the tracker driven by the same device performing the rendering. Reconfigurable hardware combined with the dataflow programming model can make application specific rendering hardware cost effective. Our sprite renderer has comparable scope to both prior examples, but our dataflow graph can be adapted to other use cases with an effort comparable to GPU shader programming.
Pnevmatikatos D, Papadimitriou K, Becker T, et al., 2015, FASTER: Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration, MICROPROCESSORS AND MICROSYSTEMS, Vol: 39, Pages: 321-338, ISSN: 0141-9331
Nazarian G, Nane R, Gaydadjiev GN, 2015, Low-cost Software Control-Flow Error Recovery, 18th Euromicro Conference on Digital System Design (DSD), Publisher: IEEE, Pages: 510-517
Nazarian G, Rodrigues DG, Moreira A, et al., 2015, Bit-flip aware control-flow error detection, 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), Publisher: IEEE, Pages: 215-221
Ciobanu CB, Varbanescu AL, Pnevmatikatos D, et al., 2015, EXTRA: Towards an Efficient Open Platform for Reconfigurable High Performance Computing, 18th IEEE International Conference on Computational Science and Engineering (CSE), Publisher: IEEE, Pages: 339-342
Becker T, Mencer O, Weston S, et al., 2015, Maxeler Data-Flow in Computational Finance, FPGA Based Accelerators for Financial Applications, Publisher: Springer International Publishing, Pages: 243-266, ISBN: 9783319154060
Riemens DP, Gaydadjiev GN, de Zeeuw CI, et al., 2014, Towards Scalable Arithmetic Units with Graceful Degradation, ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, Vol: 13, ISSN: 1539-9087
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.