Publications

Mohr-Daurat H, Sun X, Pirk H, 2024, BOSS - An architecture for database kernel composition, VLDB 2024, Publisher: VLDB Endowment, Pages: 877-890, ISSN: 2150-8097

Composable Database System Research has yielded components such as Apache Arrow for Storage, Meta’s Velox for processing and Apache Calcite for query planning. What is lacking, however, is a design for a general, efficient and easy-to-use architecture to connect them. We propose such an architecture. Our proposal is based on the ideas of partial query evaluation and a carefully designed, unified exchange format for query plans and data. We implement the architecture in a system called BOSS1 that combinesthe Apache Arrow, the GPU-accelerated compute kernel ArrayFire and the CPU-oriented Velox kernel into a fully-featured relational Data Management System (DMS). We demonstrate that the architecture is general enough to incorporate practically any DMS component, easy-to-use and virtually overhead-free. Based on the architecture, BOSS achieves significant performance improvementover the CPU-only Velox kernel and even outperforms the highly-optimized GPU-only DMS HeavyDB for some queries.

Conference paper

Mohr-Daurat H, Pirk H, 2023, Wisent: an in-memory serialization format for leafy trees, Joint Workshops at 49th International Conference on Very Large Data Bases (VLDBW’23) — Second International Workshop on Composable Data Management Systems (CDMS’23), Publisher: CEUR Workshop Proceedings, Pages: 1-8, ISSN: 1613-0073

Efficient data exchange is one of the key ingredients for high-performance, composable data management systems. Efficient data exchange formats exist for “flat” (i.e., relational) data. For nested data, however, users must resort to formats such as XML, JSON or their binary counterparts such as BSON or CBJSON. These formats provide the flexibility to store metadata as well as actual data but lead to costly serialization and deserialization. This makes them unfit to represent flat data, forcing users to combine flat formats (Arrow, Parquet and even CSV) for data with JSON or XML documents for metadata. This practice, known as “Data Packages”, is error-prone, labor-intensive and increases system complexity.To address this problem, we propose Wisent, a new exchange format designed to represent nested data while efficiently encoding the “flat” sections of the tree. We call such trees “leafy”. To this end, Wisent serializes trees bread-first rather than the conventional depth-first serialization. We found that breadth-first serialization enables lazy decoding during tree navigation and in-place/conversion data processing of the flat sections using tight loops. We implemented a Wisent deserializer in C++, which outperforms the state-of-the-art data serialization protocols by several orders of magnitude. Wisent is not just faster toencode and decode but also much simpler to implement: to demonstrate that aspect, we implemented Wisent decoders in Swift and Python and found that it can be implemented in roughly 100 lines of code while achieving performance that is dominated only by the overhead of the respective host language.

Conference paper

Pirk H, 2022, Collaborative data science using scalable homoiconicity, Sigmod Record, Vol: 51, Pages: 54-55, ISSN: 0163-5808

Motivation: Data science is increasingly collaborative. On the one hand, results need to be distributed, e.g., as interactive visualizations. On the other, collaboration in the data development process improves quality and timeliness. This can take many forms: partitioning a problem and working on aspects in parallel, exploring different solutions or reviewing someone else's work.

Journal article

Theodorakis G, Kounelis F, Pietzuch P, Pirk Het al., 2022, SCABBARD: single-node fault-tolerant stream processing, 48th International Conference on Very Large Data Bases (VLDB), Publisher: VLDB Endowment, Pages: 361-374, ISSN: 2150-8097

Single-node multi-core stream processing engines (SPEs) can process hundreds of millions of tuples per second. Yet making them fault-tolerant with exactly-once semantics while retaining this performance is an open challenge: due to the limited I/O bandwidth of a single-node, it becomes infeasible to persist all stream data and operator state during execution. Instead, single-node SPEs rely on upstream distributed systems, such as Apache Kafka, to recover stream data after failure, necessitating complex cluster-based deployments. This lack of built-in fault-tolerance features has hindered the adoption of single-node SPEs.We describe Scabbard, the first single-node SPE that supports exactly-once fault-tolerance semantics despite limited local I/O bandwidth. Scabbard achieves this by integrating persistence operations with the query workload. Within the operator graph, Scabbard determines when to persist streams based on the selectivity of operators: by persisting streams after operators that discard data, it can substantially reduce the required I/O bandwidth. As part of the operator graph, Scabbard supports parallel persistence operations and uses markers to decide when to discard persisted data. The persisted data volume is further reduced using workload-specific compression: Scabbard monitors stream statistics and dynamically generates computationally efficient compression operators. Our experiments show that Scabbard can execute stream queries that process over 200 million tuples per second while recovering from failures with sub-second latencies.

Conference paper

, 2022, Proceedings of the The British International Conference on Databases 2021, London, United Kingdom, March 28, 2022., Publisher: CEUR-WS.org

Cite

Conference paper

Mohr-Daurat H, Pirk H, 2021, Homoiconicity For End-to-end Machine Learning with BOSS., Publisher: CEUR-WS.org, Pages: 46-49

Conference paper

Papaphilippou P, Pirk H, Luk W, 2020, Accelerating the merge phase of sort-merge join, The International Conference on Field-Programmable Logic and Applications (FPL) 2019, Publisher: IEEE

We present an efficient, high-throughput and scalable hardware design for accelerating the merge phase of the sort-merge join operation. Sort-merge join is one of the fundamental join algorithms and among the most frequently executed operations in relational databases. It has been the focus of various recent pieces of research, each having different shortcomings and usually only focusing on the sort phase. In this paper, a new parallel sort-merge join architecture is developed, that provides data streaming functionality and high throughput. The key idea of the paper is the use of a novel design for a co-grouping engine, with which the input data are summarised on-the-fly. In this way, the operation is performed on streams of data, preserving the linear access pattern for faster data movement and also eliminates the need for a replay buffer/cache. In contrast to related work, our approach does not make assumptions about the value distribution of the input data and applies to any input size and width. We evaluate the design on a Zynq UltraScale+ based platform, and show that there is up to 3.1 times speedup over the host processor, even without accelerating the sort phase, while still taking into account the data transfers from and to main memory in Linux.

Conference paper

Theodorakis G, Pietzuch P, Pirk H, 2020, SlideSide: a fast incremental stream processing algorithm for multiple queries, 23rd International Conference on Extending Database Technology (EDBT), Pages: 435-438, ISSN: 2367-2005

Aggregate window computations lie at the core of online analyt-ics in both academic and industrial applications. To efficientlycompute sliding windows, the state-of-the-art algorithms utilizeincremental processing that avoids the recomputation of windowresults from scratch. In this paper, we propose a novel algorithm,calledSlideSide, that extendsTwoStacksfor multiple concur-rent aggregate queries over the same data stream. Our approachuses different yet similar processing schemes for invertible andnon-invertible functions and exhibits up to 2×better through-put compared to the state-of-the-art incremental techniques in amulti-query environment.

Conference paper

Kowalski T, Kounelis F, Pirk H, 2020, High-Performance Tree Indices: Locality matters more than one would think., Pages: 1-7

Conference paper

Theodorakis G, Koliousis A, Pietzuch P, Pirk Het al., 2020, LightSaber: Efficient Window Aggregation on Multi-Core Processors, New York, NY, USA, Publisher: Association for Computing Machinery, Pages: 2505–2521-2505–2521

Window aggregation queries are a core part of streaming applications. To support window aggregation efficiently, stream processing engines face a trade-off between exploiting parallelism (at the instruction/multi-core levels) and incremental computation (across overlapping windows and queries). Existing engines implement ad-hoc aggregation and parallelization strategies. As a result, they only achieve high performance for specific queries depending on the window definition and the type of aggregation function. We describe a general model for the design space of window aggregation strategies. Based on this, we introduce LightSaber, a new stream processing engine that balances parallelism and incremental processing when executing window aggregation queries on multi-core CPUs. Its design generalizes existing approaches: (i) for parallel processing, LightSaber constructs a parallel aggregation tree (PAT) that exploits the parallelism of modern processors. The PAT divides window aggregation into intermediate steps that enable the efficient use of both instruction-level (i.e., SIMD) and task-level (i.e., multi-core) parallelism; and (ii) to generate efficient incremental code from the PAT, LightSaber uses a generalized aggregation graph (GAG), which encodes the low-level data dependencies required to produce aggregates over the stream. A GAG thus generalizes state-of-the-art approaches for incremental window aggregation and supports work-sharing between overlapping windows. LightSaber achieves up to an order of magnitude higher throughput compared to existing systems-on a 16-core server, it processes 470 million records/s with 132 ?s average latency.

Conference paper

Theodorakis G, Koliousis A, Pietzuch PR, Pirk Het al., 2018, Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation, International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures, ADMS@VLDB 2018, Rio de Janeiro, Brazil, August 27, 2018, Pages: 34-41

The computation of sliding window aggregates is one of the corefunctionalities of stream processing systems. Presently, there aretwo classes of approaches to evaluating them. The first is non-incremental, i.e., every window is evaluated in isolation even ifoverlapping windows provide opportunities for work-sharing. Whilenot algorithmically efficient, this class of algorithm is usually veryCPU efficient. The other approach is incremental: to the amountpossible, the result of one window evaluation is used to help withthe evaluation of the next window. While algorithmically efficient,the inherent control-dependencies in the CPU instruction streammake this highly CPU inefficient.In this paper, we analyse the state of the art in efficient incre-mental window processing and extend the fastest known algorithm,the Two-Stacks approach with known as well as novel optimisa-tions. These include SIMD-parallel processing, internal data struc-ture decomposition and data minimalism. We find that, thus opti-mised, our incremental window aggregation algorithm outperformsthe state-of-the-art incremental algorithm by up to11×. In ad-dition, it is at least competitive and often significantly (up to80%)faster than a non-incremental algorithm. Consequently, stream pro-cessing systems can use our proposed algorithm to cover incremen-tal as well as non-incremental aggregation resulting in systems thatare simpler as well as faster.

Conference paper

Shanbhag A, Pirk H, Madden S, 2018, Efficient top-K query processing on massively parallel hardware, SIGMOD 2018, Publisher: Association for Computing Machinery (ACM), Pages: 1557-1570, ISSN: 0730-8078

A common operation in many data analytics workloads is to findthe top-k items, i.e., the largest or smallest operations according tosome sort order (implemented via LIMIT or ORDER BY expressionsin SQL). A naive implementation of top-k is to sort all of the itemsand then return the first k, but this does much more work thanneeded. Although efficient implementations for top-k have beenexplored on traditional multi-core processors, there has been noprior systematic study of top-k implementations on GPUs, despiteopen requests for such implementations in GPU-based frameworkslike TensorFlow1and ArrayFire2. In this work, we present severaltop-k algorithms for GPUs, including a new algorithm based onbitonic sort called bitonic top-k. The bitonic top-k algorithm is upto a factor of 15x faster than sort and 4x faster than a variety ofother possible implementations for values of k up to 256. We alsodevelop a cost model to predict the performance of several of ouralgorithms, and show that it accurately predicts actual performanceon modern GPUs.

Conference paper

Palkar S, Thomas J, Narayanan D, Thaker P, Palamuttam R, Negi P, Shanbhag A, Schwarzkopf M, Pirk H, Amarasinghe S, Madden S, Zaharia Met al., 2018, Evaluating end-to-end optimization for data analytics applications in weld, Proceedings of the VLDB Endowment, Vol: 11, Pages: 1002-1015, ISSN: 2150-8097

Modern analytics applications use a diverse mix of libraries and functions. Unfortunately, there is no optimization across these libraries, resulting in performance penalties as high as an order of magnitude in many applications. To address this problem, we proposed Weld, a common runtime for existing data analytics libraries that performs key physical optimizations such as pipelining under existing, imperative library APIs. In this work, we further develop the Weld vision by designing an automatic adaptive optimizer for Weld applications, and evaluating its impact on realistic data science workloads. Our optimizer eliminates multiple forms of overhead that arise when composing imperative libraries like Pandas and NumPy, and uses lightweight measurements to make data-dependent decisions at run-time in ad-hoc workloads where no statistics are available, with sub-second overhead. We also evaluate which optimizations have the largest impact in practice and whether Weld can be integrated into libraries incrementally. Our results are promising: using our optimizer, Weld accelerates data science workloads by up to 23X on one thread and 80X on eight threads, and its adaptive optimizations provide up to a 3.75X speedup over rule-based optimization. Moreover, Weld provides benefits if even just 4--5 operators in a library are ported to use it. Our results show that common runtime designs like Weld may be a viable approach to accelerate analytics.

Journal article

Palkar S, Thomas J, Narayanan D, Shanbhag A, Palamuttam R, Pirk H, Schwarzkopf M, Amarasinghe S, Madden S, Zaharia Met al., 2017, Weld: Rethinking the interface between data-intensive applications

Data analytics applications combine multiple functions from differentlibraries and frameworks. Even when each function is optimized in isolation,the performance of the combined application can be an order of magnitude belowhardware limits due to extensive data movement across these functions. Toaddress this problem, we propose Weld, a new interface between data-intensivelibraries that can optimize across disjoint libraries and functions. Weldexposes a lazily-evaluated API where diverse functions can submit theircomputations in a simple but general intermediate representation that capturestheir data-parallel structure. It then optimizes data movement across thesefunctions and emits efficient code for diverse hardware. Weld can be integratedinto existing frameworks such as Spark, TensorFlow, Pandas and NumPy withoutchanging their user-facing APIs. We demonstrate that Weld can speed upapplications using these frameworks by up to 29x.

Working paper

Shanbhag A, Pirk H, Madden S, 2017, Locality-Adaptive Parallel Hash Joins Using Hardware Transactional Memory, 7th International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures (ADMS) / International Workshop on In-Memory Data Management (IMDM), Publisher: SPRINGER INTERNATIONAL PUBLISHING AG, Pages: 118-133, ISSN: 0302-9743

Previous work [1] has claimed that the best performing implementation of in-memory hash joins is based on (radix-)partitioning of the build-side input. Indeed, despite the overhead of partitioning, the benefits from increased cache-locality and synchronization free parallelism in the build-phase outweigh the costs when the input data is randomly ordered. However, many datasets already exhibit significant spatial locality (i.e., non-randomness) due to the way data items enter the database: through periodic ETL or trickle loaded in the form of transactions. In such cases, the first benefit of partitioning — increased locality — is largely irrelevant. In this paper, we demonstrate how hardware transactional memory (HTM) can render the other benefit, freedom from synchronization, irrelevant as well.Specifically, using careful analysis and engineering, we develop an adaptive hash join implementation that outperforms parallel radix-partitioned hash joins as well as sort-merge joins on data with high spatial locality. In addition, we show how, through lightweight (less than 1% overhead) runtime monitoring of the transaction abort rate, our implementation can detect inputs with low spatial locality and dynamically fall back to radix-partitioning of the build-side input. The result is a hash join implementation that is more than 3 times faster than the state-of-the-art on high-locality data and never more than 1% slower.

Conference paper

Palkar S, Thomas J, Shanbhag A, Pirk H, Schwarzkopf M, Amarasinghe S, Zaharia Met al., 2016, Weld: A Common Runtime for High Performance Data Analytics, CIDR

Cite

Conference paper

Zeuch S, Pirk H, Freytag J-C, 2016, Non-invasive progressive optimization for in-memory databases, Proceedings of the VLDB Endowment, Vol: 9, Pages: 1659-1670, ISSN: 2150-8097

Progressive optimization introduces robustness for database workloads against wrong estimates, skewed data, correlated attributes, or outdated statistics. Previous work focuses on cardinality estimates and rely on expensive counting methods as well as complex learning algorithms.In this paper, we utilize performance counters to drive progressive optimization during query execution. The main advantages are that performance counters introduce virtually no costs on modern CPUs and their usage enables a non-invasive monitoring. We present fine-grained cost models to detect differences between estimates and actual costs which enables us to kick-start reoptimization. Based on our cost models, we implement an optimization approach that estimates the individual selectivities of a multi-selection query efficiently. Furthermore, we are able to learn properties like sortedness, skew, or correlation during run-time.In our evaluation we show, that the overhead of our approach is negligible, while performance improvements are convincing. Using progressive optimization, we improve runtime up to a factor of three compared to average run-times and up to a factor of 4,5 compared to worst case run-times. As a result, we avoid costly operator execution orders and; thus, making query execution highly robust.

Journal article

Pirk H, Moll O, Zaharia M, Madden Set al., 2016, Voodoo - a vector algebra for portable database performance on modern hardware, Proceedings of the VLDB Endowment, Vol: 9, Pages: 1707-1718, ISSN: 2150-8097

In-memory databases require careful tuning and many engineering tricks to achieve good performance. Such database performance engineering is hard: a plethora of data and hardware-dependent optimization techniques form a design space that is difficult to navigate for a skilled engineer --- even more so for a query compiler. To facilitate performance-oriented design exploration and query plan compilation, we present Voodoo, a declarative intermediate algebra that abstracts the detailed architectural properties of the hardware, such as multi- or many-core architectures, caches and SIMD registers, without losing the ability to generate highly tuned code. Because it consists of a collection of declarative, vector-oriented operations, Voodoo is easier to reason about and tune than low-level C and related hardware-focused extensions (Intrinsics, OpenCL, CUDA, etc.). This enables our Voodoo compiler to produce (OpenCL) code that rivals and even outperforms the fastest state-of-the-art in memory databases for both GPUs and CPUs. In addition, Voodoo makes it possible to express techniques as diverse as cache-conscious processing, predication and vectorization (again on both GPUs and CPUs) with just a few lines of code. Central to our approach is a novel idea we termed control vectors, which allows a code generating frontend to expose parallelism to the Voodoo compiler in a abstract manner, enabling portable performance across hardware platforms.We used Voodoo to build an alternative backend for MonetDB, a popular open-source in-memory database. Our backend allows MonetDB to perform at the same level as highly tuned in-memory databases, including HyPeR and Ocelot. We also demonstrate Voodoo's usefulness when investigating hardware conscious tuning techniques, assessing their performance on different queries, devices and data.

Journal article

Pirk H, Moll O, Madden S, 2016, What Makes a Good Physical plan? - Experiencing Hardware-Conscious Query Optimization with Candomble, ACM SIGMOD International Conference on Management of Data, Publisher: ASSOC COMPUTING MACHINERY, Pages: 2149-2152

Author Web Link
Cite
Citations: 2

Conference paper

Pirk H, 2015, ...like Commanding an Anthill: A Case for Micro-Distributed (Data) Management Systems, SIGMOD RECORD, Vol: 44, Pages: 53-58, ISSN: 0163-5808

Journal article

, 2015, Proceedings of the 11th International Workshop on Data Management on New Hardware, SIGMOD/PODS'15: International Conference on Management of Data, Publisher: ACM

Conference paper

Pirk H, Madden S, Stonebraker M, 2015, By their fruits shall ye know them: A Data Analyst's Perspective on Massively Parallel System Design., Publisher: ACM, Pages: 5:1-5:1

Conference paper

Karg in YUI, Kersten M, Manegold S, Pirk Het al., 2015, The DBMS-your big data sommelier, Pages: 1119-1130

Cite

Conference paper

, 2014, Proceedings of the Tenth International Workshop on Data Management on New Hardware, SIGMOD/PODS'14: International Conference on Management of Data, Publisher: ACM

Conference paper

Pirk H, Manegold S, Kersten M, 2014, Waste Not ... Efficient Co-Processing of Relational Data, IEEE 30th International Conference on Data Engineering (ICDE), Publisher: IEEE, Pages: 508-519, ISSN: 1084-4627

Conference paper

Pirk H, Petraki E, Idreos S, Manegold S, Kersten MLet al., 2014, Database cracking: fancy scan, not poor man's sort!, Publisher: ACM, Pages: 4:1-4:1

Conference paper

Heimel M, Saecker M, Pirk H, Manegold S, Markl Vet al., 2013, Hardware-oblivious parallelism for in-memory column-stores, Proceedings of the VLDB Endowment, Vol: 6, Pages: 709-720, ISSN: 2150-8097

<jats:p>The multi-core architectures of today's computer systems make parallelism a necessity for performance critical applications. Writing such applications in a generic, hardware-oblivious manner is a challenging problem: Current database systems thus rely on labor-intensive and error-prone manual tuning to exploit the full potential of modern parallel hardware architectures like multi-core CPUs and graphics cards. We propose an alternative design for a parallel database engine, based on a single set of hardware-oblivious operators, which are compiled down to the actual hardware at runtime. This design reduces the development overhead for parallel database engines, while achieving competitive performance to hand-tuned systems.</jats:p> <jats:p>We provide a proof-of-concept for this design by integrating operators written using the parallel programming framework OpenCL into the open-source database MonetDB. Following this approach, we achieve efficient, yet highly portable parallel code without the need for optimization by hand. We evaluated our implementation against MonetDB using TPC-H derived queries and observed a performance that rivals that of MonetDB's query execution on the CPU and surpasses it on the GPU. In addition, we show that the same set of operators runs nearly unchanged on a GPU, demonstrating the feasibility of our approach.</jats:p>

Journal article

, 2013, Selected Topics in Performance Evaluation and Benchmarking, Publisher: Springer Berlin Heidelberg, ISBN: 9783642367267

Book

Kargin Y, Pirk H, Ivanova M, Manegold S, Kersten Met al., 2013, Instant-On Scientific Data Warehouses Lazy ETL for Data-Intensive Research, 6th BIRTE International Workshop Held at the 38th International Conference on Very Large Databases (VLDB), Publisher: SPRINGER-VERLAG BERLIN, Pages: 60-75, ISSN: 1865-1348

Author Web Link
Cite
Citations: 1

Conference paper

Pirk H, Funke F, Grund M, Neumann T, Leser U, Manegold S, Kemper A, Kersten Met al., 2013, CPU and Cache Efficient Management of Memory-Resident Databases, 29th IEEE International Conference on Data Engineering (ICDE), Publisher: IEEE, Pages: 14-25, ISSN: 1084-4627

Author Web Link
Cite
Citations: 8

Conference paper

DrHolgerPirk

Contact

Location

Summary