Imperial College London


Faculty of EngineeringDepartment of Computing

Professor of Distributed Systems



+44 (0)20 7594 8314prp Website




442Huxley BuildingSouth Kensington Campus





Publication Type

145 results found

Sartakov VA, Vilanova L, Eyers D, Shinagawa T, Pietzuch Pet al., 2022, CAP-VMs: Capability-based isolation and sharing in the cloud, 16th USENIX Symposium on Operating Systems Design and Implementation, Pages: 597-612

Cloud stacks must isolate application components, while permitting efficient data sharing between components deployed on the same physical host. Traditionally, the MMU enforces isolation and permits sharing at page granularity. MMU approaches, however, lead to cloud stacks with large TCBs in kernel space, and page granularity requires inefficient OS interfaces for data sharing. Forthcoming CPUs with hardware support for memory capabilities offer new opportunities to implement isolation and sharing at a finer granularity. We describe cVMs, a new VM-like abstraction that uses memory capabilities to isolate application components while supporting efficient data sharing, all without mandating application code to be capability-aware. cVMs share a single virtual address space safely, each having only capabilities to access its own memory. A cVM may include a library OS, thus minimizing its dependency on the cloud environment. cVMs efficiently exchange data through two capability-based primitives assisted by a small trusted monitor: (i) an asynchronous read/write interface to buffers shared between cVMs; and (ii) a call interface to transfer control between cVMs. Using these two primitives, we build more expressive mechanisms for efficient cross-cVM communication. Our prototype implementation using CHERI RISC-V capabilities shows that cVMs isolate services (Redis and Python) with low overhead while improving data sharing.

Conference paper

Pietzuch P, Shamis A, Canakci B, Castro M, Fournet C, Ashton E, Chamayou A, Clebsch S, Delignat-Lavaud A, Kerner M, Maffre J, Vrousgou O, Wintersteiger CM, Costa M, Russinovich Met al., 2022, IA-CCF: Individual accountability for permissioned ledgers, 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI), Pages: 467-491

Permissioned ledger systems allow a consortium of members that do not trust one another to execute transactions safelyon a set of replicas. Such systems typically use Byzantinefault tolerance (BFT) protocols to distribute trust, which onlyensures safety when fewer than 1/3 of the replicas misbehave.Providing guarantees beyond this threshold is a challenge:current systems assume that the ledger is corrupt and fail toidentify misbehaving replicas or hold the members that operate them accountable—instead all members share the blame.We describe IA-CCF, a new permissioned ledger systemthat provides individual accountability. It can assign blameto the individual members that operate misbehaving replicasregardless of the number of misbehaving replicas or members.IA-CCF achieves this by signing and logging BFT protocolmessages in the ledger, and by using Merkle trees to provideclients with succinct, universally-verifiable receipts as evidence of successful transaction execution. Anyone can auditthe ledger against a set of receipts to discover inconsistenciesand identify replicas that signed contradictory statements. IACCF also supports changes to consortium membership andreplicas by tracking signing keys using a sub-ledger of governance transactions. IA-CCF provides strong disincentives tomisbehavior with low overhead: it executes 47,000 tx/s whileproviding clients with receipts in two network round trips.

Conference paper

Sartakov V, Pietzuch P, Vilanova L, 2021, CubicleOS: A library OS with software componentisation for practical isolation, Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’21), Publisher: ACM, Pages: 546-558

Library OSs have been proposed to deploy applications isolated inside containers, VMs, or trusted execution environments. They often follow a highly modular design in which third-party components are combined to offer the OS functionality needed by an application, and they are customised at compilation and deployment time to fit application requirements. Yet their monolithic design lacks isolation across components: when applications and OS components contain security-sensitive data (e.g., cryptographic keys or user data), the lack of isolation renders library OSs open to security breaches via malicious or vulnerable third-party components.

Conference paper

Sartakov V, O’Keeffe D, Eyers D, Vilanova L, Pietzuch Pet al., 2021, Spons & shields: practical isolation for trusted execution, ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE ’21), Publisher: ACM, Pages: 186-200

Trusted execution environments (TEEs) promise a cost-effective, “lift-and-shift” solution for deploying security-sensitive applications in untrusted clouds. For this, they must support rich, multi-component applications, but a large trusted computing base (TCB) inside the TEE risks that attackers can compromise application security. Fine-grained compartmentalisation can increase security through defense-in-depth, but current solutions either run all software components unprotected in the same TEE, lack efficient shared memory support, or isolate application processes using separate TEEs, impacting performance and compatibility.We describe the Spons & Shields framework (SSF) for Intel SGX TEEs, which offers intra-TEE compartmentalisation using two new abstraction, Spons and Shields. Spons and Shields generalise process, library and user/kernel isolation inside the TEE while allowing for efficient memory sharing. When users deploy unmodified multi-component applications in a TEE, SSF dynamically creates Spons (one per POSIX process or library) and Shields (to enforce a given security policy for memory accesses). Applications can be hardened with minor code changes, e.g., by using a separate Shield to isolate an SSL library. SSF uses compiler instrumentation to protect Shield boundaries, exploiting MPX instructions if available. We evaluate SSF using a complex application service (NGINX, PHP interpreter and PostgreSQL) and show that its overhead is comparable to process isolation.

Conference paper

Sartakov V, Vilanova L, Pietzuch P, 2020, CubicleOS: A Library OS with Software Componentisation for Practical Isolation

This artefact contains the library OS, two applications, isolation monitor, and scripts to reproduce experiments from the ASPLOS 2021 paper by V. A. Sartakov, L. Vilanova, R. Pietzuch — CubicleOS: A Library OS with Software Componentisation for Practical Isolation, which isolates components of a monolithic library OS without the use of message-based IPC primitives.


Mai L, Li G, Wagenlander M, Fertakis K, Brabete A-O, Pietzuch Pet al., 2020, KungFu: making training in distributed machine learning adaptive, USENIX Symposium on Operating Systems Design and Implementation (OSDI), Publisher: Usenix, Pages: 937-954

When using distributed machine learning (ML) systems to train models on a cluster of worker machines, users must con-figure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence; system parameters (e.g. the number of workers and their communication topology) impact training performance. In current systems, adapting such parameters during training is ill-supported. Users must set system parameters at deployment time, and provide fixed adaptation schedules for hyper-parameters in the training program. We describe Kung Fu, a distributed ML library for Tensor-Flow that is designed to enable adaptive training. Kung Fu allows users to express high-level Adaptation Policies(APs)that describe how to change hyper- and system parameters during training. APs take real-time monitored metrics (e.g. signal-to-noise ratios and noise scale) as input and trigger control actions (e.g. cluster rescaling or synchronisation strategy updates). For execution, APs are translated into monitoring and control operators, which are embedded in the data flowgraph. APs exploit an efficient asynchronous collective communication layer, which ensures concurrency and consistency of monitoring and adaptation operations

Conference paper

Theodorakis G, Pietzuch P, Pirk H, 2020, SlideSide: a fast incremental stream processing algorithm for multiple queries, 23rd International Conference on Extending Database Technology (EDBT), Pages: 435-438, ISSN: 2367-2005

Aggregate window computations lie at the core of online analyt-ics in both academic and industrial applications. To efficientlycompute sliding windows, the state-of-the-art algorithms utilizeincremental processing that avoids the recomputation of windowresults from scratch. In this paper, we propose a novel algorithm,calledSlideSide, that extendsTwoStacksfor multiple concur-rent aggregate queries over the same data stream. Our approachuses different yet similar processing schemes for invertible andnon-invertible functions and exhibits up to 2×better through-put compared to the state-of-the-art incremental techniques in amulti-query environment.

Conference paper

Theodorakis G, Koliousis A, Pietzuch P, Pirk Het al., 2020, LightSaber: Efficient Window Aggregation on Multi-Core Processors, New York, NY, USA, Publisher: Association for Computing Machinery, Pages: 2505–2521-2505–2521

Window aggregation queries are a core part of streaming applications. To support window aggregation efficiently, stream processing engines face a trade-off between exploiting parallelism (at the instruction/multi-core levels) and incremental computation (across overlapping windows and queries). Existing engines implement ad-hoc aggregation and parallelization strategies. As a result, they only achieve high performance for specific queries depending on the window definition and the type of aggregation function. We describe a general model for the design space of window aggregation strategies. Based on this, we introduce LightSaber, a new stream processing engine that balances parallelism and incremental processing when executing window aggregation queries on multi-core CPUs. Its design generalizes existing approaches: (i) for parallel processing, LightSaber constructs a parallel aggregation tree (PAT) that exploits the parallelism of modern processors. The PAT divides window aggregation into intermediate steps that enable the efficient use of both instruction-level (i.e., SIMD) and task-level (i.e., multi-core) parallelism; and (ii) to generate efficient incremental code from the PAT, LightSaber uses a generalized aggregation graph (GAG), which encodes the low-level data dependencies required to produce aggregates over the stream. A GAG thus generalizes state-of-the-art approaches for incremental window aggregation and supports work-sharing between overlapping windows. LightSaber achieves up to an order of magnitude higher throughput compared to existing systems-on a 16-core server, it processes 470 million records/s with 132 ?s average latency.

Conference paper

Lind J, Naor O, Eyal I, Kelbert F, Sirer EG, Pietzuch Pet al., 2019, Teechain: a secure payment network with asynchronous blockchain access, 27th ACM Symposium on Operating Systems Principles (SOSP), Publisher: ACM Press, Pages: 63-79

Blockchains such as Bitcoin and Ethereum execute payment transactions securely, but their performance is limited by the need for global consensus. Payment networks overcome this limitation through off-chain transactions. Instead of writing to the blockchain for each transaction, they only settle the final payment balances with the underlying blockchain. When executing off-chain transactions in current payment networks, parties must access the blockchain within bounded time to detect misbehaving parties that deviate from the protocol. This opens a window for attacks in which a malicious party can steal funds by deliberately delaying other parties' blockchain access and prevents parties from using payment networks when disconnected from the blockchain.We present Teechain, the first layer-two payment network that executes off-chain transactions asynchronously with respect to the underlying blockchain. To prevent parties from misbehaving, Teechain uses treasuries, protected by hardware trusted execution environments (TEEs), to establish off-chain payment channels between parties. Treasuries maintain collateral funds and can exchange transactions efficiently and securely, without interacting with the underlying blockchain. To mitigate against treasury failures and to avoid having to trust all TEEs, Teechain replicates the state of treasuries using committee chains, a new variant of chain replication with threshold secret sharing. Teechain achieves at least a 33X higher transaction throughput than the state-of-the-art Lightning payment network. A 30-machine Teechain deployment can handle over 1 million Bitcoin transactions per second.

Conference paper

Koliousis A, Watcharapichat P, Weidlich M, Mai L, Costa P, Pietzuch Pet al., 2019, Crossbow: scaling deep learning with small batch sizes on multi-GPU servers, Proceedings of the VLDB Endowment, Vol: 12, ISSN: 2150-8097

Deep learning models are trained on servers with many GPUs, andtraining must scale with the number of GPUs. Systems such asTensorFlow and Caffe2 train models with parallel synchronousstochastic gradient descent: they process a batch of training data ata time, partitioned across GPUs, and average the resulting partialgradients to obtain an updated global model. To fully utilise allGPUs, systems must increase the batch size, which hinders statisticalefficiency. Users tune hyper-parameters such as the learning rate tocompensate for this, which is complex and model-specific.We describeCROSSBOW, a new single-server multi-GPU sys-tem for training deep learning models that enables users to freelychoose their preferred batch size—however small—while scalingto multiple GPUs.CROSSBOWuses many parallel model replicasand avoids reduced statistical efficiency through a new synchronoustraining method. We introduceSMA, a synchronous variant of modelaveraging in which replicasindependentlyexplore the solution spacewith gradient descent, but adjust their searchsynchronouslybased onthe trajectory of a globally-consistent average model.CROSSBOWachieves high hardware efficiency with small batch sizes by poten-tially training multiple model replicas per GPU, automatically tuningthe number of replicas to maximise throughput. Our experimentsshow thatCROSSBOWimproves the training time of deep learningmodels on an 8-GPU server by 1.3–4×compared to TensorFlow.

Journal article

Clegg R, Landa R, Griffin D, Rio M, Hughes P, Kegel I, Stevens T, Pietzuch P, Williams Det al., 2019, Faces in the clouds: long-duration, multi-user, cloud-assisted video conferencing, IEEE Transactions on Cloud Computing, Vol: 7, Pages: 756-769, ISSN: 2168-7161

Multi-user video conferencing is a ubiquitous technology. Increasingly end-hosts in a conference are assisted by cloud-based servers that improve the quality of experience for end users. This paper evaluates the impact of strategies for placement of such servers on user experience and deployment cost. We consider scenarios based upon the Amazon EC2 infrastructure as well as future scenarios in which cloud instances can be located at a larger number of possible sites across the planet. We compare a number of possible strategies for choosing which cloud locations should host services and how traffic should route through them. Our study is driven by real data to create demand scenarios with realistic geographical user distributions and diurnal behaviour. We conclude that on the EC2 infrastructure a well chosen static selection of servers performs well but as more cloud locations are available a dynamic choice of servers becomes important.

Journal article

Pirk H, Giceva J, Pietzuch P, 2019, Thriving in the No Man's Land between compilers and databases, Conference on Innovative Data Systems Research

When developing new data-intensive applications, one faces a build-or-buy decision: use an existing off-the-shelf data management sys-tem (DMS) or implement a custom solution. While off-the-shelfsystems offer quick results, they lack the flexibility to accommo-date the changing requirements of long-term projects. Building asolution from scratch in a general-purpose programming language,however, comes with long-term development costs that may not bejustified. What is lacking is a middle ground or, more precisely,a clear migration path from off-the-shelf Data Management Sys-tems to customized applications in general-purpose programminglanguages. There is, in effect, a no man’s land that neither compilernor database researchers have claimed.We believe that this problem is an opportunity for the databasecommunity to claim a stake. We need to invest effort to transfer theoutcomes of data management research into fields of programminglanguages and compilers. The common complaint that other fieldsare re-inventing database techniques bears witness to the need forthat knowledge transfer. In this paper, we motivate the necessityfor data management techniques in general-purpose programminglanguages and outline a number of specific opportunities for knowl-edge transfer. This effort will not only cover the no man’s land butalso broaden the impact of data management research.

Conference paper

Theodorakis G, Koliousis A, Pietzuch PR, Pirk Het al., 2018, Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation, International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures, ADMS@VLDB 2018, Rio de Janeiro, Brazil, August 27, 2018, Pages: 34-41

The computation of sliding window aggregates is one of the corefunctionalities of stream processing systems. Presently, there aretwo classes of approaches to evaluating them. The first is non-incremental, i.e., every window is evaluated in isolation even ifoverlapping windows provide opportunities for work-sharing. Whilenot algorithmically efficient, this class of algorithm is usually veryCPU efficient. The other approach is incremental: to the amountpossible, the result of one window evaluation is used to help withthe evaluation of the next window. While algorithmically efficient,the inherent control-dependencies in the CPU instruction streammake this highly CPU inefficient.In this paper, we analyse the state of the art in efficient incre-mental window processing and extend the fastest known algorithm,the Two-Stacks approach with known as well as novel optimisa-tions. These include SIMD-parallel processing, internal data struc-ture decomposition and data minimalism. We find that, thus opti-mised, our incremental window aggregation algorithm outperformsthe state-of-the-art incremental algorithm by up to11×. In ad-dition, it is at least competitive and often significantly (up to80%)faster than a non-incremental algorithm. Consequently, stream pro-cessing systems can use our proposed algorithm to cover incremen-tal as well as non-incremental aggregation resulting in systems thatare simpler as well as faster.

Conference paper

O'Keeffe D, Salonidis T, Pietzuch PR, 2018, Frontier: Resilient Edge Processing for the Internet of Things, Very Large Databases (VLDB), Publisher: ACM, Pages: 1178-1191, ISSN: 2150-8097

In an edge deployment model, Internet-of-Things (IoT) applications, e.g. for building automation or video surveillance, must process data locally on IoT devices without relying on permanent connectivity to a cloud backend. The ability to harness the combined resources of multiple IoT devices for computation is influenced by the quality of wireless network connectivity. An open challenge is how practical edge-based IoT applications can be realised that are robust to changes in network bandwidth between IoT devices, due to interference and intermittent connectivity.We present Frontier, a distributed and resilient edge processing platform for IoT devices. The key idea is to express data-intensive IoT applications as continuous data-parallel streaming queries and to improve query throughput in an unreliable wireless network by exploiting network path diversity: a query includes operator replicas at different IoT nodes, which increases possible network paths for data. Frontier dynamically routes stream data to operator replicas based on network path conditions. Nodes probe path throughput and use backpressure stream routing to decide on transmission rates, while exploiting multiple operator replicas for data-parallelism. If a node loses network connectivity, a transient disconnection recovery mechanism reprocesses the lost data. Our experimental evaluation of Frontier shows that network path diversity improves throughput by 1.3×−2.8×for different IoT applications, while being resilient to intermittent network connectivity.

Conference paper

Castro Fernandez R, Culhane W, Watcharapichat P, Weidlich M, Pietzuch PRet al., 2018, Meta-dataflows: efficient exploratory dataflow jobs, ACM Conference on Management of Data (SIGMOD), Publisher: Association for Computing Machinery (ACM), ISSN: 0730-8078

Distributed dataflow systems such as Apache Spark and ApacheFlink are used to derive new insights from large datasets. While theyefficiently executeconcretedata processing workflows, expressedas dataflow graphs, they lack generic support forexploratory work-flows: if a user is uncertain about the correct processing pipeline,e.g. in terms of data cleaning strategy or choice of model parame-ters, they must repeatedly submit modified jobs to the system. This,however, misses out on optimisation opportunities for exploratoryworkflows, both in terms of scheduling and memory allocation.We describemeta-dataflows(MDFs), a new model to effectivelyexpress exploratory workflows and efficiently execute them oncompute clusters. With MDFs, users specify afamilyof dataflowsusing two primitives: (a) anexploreoperator automatically con-siders choices in a dataflow; and (b) achooseoperator assesses theresult quality of explored dataflow branches and selects a subset ofthe results. We propose optimisations to execute MDFs: a systemcan (i) avoid redundant computation when exploring branches byreusing intermediate results and discarding results from underper-forming branches; and (ii) consider future data access patterns inthe MDF when allocating cluster memory. Our evaluation showsthat MDFs improve the runtime of exploratory workflows by up to90% compared to sequential execution.

Conference paper

Aublin PRER, Kelbert FM, O'Keeffe D, Muthukumaran D, Priebe C, Lind J, Krahn R, Fetzer C, Eyers D, Pietzuch PRet al., 2018, LibSEAL: revealing service integrity violations using trusted execution, The European Conference on Computer Systems, Publisher: ACM

Conference paper

Garefalakis P, Karanasos K, Pietzuch P, Suresh A, Rao Set al., 2018, Medea: scheduling of long running applications in shared production clusters, EuroSys 2018, Publisher: ACM

The rise in popularity of machine learning, streaming, and latency-sensitiveonline applications in shared production clusters hasraised new challenges for cluster schedulers. To optimize theirperformance and resilience, these applications require precise controlof their placements, by means of complex constraints, e.g., tocollocate or separate their long-running containers across groupsof nodes. In the presence of these applications, the cluster schedulermust attain global optimization objectives, such as maximizingthe number of deployed applications or minimizing the violatedconstraints and the resource fragmentation, but without affectingthe scheduling latency of short-running containers.We present Medea, a new cluster scheduler designed for theplacement of long- and short-running containers. Medea introducespowerful placement constraints with formal semantics to captureinteractions among containers within and across applications. Itfollows a novel two-scheduler design: (i) for long-running containers,it applies an optimization-based approach that accounts forconstraints and global objectives; (ii) for short-running containers,it uses a traditional task-based scheduler for low placement latency.Evaluated on a 400-node cluster, our implementation of Medea onApache Hadoop YARN achieves placement of long-running applicationswith significant performance and resilience benefits comparedto state-of-the-art schedulers.

Conference paper

Lind J, Priebe C, Muthukumaran D, O'Keeffe D, Aublin P, Kelbert F, Reiher T, Goltzsche D, Eyers D, Kapitza R, Fetzer C, Pietzuch Pet al., 2018, Glamdring: automatic application partitioning for Intel SGX, USENIX Annual Technical Conference 2017, Publisher: USENIX, Pages: 285-298

Trusted execution support in modern CPUs, as offered byIntel SGXenclaves, can protect applications in untrustedenvironments. While prior work has shown that legacyapplications can run in their entirety inside enclaves, thisresults in a large trusted computing base (TCB). Instead,we explore an approach in which wepartitionan applica-tion and use an enclave to protect only security-sensitivedata and functions, thus obtaining a smaller TCB.We describeGlamdring, the first source-level parti-tioning framework that secures applications written inC using Intel SGX. A developer first annotates security-sensitive application data. Glamdring then automaticallypartitions the application into untrusted and enclaveparts: (i) to preserve data confidentiality, Glamdring usesdataflow analysisto identify functions that may be ex-posed to sensitive data; (ii) for data integrity, it usesback-ward slicingto identify functions that may affect sensitivedata. Glamdring then places security-sensitive functionsinside the enclave, and adds runtime checks and crypto-graphic operations at the enclave boundary to protect itfrom attack. Our evaluation of Glamdring with the Mem-cached store, the LibreSSL library, and the Digital Bitboxbitcoin wallet shows that it achieves small TCB sizes andhas acceptable performance overheads.

Conference paper

Rupprecht L, Culhane WJ, Pietzuch P, 2017, SquirrelJoin: network-aware distributed join processing with lazy partitioning, Proceedings of the VLDB Endowment, Vol: 10, Pages: 1250-1261, ISSN: 2150-8097

To execute distributed joins in parallel on compute clusters, systemspartition and exchange data records between workers. With largedatasets, workers spend a considerable amount of time transferringdata over the network. When compute clusters are shared amongmultiple applications, workers must compete for network bandwidthwith other applications. These variances in the available networkbandwidth lead tonetwork skew, which causes straggling workersto prolong the join completion time.We describeSquirrelJoin, a distributed join processing techniquethat useslazy partitioningto adapt to transient network skew inclusters. Workers maintain in-memorylazy partitionsto withhold asubset of records, i.e. not sending them immediately to other work-ers for processing. Lazy partitions are then assigned dynamicallyto other workers based on network conditions: each worker takesperiodic throughput measurements to estimate its completion time,and lazy partitions are allocated as to minimise the join completiontime. We implement SquirrelJoin as part of the Apache Flink dis-tributed dataflow framework and show that, under transient networkcontention in a shared compute cluster, SquirrelJoin speeds up joincompletion times by up to 2.9× with only a small, fixed overhead.

Journal article

Goltzsche D, Wulf C, Muthukumaran D, Rieck K, Pietzuch PR, Kapitza Ret al., 2017, TrustJS: Trusted Client-side Execution of JavaScript, EuroSec'17, Publisher: ACM

Client-side JavaScript has become ubiquitous in web applications to improve user experience and reduce server load. However, since clients are untrusted, servers cannot rely on the confidentiality or integrity of client-side JavaScript code and the data that it operates on. For example, client-side input validation must be repeated at server side, and confidential business logic cannot be offloaded. In this paper, we present TrustJS, a framework that enables trustworthy execution of security-sensitive JavaScript inside commodity browsers. TrustJS leverages trusted hardware support provided by Intel SGX to protect the client-side execution of JavaScript, enabling a flexible partitioning of web application code. We present the design of TrustJS and provide initial evaluation results, showing that trustworthy JavaScript offloading can further improve user experience and conserve more server resources.

Conference paper

Lind J, Eyal I, Pietzuch PR, Sirer EGet al., 2017, Teechan: payment channels using trusted execution environments, 4th Workshop on Bitcoin and Blockchain Research, Publisher: Springer, ISSN: 0302-9743

Blockchain protocols are inherently limited in transaction throughputand latency. Recent efforts to address performance and scale blockchainshave focused on off-chain payment channels. While such channels can achievelow latency and high throughput, deploying them securely on top of the Bitcoinblockchain has been difficult, partly because building a secure implementationrequires changes to the underlying protocol and the ecosystem.We present Teechan, a full-duplex payment channel framework that exploitstrusted execution environments. Teechan can be deployed securely on the existingBitcoin blockchain without having to modify the protocol. It: (i) achieves a highertransaction throughput and lower transaction latency than prior solutions; (ii) enablesunlimited full-duplex payments as long as the balance does not exceed thechannel’s credit; (iii) requires only a single message to be sent per payment inany direction; and (iv) places at most two transactions on the blockchain underany execution scenario.We have built and deployed the Teechan framework using Intel SGX on theBitcoin network. Our experiments show that, not counting network latencies,Teechan can achieve 2,480 transactions per second on a single channel, with submillisecondlatencies.

Conference paper

Rupprecht L, Zhang R, Owen W, Pietzuch PR, Hildebrand Det al., 2017, SwiftAnalytics: optimizing object stores for big data analytics, IEEE International Conference on Cloud Engineering (IC2E), 2017, Publisher: IEEE

Due to their scalability and low cost, object-basedstorage systems are an attractive storage solution and widelydeployed. To gain valuable insight from the data residing inobject storage but avoid expensive copying to a distributedfilesystem (e.g. HDFS), it would be natural to directly use themas a storage backend for data-parallel analytics frameworks suchas Spark or MapReduce. Unfortunately, executing data-parallelframeworks on object storage exhibits severe performance prob-lems, reducing average job completion times by up to 6.5×.We identify the two most severe performance problems whenrunning data-parallel frameworks on the OpenStack Swift objectstorage system in comparison to the HDFS distributed filesystem:(i) the fixed mapping of object names to storage nodes preventslocal writes and adds delay when objects are renamed; (ii) thecoarser granularity of objects compared to blocks reduces datalocality during reads. We propose the SwiftAnalytics objectstorage system to address them: (i) it useslocality-awarewrites tocontrol an object’s location and eliminate unnecessary I/O relatedto renames during job completion, speeding up analytics jobs byup to 5.1×; (ii) it transparently chunks objects into smaller sizedparts to improve data-locality, leading to up to 3.4×faster reads.

Conference paper

Aublin P-L, Kelbert F, O'Keffe D, Muthukumaran D, Priebe C, Lind J, Krahn R, Fetzer C, Eyers D, Pietzuch Pet al., 2017, TaLoS: secure and transparent TLS termination inside SGX enclaves, Departmental Technical Report: 17/5, Publisher: Department of Computing, Imperial College London, 17/5

We introduce TaLoS1, a drop-in replacement for existing transportlayer security (TLS) libraries that protects itself from a maliciousenvironment by running inside an Intel SGX trusted execution environment.By minimising the amount of enclave transitions andreducing the overhead of the remaining enclave transitions, TaLoSimposes an overhead of no more than 31% in our evaluation withthe Apache web server and the Squid proxy.


Pietzuch PR, Arnautov S, Trach B, Gregor F, Knauth T, Martin A, Priebe C, Lind J, Muthukumaran D, O'Keeffe D, Stillwell M, Goltzsche D, Eyers D, Rüdiger K, Fetzer Cet al., 2016, SCONE: secure Linux containers with Intel SGX, 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016, Publisher: USENIX, Pages: 689-703

In multi-tenant environments, Linux containers managed by Docker or Kubernetes have a lower resource footprint, faster startup times, and higher I/O performance com- pared to virtual machines (VMs) on hypervisors. Yet their weaker isolation guarantees, enforced through soft- ware kernel mechanisms, make it easier for attackers to compromise the confidentiality and integrity of applica- tion data within containers.We describe SCONE, a secure container mechanism for Docker that uses the SGX trusted execution support of Intel CPUs to protect container processes from out- side attacks. The design of SCONE leads to (i) a small trusted computing base (TCB) and (ii) a low performance overhead: SCONE offers a secure C standard library in- terface that transparently encrypts/decrypts I/O data; to reduce the performance impact of thread synchronization and system calls within SGX enclaves, SCONE supports user-level threading and asynchronous system calls. Our evaluation shows that it protects unmodified applications with SGX, achieving 0.6✓–1.2✓ of native throughput.

Conference paper

Brenner S, Wulf C, Lorenz M, Weichbrodt N, Goltzsche G, Fetzer C, Pietzuch PR, Kapitza Ret al., 2016, SecureKeeper: confidential zooKeeper using intel SGX, ACM/IFIP/USENIX International Conference on Middleware (Middleware), 2016, Publisher: ACM, Pages: 1-13

Cloud computing, while ubiquitous, still suffers from trust issues, especially for applications managing sensitive data. Third-party coordination services such as ZooKeeper and Consul are fundamental building blocks for cloud applications, but are exposed to potentially sensitive application data. Recently, hardware trust mechanisms such as Intel's Software Guard Extensions (SGX) offer trusted execution environments to shield application data from untrusted software, including the privileged Operating System (OS) and hypervisors. Such hardware support suggests new options for securing third-party coordination services.We describe SecureKeeper, an enhanced version of the ZooKeeper coordination service that uses SGX to preserve the confidentiality and basic integrity of ZooKeeper-managed data. SecureKeeper uses multiple small enclaves to ensure that (i) user-provided data in ZooKeeper is always kept encrypted while not residing inside an enclave, and (ii) essential processing steps that demand plaintext access can still be performed securely. SecureKeeper limits the required changes to the ZooKeeper code base and relies on Java's native code support for accessing enclaves. With an overhead of 11%, the performance of SecureKeeper with SGX is comparable to ZooKeeper with secure communication, while providing much stronger security guarantees with a minimal trusted code base of a few thousand lines of code.

Conference paper

Papagiannis I, Watcharapichat P, Muthukumaran D, Pietzuch PRet al., 2016, BrowserFlow: imprecise data flow tracking to prevent accidental data disclosure, Middleware '16: 17th International Middleware Conference, Publisher: ACM, Pages: 1-13

With the use of external cloud services such as Google Docs or Evernote in an enterprise setting, the loss of control over sensitive data becomes a major concern for organisations. It is typical for regular users to violate data disclosure policies accidentally, e.g. when sharing text between documents in browser tabs. Our goal is to help such users comply with data disclosure policies: we want to alert them about potentially unauthorised data disclosure from trusted to untrusted cloud services. This is particularly challenging when users can modify data in arbitrary ways, they employ multiple cloud services, and cloud services cannot be changed.To track the propagation of text data robustly across cloud services, we introduce imprecise data flow tracking, which identifies data flows implicitly by detecting and quantifying the similarity between text fragments. To reason about violations of data disclosure policies, we describe a new text disclosure model that, based on similarity, associates text fragments in web browsers with security tags and identifies unauthorised data flows to untrusted services. We demonstrate the applicability of imprecise data tracking through BrowserFlow, a browser-based middleware that alerts users when they expose potentially sensitive text to an untrusted cloud service. Our experiments show that BrowserFlow can robustly track data flows and manage security tags for documents with no noticeable performance impact.

Conference paper

Pietzuch P, Watcharapichat P, Lopez Morales V, Castro Fernandez Ret al., 2016, Ako: Decentralised Deep Learning with Partial Gradient Exchange, ACM Symposium on Cloud Computing 2016 (SoCC), Publisher: ACM, Pages: 84-97

Distributed systems for the training of deep neural networks (DNNs) with large amounts of data have vastly improved the accuracy of machine learning models for image and speech recognition. DNN systems scale to large cluster deployments by having worker nodes train many model replicas in parallel; to ensure model convergence, parameter servers periodically synchronise the replicas. This raises the challenge of how to split resources between workers and parameter servers so that the cluster CPU and network resources are fully utilised without introducing bottlenecks. In practice, this requires manual tuning for each model configuration or hardware type.We describe Ako, a decentralised dataflow-based DNN system without parameter servers that is designed to saturate cluster resources. All nodes execute workers that fully use the CPU resources to update model replicas. To synchronise replicas as often as possible subject to the available network bandwidth, workers exchange partitioned gradient updates directly with each other. The number of partitions is chosen so that the used network bandwidth remains constant, independently of cluster size. Since workers eventually receive all gradient partitions after several rounds, convergence is unaffected. For the ImageNet benchmark on a 64-node cluster, Ako does not require any resource allocation decisions, yet converges faster than deployments with parameter servers.

Conference paper

Pietzuch PR, Weichbrodt N, Kurmus A, Kurmus Ret al., 2016, AsyncShock: Exploiting Synchronisation Bugs in Intel SGX Enclaves, 21st European Symposium on Research in Computer Security (ESORICS), Publisher: Springer International Publishing, Pages: 440-457, ISSN: 0302-9743

Intel’s Software Guard Extensions (SGX) provide a new hardware-based trusted execution environment on Intel CPUs using secure enclaves that are resilient to accesses by privileged code and physical attackers. Originally designed for securing small services, SGX bears promise to protect complex, possibly cloud-hosted, legacy applications. In this paper, we show that previously considered harmless synchronisation bugs can turn into severe security vulnerabilities when using SGX. By exploiting use-after-free and time-of-check-to-time-of-use (TOCTTOU) bugs in enclave code, an attacker can hijack its control flow or bypass access control.We present AsyncShock, a tool for exploiting synchronisation bugs of multithreaded code running under SGX. AsyncShock achieves this by only manipulating the scheduling of threads that are used to execute enclave code. It allows an attacker to interrupt threads by forcing segmentation faults on enclave pages. Our evaluation using two types of Intel Skylake CPUs shows that AsyncShock can reliably exploit use-after-free and TOCTTOU bugs.

Conference paper

Pietzuch PR, Koliousis A, Weidlich M, Costa P, Wolf A, Castro Fernandez Ret al., 2016, Demo- The SABER system for window-based hybrid stream processing with GPGPUs, DEBS 2016, Publisher: Association for Computing Machinery, Pages: 354-357

Heterogeneous architectures that combine multi-core CPUs withmany-core GPGPUs have the potential to improve the performanceof data-intensive stream processing applications. Yet, a stream pro-cessing engine must execute streaming SQL queries with sufficientdata-parallelism to fully utilise the available heterogeneous proces-sors, and decide how to use each processor in the most effectiveway. Addressing these challenges, we demonstrate SABER, ahybrid high-performance relational stream processing engine forCPUs and GPGPUs. SABER executes window-based streaming SQL queries in a data-parallel fashion and employs an adaptive scheduling strategy to balance the load on the different types of processors. To hidedata movement costs, SABER pipelines the transfer of stream databetween CPU and GPGPU memory. In this paper, we review thedesign principles of SABER in terms of its hybrid stream processingmodel and its architecture for query execution. We also present aweb front-end that monitors processing throughput.

Conference paper

Kalyvianaki E, Fiscato M, Salonidis T, Pietzuch PRet al., 2016, THEMIS: Fairness in Federated Stream Processing under Overload, 2016 International Conference on Management of Data (SIGMOD '16), Publisher: ACM, Pages: 541-553

Federated stream processing systems, which utilise nodes from multiple independent domains, can be found increasingly in multi-provider cloud deployments, internet-of-things systems, collaborative sensing applications and large-scale grid systems. To pool resources from several sites and take advantage of local processing, submitted queries are split into query fragments, which are executed collaboratively by different sites. When supporting many concurrent users, however, queries may exhaust available processing resources, thus requiring constant load shedding. Given that individual sites have autonomy over how they allocate query fragments on their nodes, it is an open challenge how to ensure global fairness on processing quality experienced by queries in a federated scenario.We describe THEMIS, a federated stream processing system for resource-starved, multi-site deployments. It executes queries in a globally fair fashion and provides users with constant feedback on the experienced processing quality for their queries. THEMIS associates stream data with its source information content (SIC), a metric that quantifies the contribution of that data towards the query result, based on the amount of source data used to generate it. We provide the BALANCE-SIC distributed load shedding algorithm that balances the SIC values of result data. Our evaluation shows that the BALANCE-SIC algorithm yields balanced SIC values across queries, as measured by Jain's Fairness Index. Our approach also incurs a low execution time overhead.

Conference paper

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=00499513&limit=30&person=true