Imperial College London

ProfessorPeterPietzuch

Faculty of EngineeringDepartment of Computing

Professor of Distributed Systems
 
 
 
//

Contact

 

+44 (0)20 7594 8314prp Website

 
 
//

Location

 

442Huxley BuildingSouth Kensington Campus

//

Summary

 

Publications

Publication Type
Year
to

173 results found

Özcan F, Tian Y, Pietzuch P, 2023, Welcome from the Chairs, SoCC 2023 - Proceedings of the 2023 ACM Symposium on Cloud Computing, Pages: v-vi

Journal article

Zhu H, Zhao B, Chen G, Chen W, Chen Y, Shi L, Yang Y, Pietzuch P, Chen Let al., 2023, MSRL: Distributed reinforcement learning with dataflow fragments, USENIX Annual Technical Conference (USENIX ATC), Publisher: USENIX Assoc, Pages: 977-993

A wide range of reinforcement learning (RL) algorithms have been proposed, in which agents learn from interactions with a simulated environment. Executing such RL training loops is computationally expensive, but current RL systems fail to support the training loops of different RL algorithms efficiently on GPU clusters: they either hard-code algorithm-specific strategies for parallelization and distribution; or they accelerate only parts of the computation on GPUs (e.g., DNN policy updates). We observe that current systems lack an abstraction that decouples the definition of an RL algorithm from its strategy for distributed execution.We describe MSRL, a distributed RL training systemthat uses the new abstraction of a fragmented dataflowgraph (FDG) to execute RL algorithms in a flexible way.An FDG is a heterogeneous dataflow representation of anRL algorithm, which maps functions from the RL trainingloop to independent parallel dataflow fragments. Fragmentsaccount for the diverse nature of RL algorithms: each fragment can execute on a different device using its own low-level dataflow implementation, e.g., an operator graph of a DNN engine, a CUDA GPU kernel, or a multi-threaded CPU process. At deployment time, a distribution policy governs how fragments are mapped to devices, without changes to the algorithm implementation. Our experiments show that MSRL exposes trade-offs between different execution strategies, while surpassing the performance of existing RL systems.

Conference paper

Sartakov VA, Vilanova L, Geden M, Eyers D, Shinagawa T, Pietzuch Pet al., 2023, ORC: Increasing cloud memory density via object reuse with capabilities, 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Publisher: USENIX Assoc, Pages: 573-587

Cloud environments host many tenants, and typically thereis substantial overlap between the application binaries andlibraries executed by tenants. Thus, memory de-duplicationcan increase memory density by allocating memory for shared binaries only once. Existing de-duplication approaches, however, either rely on a shared OS to de-deduplicate binary objects, which provides unacceptably weak isolation; or exploit hypervisor-based de-duplication at the level of memory pages, which is blind to the semantics of the objects to be shared.We describe Object Reuse with Capabilities (ORC), whichsupports the fine-grained sharing of binary objects betweentenants, while isolating tenants strongly through a smalltrusted computing base (TCB). ORC uses hardware sup-port for memory capabilities to isolate tenants, which permits shared objects to be accessible to multiple tenants safely. Since ORC shares binary objects within a single address space through capabilities, it uses a new relocation type to create per-tenant state when loading shared objects. ORC supports the loading of objects by an untrusted guest, outside of its TCB, only verifying the safety of the loaded data. Our experiments show that ORC achieves a higher memory density with a lower overhead than hypervisor-based de-deduplication.

Conference paper

Bergman S, Silberstein M, Pietzuch P, Shinagawa T, Vilanova Let al., 2023, Translation Pass-Through for Near-Native Paging Performance in VMs, Pages: 753-768

Virtual machines (VMs) are used for consolidation, isolation, and provisioning in the cloud, but applications with large working sets are impacted by the overheads of memory address translation in VMs. Existing translation approaches incur non-trivial overheads: (i) nested paging has a worst-case latency that increases with page table depth; and (ii) paravirtualized and shadow paging suffer from high hypervisor intervention costs when updating guest page tables. We describe translation pass-through (TPT), a new memory virtualization mechanism that achieves near-native performance. TPT enables VMs to control virtual memory translation from guest-virtual to host-physical addresses using one-dimensional page tables. At the same time, inter-VM isolation is enforced by the host by exploiting new hardware support for physical memory tagging in commodity CPUs. We prototype TPT by modifying the KVM/QEMU hypervisor and enlightening the Linux guest. We evaluate it by emulating the memory tagging mechanism of AMD CPUs. Our conservative performance estimates show that TPT achieves native performance for real-world data center applications, with speedups of up to 2.4× and 1.4× over nested and shadow paging, respectively.

Conference paper

Theodorakis G, Kounelis F, Pietzuch P, Pirk Het al., 2022, SCABBARD: single-node fault-tolerant stream processing, 48th International Conference on Very Large Data Bases (VLDB), Publisher: VLDB Endowment, Pages: 361-374, ISSN: 2150-8097

Single-node multi-core stream processing engines (SPEs) can process hundreds of millions of tuples per second. Yet making them fault-tolerant with exactly-once semantics while retaining this performance is an open challenge: due to the limited I/O bandwidth of a single-node, it becomes infeasible to persist all stream data and operator state during execution. Instead, single-node SPEs rely on upstream distributed systems, such as Apache Kafka, to recover stream data after failure, necessitating complex cluster-based deployments. This lack of built-in fault-tolerance features has hindered the adoption of single-node SPEs.We describe Scabbard, the first single-node SPE that supports exactly-once fault-tolerance semantics despite limited local I/O bandwidth. Scabbard achieves this by integrating persistence operations with the query workload. Within the operator graph, Scabbard determines when to persist streams based on the selectivity of operators: by persisting streams after operators that discard data, it can substantially reduce the required I/O bandwidth. As part of the operator graph, Scabbard supports parallel persistence operations and uses markers to decide when to discard persisted data. The persisted data volume is further reduced using workload-specific compression: Scabbard monitors stream statistics and dynamically generates computationally efficient compression operators. Our experiments show that Scabbard can execute stream queries that process over 200 million tuples per second while recovering from failures with sub-second latencies.

Conference paper

Sartakov VA, Vilanova L, Eyers D, Shinagawa T, Pietzuch Pet al., 2022, CAP-VMs: Capability-based isolation and sharing in the cloud, 16th USENIX Symposium on Operating Systems Design and Implementation, Pages: 597-612

Cloud stacks must isolate application components, while permitting efficient data sharing between components deployed on the same physical host. Traditionally, the MMU enforces isolation and permits sharing at page granularity. MMU approaches, however, lead to cloud stacks with large TCBs in kernel space, and page granularity requires inefficient OS interfaces for data sharing. Forthcoming CPUs with hardware support for memory capabilities offer new opportunities to implement isolation and sharing at a finer granularity. We describe cVMs, a new VM-like abstraction that uses memory capabilities to isolate application components while supporting efficient data sharing, all without mandating application code to be capability-aware. cVMs share a single virtual address space safely, each having only capabilities to access its own memory. A cVM may include a library OS, thus minimizing its dependency on the cloud environment. cVMs efficiently exchange data through two capability-based primitives assisted by a small trusted monitor: (i) an asynchronous read/write interface to buffers shared between cVMs; and (ii) a call interface to transfer control between cVMs. Using these two primitives, we build more expressive mechanisms for efficient cross-cVM communication. Our prototype implementation using CHERI RISC-V capabilities shows that cVMs isolate services (Redis and Python) with low overhead while improving data sharing.

Conference paper

Pietzuch P, Shamis A, Canakci B, Castro M, Fournet C, Ashton E, Chamayou A, Clebsch S, Delignat-Lavaud A, Kerner M, Maffre J, Vrousgou O, Wintersteiger CM, Costa M, Russinovich Met al., 2022, IA-CCF: Individual accountability for permissioned ledgers, 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI), Pages: 467-491

Permissioned ledger systems allow a consortium of members that do not trust one another to execute transactions safelyon a set of replicas. Such systems typically use Byzantinefault tolerance (BFT) protocols to distribute trust, which onlyensures safety when fewer than 1/3 of the replicas misbehave.Providing guarantees beyond this threshold is a challenge:current systems assume that the ledger is corrupt and fail toidentify misbehaving replicas or hold the members that operate them accountable—instead all members share the blame.We describe IA-CCF, a new permissioned ledger systemthat provides individual accountability. It can assign blameto the individual members that operate misbehaving replicasregardless of the number of misbehaving replicas or members.IA-CCF achieves this by signing and logging BFT protocolmessages in the ledger, and by using Merkle trees to provideclients with succinct, universally-verifiable receipts as evidence of successful transaction execution. Anyone can auditthe ledger against a set of receipts to discover inconsistenciesand identify replicas that signed contradictory statements. IACCF also supports changes to consortium membership andreplicas by tracking signing keys using a sub-ledger of governance transactions. IA-CCF provides strong disincentives tomisbehavior with low overhead: it executes 47,000 tx/s whileproviding clients with receipts in two network round trips.

Conference paper

Thalheim J, Unnibhavi H, Priebe C, Bhatotia P, Pietzuch Pet al., 2021, Rkt-io: A direct I/O stack for shielded execution, Pages: 490-506

The shielding of applications using trusted execution environments (TEEs) can provide strong security guarantees in untrusted cloud environments. When executing I/O operations, today's shielded execution frameworks, however, exhibit performance and security limitations: they assign resources to the I/O path inefficiently, perform redundant data copies, use untrusted host I/O stacks with security risks and performance overheads. This prevents TEEs from running modern I/O-intensive applications that require high-performance networking and storage. We describe rkt-io (pronounced "rocket I/O"), a direct user-space network and storage I/O stack specifically designed for TEEs that combines high-performance, POSIX compatibility and security. rkt-io achieves high I/O performance by employing direct userspace I/O libraries (DPDK and SPDK) inside the TEE for kernel-bypass I/O. For efficiency, rkt-io polls for I/O events directly, by interacting with the hardware instead of relying on interrupts, and it avoids data copies by mapping DMA regions in the untrusted host memory. To maintain full Linux ABI compatibility, the userspace I/O libraries are integrated with userspace versions of the Linux VFS and network stacks inside the TEE. Since it omits the host OS from the I/O path, does not suffer from host interface/Iago attacks. Our evaluation with Intel SGX TEEs shows that rkt-io is 9×faster for networking and 7× faster for storage compared to host- (Scone) and LibOS-based (SGX-LKL) I/O approaches.

Conference paper

Sartakov V, Pietzuch P, Vilanova L, 2021, CubicleOS: A library OS with software componentisation for practical isolation, Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’21), Publisher: ACM, Pages: 546-558

Library OSs have been proposed to deploy applications isolated inside containers, VMs, or trusted execution environments. They often follow a highly modular design in which third-party components are combined to offer the OS functionality needed by an application, and they are customised at compilation and deployment time to fit application requirements. Yet their monolithic design lacks isolation across components: when applications and OS components contain security-sensitive data (e.g., cryptographic keys or user data), the lack of isolation renders library OSs open to security breaches via malicious or vulnerable third-party components.

Conference paper

Sartakov V, O’Keeffe D, Eyers D, Vilanova L, Pietzuch Pet al., 2021, Spons & shields: practical isolation for trusted execution, ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE ’21), Publisher: ACM, Pages: 186-200

Trusted execution environments (TEEs) promise a cost-effective, “lift-and-shift” solution for deploying security-sensitive applications in untrusted clouds. For this, they must support rich, multi-component applications, but a large trusted computing base (TCB) inside the TEE risks that attackers can compromise application security. Fine-grained compartmentalisation can increase security through defense-in-depth, but current solutions either run all software components unprotected in the same TEE, lack efficient shared memory support, or isolate application processes using separate TEEs, impacting performance and compatibility.We describe the Spons & Shields framework (SSF) for Intel SGX TEEs, which offers intra-TEE compartmentalisation using two new abstraction, Spons and Shields. Spons and Shields generalise process, library and user/kernel isolation inside the TEE while allowing for efficient memory sharing. When users deploy unmodified multi-component applications in a TEE, SSF dynamically creates Spons (one per POSIX process or library) and Shields (to enforce a given security policy for memory accesses). Applications can be hardened with minor code changes, e.g., by using a separate Shield to isolate an SSL library. SSF uses compiler instrumentation to protect Shield boundaries, exploiting MPX instructions if available. We evaluate SSF using a complex application service (NGINX, PHP interpreter and PostgreSQL) and show that its overhead is comparable to process isolation.

Conference paper

Sartakov V, Vilanova L, Pietzuch P, 2020, CubicleOS: A Library OS with Software Componentisation for Practical Isolation

This artefact contains the library OS, two applications, isolation monitor, and scripts to reproduce experiments from the ASPLOS 2021 paper by V. A. Sartakov, L. Vilanova, R. Pietzuch — CubicleOS: A Library OS with Software Componentisation for Practical Isolation, which isolates components of a monolithic library OS without the use of message-based IPC primitives.

Software

Mai L, Li G, Wagenlander M, Fertakis K, Brabete A-O, Pietzuch Pet al., 2020, KungFu: making training in distributed machine learning adaptive, USENIX Symposium on Operating Systems Design and Implementation (OSDI), Publisher: Usenix, Pages: 937-954

When using distributed machine learning (ML) systems to train models on a cluster of worker machines, users must con-figure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence; system parameters (e.g. the number of workers and their communication topology) impact training performance. In current systems, adapting such parameters during training is ill-supported. Users must set system parameters at deployment time, and provide fixed adaptation schedules for hyper-parameters in the training program. We describe Kung Fu, a distributed ML library for Tensor-Flow that is designed to enable adaptive training. Kung Fu allows users to express high-level Adaptation Policies(APs)that describe how to change hyper- and system parameters during training. APs take real-time monitored metrics (e.g. signal-to-noise ratios and noise scale) as input and trigger control actions (e.g. cluster rescaling or synchronisation strategy updates). For execution, APs are translated into monitoring and control operators, which are embedded in the data flowgraph. APs exploit an efficient asynchronous collective communication layer, which ensures concurrency and consistency of monitoring and adaptation operations

Conference paper

Goeschka KM, Oliveira R, Pietzuch P, Russello Get al., 2020, Editorial message: Special track on dependable, adaptive and trustworthy distributed systems, Proceedings of the ACM Symposium on Applied Computing, Pages: 230-231

Journal article

Theodorakis G, Pietzuch P, Pirk H, 2020, SlideSide: a fast incremental stream processing algorithm for multiple queries, 23rd International Conference on Extending Database Technology (EDBT), Pages: 435-438, ISSN: 2367-2005

Aggregate window computations lie at the core of online analyt-ics in both academic and industrial applications. To efficientlycompute sliding windows, the state-of-the-art algorithms utilizeincremental processing that avoids the recomputation of windowresults from scratch. In this paper, we propose a novel algorithm,calledSlideSide, that extendsTwoStacksfor multiple concur-rent aggregate queries over the same data stream. Our approachuses different yet similar processing schemes for invertible andnon-invertible functions and exhibits up to 2×better through-put compared to the state-of-the-art incremental techniques in amulti-query environment.

Conference paper

Wagenländer M, Mai L, Li G, Pietzuch Pet al., 2020, Spotnik: Designing distributed machine learning for transient cloud resources

To achieve higher utilisation, cloud providers offer VMs with GPUs as lower-cost transient cloud resources. Transient VMs can be revoked at short notice and vary in their availability. This poses challenges to distributed machine learning (ML) jobs, which perform long-running stateful computation in which many workers maintain and synchronise model replicas. With transient VMs, existing systems either require a fixed number of reserved VMs or degrade performance when recovering from revoked transient VMs. We believe that future distributed ML systems must be designed from the ground up for transient cloud resources. This paper describes SPOTNIK, a system for training ML models that features a more adaptive design to accommodate transient VMs: (i) SPOTNIK uses an adaptive implementation of the all-reduce collective communication operation. As workers on transient VMs are revoked, SPOTNIK updates its membership and uses the all-reduce ring to recover; and (ii) SPOTNIK supports the adaptation of the synchronisation strategy between workers. This allows a training job to switch between different strategies in response to the revocation of transient VMs. Our experiments show that, after VM revocation, SPOTNIK recovers training within 300 ms for ResNet/ImageNet.

Conference paper

Shillaker S, Pietzuch P, 2020, FAASM: Lightweight isolation for efficient stateful serverless computing, Pages: 419-433

Serverless computing is an excellent fit for big data processing because it can scale quickly and cheaply to thousands of parallel functions. Existing serverless platforms isolate functions in ephemeral, stateless containers, preventing them from directly sharing memory. This forces users to duplicate and serialise data repeatedly, adding unnecessary performance and resource costs. We believe that a new lightweight isolation approach is needed, which supports sharing memory directly between functions and reduces resource overheads. We introduce Faaslets, a new isolation abstraction for high-performance serverless computing. Faaslets isolate the memory of executed functions using software-fault isolation (SFI), as provided by WebAssembly, while allowing memory regions to be shared between functions in the same address space. Faaslets can thus avoid expensive data movement when functions are co-located on the same machine. Our runtime for Faaslets, FAASM, isolates other resources, e.g. CPU and network, using standard Linux cgroups, and provides a low-level POSIX host interface for networking, file system access and dynamic loading. To reduce initialisation times, FAASM restores Faaslets from already-initialised snapshots. We compare FAASM to a standard container-based platform and show that, when training a machine learning model, it achieves a 2× speed-up with 10× less memory; for serving machine learning inference, FAASM doubles the throughput and reduces tail latency by 90%.

Conference paper

Theodorakis G, Koliousis A, Pietzuch P, Pirk Het al., 2020, LightSaber: Efficient Window Aggregation on Multi-Core Processors, New York, NY, USA, Publisher: Association for Computing Machinery, Pages: 2505–2521-2505–2521

Window aggregation queries are a core part of streaming applications. To support window aggregation efficiently, stream processing engines face a trade-off between exploiting parallelism (at the instruction/multi-core levels) and incremental computation (across overlapping windows and queries). Existing engines implement ad-hoc aggregation and parallelization strategies. As a result, they only achieve high performance for specific queries depending on the window definition and the type of aggregation function. We describe a general model for the design space of window aggregation strategies. Based on this, we introduce LightSaber, a new stream processing engine that balances parallelism and incremental processing when executing window aggregation queries on multi-core CPUs. Its design generalizes existing approaches: (i) for parallel processing, LightSaber constructs a parallel aggregation tree (PAT) that exploits the parallelism of modern processors. The PAT divides window aggregation into intermediate steps that enable the efficient use of both instruction-level (i.e., SIMD) and task-level (i.e., multi-core) parallelism; and (ii) to generate efficient incremental code from the PAT, LightSaber uses a generalized aggregation graph (GAG), which encodes the low-level data dependencies required to produce aggregates over the stream. A GAG thus generalizes state-of-the-art approaches for incremental window aggregation and supports work-sharing between overlapping windows. LightSaber achieves up to an order of magnitude higher throughput compared to existing systems-on a 16-core server, it processes 470 million records/s with 132 ?s average latency.

Conference paper

Brito A, Fetzer C, Köpsell S, Pietzuch P, Pasin M, Felber P, Fonseca K, Rosa M, Gomes L, Riella R, Prado C, Rust LF, Lucani DE, Sipos M, Nagy L, Fehér Met al., 2019, Secure end-to-end processing of smart metering data, Journal of Cloud Computing, Vol: 8

Cloud computing considerably reduces the costs of deploying applications through on-demand, automated and fine-granular allocation of resources. Even in private settings, cloud computing platforms enable agile and self-service management, which means that physical resources are shared more efficiently. Cloud computing considerably reduces the costs of deploying applications through on-demand, automated and fine-granular allocation of resources. Even in private settings, cloud computing platforms enable agile and self-service management, which means that physical resources are shared more efficiently. Nevertheless, using shared infrastructures also creates more opportunities for attacks and data breaches. In this paper, we describe the SecureCloud approach. The SecureCloud project aims to enable confidentiality and integrity of data and applications running in potentially untrusted cloud environments. The project leverages technologies such as Intel SGX, OpenStack and Kubernetes to provide a cloud platform that supports secure applications. In addition, the project provides tools that help generating cloud-native, secure applications and services that can be deployed on potentially untrusted clouds. The results have been validated in a real-world smart grid scenario to enable a data workflow that is protected end-to-end: from the collection of data to the generation of high-level information such as fraud alerts.

Journal article

Garefalakis P, Karanasos K, Pietzuch P, 2019, Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications, Pages: 233-245

Distributed dataflow systems allow users to express a wide range of computations, including batch, streaming, and machine learning. A recent trend is to unify different computation types as part of a single stream/batch application that combines latency-sensitive ("stream") and latency-Tolerant ("batch") jobs. This sharing of state and logic across jobs simplifies application development. Examples include machine learning applications that perform batch training and low-latency inference, and data analytics applications that include batch data transformations and low-latency querying. Existing execution engines, however, were not designed for unified stream/batch applications. As we show, they fail to schedule and execute them efficiently while respecting their diverse requirements. We present Neptune, an execution framework for stream/batch applications that dynamically prioritizes tasks to achieve low latency for stream jobs. Neptune employs coroutines as a lightweight mechanism for suspending tasks without losing task progress. It couples this fine-grained control over CPU resources with a locality-And memory-Aware (LMA) scheduling policy to determine which tasks to suspend and when, thereby sharing executors among heterogeneous jobs. We evaluate our open-source Spark-based implementation of Neptune on a 75-node Azure cluster. Neptune achieves up to 3x lower end-To-end processing latencies for latency-sensitive jobs of a stream/batch application, while minimally impacting the throughput of batch jobs.

Conference paper

Lind J, Naor O, Eyal I, Kelbert F, Sirer EG, Pietzuch Pet al., 2019, Teechain: a secure payment network with asynchronous blockchain access, 27th ACM Symposium on Operating Systems Principles (SOSP), Publisher: ACM Press, Pages: 63-79

Blockchains such as Bitcoin and Ethereum execute payment transactions securely, but their performance is limited by the need for global consensus. Payment networks overcome this limitation through off-chain transactions. Instead of writing to the blockchain for each transaction, they only settle the final payment balances with the underlying blockchain. When executing off-chain transactions in current payment networks, parties must access the blockchain within bounded time to detect misbehaving parties that deviate from the protocol. This opens a window for attacks in which a malicious party can steal funds by deliberately delaying other parties' blockchain access and prevents parties from using payment networks when disconnected from the blockchain.We present Teechain, the first layer-two payment network that executes off-chain transactions asynchronously with respect to the underlying blockchain. To prevent parties from misbehaving, Teechain uses treasuries, protected by hardware trusted execution environments (TEEs), to establish off-chain payment channels between parties. Treasuries maintain collateral funds and can exchange transactions efficiently and securely, without interacting with the underlying blockchain. To mitigate against treasury failures and to avoid having to trust all TEEs, Teechain replicates the state of treasuries using committee chains, a new variant of chain replication with threshold secret sharing. Teechain achieves at least a 33X higher transaction throughput than the state-of-the-art Lightning payment network. A 30-machine Teechain deployment can handle over 1 million Bitcoin transactions per second.

Conference paper

Mai L, Koliousis A, Li G, Brabete AO, Pietzuch Pet al., 2019, Taming hyper-parameters in deep learning systems, Operating Systems Review (ACM), Vol: 53, Pages: 52-58, ISSN: 0163-5980

Deep learning (DL) systems expose many tuning parameters (“hyper-parameters”) that affect the performance and accuracy of trained models. Increasingly users struggle to configure hyper-parameters, and a substantial portion of time is spent tuning them empirically. We argue that future DL systems should be designed to help manage hyper-parameters. We describe how a distributed DL system can (i) remove the impact of hyper-parameters on both performance and accuracy, thus making it easier to decide on a good setting, and (ii) support more powerful dynamic policies for adapting hyper-parameters, which take monitored training metrics into account. We report results from prototype implementations that show the practicality of DL system designs that are hyper-parameter-friendly.

Journal article

Koliousis A, Watcharapichat P, Weidlich M, Mai L, Costa P, Pietzuch Pet al., 2019, Crossbow: scaling deep learning with small batch sizes on multi-GPU servers, Proceedings of the VLDB Endowment, Vol: 12, ISSN: 2150-8097

Deep learning models are trained on servers with many GPUs, andtraining must scale with the number of GPUs. Systems such asTensorFlow and Caffe2 train models with parallel synchronousstochastic gradient descent: they process a batch of training data ata time, partitioned across GPUs, and average the resulting partialgradients to obtain an updated global model. To fully utilise allGPUs, systems must increase the batch size, which hinders statisticalefficiency. Users tune hyper-parameters such as the learning rate tocompensate for this, which is complex and model-specific.We describeCROSSBOW, a new single-server multi-GPU sys-tem for training deep learning models that enables users to freelychoose their preferred batch size—however small—while scalingto multiple GPUs.CROSSBOWuses many parallel model replicasand avoids reduced statistical efficiency through a new synchronoustraining method. We introduceSMA, a synchronous variant of modelaveraging in which replicasindependentlyexplore the solution spacewith gradient descent, but adjust their searchsynchronouslybased onthe trajectory of a globally-consistent average model.CROSSBOWachieves high hardware efficiency with small batch sizes by poten-tially training multiple model replicas per GPU, automatically tuningthe number of replicas to maximise throughput. Our experimentsshow thatCROSSBOWimproves the training time of deep learningmodels on an 8-GPU server by 1.3–4×compared to TensorFlow.

Journal article

Clegg R, Landa R, Griffin D, Rio M, Hughes P, Kegel I, Stevens T, Pietzuch P, Williams Det al., 2019, Faces in the clouds: long-duration, multi-user, cloud-assisted video conferencing, IEEE Transactions on Cloud Computing, Vol: 7, Pages: 756-769, ISSN: 2168-7161

Multi-user video conferencing is a ubiquitous technology. Increasingly end-hosts in a conference are assisted by cloud-based servers that improve the quality of experience for end users. This paper evaluates the impact of strategies for placement of such servers on user experience and deployment cost. We consider scenarios based upon the Amazon EC2 infrastructure as well as future scenarios in which cloud instances can be located at a larger number of possible sites across the planet. We compare a number of possible strategies for choosing which cloud locations should host services and how traffic should route through them. Our study is driven by real data to create demand scenarios with realistic geographical user distributions and diurnal behaviour. We conclude that on the EC2 infrastructure a well chosen static selection of servers performs well but as more cloud locations are available a dynamic choice of servers becomes important.

Journal article

Pirk H, Giceva J, Pietzuch P, 2019, Thriving in the No Man's Land between compilers and databases, Conference on Innovative Data Systems Research

When developing new data-intensive applications, one faces a build-or-buy decision: use an existing off-the-shelf data management sys-tem (DMS) or implement a custom solution. While off-the-shelfsystems offer quick results, they lack the flexibility to accommo-date the changing requirements of long-term projects. Building asolution from scratch in a general-purpose programming language,however, comes with long-term development costs that may not bejustified. What is lacking is a middle ground or, more precisely,a clear migration path from off-the-shelf Data Management Sys-tems to customized applications in general-purpose programminglanguages. There is, in effect, a no man’s land that neither compilernor database researchers have claimed.We believe that this problem is an opportunity for the databasecommunity to claim a stake. We need to invest effort to transfer theoutcomes of data management research into fields of programminglanguages and compilers. The common complaint that other fieldsare re-inventing database techniques bears witness to the need forthat knowledge transfer. In this paper, we motivate the necessityfor data management techniques in general-purpose programminglanguages and outline a number of specific opportunities for knowl-edge transfer. This effort will not only cover the no man’s land butalso broaden the impact of data management research.

Conference paper

Goeschka KM, Oliveira R, Pietzuch P, Russello Get al., 2019, Special track on dependable, adaptive and trustworthy distributed systems, Proceedings of the ACM Symposium on Applied Computing, Vol: Part F147772, Pages: 266-267

Journal article

Segarra C, Delgado-Gonzalo R, Lemay M, Aublin PL, Pietzuch P, Schiavoni Vet al., 2019, Using trusted execution environments for secure stream processing of medical data: (Case study paper), Pages: 91-107, ISSN: 0302-9743

Processing sensitive data, such as those produced by body sensors, on third-party untrusted clouds is particularly challenging without compromising the privacy of the users generating it. Typically, these sensors generate large quantities of continuous data in a streaming fashion. Such vast amount of data must be processed efficiently and securely, even under strong adversarial models. The recent introduction in the mass-market of consumer-grade processors with Trusted Execution Environments (TEEs), such as Intel SGX, paves the way to implement solutions that overcome less flexible approaches, such as those atop homomorphic encryption. We present a secure streaming processing system built on top of Intel SGX to showcase the viability of this approach with a system specifically fitted for medical data. We design and fully implement a prototype system that we evaluate with several realistic datasets. Our experimental results show that the proposed system achieves modest overhead compared to vanilla Spark while offering additional protection guarantees under powerful attackers and threat models.

Conference paper

Theodorakis G, Koliousis A, Pietzuch PR, Pirk Het al., 2018, Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation, International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures, ADMS@VLDB 2018, Rio de Janeiro, Brazil, August 27, 2018, Pages: 34-41

The computation of sliding window aggregates is one of the corefunctionalities of stream processing systems. Presently, there aretwo classes of approaches to evaluating them. The first is non-incremental, i.e., every window is evaluated in isolation even ifoverlapping windows provide opportunities for work-sharing. Whilenot algorithmically efficient, this class of algorithm is usually veryCPU efficient. The other approach is incremental: to the amountpossible, the result of one window evaluation is used to help withthe evaluation of the next window. While algorithmically efficient,the inherent control-dependencies in the CPU instruction streammake this highly CPU inefficient.In this paper, we analyse the state of the art in efficient incre-mental window processing and extend the fastest known algorithm,the Two-Stacks approach with known as well as novel optimisa-tions. These include SIMD-parallel processing, internal data struc-ture decomposition and data minimalism. We find that, thus opti-mised, our incremental window aggregation algorithm outperformsthe state-of-the-art incremental algorithm by up to11×. In ad-dition, it is at least competitive and often significantly (up to80%)faster than a non-incremental algorithm. Consequently, stream pro-cessing systems can use our proposed algorithm to cover incremen-tal as well as non-incremental aggregation resulting in systems thatare simpler as well as faster.

Conference paper

Goltzsche D, Ruesch S, Nieke M, Vaucher S, Weichbrodt N, Schiavoni V, Aublin P-L, Costa P, Fetzer C, Felber P, Pietzuch P, Kapitza Ret al., 2018, ENDBOX: scalable middlebox functions using client-side trusted execution, 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), Publisher: IEEE, Pages: 386-397, ISSN: 1530-0889

Many organisations enhance the performance, security, and functionality of their managed networks by deploying middleboxes centrally as part of their core network. While this simplifies maintenance, it also increases cost because middlebox hardware must scale with the number of clients. A promising alternative is to outsource middlebox functions to the clients themselves, thus leveraging their CPU resources. Such an approach, however, raises security challenges for critical middlebox functions such as firewalls and intrusion detection systems. We describe EndBox, a system that securely executes middlebox functions on client machines at the network edge. Its design combines a virtual private network (VPN) with middlebox functions that are hardware-protected by a trusted execution environment (TEE), as offered by Intel's Software Guard Extensions (SGX). By maintaining VPN connection endpoints inside SGX enclaves, EndBox ensures that all client traffic, including encrypted communication, is processed by the middlebox. Despite its decentralised model, EndBox's middlebox functions remain maintainable: they are centrally controlled and can be updated efficiently. We demonstrate EndBox with two scenarios involving (i) a large company; and (ii) an Internet service provider that both need to protect their network and connected clients. We evaluate EndBox by comparing it to centralised deployments of common middlebox functions, such as load balancing, intrusion detection, firewalling, and DDoS prevention. We show that EndBox achieves up to 3.8x higher throughput and scales linearly with the number of clients.

Conference paper

Pietzuch P, 2018, Message from the ICDCS 2018 program chair, Proceedings - International Conference on Distributed Computing Systems, Vol: 2018-July

Journal article

Brito A, Fetzer C, Kopsell S, Pietzuch P, Pasin M, Felber P, Fonseca K, Rosa M, Gomes L, Riella R, Prado C, Da Costa Carmo LFR, Lucani D, Sipos M, Nagy L, Feher Met al., 2018, Cloud challenge: Secure end-To-end processing of smart metering data, Pages: 19-20

Cloud computing considerably reduces the costs of deploying applications through on-demand, automated, and fine-granular allocation of resources. Even in private settings, cloud computing platforms enable agile and self-service management, which means that physical resources are shared more efficiently. Nevertheless, using shared infrastructures also creates more opportunities for attacks and data breaches. In this paper, we describe the SecureCloud approach. The SecureCloud project aims to enable confidentiality and integrity of data and applications running in potentially untrusted cloud environments. The project leverages technologies such as Intel SGX, OpenStack and Kubernetes to provide a cloud platform that supports secure applications. In addition, the project provides tools that help generating cloud-native, secure applications and services to be deployed on potentially untrusted clouds. The results have been validated in the smart grid scenario and enabled a data workflow that is protected end-To-end: from the collection of data to the generation of high-level information such as fraud alerts.

Conference paper

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=00499513&limit=30&person=true