137 results found
Sartakov V, Pietzuch P, Vilanova L, 2021, CubicleOS: A Library OS with Software Componentisation for Practical Isolation, Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’21)
Sartakov V, O’Keeffe D, Eyers D, et al., 2021, Spons & shields: practical isolation for trusted execution, ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE ’21), Publisher: ACM, Pages: 186-200
Trusted execution environments (TEEs) promise a cost-effective, “lift-and-shift” solution for deploying security-sensitive applications in untrusted clouds. For this, they must support rich, multi-component applications, but a large trusted computing base (TCB) inside the TEE risks that attackers can compromise application security. Fine-grained compartmentalisation can increase security through defense-in-depth, but current solutions either run all software components unprotected in the same TEE, lack efficient shared memory support, or isolate application processes using separate TEEs, impacting performance and compatibility.We describe the Spons & Shields framework (SSF) for Intel SGX TEEs, which offers intra-TEE compartmentalisation using two new abstraction, Spons and Shields. Spons and Shields generalise process, library and user/kernel isolation inside the TEE while allowing for efficient memory sharing. When users deploy unmodified multi-component applications in a TEE, SSF dynamically creates Spons (one per POSIX process or library) and Shields (to enforce a given security policy for memory accesses). Applications can be hardened with minor code changes, e.g., by using a separate Shield to isolate an SSL library. SSF uses compiler instrumentation to protect Shield boundaries, exploiting MPX instructions if available. We evaluate SSF using a complex application service (NGINX, PHP interpreter and PostgreSQL) and show that its overhead is comparable to process isolation.
Mai L, Li G, Wagenlander M, et al., 2020, KungFu: making training in distributed machine learning adaptive, USENIX Symposium on Operating Systems Design and Implementation (OSDI), Publisher: Usenix, Pages: 937-954
When using distributed machine learning (ML) systems to train models on a cluster of worker machines, users must con-figure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence; system parameters (e.g. the number of workers and their communication topology) impact training performance. In current systems, adapting such parameters during training is ill-supported. Users must set system parameters at deployment time, and provide fixed adaptation schedules for hyper-parameters in the training program. We describe Kung Fu, a distributed ML library for Tensor-Flow that is designed to enable adaptive training. Kung Fu allows users to express high-level Adaptation Policies(APs)that describe how to change hyper- and system parameters during training. APs take real-time monitored metrics (e.g. signal-to-noise ratios and noise scale) as input and trigger control actions (e.g. cluster rescaling or synchronisation strategy updates). For execution, APs are translated into monitoring and control operators, which are embedded in the data flowgraph. APs exploit an efficient asynchronous collective communication layer, which ensures concurrency and consistency of monitoring and adaptation operations
Theodorakis G, Pietzuch P, Pirk H, 2020, SlideSide: a fast incremental stream processing algorithm for multiple queries, 23rd International Conference on Extending Database Technology (EDBT), Pages: 435-438, ISSN: 2367-2005
Aggregate window computations lie at the core of online analyt-ics in both academic and industrial applications. To efficientlycompute sliding windows, the state-of-the-art algorithms utilizeincremental processing that avoids the recomputation of windowresults from scratch. In this paper, we propose a novel algorithm,calledSlideSide, that extendsTwoStacksfor multiple concur-rent aggregate queries over the same data stream. Our approachuses different yet similar processing schemes for invertible andnon-invertible functions and exhibits up to 2×better through-put compared to the state-of-the-art incremental techniques in amulti-query environment.
Theodorakis G, Koliousis A, Pietzuch P, et al., 2020, LightSaber: Efficient Window Aggregation on Multi-Core Processors, New York, NY, USA, Publisher: Association for Computing Machinery, Pages: 2505–2521-2505–2521
Window aggregation queries are a core part of streaming applications. To support window aggregation efficiently, stream processing engines face a trade-off between exploiting parallelism (at the instruction/multi-core levels) and incremental computation (across overlapping windows and queries). Existing engines implement ad-hoc aggregation and parallelization strategies. As a result, they only achieve high performance for specific queries depending on the window definition and the type of aggregation function. We describe a general model for the design space of window aggregation strategies. Based on this, we introduce LightSaber, a new stream processing engine that balances parallelism and incremental processing when executing window aggregation queries on multi-core CPUs. Its design generalizes existing approaches: (i) for parallel processing, LightSaber constructs a parallel aggregation tree (PAT) that exploits the parallelism of modern processors. The PAT divides window aggregation into intermediate steps that enable the efficient use of both instruction-level (i.e., SIMD) and task-level (i.e., multi-core) parallelism; and (ii) to generate efficient incremental code from the PAT, LightSaber uses a generalized aggregation graph (GAG), which encodes the low-level data dependencies required to produce aggregates over the stream. A GAG thus generalizes state-of-the-art approaches for incremental window aggregation and supports work-sharing between overlapping windows. LightSaber achieves up to an order of magnitude higher throughput compared to existing systems-on a 16-core server, it processes 470 million records/s with 132 ?s average latency.
Lind J, Naor O, Eyal I, et al., 2019, Teechain: a secure payment network with asynchronous blockchain access, 27th ACM Symposium on Operating Systems Principles (SOSP), Publisher: ACM Press, Pages: 63-79
Blockchains such as Bitcoin and Ethereum execute payment transactions securely, but their performance is limited by the need for global consensus. Payment networks overcome this limitation through off-chain transactions. Instead of writing to the blockchain for each transaction, they only settle the final payment balances with the underlying blockchain. When executing off-chain transactions in current payment networks, parties must access the blockchain within bounded time to detect misbehaving parties that deviate from the protocol. This opens a window for attacks in which a malicious party can steal funds by deliberately delaying other parties' blockchain access and prevents parties from using payment networks when disconnected from the blockchain.We present Teechain, the first layer-two payment network that executes off-chain transactions asynchronously with respect to the underlying blockchain. To prevent parties from misbehaving, Teechain uses treasuries, protected by hardware trusted execution environments (TEEs), to establish off-chain payment channels between parties. Treasuries maintain collateral funds and can exchange transactions efficiently and securely, without interacting with the underlying blockchain. To mitigate against treasury failures and to avoid having to trust all TEEs, Teechain replicates the state of treasuries using committee chains, a new variant of chain replication with threshold secret sharing. Teechain achieves at least a 33X higher transaction throughput than the state-of-the-art Lightning payment network. A 30-machine Teechain deployment can handle over 1 million Bitcoin transactions per second.
Koliousis A, Watcharapichat P, Weidlich M, et al., 2019, Crossbow: scaling deep learning with small batch sizes on multi-GPU servers, Proceedings of the VLDB Endowment, Vol: 12, ISSN: 2150-8097
Deep learning models are trained on servers with many GPUs, andtraining must scale with the number of GPUs. Systems such asTensorFlow and Caffe2 train models with parallel synchronousstochastic gradient descent: they process a batch of training data ata time, partitioned across GPUs, and average the resulting partialgradients to obtain an updated global model. To fully utilise allGPUs, systems must increase the batch size, which hinders statisticalefficiency. Users tune hyper-parameters such as the learning rate tocompensate for this, which is complex and model-specific.We describeCROSSBOW, a new single-server multi-GPU sys-tem for training deep learning models that enables users to freelychoose their preferred batch size—however small—while scalingto multiple GPUs.CROSSBOWuses many parallel model replicasand avoids reduced statistical efficiency through a new synchronoustraining method. We introduceSMA, a synchronous variant of modelaveraging in which replicasindependentlyexplore the solution spacewith gradient descent, but adjust their searchsynchronouslybased onthe trajectory of a globally-consistent average model.CROSSBOWachieves high hardware efficiency with small batch sizes by poten-tially training multiple model replicas per GPU, automatically tuningthe number of replicas to maximise throughput. Our experimentsshow thatCROSSBOWimproves the training time of deep learningmodels on an 8-GPU server by 1.3–4×compared to TensorFlow.
Clegg R, Landa R, Griffin D, et al., 2019, Faces in the clouds: long-duration, multi-user, cloud-assisted video conferencing, IEEE Transactions on Cloud Computing, Vol: 7, Pages: 756-769, ISSN: 2168-7161
Multi-user video conferencing is a ubiquitous technology. Increasingly end-hosts in a conference are assisted by cloud-based servers that improve the quality of experience for end users. This paper evaluates the impact of strategies for placement of such servers on user experience and deployment cost. We consider scenarios based upon the Amazon EC2 infrastructure as well as future scenarios in which cloud instances can be located at a larger number of possible sites across the planet. We compare a number of possible strategies for choosing which cloud locations should host services and how traffic should route through them. Our study is driven by real data to create demand scenarios with realistic geographical user distributions and diurnal behaviour. We conclude that on the EC2 infrastructure a well chosen static selection of servers performs well but as more cloud locations are available a dynamic choice of servers becomes important.
Pirk H, Giceva J, Pietzuch P, 2019, Thriving in the No Man's Land between compilers and databases, Conference on Innovative Data Systems Research
When developing new data-intensive applications, one faces a build-or-buy decision: use an existing off-the-shelf data management sys-tem (DMS) or implement a custom solution. While off-the-shelfsystems offer quick results, they lack the flexibility to accommo-date the changing requirements of long-term projects. Building asolution from scratch in a general-purpose programming language,however, comes with long-term development costs that may not bejustified. What is lacking is a middle ground or, more precisely,a clear migration path from off-the-shelf Data Management Sys-tems to customized applications in general-purpose programminglanguages. There is, in effect, a no man’s land that neither compilernor database researchers have claimed.We believe that this problem is an opportunity for the databasecommunity to claim a stake. We need to invest effort to transfer theoutcomes of data management research into fields of programminglanguages and compilers. The common complaint that other fieldsare re-inventing database techniques bears witness to the need forthat knowledge transfer. In this paper, we motivate the necessityfor data management techniques in general-purpose programminglanguages and outline a number of specific opportunities for knowl-edge transfer. This effort will not only cover the no man’s land butalso broaden the impact of data management research.
Theodorakis G, Koliousis A, Pietzuch PR, et al., 2018, Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation, International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures, ADMS@VLDB 2018, Rio de Janeiro, Brazil, August 27, 2018, Pages: 34-41
The computation of sliding window aggregates is one of the corefunctionalities of stream processing systems. Presently, there aretwo classes of approaches to evaluating them. The first is non-incremental, i.e., every window is evaluated in isolation even ifoverlapping windows provide opportunities for work-sharing. Whilenot algorithmically efficient, this class of algorithm is usually veryCPU efficient. The other approach is incremental: to the amountpossible, the result of one window evaluation is used to help withthe evaluation of the next window. While algorithmically efficient,the inherent control-dependencies in the CPU instruction streammake this highly CPU inefficient.In this paper, we analyse the state of the art in efficient incre-mental window processing and extend the fastest known algorithm,the Two-Stacks approach with known as well as novel optimisa-tions. These include SIMD-parallel processing, internal data struc-ture decomposition and data minimalism. We find that, thus opti-mised, our incremental window aggregation algorithm outperformsthe state-of-the-art incremental algorithm by up to11×. In ad-dition, it is at least competitive and often significantly (up to80%)faster than a non-incremental algorithm. Consequently, stream pro-cessing systems can use our proposed algorithm to cover incremen-tal as well as non-incremental aggregation resulting in systems thatare simpler as well as faster.
O'Keeffe D, Salonidis T, Pietzuch PR, 2018, Frontier: Resilient Edge Processing for the Internet of Things, Very Large Databases (VLDB), Publisher: ACM, Pages: 1178-1191, ISSN: 2150-8097
In an edge deployment model, Internet-of-Things (IoT) applications, e.g. for building automation or video surveillance, must process data locally on IoT devices without relying on permanent connectivity to a cloud backend. The ability to harness the combined resources of multiple IoT devices for computation is influenced by the quality of wireless network connectivity. An open challenge is how practical edge-based IoT applications can be realised that are robust to changes in network bandwidth between IoT devices, due to interference and intermittent connectivity.We present Frontier, a distributed and resilient edge processing platform for IoT devices. The key idea is to express data-intensive IoT applications as continuous data-parallel streaming queries and to improve query throughput in an unreliable wireless network by exploiting network path diversity: a query includes operator replicas at different IoT nodes, which increases possible network paths for data. Frontier dynamically routes stream data to operator replicas based on network path conditions. Nodes probe path throughput and use backpressure stream routing to decide on transmission rates, while exploiting multiple operator replicas for data-parallelism. If a node loses network connectivity, a transient disconnection recovery mechanism reprocesses the lost data. Our experimental evaluation of Frontier shows that network path diversity improves throughput by 1.3×−2.8×for different IoT applications, while being resilient to intermittent network connectivity.
Castro Fernandez R, Culhane W, Watcharapichat P, et al., 2018, Meta-dataflows: efficient exploratory dataflow jobs, ACM Conference on Management of Data (SIGMOD), Publisher: Association for Computing Machinery (ACM), ISSN: 0730-8078
Distributed dataflow systems such as Apache Spark and ApacheFlink are used to derive new insights from large datasets. While theyefficiently executeconcretedata processing workflows, expressedas dataflow graphs, they lack generic support forexploratory work-flows: if a user is uncertain about the correct processing pipeline,e.g. in terms of data cleaning strategy or choice of model parame-ters, they must repeatedly submit modified jobs to the system. This,however, misses out on optimisation opportunities for exploratoryworkflows, both in terms of scheduling and memory allocation.We describemeta-dataflows(MDFs), a new model to effectivelyexpress exploratory workflows and efficiently execute them oncompute clusters. With MDFs, users specify afamilyof dataflowsusing two primitives: (a) anexploreoperator automatically con-siders choices in a dataflow; and (b) achooseoperator assesses theresult quality of explored dataflow branches and selects a subset ofthe results. We propose optimisations to execute MDFs: a systemcan (i) avoid redundant computation when exploring branches byreusing intermediate results and discarding results from underper-forming branches; and (ii) consider future data access patterns inthe MDF when allocating cluster memory. Our evaluation showsthat MDFs improve the runtime of exploratory workflows by up to90% compared to sequential execution.
Aublin PRER, Kelbert FM, O'Keeffe D, et al., 2018, LibSEAL: revealing service integrity violations using trusted execution, The European Conference on Computer Systems, Publisher: ACM
Garefalakis P, Karanasos K, Pietzuch P, et al., 2018, Medea: scheduling of long running applications in shared production clusters, EuroSys 2018, Publisher: ACM
The rise in popularity of machine learning, streaming, and latency-sensitiveonline applications in shared production clusters hasraised new challenges for cluster schedulers. To optimize theirperformance and resilience, these applications require precise controlof their placements, by means of complex constraints, e.g., tocollocate or separate their long-running containers across groupsof nodes. In the presence of these applications, the cluster schedulermust attain global optimization objectives, such as maximizingthe number of deployed applications or minimizing the violatedconstraints and the resource fragmentation, but without affectingthe scheduling latency of short-running containers.We present Medea, a new cluster scheduler designed for theplacement of long- and short-running containers. Medea introducespowerful placement constraints with formal semantics to captureinteractions among containers within and across applications. Itfollows a novel two-scheduler design: (i) for long-running containers,it applies an optimization-based approach that accounts forconstraints and global objectives; (ii) for short-running containers,it uses a traditional task-based scheduler for low placement latency.Evaluated on a 400-node cluster, our implementation of Medea onApache Hadoop YARN achieves placement of long-running applicationswith significant performance and resilience benefits comparedto state-of-the-art schedulers.
Lind J, Priebe C, Muthukumaran D, et al., 2018, Glamdring: automatic application partitioning for Intel SGX, USENIX Annual Technical Conference 2017, Publisher: USENIX, Pages: 285-298
Trusted execution support in modern CPUs, as offered byIntel SGXenclaves, can protect applications in untrustedenvironments. While prior work has shown that legacyapplications can run in their entirety inside enclaves, thisresults in a large trusted computing base (TCB). Instead,we explore an approach in which wepartitionan applica-tion and use an enclave to protect only security-sensitivedata and functions, thus obtaining a smaller TCB.We describeGlamdring, the first source-level parti-tioning framework that secures applications written inC using Intel SGX. A developer first annotates security-sensitive application data. Glamdring then automaticallypartitions the application into untrusted and enclaveparts: (i) to preserve data confidentiality, Glamdring usesdataflow analysisto identify functions that may be ex-posed to sensitive data; (ii) for data integrity, it usesback-ward slicingto identify functions that may affect sensitivedata. Glamdring then places security-sensitive functionsinside the enclave, and adds runtime checks and crypto-graphic operations at the enclave boundary to protect itfrom attack. Our evaluation of Glamdring with the Mem-cached store, the LibreSSL library, and the Digital Bitboxbitcoin wallet shows that it achieves small TCB sizes andhas acceptable performance overheads.
Rupprecht L, Culhane WJ, Pietzuch P, 2017, SquirrelJoin: network-aware distributed join processing with lazy partitioning, Proceedings of the VLDB Endowment, Vol: 10, Pages: 1250-1261, ISSN: 2150-8097
To execute distributed joins in parallel on compute clusters, systemspartition and exchange data records between workers. With largedatasets, workers spend a considerable amount of time transferringdata over the network. When compute clusters are shared amongmultiple applications, workers must compete for network bandwidthwith other applications. These variances in the available networkbandwidth lead tonetwork skew, which causes straggling workersto prolong the join completion time.We describeSquirrelJoin, a distributed join processing techniquethat useslazy partitioningto adapt to transient network skew inclusters. Workers maintain in-memorylazy partitionsto withhold asubset of records, i.e. not sending them immediately to other work-ers for processing. Lazy partitions are then assigned dynamicallyto other workers based on network conditions: each worker takesperiodic throughput measurements to estimate its completion time,and lazy partitions are allocated as to minimise the join completiontime. We implement SquirrelJoin as part of the Apache Flink dis-tributed dataflow framework and show that, under transient networkcontention in a shared compute cluster, SquirrelJoin speeds up joincompletion times by up to 2.9× with only a small, fixed overhead.
Lind J, Eyal I, Pietzuch PR, et al., 2017, Teechan: payment channels using trusted execution environments, 4th Workshop on Bitcoin and Blockchain Research, Publisher: Springer, ISSN: 0302-9743
Blockchain protocols are inherently limited in transaction throughputand latency. Recent efforts to address performance and scale blockchainshave focused on off-chain payment channels. While such channels can achievelow latency and high throughput, deploying them securely on top of the Bitcoinblockchain has been difficult, partly because building a secure implementationrequires changes to the underlying protocol and the ecosystem.We present Teechan, a full-duplex payment channel framework that exploitstrusted execution environments. Teechan can be deployed securely on the existingBitcoin blockchain without having to modify the protocol. It: (i) achieves a highertransaction throughput and lower transaction latency than prior solutions; (ii) enablesunlimited full-duplex payments as long as the balance does not exceed thechannel’s credit; (iii) requires only a single message to be sent per payment inany direction; and (iv) places at most two transactions on the blockchain underany execution scenario.We have built and deployed the Teechan framework using Intel SGX on theBitcoin network. Our experiments show that, not counting network latencies,Teechan can achieve 2,480 transactions per second on a single channel, with submillisecondlatencies.
Rupprecht L, Zhang R, Owen W, et al., 2017, SwiftAnalytics: optimizing object stores for big data analytics, IEEE International Conference on Cloud Engineering (IC2E), 2017, Publisher: IEEE
Due to their scalability and low cost, object-basedstorage systems are an attractive storage solution and widelydeployed. To gain valuable insight from the data residing inobject storage but avoid expensive copying to a distributedfilesystem (e.g. HDFS), it would be natural to directly use themas a storage backend for data-parallel analytics frameworks suchas Spark or MapReduce. Unfortunately, executing data-parallelframeworks on object storage exhibits severe performance prob-lems, reducing average job completion times by up to 6.5×.We identify the two most severe performance problems whenrunning data-parallel frameworks on the OpenStack Swift objectstorage system in comparison to the HDFS distributed filesystem:(i) the fixed mapping of object names to storage nodes preventslocal writes and adds delay when objects are renamed; (ii) thecoarser granularity of objects compared to blocks reduces datalocality during reads. We propose the SwiftAnalytics objectstorage system to address them: (i) it useslocality-awarewrites tocontrol an object’s location and eliminate unnecessary I/O relatedto renames during job completion, speeding up analytics jobs byup to 5.1×; (ii) it transparently chunks objects into smaller sizedparts to improve data-locality, leading to up to 3.4×faster reads.
Pietzuch PR, Arnautov S, Trach B, et al., 2016, SCONE: secure Linux containers with Intel SGX, 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016, Publisher: USENIX, Pages: 689-703
In multi-tenant environments, Linux containers managed by Docker or Kubernetes have a lower resource footprint, faster startup times, and higher I/O performance com- pared to virtual machines (VMs) on hypervisors. Yet their weaker isolation guarantees, enforced through soft- ware kernel mechanisms, make it easier for attackers to compromise the confidentiality and integrity of applica- tion data within containers.We describe SCONE, a secure container mechanism for Docker that uses the SGX trusted execution support of Intel CPUs to protect container processes from out- side attacks. The design of SCONE leads to (i) a small trusted computing base (TCB) and (ii) a low performance overhead: SCONE offers a secure C standard library in- terface that transparently encrypts/decrypts I/O data; to reduce the performance impact of thread synchronization and system calls within SGX enclaves, SCONE supports user-level threading and asynchronous system calls. Our evaluation shows that it protects unmodified applications with SGX, achieving 0.6✓–1.2✓ of native throughput.
Papagiannis I, Watcharapichat P, Muthukumaran D, et al., 2016, BrowserFlow: imprecise data flow tracking to prevent accidental data disclosure, Middleware '16: 17th International Middleware Conference, Publisher: ACM, Pages: 1-13
With the use of external cloud services such as Google Docs or Evernote in an enterprise setting, the loss of control over sensitive data becomes a major concern for organisations. It is typical for regular users to violate data disclosure policies accidentally, e.g. when sharing text between documents in browser tabs. Our goal is to help such users comply with data disclosure policies: we want to alert them about potentially unauthorised data disclosure from trusted to untrusted cloud services. This is particularly challenging when users can modify data in arbitrary ways, they employ multiple cloud services, and cloud services cannot be changed.To track the propagation of text data robustly across cloud services, we introduce imprecise data flow tracking, which identifies data flows implicitly by detecting and quantifying the similarity between text fragments. To reason about violations of data disclosure policies, we describe a new text disclosure model that, based on similarity, associates text fragments in web browsers with security tags and identifies unauthorised data flows to untrusted services. We demonstrate the applicability of imprecise data tracking through BrowserFlow, a browser-based middleware that alerts users when they expose potentially sensitive text to an untrusted cloud service. Our experiments show that BrowserFlow can robustly track data flows and manage security tags for documents with no noticeable performance impact.
Brenner S, Wulf C, Lorenz M, et al., 2016, SecureKeeper: confidential zooKeeper using intel SGX, ACM/IFIP/USENIX International Conference on Middleware (Middleware), 2016, Publisher: ACM, Pages: 1-13
Cloud computing, while ubiquitous, still suffers from trust issues, especially for applications managing sensitive data. Third-party coordination services such as ZooKeeper and Consul are fundamental building blocks for cloud applications, but are exposed to potentially sensitive application data. Recently, hardware trust mechanisms such as Intel's Software Guard Extensions (SGX) offer trusted execution environments to shield application data from untrusted software, including the privileged Operating System (OS) and hypervisors. Such hardware support suggests new options for securing third-party coordination services.We describe SecureKeeper, an enhanced version of the ZooKeeper coordination service that uses SGX to preserve the confidentiality and basic integrity of ZooKeeper-managed data. SecureKeeper uses multiple small enclaves to ensure that (i) user-provided data in ZooKeeper is always kept encrypted while not residing inside an enclave, and (ii) essential processing steps that demand plaintext access can still be performed securely. SecureKeeper limits the required changes to the ZooKeeper code base and relies on Java's native code support for accessing enclaves. With an overhead of 11%, the performance of SecureKeeper with SGX is comparable to ZooKeeper with secure communication, while providing much stronger security guarantees with a minimal trusted code base of a few thousand lines of code.
Pietzuch P, Watcharapichat P, Lopez Morales V, et al., 2016, Ako: Decentralised Deep Learning with Partial Gradient Exchange, ACM Symposium on Cloud Computing 2016 (SoCC), Publisher: ACM, Pages: 84-97
Distributed systems for the training of deep neural networks (DNNs) with large amounts of data have vastly improved the accuracy of machine learning models for image and speech recognition. DNN systems scale to large cluster deployments by having worker nodes train many model replicas in parallel; to ensure model convergence, parameter servers periodically synchronise the replicas. This raises the challenge of how to split resources between workers and parameter servers so that the cluster CPU and network resources are fully utilised without introducing bottlenecks. In practice, this requires manual tuning for each model configuration or hardware type.We describe Ako, a decentralised dataflow-based DNN system without parameter servers that is designed to saturate cluster resources. All nodes execute workers that fully use the CPU resources to update model replicas. To synchronise replicas as often as possible subject to the available network bandwidth, workers exchange partitioned gradient updates directly with each other. The number of partitions is chosen so that the used network bandwidth remains constant, independently of cluster size. Since workers eventually receive all gradient partitions after several rounds, convergence is unaffected. For the ImageNet benchmark on a 64-node cluster, Ako does not require any resource allocation decisions, yet converges faster than deployments with parameter servers.
Pietzuch PR, Weichbrodt N, Kurmus A, et al., 2016, AsyncShock: Exploiting Synchronisation Bugs in Intel SGX Enclaves, 21st European Symposium on Research in Computer Security (ESORICS), Publisher: Springer International Publishing, Pages: 440-457, ISSN: 0302-9743
Intel’s Software Guard Extensions (SGX) provide a new hardware-based trusted execution environment on Intel CPUs using secure enclaves that are resilient to accesses by privileged code and physical attackers. Originally designed for securing small services, SGX bears promise to protect complex, possibly cloud-hosted, legacy applications. In this paper, we show that previously considered harmless synchronisation bugs can turn into severe security vulnerabilities when using SGX. By exploiting use-after-free and time-of-check-to-time-of-use (TOCTTOU) bugs in enclave code, an attacker can hijack its control flow or bypass access control.We present AsyncShock, a tool for exploiting synchronisation bugs of multithreaded code running under SGX. AsyncShock achieves this by only manipulating the scheduling of threads that are used to execute enclave code. It allows an attacker to interrupt threads by forcing segmentation faults on enclave pages. Our evaluation using two types of Intel Skylake CPUs shows that AsyncShock can reliably exploit use-after-free and TOCTTOU bugs.
Pietzuch PR, Koliousis A, Weidlich M, et al., 2016, Demo- The SABER system for window-based hybrid stream processing with GPGPUs, DEBS 2016, Publisher: Association for Computing Machinery, Pages: 354-357
Heterogeneous architectures that combine multi-core CPUs withmany-core GPGPUs have the potential to improve the performanceof data-intensive stream processing applications. Yet, a stream pro-cessing engine must execute streaming SQL queries with sufficientdata-parallelism to fully utilise the available heterogeneous proces-sors, and decide how to use each processor in the most effectiveway. Addressing these challenges, we demonstrate SABER, ahybrid high-performance relational stream processing engine forCPUs and GPGPUs. SABER executes window-based streaming SQL queries in a data-parallel fashion and employs an adaptive scheduling strategy to balance the load on the different types of processors. To hidedata movement costs, SABER pipelines the transfer of stream databetween CPU and GPGPU memory. In this paper, we review thedesign principles of SABER in terms of its hybrid stream processingmodel and its architecture for query execution. We also present aweb front-end that monitors processing throughput.
Kalyvianaki E, Fiscato M, Salonidis T, et al., 2016, THEMIS: Fairness in Federated Stream Processing under Overload, 2016 International Conference on Management of Data (SIGMOD '16), Publisher: ACM, Pages: 541-553
Federated stream processing systems, which utilise nodes from multiple independent domains, can be found increasingly in multi-provider cloud deployments, internet-of-things systems, collaborative sensing applications and large-scale grid systems. To pool resources from several sites and take advantage of local processing, submitted queries are split into query fragments, which are executed collaboratively by different sites. When supporting many concurrent users, however, queries may exhaust available processing resources, thus requiring constant load shedding. Given that individual sites have autonomy over how they allocate query fragments on their nodes, it is an open challenge how to ensure global fairness on processing quality experienced by queries in a federated scenario.We describe THEMIS, a federated stream processing system for resource-starved, multi-site deployments. It executes queries in a globally fair fashion and provides users with constant feedback on the experienced processing quality for their queries. THEMIS associates stream data with its source information content (SIC), a metric that quantifies the contribution of that data towards the query result, based on the amount of source data used to generate it. We provide the BALANCE-SIC distributed load shedding algorithm that balances the SIC values of result data. Our evaluation shows that the BALANCE-SIC algorithm yields balanced SIC values across queries, as measured by Jain's Fairness Index. Our approach also incurs a low execution time overhead.
Alim A, Clegg RG, Mai L, et al., 2016, FLICK: developing and running application-specific network services, 2016 USENIX Annual Technical Conference (USENIX ATC '16), Publisher: USENIX Association
Data centre networks are increasingly programmable,with application-specific network services proliferating,from custom load-balancers to middleboxes providingcaching and aggregation. Developers must currently implementthese services using traditional low-level APIs,which neither support natural operations on applicationdata nor provide efficient performance isolation.We describe FLICK, a framework for the programmingand execution of application-specific network serviceson multi-core CPUs. Developers write network servicesin the FLICK language, which offers high-level processingconstructs and application-relevant data types.FLICK programs are translated automatically to efficient,parallel task graphs, implemented in C++ on top of auser-space TCP stack. Task graphs have bounded resourceusage at runtime, which means that the graphsof multiple services can execute concurrently withoutinterference using cooperative scheduling. We evaluateFLICK with several services (an HTTP load-balancer,a Memcached router and a Hadoop data aggregator),showing that it achieves good performance while reducingdevelopment effort.
Koliousis A, Weidlich M, Fernandez R, et al., 2016, Saber: window-based hybrid stream processing for heterogeneous architectures, 2016 ACM SIGMOD/PODS Conference, Publisher: ACM, Pages: 555-569
Modern servers have become heterogeneous, often combining multicoreCPUs with many-core GPGPUs. Such heterogeneous architectureshave the potential to improve the performance of data-intensivestream processing applications, but they are not supported by currentrelational stream processing engines. For an engine to exploit aheterogeneous architecture, it must execute streaming SQL querieswith sufficient data-parallelism to fully utilise all available heterogeneousprocessors, and decide how to use each in the most effectiveway. It must do this while respecting the semantics of streamingSQL queries, in particular with regard to window handling.We describe SABER, a hybrid high-performance relational streamprocessing engine for CPUs and GPGPUs. SABER executes windowbasedstreaming SQL queries in a data-parallel fashion using allavailable CPU and GPGPU cores. Instead of statically assigningquery operators to heterogeneous processors, SABER employs anew adaptive heterogeneous lookahead scheduling strategy, whichincreases the share of queries executing on the processor that yieldsthe highest performance. To hide data movement costs, SABERpipelines the transfer of stream data between different memory typesand the CPU/GPGPU. Our experimental comparison against state-ofthe-artengines shows that SABER increases processing throughputwhile maintaining low latency for a wide range of streaming SQLqueries with small and large windows sizes.
Cherkasova L, Pietzuch P, Wang CL, 2016, Message from the IC2E 2016 program commitee co-chairs, IC2E, Pages: x-xi
Ogden, Thomas D, Pietzuch P, 2016, AT-GIS: highly parallel spatial query processing with associative transducers, ACM SIGMOD International Conference on Management of Data 2016, Publisher: ACM, Pages: 1041-1054
Users in many domains, including urban planning, transportation,and environmental science want to execute analytical queries overcontinuously updated spatial datasets. Current solutions for largescalespatial query processing either rely on extensions to RDBMS,which entails expensive loading and indexing phases when thedata changes, or distributed map/reduce frameworks, running onresource-hungry compute clusters. Both solutions struggle with thesequential bottleneck of parsing complex, hierarchical spatial dataformats, which frequently dominates query execution time. Ourgoal is to fully exploit the parallelism offered by modern multicoreCPUs for parsing and query execution, thus providing theperformance of a cluster with the resources of a single machine.We describe AT-GIS, a highly-parallel spatial query processingsystem that scales linearly to a large number of CPU cores. ATGISintegrates the parsing and querying of spatial data using a newcomputational abstraction called associative transducers(ATs). ATscan form a single data-parallel pipeline for computation withoutrequiring the spatial input data to be split into logically independentblocks. Using ATs, AT-GIS can execute, in parallel, spatial queryoperators on the raw input data in multiple formats, without anypre-processing. On a single 64-core machine, AT-GIS provides 3×the performance of an 8-node Hadoop cluster with 192 cores forcontainment queries, and 10× for aggregation queries.
This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.