Dr Lluis Vilanova

Faculty of Engineering, Department of Computing

Senior Lecturer

Contact

+44 (0)20 7594 8328vilanova Website

Location

556Huxley BuildingSouth Kensington Campus

Summary

Publications

Bergman S, Silberstein M, Pietzuch P, Shinagawa T, Vilanova Let al., 2023, Translation pass-through for near-native paging performance in VMs, 2023 USENIX Annual Technical Conference (USENIX ATC '23), Publisher: USENIX, Pages: 753-768

Virtual machines (VMs) are used for consolidation, isolation, and provisioning in the cloud, but applications with large working sets are impacted by the overheads of memory address translation in VMs. Existing translation approaches incur non-trivial overheads: (i) nested paging has a worst-case latency that increases with page table depth; and (ii) paravirtualized and shadow paging suffer from high hypervisor intervention costs when updating guest page tables. We describe translation pass-through (TPT), a new memory virtualization mechanism that achieves near-native performance. TPT enables VMs to control virtual memory translation from guest-virtual to host-physical addresses using one-dimensional page tables. At the same time, inter-VM isolation is enforced by the host by exploiting new hardware support for physical memory tagging in commodity CPUs. We prototype TPT by modifying the KVM/QEMU hypervisor and enlightening the Linux guest. We evaluate it by emulating the memory tagging mechanism of AMD CPUs. Our conservative performance estimates show that TPT achieves native performance for real-world data center applications, with speedups of up to 2.4× and 1.4× over nested and shadow paging, respectively.

Conference paper

Sartakov VA, Vilanova L, Geden M, Eyers D, Shinagawa T, Pietzuch Pet al., 2023, ORC: Increasing cloud memory density via object reuse with capabilities, 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Publisher: USENIX Assoc, Pages: 573-587

Cloud environments host many tenants, and typically thereis substantial overlap between the application binaries andlibraries executed by tenants. Thus, memory de-duplicationcan increase memory density by allocating memory for shared binaries only once. Existing de-duplication approaches, however, either rely on a shared OS to de-deduplicate binary objects, which provides unacceptably weak isolation; or exploit hypervisor-based de-duplication at the level of memory pages, which is blind to the semantics of the objects to be shared.We describe Object Reuse with Capabilities (ORC), whichsupports the fine-grained sharing of binary objects betweentenants, while isolating tenants strongly through a smalltrusted computing base (TCB). ORC uses hardware sup-port for memory capabilities to isolate tenants, which permits shared objects to be accessible to multiple tenants safely. Since ORC shares binary objects within a single address space through capabilities, it uses a new relocation type to create per-tenant state when loading shared objects. ORC supports the loading of objects by an untrusted guest, outside of its TCB, only verifying the safety of the loaded data. Our experiments show that ORC achieves a higher memory density with a lower overhead than hypervisor-based de-deduplication.

Conference paper

Sartakov VA, Vilanova L, Eyers D, Shinagawa T, Pietzuch Pet al., 2022, CAP-VMs: Capability-based isolation and sharing in the cloud, 16th USENIX Symposium on Operating Systems Design and Implementation, Pages: 597-612

Cloud stacks must isolate application components, while permitting efficient data sharing between components deployed on the same physical host. Traditionally, the MMU enforces isolation and permits sharing at page granularity. MMU approaches, however, lead to cloud stacks with large TCBs in kernel space, and page granularity requires inefficient OS interfaces for data sharing. Forthcoming CPUs with hardware support for memory capabilities offer new opportunities to implement isolation and sharing at a finer granularity. We describe cVMs, a new VM-like abstraction that uses memory capabilities to isolate application components while supporting efficient data sharing, all without mandating application code to be capability-aware. cVMs share a single virtual address space safely, each having only capabilities to access its own memory. A cVM may include a library OS, thus minimizing its dependency on the cloud environment. cVMs efficiently exchange data through two capability-based primitives assisted by a small trusted monitor: (i) an asynchronous read/write interface to buffers shared between cVMs; and (ii) a call interface to transfer control between cVMs. Using these two primitives, we build more expressive mechanisms for efficient cross-cVM communication. Our prototype implementation using CHERI RISC-V capabilities shows that cVMs isolate services (Redis and Python) with low overhead while improving data sharing.

Conference paper

Bergman S, Faldu P, Grot B, Vilanova L, Silberstein Met al., 2022, Reconsidering OS memory optimizations in the presence of disaggregated memory, ACM SIGPLAN International Symposium on Memory Management, Publisher: ACM, Pages: 1-14

Tiered memory systems introduce an additional memory level with higher-than-local-DRAM access latency and require sophisticated memory management mechanisms to achieve cost-efficiency and high performance. Recent works focus on byte-addressable tiered memory architectures which offer better performance than pure swap-based systems. We observe that adding disaggregation to a byte-addressable tiered memory architecture requires important design changes that deviate from the common techniques that target lower-latency non-volatile memory systems. Our comprehensive analysis of real workloads shows that the high access latency to disaggregated memory undermines the utility of well-established memory management optimizations Based on these insights, we develop HotBox – a disaggregated memory management subsystem for Linux that strives to maximize the local memory hit rate with low memory management overhead. HotBox introduces only minor changes to the Linux kernel while outperforming state-of-the-art systems on memory-intensive benchmarks by up to 2.25×.

Conference paper

Vilanova L, Maudlej L, Bergman S, Miemietz T, Hille M, Asmussen N, Roitzsch M, Härtig H, Silberstein Met al., 2022, Slashing the disaggregation tax in heterogeneous data centers with FractOS, EuroSys '22: Seventeenth European Conference on Computer Systems, Publisher: ACM, Pages: 352-367

Disaggregated heterogeneous data centers promise higher efficiency, lower total costs of ownership, and more flexibility for data-center operators. However, current software stacks can levy a high tax on application performance. Applications and OSes are designed for systems where local PCIe-connected devices are centrally managed by CPUs, but this centralization introduces unnecessary messages through the shared data-center network in a disaggregated system.We present FractOS, a distributed OS that is designed to minimize the network overheads of disaggregation in heterogeneous data centers. FractOS elevates devices to be first-class citizens, enabling direct peer-to-peer data transfers and task invocations among them, without centralized application and OS control. FractOS achieves this through: (1) new abstractions to express distributed applications across services and disaggregated devices, (2) new mechanisms that enable devices to securely interact with each other and other data-center services, (3) a distributed and isolated OS layer that implements these abstractions and mechanisms, and can run on host CPUs and SmartNICs.Our prototype shows that FractOS accelerates real-world heterogeneous applications by 47%, while reducing their network traffic by 3×.

Conference paper

Sartakov V, Pietzuch P, Vilanova L, 2021, CubicleOS: A library OS with software componentisation for practical isolation, Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’21), Publisher: ACM, Pages: 546-558

Library OSs have been proposed to deploy applications isolated inside containers, VMs, or trusted execution environments. They often follow a highly modular design in which third-party components are combined to offer the OS functionality needed by an application, and they are customised at compilation and deployment time to fit application requirements. Yet their monolithic design lacks isolation across components: when applications and OS components contain security-sensitive data (e.g., cryptographic keys or user data), the lack of isolation renders library OSs open to security breaches via malicious or vulnerable third-party components.

Conference paper

Sartakov V, O’Keeffe D, Eyers D, Vilanova L, Pietzuch Pet al., 2021, Spons & shields: practical isolation for trusted execution, ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE ’21), Publisher: ACM, Pages: 186-200

Trusted execution environments (TEEs) promise a cost-effective, “lift-and-shift” solution for deploying security-sensitive applications in untrusted clouds. For this, they must support rich, multi-component applications, but a large trusted computing base (TCB) inside the TEE risks that attackers can compromise application security. Fine-grained compartmentalisation can increase security through defense-in-depth, but current solutions either run all software components unprotected in the same TEE, lack efficient shared memory support, or isolate application processes using separate TEEs, impacting performance and compatibility.We describe the Spons & Shields framework (SSF) for Intel SGX TEEs, which offers intra-TEE compartmentalisation using two new abstraction, Spons and Shields. Spons and Shields generalise process, library and user/kernel isolation inside the TEE while allowing for efficient memory sharing. When users deploy unmodified multi-component applications in a TEE, SSF dynamically creates Spons (one per POSIX process or library) and Shields (to enforce a given security policy for memory accesses). Applications can be hardened with minor code changes, e.g., by using a separate Shield to isolate an SSL library. SSF uses compiler instrumentation to protect Shield boundaries, exploiting MPX instructions if available. We evaluate SSF using a complex application service (NGINX, PHP interpreter and PostgreSQL) and show that its overhead is comparable to process isolation.

Conference paper

Sartakov V, Vilanova L, Pietzuch P, 2020, CubicleOS: A Library OS with Software Componentisation for Practical Isolation

This artefact contains the library OS, two applications, isolation monitor, and scripts to reproduce experiments from the ASPLOS 2021 paper by V. A. Sartakov, L. Vilanova, R. Pietzuch — CubicleOS: A Library OS with Software Componentisation for Practical Isolation, which isolates components of a monolithic library OS without the use of message-based IPC primitives.

Abstract
Cite

Software

Vilanova L, Maudlej L, Hille M, Asmussen N, Roitzsch M, Silberstein Met al., 2020, Caladan: a distributed meta-OS for data center disaggregation, Systems for Post-Moore Architectures (SPMA) 2020

Data center resource disaggregation promises cost savings by pool-ing compute, storage and memory resources into separate, net-worked nodes. The benefits of this model are clear, but a closer lookshows that its full performance and efficiency potential cannot beeasily realized. Existing systems use CPUs pervasively to interface ar-bitrary devices with the network and to orchestrate communicationamong them, reducing the benefits of disaggregation.In this paper we presentCaladan, a novel system with a trusteduni-versal resource fabricthat interconnects all resources and efficientlyoffloads the system and application control planes to SmartNICs,freeing server CPUs to execute application logic. Caladan offersthree core services: capability-driven distributed name space, virtualdevices, and direct inter-device communications. These servicesare implemented in a trustedmeta-kernelthat executes in per-nodeSmartNICs. Low-level device drivers running on the commodity hostOS are used for setting up accelerators and I/O devices, and exposingthem to Caladan. Applications run in a distributed fashion acrossCPUs and multiple accelerators, which in turn can directly performI/O, i.e., access files, other accelerators or host services. Our dis-tributed dataflow runtime runs on top of this substrate. It orchestratesthe distributed execution, connecting disaggregated resources usingdata transfers and inter-device communication, while eliminatingthe performance bottlenecks of the traditional CPU-centric design.

Conference paper

Vilanova L, Amit N, Etsion Y, 2019, Using SMT to accelerate nested virtualization, International Symposium on Computer Architecture (ISCA), Pages: 750-761

IaaS datacenters offer virtual machines (VMs) to their clients, who in turn sometimes deploy their own virtualized environments, thereby running a VM inside a VM. This is known as nested virtualization.VMs are intrinsically slower than bare-metal execution, as they often trap into their hypervisor to perform tasks like operating virtual I/O devices. Each VM trap requires loading and storing dozens of registers to switch between the VM and hypervisor contexts, thereby incurring costly runtime overheads. Nested virtualization further magnifies these overheads, as every VM trap in a traditional virtualized environment triggers at least twice as many traps.We propose to leverage the replicated thread execution resources in simultaneous multithreaded (SMT) cores to alleviate the overheads of VM traps in nested virtualization. Our proposed architecture introduces a simple mechanism to colocate different VMs and hypervisors on separate hardware threads of a core, and replaces the costly context switches of VM traps with simple thread stall and resume events. More concretely, as each thread in an SMT core has its own register set, trapping between VMs and hypervisors does not involve costly context switches, but simply requires the core to fetch instructions from a different hardware thread. Furthermore, our inter-thread communication mechanism allows a hypervisor to directly access and manipulate the registers of its subordinate VMs, given that they both share the same in-core physical register file.A model of our architecture shows up to 2.3× and 2.6× better I/O latency and bandwidth, respectively. We also show a software-only prototype of the system using existing SMT architectures, with up to 1.3× and 1.5× better I/O latency and bandwidth, respectively, and 1.2--2.2× speedups on various real-world applications.

Conference paper

Hunger C, Vilanova L, Papamanthou C, Etsion Y, Tiwari Met al., 2018, DATS - data containers for web applications, Architectural Support for Programming Languages and Operating Systems (ASPLOS), Publisher: ACM, Pages: 722-736

Data containers enable users to control access to their data while untrusted applications compute on it. However, they require replicating an application inside each container - compromising functionality, programmability, and performance. We propose DATS - a system to run web applications that retains application usability and efficiency through a mix of hardware capability enhanced containers and the introduction of two new primitives modeled after the popular model-view-controller (MVC) pattern. (1) DATS introduces a templating language to create views that compose data across data containers. (2) DATS uses authenticated storage and confinement to enable an untrusted storage service, such as memcached and deduplication, to operate on plain-text data across containers. These two primitives act as robust declassifiers that allow DATS to enforce non-interference across containers, taking large applications out of the trusted computing base (TCB). We showcase eight different web applications including Gitlab and a Slack-like chat, significantly improve the worst-case overheads due to application replication, and demonstrate usable performance for common-case usage.

Conference paper

Vilanova L, Jorda M, Navarro N, Etsion Y, Valero Met al., 2017, Direct Inter-Process Communication (dIPC): Repurposing the CODOMs Architecture to Accelerate IPC, 12th European Conference on Computer Systems (EuroSys), Publisher: ASSOC COMPUTING MACHINERY, Pages: 16-31

Author Web Link
Cite
Citations: 8

Conference paper

Alvarez L, Vilanova L, Moreto M, Casas M, Gonzalez M, Martorell X, Navarro N, Ayguade E, Valero Met al., 2015, Coherence Protocol for Transparent Management of Scratchpad Memories in Shared Memory Manycore Architectures, International Symposium on Computer Architecture (ISCA)

Cite

Conference paper

Alvarez L, Vilanova L, Gonzalez M, Martorell X, Navarro N, Ayguade Eet al., 2015, Hardware-Software Coherence Protocol for the Coexistence of Caches and Local Memories, IEEE TRANSACTIONS ON COMPUTERS, Vol: 64, Pages: 152-165, ISSN: 0018-9340

Author Web Link
Cite
Citations: 3

Journal article

Cabezas J, Vilanova L, Gelado I, Jablin TB, Navarro N, Hwu W-MWet al., 2015, Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes, 29th ACM International Conference on Supercomputing (ICS), Publisher: ASSOC COMPUTING MACHINERY, Pages: 3-13

Author Web Link
Cite
Citations: 11

Conference paper

Vilanova L, Ben-Yehuda M, Navarro N, Etsion Y, Valero Met al., 2014, CODOMs: Protecting Software with Code-centric Memory Domains, ACM/IEEE 41st Annual International Symposium on Computer Architecture (ISCA), Publisher: IEEE, Pages: 469-480, ISSN: 1063-6897

Author Web Link
Cite
Citations: 29

Conference paper

Cabezas J, Vilanova L, Gelado I, Jablin TB, Navarro N, Hwu W-Met al., 2014, Automatic Execution of Single-GPU Computations across Multiple GPUs, 23rd International Conference on Parallel Architectures and Compilation Techniques (PACT), Publisher: IEEE, Pages: 467-468, ISSN: 1089-795X

Author Web Link
Cite
Citations: 6

Conference paper

Rajovic N, Vilanova L, Villavieja C, Puzovic N, Ramirez Aet al., 2013, The low power architecture approach towards exascale computing, JOURNAL OF COMPUTATIONAL SCIENCE, Vol: 4, Pages: 439-443, ISSN: 1877-7503

Author Web Link
Cite
Citations: 35

Journal article

Tanasic I, Vilanova L, Jordà M, Cabezas J, Gelado I, Navarro N, Hwu WMet al., 2013, Comparison based sorting for systems with multiple GPUs, Pages: 1-11

As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed several sorting algorithms for single-GPU systems. However, some workstations and HPC systems have multiple GPUs, and applications running on them are de- signed to use all available GPUs in the system. In this paper we present a high performance multi-GPU merge sort algorithm that solves the problem of sorting data distributed across several GPUs. Our merge sort algorithm first sorts the data on each GPU using an existing single- GPU sorting algorithm. Then, a series of merge steps pro- duce a globally sorted array distributed across all the GPUs in the system. This merge phase is enabled by a novel pivot selection algorithm that ensures that merge steps always distribute data evenly among all GPUs. We also present the implementation of our sorting algorithm in CUDA, as well as a novel inter-GPU communication technique that enables this pivot selection algorithm. Experimental results show that an efficient implementation of our algorithm achieves a speed up of 1.9x when running on two GPUs and 3.3x when running on four GPUs, compared to sorting on a single GPU. At the same time, it is able to sort two and four times more records, compared to sorting on one GPU. Copyright 2013 ACM.

Abstract
Cite
Citations: 12

Conference paper

Jordà M, Tanasic I, Cabezas J, Vilanova L, Gelado I, Navarro Net al., 2013, Auto-tuning of data communication on heterogeneous systems, Pages: 135-140

Heterogeneous systems formed by traditional CPUs and compute accelerators, such as GPUs, are becoming widely used to build modern supercomputers. However, many different system topologies (i.e., how CPUs, accelerators, and I/O devices are interconnected) are being deployed. Each system organization presents different trade-offs when transferring data between CPUs, accelerators, and nodes within a cluster, requiring different software implementations to achieve optimal data communication bandwidth. In this paper we explore the potential impact of two optimizations to achieve optimal data transfer bandwidth: topology-aware process placement policies, and double-buffering. We design a set of experiments to evaluate all possible alternatives, and run each of them on different hardware configurations. We show that optimal data transfer mechanisms depend on both the hardware topology and the application dataset size. Our experimental evaluation shows that auto-tuning applications to match the hardware topology, and to find the best double-buffering configuration can improve the data transfers bandwidth up to 70% for local communication and is key to achieve optimal bandwidth in remote communication for data transfers larger than 128KB. © 2013 IEEE.

Abstract
Cite

Conference paper

Alvarez L, Vilanova L, Gonzalez M, Martorell X, Navarro N, Ayguade Eet al., 2012, Hardware-Software Coherence Protocol for the Coexistence of Caches and Local Memories, 25th ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Publisher: IEEE, ISSN: 2167-4329

Author Web Link
Cite
Citations: 2

Conference paper

Villavieja C, Karakostas V, Vilanova L, Etsion Y, Ramirez A, Mendelson A, Navarro N, Cristal A, Unsal OSet al., 2011, DiDi: Mitigating the performance impact of TLB shootdowns using a shared TLB directory, Pages: 340-349, ISSN: 1089-795X

Translation Lookaside Buffers (TLBs) are ubiquitously used in modern architectures to cache virtual-to-physical mappings and, as they are looked up on every memory access, are paramount to performance scalability. The emergence of chipmultiprocessors (CMPs) with per-core TLBs, has brought the problem of TLB coherence to front stage. TLBs are kept coherent at the software-level by the operating system (OS). Whenever the OS modifies page permissions in a page table, it must initiate a coherency transaction among TLBs, a process known as a TLB shootdown. Current CMPs rely on the OS to approximate the set of TLBs caching a mapping and synchronize TLBs using costly Inter-Proceessor Interrupts (IPIs) and software handlers. In this paper, we characterize the impact of TLB shootdowns on multiprocessor performance and scalability, and present the design of a scalable TLB coherency mechanism. First, we show that both TLB shootdown cost and frequency increase with the number of processors and project that softwarebased TLB shootdowns would thwart the performance of large multiprocessors. We then present a scalable architectural mechanism that couples a shared TLB directory with load/store queue support for lightweight TLB invalidation, and thereby eliminates the need for costly IPIs. Finally, we show that the proposed mechanism reduces the fraction of machine cycles wasted on TLB shootdowns by an order of magnitude. © 2011 IEEE.

Abstract
Cite
Citations: 74

Conference paper

Rajovic N, Puzovic N, Vilanova L, Villavieja C, Ramirez Aet al., 2011, The low-power architecture approach towards exascale computing, Pages: 1-2

Energy efficiency is a first-order concern when deploying any computer system. From battery-operated mobile devices, to data centers and supercomputers, energy consumption limits the performance that can be offered. We are exploring an alternative to current supercomputers that builds on the small energy-efficient mobile processors. We present results from the prototype system based on ARM Cortex-A9 and make projections about the possibilities to increase energy efficiency. © 2011 Authors.

Abstract
Cite
Citations: 13

Conference paper

Jimenez VJ, Vilanova L, Gelado I, Gil M, Fursin G, Navarro Net al., 2009, Predictive Runtime Code Scheduling for Heterogeneous Architectures, Editors: Seznec, Emer, O'Boyle, Martonosi, Ungerer, Publisher: SPRINGER-VERLAG BERLIN, Pages: 19-33, ISBN: 978-3-540-92989-5

Author Web Link
Cite
Citations: 46

Book chapter

This data is extracted from the Web of Science and reproduced under a licence from Thomson Reuters. You may not copy or re-distribute this data in whole or in part without the written consent of the Science business of Thomson Reuters.

Request URL: http://wlsprd.imperial.ac.uk:80/respub/WEB-INF/jsp/search-html.jsp Request URI: /respub/WEB-INF/jsp/search-html.jsp Query String: respub-action=search.html&id=01054588&limit=30&person=true