Imperial College London

ProfessorPeterPietzuch

Faculty of EngineeringDepartment of Computing

Professor of Distributed Systems
 
 
 
//

Contact

 

+44 (0)20 7594 8314prp Website

 
 
//

Location

 

442Huxley BuildingSouth Kensington Campus

//

Summary

 

Publications

Citation

BibTex format

@inproceedings{Castro:2018,
author = {Castro, Fernandez R and Culhane, W and Watcharapichat, P and Weidlich, M and Pietzuch, PR},
publisher = {Association for Computing Machinery (ACM)},
title = {Meta-dataflows: efficient exploratory dataflow jobs},
url = {http://hdl.handle.net/10044/1/58752},
year = {2018}
}

RIS format (EndNote, RefMan)

TY  - CPAPER
AB - Distributed dataflow systems such as Apache Spark and ApacheFlink are used to derive new insights from large datasets. While theyefficiently executeconcretedata processing workflows, expressedas dataflow graphs, they lack generic support forexploratory work-flows: if a user is uncertain about the correct processing pipeline,e.g. in terms of data cleaning strategy or choice of model parame-ters, they must repeatedly submit modified jobs to the system. This,however, misses out on optimisation opportunities for exploratoryworkflows, both in terms of scheduling and memory allocation.We describemeta-dataflows(MDFs), a new model to effectivelyexpress exploratory workflows and efficiently execute them oncompute clusters. With MDFs, users specify afamilyof dataflowsusing two primitives: (a) anexploreoperator automatically con-siders choices in a dataflow; and (b) achooseoperator assesses theresult quality of explored dataflow branches and selects a subset ofthe results. We propose optimisations to execute MDFs: a systemcan (i) avoid redundant computation when exploring branches byreusing intermediate results and discarding results from underper-forming branches; and (ii) consider future data access patterns inthe MDF when allocating cluster memory. Our evaluation showsthat MDFs improve the runtime of exploratory workflows by up to90% compared to sequential execution.
AU - Castro,Fernandez R
AU - Culhane,W
AU - Watcharapichat,P
AU - Weidlich,M
AU - Pietzuch,PR
PB - Association for Computing Machinery (ACM)
PY - 2018///
SN - 0730-8078
TI - Meta-dataflows: efficient exploratory dataflow jobs
UR - http://hdl.handle.net/10044/1/58752
ER -