Taskerman: A Distributed Cluster Task Manager



Raghavendra PrabhuYelp Image


At Yelp  we use a wide variety of datastore technologies, including  Cassandra, Elasticsearch, MySQL and Zookeeper. These distributed datastores ask to be treated like pets but can only be reared like cattle given the scale of our systems. Requiring engineers to give them frequent attention  is not feasible or scalable. We need cluster automation which is powerful and reliable, and most importantly, safe. This is where Taskerman steps in.

Taskerman is a distributed cluster task manager, which wears many hats to keep our clusters highly available, consistent, secure and in an optimal condition. It currently automates periodic tasks such as backups, repairs, restarts, monitoring and reboots. It is built to be extensible and composable enough to solve multiple real-world problems.

The talk will cover the genesis of Taskerman inside Yelp, its architecture and evolution, adoption of best practices from distributed systems theory, future challenges and roadmap.


Raghavendra Prabhu works as a Software Engineer in the Distributed Systems team at Yelp's London office. His work revolves around distributed datastores such as Cassandra, Elasticsearch, Zookeeper, their interactions, and automations. Prior to that, he was the Product Lead of Galera-based Percona XtraDB Cluster (PXC) at Percona. He started his career at Yahoo as a Systems Engineer, working primarily with the database stacks of Yahoo.

Raghavendra's main interests include databases, virtualization and containers, distributed systems, and operating systems. In his spare time, he likes to read books and technical papers/literature, listen to music, hack on FOSS software, go on hiking in nature reserves.

He has previously spoken at various conferences such as Percona Live, FOSDEM, LinuxConfAu (LCA), Fossetcon, RICON, Highload++ and SCALE. Slides from these talks are available here: http://www.slideshare.net/slidunder