How to train your Vicuna – finetuning & evaluating LLMs in the wild
Hao Zhang from Halıcıoğlu Data Science Institute and the Department of Computer Science and Engineering at UC San Diego.
Post the release of Meta’s Llama weights, the open source development of large language models (LLMs) are seeing rapid progress almost every day. This talk will share our experience with serving and evaluating 20+ LLM-based Chatbots, including Vicuna, within the Chatbot Arena. I will start by briefly explaining Vicuna, an open source chatbot we finetuned from Llama, and the Chatbot Arena platform we developed to evaluate the quality of such models in the wild. I will then discuss the underlying system challenges we faced: how to serve many LLMs, achieving high throughput and low latency, given only a limited amount of university-donated GPUs. I’ll cover two key enabling techniques behind the scene: paged attention (vLLM, SOSP’23) and statistical multiplexing with model parallelism (AlpaServe, OSDI’23). This is joint work with members of the LMSYS Org team.
Hao Zhang is an Assistant Professor in Halıcıoğlu Data Science Institute and the Department of Computer Science and Engineering at UC San Diego. Before joining UCSD, Hao was a postdoctoral researcher at UC Berkeley working with Ion Stoica (2021 – 2023). Hao completed his Ph.D. in Computer Science at Carnegie Mellon University with Eric Xing (2014 – 2020). During his Ph.D., Hao took on leave and worked for the ML startup company Petuum (2016 – 2021).
Hao’s research interest is in the intersection area of machine learning and systems. Hao’s past work includes Vicuna, FastChat, Alpa, vLLM, Poseidon, Petuum. Hao’s research has been recognized with the Jay Lepreau best paper award at OSDI’21 and an NVIDIA pioneer research award at NeurIPS’17. Parts of Hao’s research have been commercialized at multiple start-ups including Petuum and AnyScale.