We’re kicking off the new academic year with a seminar from LSE Data Science Institute’s Dr Blake Miller about the Social and Ethical Implications of Data Scarcity and Data Drift in Large Language Models as part of our ongoing Unsolved Problems Seminar Series for the DSI Squared partnership.
The Unsolved Problems Seminar Series aims to foster innovations by bridging the gap between social sciences, computer sciences and STEM subjects through presenting unsolved problems and crowdsourcing solutions from experts across these fields.
Date: 05 October 2023
Location: Data Science Institute, Imperial College London, William Penney Laboratory, South Kensington Campus, SW7 2AZ
Speaker: Dr Blake Miller
In this project, I investigate the effects of behavioral changes in data producers/providers due to the swift introduction and widespread adoption of powerful large language model (LLM) tools. I examine the impact of their use on the quality and quantity of data produced on platforms where these models are commonly trained (e.g., Wikipedia, StackOverflow, Quora, etc.). I discuss the potential challenges arising from data drift and domain mismatch resulting from this behavioral shift, specifically concerning safety, content moderation, and the factual accuracy of LLM outputs. This project aims to highlight the extent of behavior change among content creators and emphasizes the potential risks of LLMs becoming less reliable due to of scarcity of non-synthetic data.