Researchers discussing at the DSI Squared event

In the third instalment of the DSI Squared Unsolved Problems Seminar Series, Dr Ovidiu Șerban, Research Fellow in Intelligent Data Processing and Curation at Imperial’s Data Science Institute will be presenting and discussing ‘The need for “smarter” data curation methods’.

 

When your data science research hits an issue, what do you? You should present your research problems at the DSI Squared Unsolved Problems Research Seminar, to get helpful ideas or to find collaborators across disciplines for breaking through the obstacles.

This series of seminars forms part of the DSI Squared collaborationbetween the LSE Data Science Institute and ICL Data Science Institute, to foster innovations by bridging the social sciences and computer science and STEM subjects.

Innovative researchers from both Institutes are invited to showcase their ideas in front of an expert audience of colleagues from both Institutes. These attendees offer their ranging expertise and knowledge to crowd source solutions to these stumbling blocks! For example, core data science experts may wish for contributions from those with knowledge in social science and vice versa.

The need for “smarter” data curation methods

Speaker: Dr Ovidiu Serban

Location: Fawcett House, FAW.2.04, London School of Economics Main Campus, WC2A 2AE
Date: 16 January 2023
Time:
12:30 – 14:00

Abstract:  The Deep Learning community is buzzing to find the “best” and “largest” model they can train without thinking more about the data and where it comes from. This phenomenon makes junior data scientists and students at all levels feel very uneasy with Data Curation, which is still considered an underrated topic. Throughout this talk, we will look at a few projects, their data problems and how we addressed the data curation issues to improve the Machine Learning models. In one of the projects, we will be forecasting COVID-19 cases and excess deaths using data proxies for human activity. In another project, we will look at fraudulent activity detection and the issue of generalising datasets for infrequent events. Last, we will look at data quality issues with human-annotated data and how to estimate the quality of textual annotations beyond inter-annotator agreements. 

The unsolved challenge of all these projects is improving data quality by spending little time manually curating and reviewing the data. Are there more intelligent data curation techniques available to accelerate this process?

Reading list:

  1. Romain Molinas, Cesar Quilodran Casas, Rossella Arcucci, Ovidiu Serban. A novel approach for predicting epidemiological forecasting parameters based on real-time signals and Data Assimilation. (in review) Available on request.
  2. Tuccella, J., Nadler, P., & Şerban, O. (2021). Protecting Retail Investors from Order Book Spoofing using a GRU-based Detection Model. arXiv. https://doi.org/10.48550/arXiv.2110.03687
  3. Vaghela, Uddhav and Rabinowicz, Simon and Bratsos, Paris and Martin, Guy and Fritzilas, Epameinondas and Markar, Sheraz and Purkayastha, Sanjay and Stringer, Karl and Singh, Harshdeep and Llewellyn, Charlie and Dutta, Debabrata and Clarke, Jonathan M and Howard, Matthew and Curators, PanSurg REDASA and Serban, Ovidiu and Kinross, James. Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study. In Journal of Medical Internet Research (pp. e25714), 2021.

For more information or to sign up visit the event page.

Registration is now closed. Add event to calendar
See all events