Imperial and SOLVE Chemistry bring chemical reaction data to machine learning

by David Silverman

Three people in lab coats
Team members from SOLVE Chemistry

Researchers from Imperial and its spinout company SOLVE Chemistry have presented a chemical dataset at the prestigious AI conference NeurIPS that could help accelerate the use of machine learning to solve solvent challenges in industrial chemistry.

Industrial chemists often use prior data to help predict reaction outcomes such as how a certain solvent or temperature setting will perform in a manufacturing process. But existing datasets are patchy – for example, they typically only include certain solvents and certain temperatures. They are therefore not powerful enough to reliably predict the best way to produce a chemical.

The new dataset contains comprehensive data on one industrially relevant reaction, catechol rearrangement, that could be used to effectively train machine learning algorithms to predict which solvents and conditions will give the best yields. It could also make it possible to train models that find the highest‑yielding options within a shortlist of more sustainable solvents or at the lowest feasible temperature.

The project to acquire and test the new dataset was initiated by Professor Kim Jelfs in Imperial’s Department of Chemistry and Professor Ruth Misener in the Department of Computing, working as part of AIChemy, an EPSRC AI hub for Chemistry. SOLVE Chemistry was funded to produce the first publicly available dataset to include dense sampling of a continuous solvent and temperature space using technology developed by the company.

The new dataset could be used to effectively train machine learning algorithms to predict which solvents and conditions give the best yields. It could help use more sustainable solvents and lower temperatures.

Imperial researchers led by PhD student Toby Boyne then demonstrated some ways this data can be used to develop and test predictive algorithms. Mr Boyne said: “Predicting the impacts of different solvents is already a significant challenge. By including a range of solvent classes, as well as mixtures of solvents, we hope this dataset inspires the machine learning community to develop models that better capture solvent effects, and are robust across a range of experimental conditions.”

To gather the data, the researchers used automated flow chemistry techniques developed by SOLVE Chemistry, which was founded by Imperial graduates Dr Linden Schrecker and Dr Jose Pablo Folch from their EPSRC- and BASF-funded PhDs. This enabled them to gather the reaction data continuously as reactions evolved over time.

In the case of temperature and residence time, this allowed them to gather dense enough data points to represent the variables as continuous rather than discrete. This could make it easier for machine learning models to detect nonlinear relationships between, for example, temperature and yield.

In the case of solvent selection, which is traditionally a categorical variable, they obtained continuous data using solvent mixtures, which allows solvent conditions to be explored in a more varied design space. For example, instead of just testing pure water and pure ethanol, they ran reactions along continuous ramps of water-ethanol mixtures and other solvent blends.

“What makes a solvent good for one reaction may not be what makes it good for another reaction,” explained SOLVE Chemistry co-founder and Chief Scientific Officer, Dr Jose Pablo Folch. “We’re creating a workflow that combines cutting-edge machine learning and unique data collection to quickly uncover solvent effects on reactions of commercial interest.”

The team put together 1,220 data points, which for chemical data is substantial and has potential to enable an entire class of models that were not previously testable at scale in chemistry. Following their publication in NeurIPS, Gabriel Gibberd in the Department of Chemical Engineering, a graduate from Imperial's Digital Chemistry MSc programme, has used the dataset to achieve an even greater machine learning performance, also accepted into NeurIPS.

Hackathon

The NeurIPS paper is accompanied by a public hackathon, designed to give researchers early access to the dataset and challenge them to build models that predict unseen reaction outcomes as accurately as possible. “Whoever wins will have a solution that will catalyse the next frontier of research in this area,” said Dr Linden Schrecker, SOLVE Chemistry’s co-founder and CEO.

Join the Catechol Benchmark Hackathon

Future developments

The team aims to make the dataset a springboard for new research in both chemistry and machine learning. By providing a unique, well‑curated dataset that combines time‑series, temperature-series and continuous solvent spaces, they give machine learning developers a realistic but tractable test bed for ideas in few‑shot learning, active learning and representation learning.

As researchers compete in the hackathon and build on the open data and code, the expectation is that new modelling strategies will emerge that can then be transferred to other reactions, other materials systems and industrial workflows.

Further reading

The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning (PDF)

Catechol Benchmark dataset

SoDaDE: Solvent Data-Driven Embeddings with Small Transformer Models (PDF)

AIChemy

Article text (excluding photos or graphics) © Imperial College London.

Photos and graphics subject to third party copyright used with permission or © Imperial College London.

Article people, mentions and related links

Reporters

David Silverman

Administration/Non-faculty departments

Latest articles