How machine learning can drive new material discovery


Dr. Jacqui Cole, from the University of Cambridge, explained to a packed lecture theatre the many steps needed to mine data to design the optimum material for a given application.

Jacqui Cole explained to a packed lecture theatre the data mining process behind designing an optimum material for a given application.

Dr Jacqui Cole of the University of Cambridge gave an insightful IMSE Highlight Seminar on data-driven molecular engineering of functional materials.

The world needs new materials to stimulate industry in key sectors of our economy, from environment and sustainability to information storage, and efficiency of chemical processes. Yet, nearly all functional materials (materials which possess particular native properties and functions of their own) are still discovered by ‘trial-and-error’. This lack of predictability affords a bottleneck to technological innovation. That is according to Dr Jacqui Cole, Head of the Molecular Engineering group at the Cavendish Laboratory, who gave her IMSE Highlight Seminar last Thursday. Dr. Cole's Molecular Engineering group is a joint initiative between the Cavendish Laboratory and the Department of Chemical Engineering and Biotechnology at Cambridge with the ISIS Facility RAL.

The world needs new materials to stimulate industry in key sectors of our economy. Dr. Jacqui Cole Head of the Molecular Engineering group, Cavendish Laboratory, University of Cambridge

In Dr. Cole's seminar, data-driven molecular engineering was given a detailed introduction along with examples of how the emerging field offers prospective solutions. Such approaches to materials discovery are only now becoming possible due to recent advances in artificial intelligence, the rapid rise in high-performance computing capacities, and changes in government legislation that regulates the open-access of scientific data.

Dr. Cole's Molecular Engineering research group have succeeded in encoding a given molecular design and engineering strategy into algorithms that search through massive chemical-property datasets to discover a material that suits a given application.

The machine learning approach

The materials discovery approach uses machine learning to comb the available scientific literature and automatically extracts chemical information in a tool called ChemDataExtractor. ChemDataExtractor can extract chemical names, properties, and spectra from a journal article so they can be imported into a database or spreadsheet.

steps of ChemDataExtractor
How the ChemDataExtractor tool works in 3 steps.

Using state-of-the-art natural language processing algorithms to interpret the English language text that makes up the majority of scientific documents, machine-learning methods can extract valuable information from each sentence. As a result, it produces a full record containing identifiers, properties, and spectra for each unique chemical entity in the document.

The result means that ChemDataExtractor is able to predict new functional materials that can then be experimentally validated using a range of advanced materials characterisation and device testing methods. One example of the potential of this novel approach is the discovery of new light-harvesting materials for dye-sensitized solar cells, which has been included in several buildings as a source of renewably generated electricity.

ChemDataExtractor is available as an open source python package that you can download and use for free at 


Dr Kieran Brophy

Dr Kieran Brophy
Faculty of Engineering

Click to expand or contract

Contact details

Show all stories by this author


See more tags