Improving Chemistry Databases Using Machine Learning

A picture tells a thousand words, but that picture can be hard to come by when we’re talking about large datasets of chemical reactions.

DeepMatter own SPRESI, the world’s third largest chemistry reaction database with 4.5 million entries. This hand-curated dataset has been created over the span of 30 years and constitutes chemistry from the past 100 years. It also forms the backbone of our industry leading retrosynthesis tool, ICSynth, which is used worldwide in small-molecule drug discovery.

However, given that SPRESI covers such a large field of chemistry and has been developed over a long time, it is natural that it contains entries which may not be of relevance to the pharmaceutical market. Ever vigilant of our data, we set out to find these reactions in order that we could better focus ICSynth on chemistry relevant to small molecule drug discovery.

One particular use case was the presence of a small number of petrochemical reactions in SPRESI. These had been added slowly over decades, but we found that they were popping up inappropriately when finding routes for drug candidates. We therefore decided to try and identify these using active learning – an iterative combination of machine learning and human experts reviewing the data

In active learning, a machine learning model is trained on a large dataset where only some labels are known for certain, then a human reviews the results and adds some new labels based on the observed results. The process then repeats until the user is happy with the model accuracy.

In our case, we created a feature set from SPRESI, extracted some labels of known petrochemical reactions and known non-petrochemical reactions, then created a classification model. The results were then visualised and additional labels derived using the classifications and expert exploration of the data. This was done by interactively exploring the space and identifying points near known petrochemical reactions to see whether they were also petrochemical or otherwise. This also allowed us to identify additional features which could be derived from the SPRESI reactions to help separate the dataset.

After a few active learning iterations, we identified a few thousand entries in SPRESI as petrochemical – up considerably from the 50 we started with. Most of the petrochemical points themselves cluster nicely around another set of reaction data, with a few others scattered throughout the image (which in turn corresponds to different journals, titles, reaction conditions etc.). We now filter these petrochemical results from ICSynth during retrosynthesis, making the product and the routes suggested more appropriate for the pharmaceutical market. Contact us if you have any question on how ICSynth can be used in your drug discovery process, or even if you would like some help in filtering your own ELN datasets.

See ICSYNTH in use


If you’d like a fuller demonstration of the ICSYNTH experience you can request a free demo of the platform by clicking the button below.

Stay in the know


To keep up to date with DigitalGlassware® or any of our other products,  including our Retrosynthesis Prediction tool, ICSYNTH, sign up to our newsletter here: