Purdue researchers create new data mining framework for AI models

Chloe Kent 20 December 2019 (Last Updated December 20th, 2019 12:07)

Purdue University drug discovery researchers have created a new data mining framework for training machine learning models.

Purdue researchers create new data mining framework for AI models
Lemon helps researchers mine the Protein Data Base within six minutes. Credit: Shutterstock

Purdue University drug discovery researchers have created a new data mining framework for training artificial intelligence and machine learning models.

The software, known as Lemon, helps researchers mine the Protein Data Base (PDB), which hosts data on more than 140,000 biomolecular structures, within six minutes.

A key challenge in using machine learning for drug development is creating a process by which a computer can extract the needed information from a data pool.

Drug scientists must pull biological data and train the software to understand how a typical human body will interact with the combinations that come together to form a medication.

Purdue College of Science assistant professor Caurac Chopra said: “It can take an enormous amount of time to sort through all the accumulated data. Machine learning can help, but you still need a strong framework from which the computer can quickly analyze data to help in the creation of safe and effective drugs.”

Lemon’s fast C++11 library with Python bindings means it can mine the PDB with exceptional speed. Loading all traditional mmCIF files in the PDF typically takes around 290 minutes, but Lemon does this in about six minutes when applying a simple workflow on an 8-core machine.

Lemon also allows the user to write custom functions to use as part of their software suite, as well as develop custom functions in a standard manner to generate unique benchmarking datasets for the entire scientific community.

Purdue PhD chemistry student Jonathan Fine, who worked on the platform, said: “Experimental structures deposited in PDB have resulted in several advances for structural and computational biology scientific and education communities that help advance drug development and other areas.

“We created Lemon as a one-stop-shop to quickly mine the entire data bank and pull out the useful biological information that is key for developing drugs.”

Lemon was originally designed to create benchmarking sets for drug design software and identify the biomolecular interactions that cannot be modeled well in the PDB, which are known as lemons.