Genetic Algorithm Mutations for Molecules with a Hybrid Language Model-Based GAN Architecture
DescriptionDrug discovery is a time-consuming process with successive stages, often taking ~10 to ~15 years to develop candidate molecules into molecular therapeutics. In the computer aided drug discovery, new technologies are being developed to shorten the first stage of the drug discovery process: screening candidates for hit molecules. Given the large size of chemical space from which a new drug molecule has to be selected, this screening step is a challenge and reducing the number of costly experiments required is a priority.
A desirable solution for accelerating this process while keeping the cost under control is to generate drug molecules with desired properties via virtual design-build-test cycle. AI methods and HPC resources have shown potential for leveraging widely available small molecule libraries to generate new optimized molecules.
Recent progress has demonstrated advantages of using generative models, specifically Transformer-based language models (LM) that have been successfully implemented to predict desired chemical properties from sequence data (1, 2). These LMs are applied as powerful automated mutation operator, learning from commonly occurring chemical sequences available in the database. This calculated shift towards chemical-sequence for model training points to a revolution in moving away from the time-consuming feature engineering and curation that has long relied on molecular properties and fingerprints. As an example, our recent work illustrated a possible LM-based efficient strategy for creating generalizable models for small target molecules and protein sequences (3).
Here we present a first-of-its-kind comparative study between LM and a novel architecture on where LM can be efficiently deployed on Generative Adversarial Network (GAN) platform, to perform different specific optimizations tasks using genetic algorithm-based mutations. Fundamentally this hybrid architecture (LM-GAN) uses traditional generator and discriminator but takes advantage of pre-trained LM while predicting new molecules. During training, the mutation rate is varied from 10% to 100% in four different set of population size ranging from 5K to 50K. Random mutations were considered to select μ parents from population and to generate new molecules with 5 top predictions for a given set of masks. Thus, implemented genetic algorithm has (μ+5μ) survivor selection scheme where only novel unique molecules are retained in population.
Our results show that LM-GAN performs better with smaller size population (up to 10K) in generating molecules both in terms of better optimized properties and with a greater number of atoms, but this trend reverses as the population size increases. On the other hand, LM performs better in terms of generating more novel molecules. Finally, when estimating the ratio of accepted molecules to the generated novel molecules with desired optimized properties, LM-GAN performs consistently better in all population size.
Apart from drug or molecules discovery, in terms of HPC and AI, this work paves the way for further study in understanding the necessity of pre-training and fine-tuning of population data (type, sampling, diversity and size) requirements, the effect of GAN framework on LM models with variation in mutation rate, the effect of LM in replacing CNNs to capture non-local, long-range dependencies and addressing the problem of mode collapse.
A desirable solution for accelerating this process while keeping the cost under control is to generate drug molecules with desired properties via virtual design-build-test cycle. AI methods and HPC resources have shown potential for leveraging widely available small molecule libraries to generate new optimized molecules.
Recent progress has demonstrated advantages of using generative models, specifically Transformer-based language models (LM) that have been successfully implemented to predict desired chemical properties from sequence data (1, 2). These LMs are applied as powerful automated mutation operator, learning from commonly occurring chemical sequences available in the database. This calculated shift towards chemical-sequence for model training points to a revolution in moving away from the time-consuming feature engineering and curation that has long relied on molecular properties and fingerprints. As an example, our recent work illustrated a possible LM-based efficient strategy for creating generalizable models for small target molecules and protein sequences (3).
Here we present a first-of-its-kind comparative study between LM and a novel architecture on where LM can be efficiently deployed on Generative Adversarial Network (GAN) platform, to perform different specific optimizations tasks using genetic algorithm-based mutations. Fundamentally this hybrid architecture (LM-GAN) uses traditional generator and discriminator but takes advantage of pre-trained LM while predicting new molecules. During training, the mutation rate is varied from 10% to 100% in four different set of population size ranging from 5K to 50K. Random mutations were considered to select μ parents from population and to generate new molecules with 5 top predictions for a given set of masks. Thus, implemented genetic algorithm has (μ+5μ) survivor selection scheme where only novel unique molecules are retained in population.
Our results show that LM-GAN performs better with smaller size population (up to 10K) in generating molecules both in terms of better optimized properties and with a greater number of atoms, but this trend reverses as the population size increases. On the other hand, LM performs better in terms of generating more novel molecules. Finally, when estimating the ratio of accepted molecules to the generated novel molecules with desired optimized properties, LM-GAN performs consistently better in all population size.
Apart from drug or molecules discovery, in terms of HPC and AI, this work paves the way for further study in understanding the necessity of pre-training and fine-tuning of population data (type, sampling, diversity and size) requirements, the effect of GAN framework on LM models with variation in mutation rate, the effect of LM in replacing CNNs to capture non-local, long-range dependencies and addressing the problem of mode collapse.
Event Type
Workshop
TimeSunday, 13 November 202210:45am - 11am CST
LocationD222
W
Recorded