BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20230124T171523Z
LOCATION:D222
DTSTART;TZID=America/Chicago:20221113T104500
DTEND;TZID=America/Chicago:20221113T110000
UID:submissions.supercomputing.org_SC22_sess432_ws_cafcw104@linklings.com
SUMMARY:Genetic Algorithm Mutations for Molecules with a Hybrid Language M
 odel-Based GAN Architecture
DESCRIPTION:Workshop\n\nGenetic Algorithm Mutations for Molecules with a H
 ybrid Language Model-Based GAN Architecture\n\nBhowmik, Blanchard, Lyngaas
 , Wang, Irle...\n\nDrug discovery is a time-consuming process with success
 ive stages, often taking ~10 to ~15 years to develop candidate molecules i
 nto molecular therapeutics. In the computer aided drug discovery, new tech
 nologies are being developed to shorten the first stage of the drug discov
 ery process: screening candidates for hit molecules. Given the large size 
 of chemical space from which a new drug molecule has to be selected, this 
 screening step is a challenge and reducing the number of costly experiment
 s required is a priority.  \n\nA desirable solution for accelerating this 
 process while keeping the cost under control is to generate drug molecules
  with desired properties via virtual design-build-test cycle. AI methods a
 nd HPC resources have shown potential for leveraging widely available smal
 l molecule libraries to generate new optimized molecules.  \n\nRecent prog
 ress has demonstrated advantages of using generative models, specifically 
 Transformer-based language models (LM) that have been successfully impleme
 nted to predict desired chemical properties from sequence data (1, 2). The
 se LMs are applied as powerful automated mutation operator, learning from 
 commonly occurring chemical sequences available in the database. This calc
 ulated shift towards chemical-sequence for model training points to a revo
 lution in moving away from the time-consuming feature engineering and cura
 tion that has long relied on molecular properties and fingerprints. As an 
 example, our recent work illustrated a possible LM-based efficient strateg
 y for creating generalizable models for small target molecules and protein
  sequences (3).   \n\nHere we present a first-of-its-kind comparative stud
 y between LM and a novel architecture on where LM can be efficiently deplo
 yed on Generative Adversarial Network (GAN) platform, to perform different
  specific optimizations tasks using genetic algorithm-based mutations. Fun
 damentally this hybrid architecture (LM-GAN) uses traditional generator an
 d discriminator but takes advantage of pre-trained LM while predicting new
  molecules. During training, the mutation rate is varied from 10% to 100% 
 in four different set of population size ranging from 5K to 50K. Random mu
 tations were considered to select &#956; parents from population and to generat
 e new molecules with 5 top predictions for a given set of masks. Thus, imp
 lemented genetic algorithm has (&#956;+5&#956;) survivor selection scheme where only
  novel unique molecules are retained in population. \n\nOur results show t
 hat LM-GAN performs better with smaller size population (up to 10K) in gen
 erating molecules both in terms of better optimized properties and with a 
 greater number of atoms, but this trend reverses as the population size in
 creases. On the other hand, LM performs better in terms of generating more
  novel molecules. Finally, when estimating the ratio of accepted molecules
  to the generated novel molecules with desired optimized properties, LM-GA
 N performs consistently better in all population size.  \n\nApart from dru
 g or molecules discovery, in terms of HPC and AI, this work paves the way 
 for further study in understanding the necessity of pre-training and fine-
 tuning of population data (type, sampling, diversity and size) requirement
 s, the effect of GAN framework on LM models with variation in mutation rat
 e, the effect of LM in replacing CNNs to capture non-local, long-range dep
 endencies and addressing the problem of mode collapse.\n\nSession Format: 
 Recorded\n\nRegistration Category: Workshop Reg Pass
END:VEVENT
END:VCALENDAR