Angewandte
Research Articles
Chemie
on a large amount of data that relate to the task to be Results and Discussion
performed (“pre-training”). In the case of CLMs, this is
usually done using large collections of molecules (e.g., in the
order of 200000 to 1000000[9,16,17]). Pre-training enables the
generative model to capture a) the SMILES “syntax” (i.e.,
how alphanumeric characters should be assembled to gen-
erate strings that correspond to valid molecules, Figure 1) and
b) the properties of the pre-training dataset, such as phys-
icochemical features and synthesizability of the molecules in
the dataset. In the second step, the pre-trained CLM is further
trained (“fine-tuned”) with a smaller set of task-specific
molecules.[13,19,20] During this transfer learning process, the
CLM is biased towards the chemical space of interest, that is,
molecules with desired biological and physicochemical prop-
erties. This ability to learn in a low-data regime (“few-shot”
learning[21,22]) renders CLMs particularly useful for applica-
tion to biological targets for which only few ligands are
known. The fully trained CLM can be used to generate new
molecules in the form of SMILES strings. Such data
generation is performed by predicting one character of
a SMILES string (“token”) at a time, based on all the
previous tokens. Importantly, this process does not require
handcrafted molecule design rules, as CLMs learn solely from
the SMILES strings used for training.
Previous prospective applications of CLMs for de novo
molecule generation used the so-called “temperature sam-
pling” to generate large virtual molecular libraries.[9,13,15]
Temperature sampling allows to sample new SMILES strings
by adding tokens to the (growing) string according to the
probabilities learned by the CLM, wherein the most likely
token at a given position will be sampled more often
(Figure 1b). However, the generated SMILES strings might
not always be “chemically meaningful” (invalid strings), or
they might not match the feature distribution of the training
data because of the random component of temperature
sampling. Therefore, additional methods are usually needed
to select the most promising designs from the virtual
molecular libraries, e.g., based on the similarity to known
bioactive molecules, external activity prediction, or reward
functions.[9,13,15,23] Here, we use the beam search algorithm as
a model-intrinsic alternative to temperature sampling. This
method enables the CLM to simultaneously generate and
prioritize the molecular designs in an automated fashion,
without employing additional selection methods.[24,25] Beam
search scoring was successfully validated in a prospective
application aiming to generate new retinoic acid-related
orphan receptor (ROR)[26] ligands from scratch.
Chemical Language Model and Beam Search Sampling for
De Novo Design
We explored the beam search algorithm[33] to generate
molecules from a CLM as a potential alternative to temper-
ature sampling combined with an external ranking method.
Given the probabilities learnt by a CLM, a vast number of
SMILES strings could in theory be sampled. As it is
computationally not feasible to sample all outputs, a heuristic
method such as beam search can be used to find the likely
outputs. Here, our underlying hypothesis was that the
probability for generating a certain SMILES string correlates
with the quality of the corresponding molecule regarding the
implicit design objective as represented in the fine-tuning set
(e.g., desired bioactivity, physicochemical properties). During
molecule generation by beam search sampling, the algorithm
progressively adds tokens to a SMILES string while keeping
track of the k most likely SMILES string(s). To add a new
token, the algorithm computes the conditional probability of
each possible token given the tokens in the existing string and
defines the k most likely tokens to extend the string (Fig-
ure 1c). The set of k most likely selections is based on
a scoring function (“beam search score”), which is computed
as the product of the probabilities of each token (Figure 1c).
This process is repeated until the SMILES string is completed
(i.e., the “end-of-string” token is added) or a predefined
maximal string length is reached. Thus, beam search can be
used to generate highly probable molecules, as computed by
(i) the underlying model and (ii) the beam search score. The
beam search score allows to rank the de novo designs
according to the probability of their SMILES tokens.
As a framework to probe beam search sampling, we
employed a recently published CLM based on a recurrent
neural network with long short-term memory cells (LSTM),
which are suited for sequence modeling.[34] The CLM was
trained with the SMILES strings of 365063 molecules from
ChEMBL[35] to iteratively predict the next token of each
SMILES string given the preceding tokens (Figure 1b). The
training procedure was carried out over ten epochs, meaning
that each molecule used for training was seen by the CLM ten
times. This pre-trained CLM was then fine-tuned using sets of
known ROR ligands (Figure S1, Table S1), to obtain a bias
towards the design objective, namely the generation of new
molecules with bioactivity on RORs, by transfer learning.
Open-source code for the CLM and the beam search
algorithm, and the data used in this study are available at
RORs were chosen as molecular targets because these
receptor proteins are an attractive but not extensively studied
family of potential drug targets. They constitute a family of
ligand-activated transcription factors that mainly act as
monomers and are involved in the circadian control of energy
homeostasis[27,28] and immune system regulation,[29,30] among
other functions. RORs hold promising pharmacological
potential for various indications, specifically for autoimmune
diseases.[29,30] No ROR ligand has reached drug approval to
date, partially owing to compound-related issues such as poor
aqueous solubility, lack of selectivity, and clinical safety
concerns.[29,31,32]
Application of Beam Search Sampling to Designing Inverse
RORg Agonists
For prospective evaluation, we applied the beam search to
the design of natural product-inspired RORg ligands. Learn-
ing from natural products as a traditional source of inspiration
for drug discovery[36,37] may hold several advantages over
Angew. Chem. Int. Ed. 2021, 60, 2 – 8
ꢂ 2021 The Authors. Angewandte Chemie International Edition published by Wiley-VCH GmbH
&&&&
These are not the final page numbers!