De Novo Drug Design Using Artificial Intelligence Applied on SARS-CoV-2 Viral Proteins ASYNT-GAN

Computer-assisted de novo design of natural product mimetics offers a viable strategy to reduce synthetic efforts and obtain natural-product-inspired bioactive small molecules, but suffers from several limitations. Deep learning techniques can help address these shortcomings. They propose the generation of synthetic molecule structures that optimizes the binding affinity to a target. To achieve this, they leverage important advancements in deep learning. Our approach generalizes to systems beyond the source system and achieves the generation of complete structures that optimize the binding to a target unseen during training. Translating the input sub-systems into the latent space permits the ability to search for similar structures, and the sampling from the latent space for generation.

The architecture of the proposed solution. The model consists of an Encoder-Decoder architecture that translates the inputs into the latent space and a Generator that produces the 3D structure of the system. We propose a stacked Generator architecture that takes the first output and calculates regions of interest. We use the regions of interest to resample and generate a second output that is concatenated to the first to produce the prediction of the 3D structure of the molecular system.

An outbreak of the novel coronavirus, SARS-CoV-2, has caused worldwide social and economic disruption. The scientific community has limited knowledge of the molecular details of the infection. Currently, there is a lack of common adaption of antiviral drugs, and are no vaccines for prevention.

The time and effort to create and market a drug or vaccine that can treat a certain infection can span over decades and millions of investments. Throughout the process of drug discovery, one of the main challenges is to identify a molecular structure that can attach itself to a target. The quality of the binding has a direct influence on the side effects and effectiveness of the treatment. Multiple approaches exist to find or create these structures. On successful structure identification, the compounds’ structure is further built around it.

Approximate nearest neighbor search returns points, whose distance from the query is at most c times the distance from the query to its nearest points.

Computer-assisted de novo design permits to reduce efforts and obtain natural-product-inspired bioactive molecules. Three types of in silico drug-target interaction (DTI) prediction methods have been proposed in the literature: molecular docking, similarity-based, and deep learning-based models. Molecular docking is a simulation-based on predefined human rules aiming to optimize the conformation of ligand and the target protein. The currently applied docking methods have shown multiple limitations, in particular unsatisfactory scoring of biological activities. Investigations have shown that the performance of docking approaches is dependent on the type of receptors and performs better on hydrophobic vs. hydrophilic pockets. Additionally, in practice, the binding pose rarely has enough information to reproduce affinities. For this reason, more computationally expensive methods, such as molecular mechanics generalized born and surface area continuum solvation (MM/GBSA), need to be applied to re-score compounds during virtual screening.

Proposed machine learning DTI approaches as, KronRLS and SimBoost, concentrate on binding affinity prediction. Where the compound and protein are transformed into their Simplified Molecular Input Line Entry System (SMILES) and sequence representation, respectively, and used to predict the probability of high-affinity score. By applying this transformation, the ligands and proteins are considered in their 1D representations by the model. This results in the loss of all relative information regarding the interactions of the two in 3D space. Additionally, the underlying complexity of the protein sequence is completely ignored, as each letter in the sequence is encoded as a single unique id. However, each letter represents an amino acid that by itself is a sequence of atoms and bonds that have a particular 3D formation and possible reactions between them and the ligand.

Shows the regions of interest identified to resample data from those regions. The data is concatenated to the activations from the first generator and used as input by the stacked generator.

Transformer architectures are used to leverage the advances of Natural Language Processing (NLP) and work with the SMILES representations of compounds. Additional to the above-mentioned limitations, following this approach, the compounds must be created manually or searched in the chemical space. The chemical space has been estimated to be in the order of 1063 organic compounds of size up to 30 atoms, which infers that iterating over the full space in search of a potential hit is computational expansive and highly inefficient.

Reinforcement Learning techniques have successfully been applied by splitting compounds into meaningful molecule fragments and adding a molecule fragment at a time to produce a score generated by human-defined rules. The model, however, is not gaining insights from the training data, but rather by humanly predefined rules for scoring.

Shows the ability to perform vector arithmetic for search and walk through latent space.

To address these shortcomings, they propose the generation of synthetic small and sophisticated molecule structures that optimize the binding affinity to a target (ASYNT-GAN). To achieve this,they leverage three important achievements in machine learning: attention, deep learning on graphs, and generative adversarial networks. Similar to question answer in NLP, they generate a molecular architecture based on an existing target that functions as context. By exploring the latent space, created by the model, they propose a novel way of searching for candidate compounds suitable for binding.

I. Jacobs, M. Maragoudakis, De Novo Drug Design Using Artificial Intelligence Applied on SARS-CoV-2 Viral Proteins ASYNT-GAN, Biochem. 1 (2021) 36–48. doi:10.3390/biochem1010004.