The existence of a druggable binding pocket is a prerequisite for computational drug-target interaction studies including virtual screening. Retrospective studies have shown that extended sampling methods like Markov State Modeling and mixed-solvent simulations can identify cryptic pockets relevant for drug discovery. Here, they apply a combination of mixed-solvent molecular dynamics (MD) and time-structure independent component analysis (TICA) to four retrospective case studies: NPC2, the CECR2 bromodomain, TEM-1, and MCL-1. They compare previous experimental and computational findings to their results. It is shown that the successful identification of cryptic pockets depends on the system and the cosolvent probes. They used alternative TICA internal features such as the unbiased backbone coordinates or backbone dihedrals versus biased interatomic distances. They found that in the case of NPC2, TEM-1, and MCL-1, the use of unbiased features is able to identify cryptic pockets, although in the case of the CECR2 bromodomain, more specific features are required to properly capture a pocket opening. In the perspective of virtual screening applications, it is shown how docking studies with the parent ligands depend critically on the conformational state of the targets.
Cryptic pockets correspond to hidden pockets in a protein that are revealed upon ligand binding but are not identified in the crystal structure of an apo (unliganded) protein. Such pockets can be of relevance for drug development through allosteric regulation of an active site distant from the allosteric site, by impeding protein–protein interactions, or through other mechanisms. They have been identified experimentally in various protein targets, relevant for drug discovery, with examples including HIV integrase, K-Ras, and Hsp90. An alternative binding site to the orthosteric site may offer potential benefits such as the ability to modify the protein activity without interrupting binding at a substrate site, the ability to identify entirely new chemical matter capable of binding at a distinct site, options to develop more selective compounds, and possibly alternative functional effects such as allosteric modes of action. Therefore, the challenge is how to find such sites in a prospective manner, and in this regard, computational methods can play an important role. Once identified, these sites can lead to alternative approaches to drug discovery, including improved, more precise, and accurate virtual screening studies.
Workflow corresponding to the described methodology. A series of mixed-solvent simulations are reduced in dimensionality via TICA. The TICA is performed using different features to allow clustering based on sites. Next, the conformational space is clustered into several states which are evaluated for pockets with the pocket detection tool MDpocket.
Nowadays, the identification of these pockets is an important challenge of structure-based drug design. Molecular dynamics (MD) simulations have been one of the go-to tools for the discovery of these sites. Proteins can be simulated for hundreds of nanoseconds or even several microseconds in a conventional setup, but the conformational changes that unveil cryptic pockets tend to happen around the millisecond time scale. For example, a TEM-1 β-lactamase cryptic site was unveiled via MD with a total simulation time of 81 μs, which today is feasible. Another example is the recent SARS-CoV-2 study which resulted in a total of 0.1 s of molecular simulation. Since atomistic MD is limited to integration timesteps at most a few femtoseconds, it is difficult to go into the millisecond time scale and beyond with current computers, where these new conformational states could be visited. Only a few special machines, such as Anton or the use of the Folding@home framework, are able to reach these extensive time scales. According to the transition state theory, the relationship between the time scale of a state transition and the height of an energy barrier is exponential. This means that most conformational events of interest, such as the folding process of a protein or the binding/unbinding of a substrate, often occur at a larger time scale.
Opening of the cryptic pocket in NPC2. A) Representative structures of the “closed” and the “open” conformations overlapped with their MDpocket (orange) output. Yellow structures correspond to the cluster center which is evaluated and shown, while green and salmon structures correspond to the holo (2HKA) and apo (1NEP) NPC2 structures. The ligand bound to the holo structure is shown in cyan. The MDpocket output is shown at an isovalue of 0.5. B) Zoomed-in view of the key interactions regarding the NPC2 pocket opening between Y100 and F66 in the apo (salmon) and holo (green) states and violin plots of the F66 - Y100 distance for the different probes used for mixed-solvent simulations. The horizontal lines correspond to the apo (salmon) and holo (green) PDB values. It can be observed how the use of probes induces a bigger opening of the pocket and for a longer amount of time. C) Molecular docking of the holo ligand yields a much more “concentrated” population around the cryptic binding site once evaluated in an open-like conformation. On the other hand, docking with a “closed” conformation does not find the cryptic pocket. Every gray dot corresponds to an atom of the docked conformations, and the crystallized ligand C3S is depicted in blue.
One way to overcome this issue that has been studied recently is mixed-solvent simulations. In this method, the protein of interest is simulated in a mixture of water and small molecule probes (cosolvents) such as methanol, benzene, propane, imidazole, etc. The competition between the probes and water to interact with different protein regions helps to induce different conformational changes and/or map potential allosteric binding sites. The absence of a cosolvent visiting these conformational states would require a substantial computational cost that is feasible for only a few protein systems, because the alternative conformation may simply not be stable thermodynamically, or the barrier that needs to be crossed in order to form it may be prohibitively high. Indeed, this alternative mapping strategy might be able to unveil cryptic sites that would simply never be discovered by conventional approaches. Different approaches to mixed-solvent MD simulations have been developed and include MixMD, MDMix, SILCS, CAT, and others. One of the main caveats of previous studies is that they tend to run short simulations (10–50 ns) which are often too short to observe larger conformational changes and/or identify buried pockets. Schmidt et al. showed how pockets in their study tend to open after the 100 ns mark. One of the reasons for the use of short MD simulations was concerns regarding phase separation due to clustering of the probes at longer times, but this problem can be overcome thanks to extensive tests concerning the role of cosolvent concentration and through use of repulsive terms between cosolvent molecules in some cases. Short simulations are good for the identification of superficial hotspots, but they are cryptic conformations that present deeper pockets. Furthermore, mixed-solvent methodologies tend to analyze results based on the probe population around the protein rather than pocket formation and/or identification. Although both terms should be related, they learned from previous experience with these methodologies that probes tend to visit terminal and disordered regions, which can lead to false positives.
Opening of the cryptic pocket in CECR2 BrD is only seen when key distances are used as a feature. A) Representative structure of the “closed” and “open” conformations. Colored in gray is the apo crystal structure (3NXB). B) Evaluation of the feature selection for TICA in CECR2 BrD. (Top) Backbone coordinates feature. Although the pocket corresponding to the holo structure (5V84) is identified consistently in the different states, the cryptic binding pocket is not clearly identified. (Below) Feature based on the distance of pairs of key residues. The cryptic pocket can be identified. Yellow structures correspond to the cluster center which is evaluated and shown, while green and salmon structures correspond to the ligand-bound (5V84) and apo (3NXB) structures. The ligand bound to the holo structure is shown in cyan. The MDpocket output (orange) is shown at an isovalue of 0.5. C) Violin plots describing the distance between the pairs L460-S482 and F459-I481 along the trajectories depending on the different features used. In relation to B), it can be observed how a non H-bond forming set of distances can be isolated upon using a distance-based feature for TICA, especially in the case of F459-I491. D) shows the difference from the distances of the previously mentioned residues between the PDB structure and representative structures from backbone coordinates and distance features. Runs with phenol as a mixed solvent are shown for all pictures in this figure unless differently mentioned.
The use of the sampling water interfaces through scaled Hamiltonians (SWISH) approach together with mixed solvents can overcome several of these issues. In SWISH, a Hamiltonian replica exchange approach is used where instead of changing the temperature of each replica, the nonbonded interactions of apolar carbon and sulfur atoms with water oxygen atoms are modified. In combination with hydrophobic probes such as benzene or indole, this methodology can very effectively open cryptic pockets. However, the combination of the cosolvent and of the interaction scaling can be too aggressive and lead to partial denaturing of the protein, especially for longer simulation times. This issue can be prevented with contact-map restraints to maintain a secondary structure enabling longer sampling times up to 1 μs. Another interesting aspect of the SWISH protocol is that all probes studied are modified with repulsive terms to avoid phase separation.
Opening of the cryptic pocket in TEM-1 β-lactamase. A) Representative structures of the “closed” and the “open” conformations overlapped with their MDpocket (orange) output. Yellow structures correspond to the cluster center which is evaluated, while green and salmon structures correspond to the holo (1PZO) and apo (1JWP) TEM-1 structures. The ligand bound to the holo structure is shown in cyan. The MDpocket output is shown at an isovalue of 0.5. B) Molecular docking of the holo ligand yields a much more “concentrated” population around the cryptic binding site once evaluated in an open-like conformation. On the other hand, docking with a “closed” conformation does not seem to find the crpytic pocket. Every gray dot corresponds to an atom of the docked conformations.
The Markov State Model (MSM) approach combined with MD simulations is a promising current methodology for extensive protein sampling. MSMs are kinetic models of the processes under study constructed based on many underlying MD simulations that can address sampling challenges. MSMs represent a paradigm shift in the use of simulations, away from anecdotal single-trajectory approaches to a more comprehensive statistical treatment. Many (tens up to thousands) simulations are performed in parallel, and the process of analyzing the data is handled by the MSM. First, the trajectories are structurally clustered to identify microstates via dimensionality reduction techniques such as time independent component analysis (TICA). All frames of the trajectories are assigned to the different microstates with respect to simulation time. For a given time lag (window separation), transitions between states are counted, delivering an event count matrix from which probabilities of the states can also be generated. Clearly not all states are connected within the specified time lag, hence the matrix is sparse, and convergence must be tested. Increasing the time lag leads to a more complete transition matrix, but the model may then become too coarse. Finally, solving the eigenvalues and eigenvectors of the transition matrix delivers kinetically similar states of the studied system. Once generated, there are multiple approaches for testing the model, and nowadays common approaches use tools to discretize the slow order parameters and on the fly adaptive MSM learning of the sampling space. Although MSMs were originally conceived as a method to study protein folding mechanisms, they have been applied with successful results in a variety of pocket discovery studies. Recently, a hidden transient state in all bromodomain families was identified through the means of an MSM. Furthermore, a series of MSM models of TEM1 β-lactamase identified the cryptic pocket between helices 11 and 12 as well as two additional cryptic pockets that present allosteric coupling to the active site.
However, in most cases, the identification of novel cryptic pockets tends to follow prior knowledge of the target, for instance, in relation to the dynamics of the system or the implication of certain residues in the protein’s allosteric motions (e.g., breaking of hydrogen bonds). In addition, the total simulation time scales needed in order to obtain an equilibrated ensemble tend to vary a great deal from one protein to another, ranging from ensembles of 30 μs to more than 500 μs. In summary, every case for MSM is unique and can lead to an unexpected amount of computer power required, which nowadays very few can achieve.
In this study, their aim is to develop a methodology that can sample the conformational space of a protein of interest in order to identify novel cryptic pockets. At this point, they focus on pocket identification rather than on determining its functionality. They perform unbiased mixed-solvent simulations for a longer time scale than is typical, allowing a more extensive search for cavities and longer protein-probe stability assessment. The ensemble of simulations is then analyzed using TICA methodology from the MSM approach, allowing the identification of different stable macrostates. The TICA analysis is performed using Cartesian backbone coordinates, backbone torsions, and distances between pairs of atoms as features and later divided into several states which makes pocket identification more straightforward than when it is performed for a single simulation replica or for the whole ensemble. They show that this approach is capable of identifying sites that are missed by other methods, and it does not require the use of system-specific features or of modified potentials. An important consideration for cryptic pocket identification in drug discovery is the ability to rank the identified pockets based on their druggability or equivalently to return a greater level of certainty for some pockets than for others.
Otherwise, resources are wasted by in silico and in vitro screening against uncertain or implausible pockets. In this regard, their approach is to use MDpocket as a pocket identification tool, as it is one of the few tools able to analyze an ensemble of structures. Comparison between the results obtained with MDpocket in the different analyzed states helps us to rank the observed pockets as well as to discover transient pockets that would be overlooked with a more classical approach.
Divide and Conquer. Pocket-Opening Mixed-Solvent Simulations in the Perspective of Docking Virtual Screening Applications for Drug Discovery Francesc Sabanés Zariquiey, Edgar Jacoby, Ann Vos, Herman W. T. van Vlijmen, Gary Tresadern, and Jeremy Harvey Journal of Chemical Information and Modeling 2022 62 (3), 533-543 DOI: 10.1021/acs.jcim.1c01164