The optimizers package#
These are the optimizers of Biobuild. Whenever you want to improve a molecule’s conformation or generate new conformers for a molecule, you will want to use the optimizers.
The Rotatron Environments#
Biobuild implements a torsional optimization system. That is, instead of “wiggling” atoms around until a genetically favorable structure is obtained, Biobuild rotates around bonds within a structure to find the most favorable conformation. This is done by using the Rotatron Environments. The Rotatron Environments are OpenAI Gym environments that store an evaluation function to simulate rotating a molecule around a given set of bonds by a given set of angles. There is a base “Rotatron” environment and three subclasses thereof that can be used for optimization heads-on. These are:
The DistanceRotatron environment evaulates conformations based on the pairwise distances between nodes in the optimized graph.
It uses two forces, a global “unfolding” force to maximize spacial separation between nodes, and a local “pushback” force to maximize distances between the closest nodes.
The evaluation is computed as:
There are multiple variations of this basic formulation available (see the functions below).
- class biobuild.optimizers.DistanceRotatron.DistanceRotatron(graph: BaseGraph, rotatable_edges: list = None, radius: float = 20, pushback: float = 3, unfold: float = 2, clash_distance: float = 0.9, crop_nodes_further_than: float = -1, n_smallest: int = 10, concatenation_function: callable = None, bounds: tuple = (-3.141592653589793, 3.141592653589793), **kwargs)[source]#
Bases:
RotatronA distance-based Rotatron environment.
- Parameters:
graph (AtomGraph or ResidueGraph) – The graph to optimize
rotatable_edges (list) – A list of edges that can be rotated during optimization. If None, all non-locked edges are used.
radius (float) – The radius around rotatable edges to include in the distance calculation. Set to -1 to disable.
pushback (float) – Short distances between atoms are given higher weight in the evaluation using this factor.
unfold (float) – The exponent to use when computing the mean distance to others for each node. Higher values give higher values to global unfolding of the graph.
clash_distance (float) – The distance at which atoms are considered to be clashing.
crop_nodes_further_than (float) – Nodes that are further away than this factor times the radius from any rotatable edge at the beginning of the optimization are removed from the graph and not considered during optimization. This speeds up computation. Set to -1 to disable.
n_smallest (int) – The number of smallest distances to use when computing the evaluation for each node.
concatenation_function (callable) – A custom function to use when computing the evaluation for each node. This function should take the environment (self) as first argument and a 1D array of pairwise-distances from one node to all others as second argument and return a scalar.
bounds (tuple) – The bounds for the minimal and maximal rotation angles.
- eval(state)[source]#
Calculate the evaluation score for a given state
- Parameters:
state (np.ndarray) – The state of the environment
- Returns:
The evaluation for the state
- Return type:
float
- biobuild.optimizers.DistanceRotatron.concatenation_function_linear(self, x)[source]#
A concatentation function that computes the evaluation as:
Mean distance * unfold + (mean of n smallest distances) * pushback
- biobuild.optimizers.DistanceRotatron.concatenation_function_no_pushback(self, x)[source]#
A concatentation function that computes the evaluation as:
Mean distance ** unfold
- biobuild.optimizers.DistanceRotatron.concatenation_function_no_unfold(self, x)[source]#
A concatentation function that computes the evaluation as:
Mean distance + pushback * mean of n smallest distances
The OverlapRotatron is a environment that approximates molecular graphs using multi-variat Gaussian distributions. The overlap between the distributions is used as the evaluation function for the environment. Hence, this environment tries to minimize the overlap between distributions in order to find favorable conformations.
As measure for the overlap between two distributions, the Jensen-Shannon divergence is used by default. Custom overlap functions can be passed to the environment.
- biobuild.optimizers.OverlapRotatron.MVN(points)[source]#
Compute a multi-variate normal distribution for a given set of points.
- Parameters:
points (np.ndarray) – The points to compute the mean and covariance matrix for.
- Returns:
mvn – The multi-variate normal distribution for the points.
- Return type:
scipy.stats.multivariate_normal
- class biobuild.optimizers.OverlapRotatron.OverlapRotatron(graph: BaseGraph, rotatable_edges: list = None, clash_distance: float = 0.9, crop_nodes_further_than: float = -1, distance_function: callable = None, ignore_further_than: float = -1, bounds: tuple = (-3.141592653589793, 3.141592653589793))[source]#
Bases:
RotatronA distribution overlap-based Rotatron environment.
- Parameters:
graph (AtomGraph or ResidueGraph) – The graph to optimize
rotatable_edges (list) – A list of edges that can be rotated during optimization. If None, all non-locked edges are used.
clash_distance (float) – The distance at which two atoms are considered to be clashing.
crop_nodes_further_than (float) – If greater than 0, crop nodes that are further than this distance from the rotatable edges so that they are not considered in the overlap calculation.
distance_function (callable) – A specific distance function to use for calculating the overlap. This function should take two arrays of shape (1, 3) (centers) and two arrays of shape (3, 3) (covariances) and return a scalar.
ignore_further_than (float) – If greater than 0, centroids that are further than this distance from each other are evaluated as 0 overlap automatically.
bounds (tuple) – The bounds for the minimal and maximal rotation angles.
- biobuild.optimizers.OverlapRotatron.jensen_shannon_overlap(mvn1, mvn2)[source]#
Compute the overlap between two gaussians using the Jensen-Shannon divergence.
- Parameters:
mvn1 (scipy.stats.multivariate_normal) – The two gaussians to compute the overlap for.
mvn2 (scipy.stats.multivariate_normal) – The two gaussians to compute the overlap for.
- Returns:
overlap – The overlap between the two gaussians.
- Return type:
float
The ForceFieldRotatron is a rotatron that uses RDKit’s MMFF94 force field to evaluate a given state. Consequently, this environment can only function if RDKIt is installed.
Note
Because this environment uses an actual energy function to evaluate states, this environment performs very poorly with ResidueGraph inputs! ResidueGraphs are abstractions without a valid chemical structure. Consequently, even though this environment can be used with ResidueGraphs, it is not recommended.
- class biobuild.optimizers.ForceFieldRotatron.ForceFieldRotatron(graph: BaseGraph, rotatable_edges: list = None, clash_distance: float = 0.9, crop_nodes_further_than: float = -1, mmff_variant: str = 'mmff94', bounds: tuple = (-3.141592653589793, 3.141592653589793))[source]#
Bases:
RotatronA force field based rotatron. This rotatron uses RDKit’s MMFF94 force field to evaluate the energy of a given state.
- Parameters:
graph (AtomGraph) – The graph to optimize
rotatable_edges (list) – A list of edges that can be rotated during optimization. If None, all non-locked edges are used.
clash_distance (float) – The distance at which two atoms are considered to be clashing.
crop_nodes_further_than (float) – If greater than 0, crop nodes that are further than this distance from the rotatable edges so that they are not considered in the overlap calculation.
mmff_variant (str) – The MMFF variant to use. Can be one of “mmff94”, “mmff94s”, “uff”, “mmff94splus”
bounds (tuple) – The bounds for the minimal and maximal rotation angles.
This is the basic Rotatron environment. It provides the basic functionality for preprocessing a graph into numpy arrays, masking rotatable edges, and evaluating a possible solution. All other Rotatron environments inherit from this class.
- class biobuild.optimizers.Rotatron.Rotatron(graph: BaseGraph, rotatable_edges: list = None)[source]#
Bases:
EnvThe base class for rotational optimization environments.
- Parameters:
graph (AtomGraph or ResidueGraph) – The graph to optimize
rotatable_edges (list) – A list of edges that can be rotated during optimization. If None, all non-locked edges are used.
- property best#
The best state, the action that lead there, and evaluation that the environment has seen
- eval(state)[source]#
Calculate the evaluation score for a given state
- Parameters:
state (np.ndarray) – The state of the environment
- Returns:
The evaluation for the state
- Return type:
float
- is_done(state)[source]#
Check whether the environment is done
- Parameters:
state (np.ndarray) – The state of the environment
- Returns:
Whether the environment is done
- Return type:
bool
Optimization algorithms#
The Rotatron environments are used to specify the problems to solve. The optimization algorithms are used to solve them. Biobuild implements a number of classical optimization algorithms that are tailored to work with the Rotatron environments. These are:
Particle Swarm Optimization
The Particle Swarm Optimization algorithm is a classical optimization algorithm that is based on the behavior of a swarm of particles. Each particle has a position and a velocity. The position is the current solution to the problem, and the velocity is the direction in which the particle is moving. The particles are attracted to the best solution found so far, and repelled by the worst solution found so far. This way, the particles will move towards the best solution found so far, and will not get stuck in local minima.
The algorithm performs well with both small and large inputs, both with AtomGraphs and ResidueGraphs. It is also often the fastest to compute, so it is the default algorithm .
- biobuild.optimizers.algorithms.swarm_optimize(env, n_particles: int = None, max_steps: int = 30, stop_if_done: bool = True, threshold: float = 1e-06, w: float = 0.9, c1: float = 0.5, c2: float = 0.3, cooldown_rate: float = 0.99, n_best: int = 1)[source]#
Optimize a rotatron environment through a simple particle swarm optimization.
- Parameters:
env (biobuild.optimizers.environments.Rotatron) – The environment to optimize
n_particles (int, optional) – The number of particles to use. Set this to None in order to compute the number of particles based on the number of rotatable edges in the environment.
max_steps (int, optional) – The maximum number of steps to take.
stop_if_done (bool, optional) – Stop the optimization if the environment signals it is done or the solutions have converged.
threshold (float, optional) – A threshold to use for convergence of the best solution found. The algorithm will stop if the variation of the best solution evaluation history is less than this threshold.
w (float, optional) – The inertia parameter for the particle swarm optimization.
c1 (float, optional) – The cognitive parameter for the particle swarm optimization.
c2 (float, optional) – The social parameter for the particle swarm optimization.
cooldown_rate (float, optional) – The rate at which the inertia parameter is reduced. The inertia parameter is reduced by this factor every generation. E.g. 0.95 will reduce the inertia parameter by 5% every generation.
n_best (int, optional) – The number of best solutions to return at the end of the optimization.
- Returns:
The solution and evaluation for the solution
- Return type:
solution, evaluation
Genetic Algorithm
The Genetic Algorithm is one of the most iconic optimization algorithms. It is based on the behavior of a population of individuals. Each individual has a “genome”, which is the current solution to the problem. Each generation (optimization round) individuals are mutated (randomly change their solution), and the best individuals reproduce and make it to the next round. This way, the population will move gradually towards good solutions.
The algorithm performs well on any scale but gets exceedingly slower the larger the molecules become. Also, it works slightly better with AtomGraphs than with ResidueGraphs.
- biobuild.optimizers.algorithms.genetic_optimize(env, max_generations: int = 500.0, stop_if_done: bool = True, threshold: float = 1e-06, variation: float = 0.2, population_size: int = 50, parents: int | float = 0.25, children: int | float = 0.3, mutants: int | float = 0.3, newcomers: int | float = 0.15, variation_cooldown: float = 1, n_best: int = 1)[source]#
A simple genetic algorithm for optimizing a Rotatron environment.
- Parameters:
env (biobuild.optimizers.environments.Rotatron) – The environment to optimize
max_generations (int, optional) – The maximum number of steps to take.
stop_if_done (bool, optional) – Stop the optimization if the environment signals it is done or the solutions have converged.
threshold (float, optional) – A thershold to use for convergence of the best solution found. The algorithm will stop if the variation of the best solution evaluation history is less than this threshold.
variation (float, optional) – The variation to use for the initial action.
population_size (int, optional) – The size of the population.
parents (int or float, optional) – The number or fraction of parents (elites) to select. The parents are selected from the best solutions. Parents produce offspring and pass to the next generation.
children (int or float, optional) – The number or fraction of children to generate from the parents. Children are generated by averaging the parents and adding some noise.
mutants (int or float, optional) – The number or fraction of mutants to generate. Mutants are generated by adding noise to parents, therby generating abarrent clones.
newcomers (int or float, optional) – Newcomers are entirely new solution candidates.
variation_cooldown (float, optional) – The rate at which the variation is reduced. The variation is reduced by this factor every generation. E.g. 0.95 will reduce the variation by 5% every generation.
n_best (int, optional) – The number of best solutions to return at the end of the optimization.
- Returns:
The angles(s) and evaluation(s) of the best solution(s) found.
- Return type:
solution, evaluation
Simulated Annealing
Simulated Annealing is another optimization algorithm that has similarities to both genetic and particle swarm optimization. It explores solutions by randomly changing the current one, and accepts or rejects the new solution based on the change in energy. The algorithm is based on the annealing process in metallurgy, where a metal is heated and then slowly cooled down. This way, the metal will settle in a more stable state.
The algorithm performs well with better with smaller inputs but is suitable for larger ones using both AtomGraphs and ResidueGraphs.
- biobuild.optimizers.algorithms.anneal_optimize(env, n_particles: int = None, max_steps: int = 100, stop_if_done: bool = True, threshold: float = 1e-06, variance: float = 0.3, cooldown_rate: float = 0.98, n_best: int = 1)[source]#
Optimize a rotatron environment through a simple simulated annealing.
- Parameters:
env (biobuild.optimizers.environments.Rotatron) – The environment to optimize
n_particles (int, optional) – The number of particles to use. Set to None in order to compute the number of particles based on the number of rotatable edges in the environment, where one particle is used per two rotatable edges.
max_steps (int, optional) – The maximum number of steps to take.
stop_if_done (bool, optional) – Stop the optimization if the environment signals it is done or the solutions have converged.
threshold (float, optional) – A threshold to use for convergence of the best solution found. The algorithm will stop if the variation of the best solution evaluation history is less than this threshold.
variance (float, optional) – The variation to use for updating particle positions.
n_best (int, optional) – The number of best solutions to return at the end of the optimization.
- Returns:
The solution and evaluation for the solution
- Return type:
solution, evaluation
Gradient-based algorithms
We implement a direct link to scipy.optimize.minimize which provides a number of gradient-based optimization algorithms. These algorithms are usually very fast and perform well on small inputs. However, as evaluation landscapes of larger molecules tend to get “rugged” gradient-based methods tend to struggle with larger inputs.
Any algorithm implemented by scipy.optimize.minimize can be used. The default is the L-BFGS-B algorithm. For a complete list of available algorithms checkout the scipy documentation.
- biobuild.optimizers.algorithms.scipy_optimize(env, steps: int = 100000.0, method: str = 'L-BFGS-B', **kws)[source]#
Optimize a Rotatron environment through a simple scipy optimization
- Parameters:
env (biobuild.optimizers.environments.Rotatron) – The environment to optimize
steps (int, optional) – The number of steps to take.
method (str, optional) – The optimizer to use, by default “L-BFGS-B”. This can be any optimizer from scipy.optimize.minimize
kws (dict, optional) – Keyword arguments to pass as options to the optimizer
- Returns:
The angles(s) and evaluation(s) of the best solution(s) found.
- Return type:
solution, evaluation
Optimization utilities#
Biobuild also implements a number of utilities that can be used to make the optimization a little easier for the user by automizing certain steps.
This module contains utility functions for the optimizers.
- biobuild.optimizers.utils.apply_solution(sol: ndarray, env: Rotatron.Rotatron, mol: Molecule.Molecule) Molecule.Molecule[source]#
Apply a solution to a Molecule object.
- biobuild.optimizers.utils.auto_algorithm(mol)[source]#
Decide which algorithm to use for a quick-optimize based on the molecule size.
- biobuild.optimizers.utils.optimize(mol: Molecule.Molecule, env: Rotatron.Rotatron = None, algorithm: str | callable = None, **kwargs) Molecule.Molecule[source]#
Quickly optimize a molecule using a specific algorithm.
Note
This is a convenience function that will automatically create an environment and determine edges. However, that means that the environment will be created from scratch every time this function is called. Also, the environment will likely not taylor to any specifc requirements of the situation. For better performance and control, it is recommended to create an environment manually and supply it to the function using the env argument.
- Parameters:
mol (Molecule) – The molecule to optimize. This molecule will be modified in-place.
env (Rotatron, optional) – The environment to use. This needs to be a Rotatron instance that is fully set up and ready to use.
algorithm (str or callable, optional) – The algorithm to use. If not provided, an algorithm is automatically determined, depending on the molecule size. If provided, this can be: - “genetic”: A genetic algorithm - “swarm”: A particle swarm optimization algorithm - “anneal”: A simulated annealing algorithm - “scipy”: A gradient descent algorithm (default scipy implementation, can be changed using a ‘method’ keyword argument) - “rdkit”: A force field based optimization using RDKit (if installed) - or some other callable that takes an environment as its first argument
**kwargs – Additional keyword arguments to pass to the algorithm
- Returns:
The optimized molecule
- Return type: