Source code for biobuild.core.Molecule

"""
The `Molecule` class is a wrapper around a biopython structure and a core part 
of biobuild functionality. It provides a convenient interface to molecular structures
and their properties, such as atoms, bonds, residues, chains, etc. 

.. note:: 

    To help with identifying individual atoms, residues, etc. biobuild uses a different identification scheme than biopython does.
    Therefore biobuild comes with its own child classes of the biopython classes
    that are used to represent the structure. These classes are called `Atom`, `Residue`, `Chain`, etc.
    and can be used as drop-in replacements for the biopython classes and should not break any existing code.
    However, in case any incompatibility is observed anyway, the classes are equipped with a `to_biopython` method
    that will remove the biobuild overhead and return pure biopython objects (this is not an in-place operation, however,
    and will return a new object). 


Making Molecules
================

The easiest way to make a new molecule is to use the toplevel `molecule` function, which will automatically
try to detect the type of user provided input and generate a molecule from it. Currently supported inputs are:
- A biopython structure
- A PDB id
- A PDB file
- A CIF file
- A SMILES string
- An InChI string
- An IUPAC name or abbreviation, or any name that matches a known compound synonym that is associated with the PubChem database

.. code-block:: python

    from biobuild import molecule

    my_glucose = molecule("GLC") # use the PDB id

    # or
    my_glucose = molecule("GLC.pdb") # use the PDB file

    # or
    my_glucose = molecule("alpha-d-glucose") # use the name

    # ...

Since the `molecule` function is a try-and-error function it is convenient but not the most efficient. Hence, the `Molecule` class
offers already a number of convenient methods to easily generate molecules directly from specified data sources. Available methods are:

- `Molecule.from_pdb` to generate a molecule from a PDB file
- `Molecule.from_cif` to generate a molecule from a CIF file
- `Molecule.from_smiles` to generate a molecule from a SMILES string
- `Molecule.from_pubchem` to generate a molecule from a PubChem entry
- `Molecule.from_compound` to generate a molecule from a PDBECompounds entry
- `Molecule.from_rdkit` to generate a molecule from an RDKit molecule object

Hence, if we know that "glucose" is already available in our local PDBECompounds database, we can generate the molecule also as follows:

.. code-block:: python

    from biobuild import Molecule

    my_glucose = Molecule.from_compound("GLC") # use the PDB id


The quickest way to query the local PDBECompounds database is to use the `PDB Id` of the desired compounds.
However, the `from_compound` accepts other inputs as well. The database is queried using the `by` parameter, which can be one of the following:
- "id" for the PDB id (default)
- "name" for the name of the compound (must match any known synonym of the iupac name)
- "formula" for the chemical formula (usually ambiguous and will therefore often raise an error)
- "smiles" for the SMILES string (also accepts InChI)

.. code-block:: python

    # create a new glucose molecule
    glc = Molecule.from_compound("alpha-d-glucose", by="name") # use the name


Saving Molecules
----------------
If a molecule is created from a large PDB file and has undergone a lot of preprocessing, it may be useful to `save` the molecule to a pickle file
which can be loaded again later to avoid repeated preprocessing. This can be done using the `save` and `load` methods.

.. code-block:: python

    # save the molecule to a pickle file
    my_molecule.save("my_molecule.pkl")

    # load the molecule from the pickle file
    my_molecule = Molecule.load("my_molecule.pkl")

    

Modifying Molecules
===================

Once a molecule is created, it can be modified in a number of ways. The most common modifications are
- adding bonds (because when loading from a PDB file, bonds are not inferred automatically!)
- adding additional residues and atoms
- removing residues and atoms
- adjusting labelling (e.g. changing the chain names and residue seqids)

Adding bonds
------------

To add a bond between two atoms, we use the `add_bond` method, which expects two arguments for the two connected atoms.
The atoms can be specified by their `full_id` tuple, their `id` string, their `serial_number` (always starting at 1) or directly (the `biopython.Atom` object).

.. code-block:: python

    glc = Molecule.from_compound("GLC")

    # add a bond between the first and second atom
    glc.add_bond(1, 2)

    # and also add a bond between "O1" and "HO1" atoms
    glc.add_bond("O1", "HO1")


Already for small molecules such as glucose with only 24 atoms, it would be very tedious to add all bonds manually. Good thing that the molecules
created using `from_compound` or `from_pdb` already contain all the default bonds!

However, in case the bonds are missing, or the PDB file did not specify any to begin with, the `Molecule` class offers two methods: `apply_standard_bonds` and `infer_bonds`. 
The former uses reference connectivity information from the PDBECompounds database or CHARMMTopology to add all bonds that are known for the compound (if it exists in the database).
The latter will use a simple distance-based approach to infer bonds between atoms that are closer than a specified threshold (default: 1.6 Å), which can be restricted
further to a min-max window.

.. code-block:: python

    # add all standard bonds for Glucose
    glc.apply_standard_bonds()

    # add all bonds that are closer than 1.6 Å
    glc.infer_bonds(bond_length=1.6)

    # add all bonds that are closer than 1.6 Å, but not closer than 1.0 Å
    glc.infer_bonds(bond_length=(1.0, 1.6))


.. note::

    By default `infer_bonds` will not attempt to add bonds between atoms that belong to different residues.
    This is because in cases of suboptimal conformations or in very large structures atoms that are close in space may not be connected in the structure.
    To override this behaviour, set the `restrict_residues` parameter to `False`.

    .. code-block:: python

        # add all bonds that are closer than 1.6 Å, even if they belong to different residues
        glc.infer_bonds(bond_length=1.6, restrict_residues=False)

        
Residue Connections
-------------------

Alternatively, instead of `infer_bonds` one may use `infer_residue_connections` to get bonds that connect different residues.
This method will infer all bonds between atoms from different residues based on the distance between them. The inferred bonds are
saved in the `Molecule` and also returned in a `list`. If the optional argument `triplet` is set to `True`, the methodd will also
return the bonds immediately adjacent to the inferred bonds.

Take the following example of a molecule with two residues `A` and `B` that are connected by a bond between `OA` and `(1)CB`:

.. code-block:: text    

     connection -->  OA     OB --- H
                    /  \\   /
     (1)CA --- (2)CA   (1)CB 
        /         \\        \\
    (6)CA         (3)CA    (2)CB --- (3)CB
        \\         /
     (5)CA --- (4)CA
 
     
If `triplet=False` the method will only return the bond between `OA` and `(1)CB`. However, if `triplet=True` it will also return the bond
between `(2)CA` and `OA` - thus forming a triplet of atoms `(2)CA`, `OA` and `(1)CB` that connect the two residues A and B.

.. code-block:: python

    # infer all bonds between atoms from different residues
    >>> glc.infer_residue_connections(triplet=False)
    [(OA, (1)CB)]
    >>> glc.infer_residue_connections(triplet=True)
    [(OA, (1)CB), ((2)CA, OA)]
    


Adding residues and atoms
-------------------------

To add one or more new residue(s), we use the `add_residues` method, which expects a number `biobuild.Residue` objects as unnamed arguments.
Similarly, to add one or more new atom(s), we use the `add_atoms` method, which expects a number of `biobuild.Atom` objects as unnamed arguments.
Both methods allow to specify the parent object (`chain` or `residue`) via an optional argument and will automatically choose the last
chain or residue if none is specified.

.. code-block:: python

    from biobuild import Residue, Atom

    new_residue = Residue("XYZ", 1, " ")
    
    # do things with the residue here
    # ...

    # add the residue to the molecule
    # (add it to the last chain, whichever that may be)
    glc.add_residues(new_residue)

    new_atom = Atom("X", [0, 0, 0])

    # add the atom to the first residue in the `glc` molecule
    glc.add_atoms(new_atom, residue=1)

Removing residues and atoms
---------------------------

In order to remove residues or atoms or bonds, we can use the `remove_residues`, `remove_atoms` and `remove_bond`(yes, singluar!) methods.
They work exactly like their `add_` counterparts, but instead of adding, they remove the specified objects.

.. code-block:: python

    # remove the first residue
    glc.remove_residues(1)

    # remove the first atom
    glc.remove_atoms(1)

    # remove the bond between the first and second atom
    glc.remove_bond(1, 2)


.. warning::

    When adding and removing atoms and residues, the `Molecule` object will not automatically update the `biopython.Structure` object as well as its internal
    connectivity graph (the `AtomGraph`). However, if the user chooses to edit the biopython structure directly, the `AtomGraph` will **not** be updated automatically!
    In this case, the user must call the `update_atom_graph` method to update the `AtomGraph` manually. 

    .. code-block:: python

        # add a new residue to the molecule
        glc.add_residues(new_residue)

        # now add atoms into the residue directly instead of via `add_atoms`
        new_residue.add(new_atom)
    
        # now update the graph
        glc.update_atom_graph()


Adjusting labelling
-------------------

Single-residue molecules that were loaded from a PDB file may not use the same atom labelling as the `PDBE` and `CHARMM` databases.
In order to quickly adjust the labelling, a method `autolabel` exists. `autolabel` uses the atom connectivity and a rule-based algorithm to infer 
the most likely atom labels. However, since this method is not perfect, it is recommended to check the labelling afterwards and adjust it manually if necessary.

If working with large molecules that follow another labelling scheme it may be more efficient to simply
define your own linkage recipies or patches (see the documentation of `linkages`) that use the the appropriate labelling scheme.

.. code-block:: python

    # load a molecule from a PDB file
    glc = Molecule.from_pdb("glucose.pdb")

    # adjust the labelling
    glc.autolabel()

    # save the molecule to a new PDB file
    glc.to_pdb("glucose_autolabelled.pdb")


Another common operation is the adjustment of chain names and residue seqids. This can be done using the `reindex` method.
This method accepts three starting values for the chain name, residue seqid and atom serial number and will then reindex all chains, residues and atoms
to ensure they are continuously numbered and labelled. Some internal methods used when connecting different molecules are reliant on a continuous
numbering scheme, so this method should be called before connecting molecules that were loaded from PDB files.

.. code-block:: python

    # load a molecule from a PDB file
    glc = Molecule.from_pdb("glucose.pdb")

    # reindex the molecule
    glc.reindex()

We can also use one molecule as a "reference" for reindexing another molecule to make sure there are now labelling conflicts between them in case we want to connect them together later (this is usually done internally by _biobuild_ automatically).

.. code-block:: python  

    # load a molecule from a PDB file
    glc = Molecule.from_pdb("glucose.pdb")

    # load another molecule from a PDB file
    cel = Molecule.from_pdb("cellulose.pdb")
    cel.reindex() # make sure we have a continuous numbering scheme

    # reindex the glucose molecule using the cellulose molecule as a reference
    cel.adjust_indexing(glc)



Connecting Molecules
====================

Since most modifications are not simply single residues but rather complex structures, the second main purpose of a `Molecule` is to be easily 
connected to other Molecules to form a larger structure. To this end, the `Molecule` class provides a number of methods to easily assemble complex structures from small single residue molecules.


Forming Polymers
----------------

The simplest way to generate a large structure is probably the `repeat` method, which will repeat the given molecule `n` times
to form a homo-polymer.

.. code-block:: python

    # create a glucose molecule
    glc = Molecule.from_compound("GLC")

    # create cellulose from glucose
    # using a 1-4 beta-beta glycosidic linkage
    # which is pre-defined in the default CHARMMTopology
    glc.repeat(10, "14bb")

    # Now we have a cellulose of 10 glucoses

In the above example we used the `repeat` method explicitly, but we could also achieve the same with the short-hand `*=`. For this to work, we need to specify 
the linkage type beforehand. We do this by setting the `patch` attribute before using any operator.

.. code-block:: python

    # specify the "default" linkage type for connecting 
    # other molecules to this glucose
    glc.linkage = "14bb"

    # now make a cellulose by multiplying glucoses
    glc *= 20

    # Now we have a cellulose of 20 glucoses

    
If we wish to keep `glc` as a single residue Glucose and still get our desired cellulose, 
we can set `inplace=False` when calling `repeat` or simply use the `*` operator, both of which 
will have the same effect of creating a new copy that houses the appropriate residues.

.. code-block:: python

    cel = glc.repeat(10, "14bb", inplace=False)

    # or (equivalently)
    glc.patch = "14bb"
    cel = glc * 10


    
Connecting different Molecules
------------------------------

What if we want to connect different molecules? For example, we may want to connect a Galactose to a Glucose to form Lactose.
This can be achieved using the `attach` method, which will attach a given molecule to to another molecule.

.. code-block:: python

    glc = Molecule.from_compound("GLC")
    gal = Molecule.from_compound("GAL")

    # attach the galactose to the glucose
    # (we want a copy, so we set inplace=False just like with 'repeat')
    lac = glc.attach(gal, "14bb", inplace=False)

    # Now we have a lactose molecule


In the above example, the `attach` method is used to attach the galactose molecule to the glucose, but for those
among us who prefer a more shorty syntax, the `+` operator will do the same thing. 

.. code-block:: python

    # specify that incoming molecules shall be
    # attached using a 1-4 beta linkage
    glc.linkage = "14bb"

    # now attach the galactose to the glucose
    lac = glc + gal


Of course, if there is a `+` operator there should also be a `+=` operator, which is simply the equivalent of `attach` with `inplace=True`.

.. code-block:: python

    glc.linkage = "14bb"
    glc += gal


    # Now 'glc' is a lactose molecule


Setting default Modifiers
-------------------------

So far, we have always worked with a 1-4 beta-beta glycosidic linkage, which we apparently could select using the string `"14bb"`. 
But what if we want to use a different linkage type? For example, a 1-4 alpha-beta glycosidic linkage?  You of course noticed, 
that `attach` and `repeat` accept an argument `link` which allows you to specify the linkage type, and that if you leave it blank
the default linkage type is used. But how do we set the default linkage type?

Let's first check what linkage types are available by default anyway. Have you noticed an argument named `_topology`
at the end of the `attach` or `repeat` methods? The `topology` refers to the underlying _CHARMM_ topology which hosts the linkage type information.
By default a topology is already loaded in _biobuild_'s framework so it is not necessary for the user to specify anything here, but we can check which linkage types are available by:

.. code-block:: python

    import biobuild as bb

    # get the default topology
    topology = bb.get_default_topology()

    print(topology.patches)


Any of these linkages can be referenced by their name, e.g. `"14bb"` or `"14ab"`. 

Wait a second, my desired linkage is not in the list! What now?! Well, you can always define a new `Linkage` to suit your needs.
Check out the documentation on :ref:`linkages` for more information on how to do this. If you have your desired linkage 
ready to go, set it as the default by:

.. code-block:: python

    my_molecule.linkage = my_linkage
    
    # or if you feel "super correct"
    my_molecule.set_linkage(my_linkage)

    # or if you feel "extra cocky"
    my_molecule % my_linkage # <- the modulo operator assignes the "modifier" to the molecule


Now any call to `attach`, `repeat`, or any of its operator proxies will use your defined linkage by default.

Setting the default Residue for attachment
------------------------------------------

When defining a `Linkage` we specify which atoms are supposed to be connected and removed, but we do not specify which residues
these atoms belong to. We specify this as arguments inside the `attach` method for instance, but we can also leave this blank, in which case the
last residue in the molecule is used by default. This is obviously not always what we want, however! Hence, if we do not want to specify the residue for attachment
at every `attach` call or if we want to use the `+` operator, we can set the default residue for attachment by setting the `attach_residue` attribute:

.. code-block:: python

    # set the default attachment residue to the first residue
    my_molecule.attach_residue = 1

    # or
    my_molecule.set_attach_residue(1)

    # or (if you feel "extra cocky")
    my_molecule @ 1 # <- the 'at' operator sets the attachment residue


    # serial number indexing also works in reverse
    # (set the second last residue as the default attachment residue)
    my_molecule.attach_residue = -2  
"""

from copy import deepcopy
import os
from typing import Union

import numpy as np
import Bio.PDB as bio

import biobuild.core.entity as entity
import biobuild.core.Linkage as Linkage
import biobuild.utils as utils
import biobuild.structural as structural
import biobuild.resources as resources
import biobuild.optimizers as optimizers

__all__ = [
    "Molecule",
    "read_pdb",
    "write_pdb",
    "read_cif",
    "write_cif",
    "read_smiles",
    "molecule",
    "connect",
    "polymerize",
    "phosphorylate",
    "make_smiles",
    "query_pubchem",
]



[docs]
def read_pdb(filename: str, id: str = None) -> "Molecule":
    """
    Read a PDB file and return a molecule.

    Parameters
    ----------
    filename : str
        The path to the PDB file
    id : str
        The id of the molecule

    Returns
    -------
    molecule : Molecule
        The molecule
    """
    return Molecule.from_pdb(filename, id=id)




[docs]
def write_pdb(mol: "Molecule", filename: str) -> None:
    """
    Write a molecule to a PDB file.

    Parameters
    ----------
    mol : Molecule
        The molecule to write
    filename : str
        The path to the PDB file
    """
    mol.to_pdb(filename)




[docs]
def read_cif(filename: str, id: str = None) -> "Molecule":
    """
    Read a CIF file and return a molecule.

    Parameters
    ----------
    filename : str
        The path to the CIF file
    id : str
        The id of the molecule

    Returns
    -------
    molecule : Molecule
        The molecule
    """
    return Molecule.from_cif(filename, id=id)




[docs]
def write_cif(mol: "Molecule", filename: str) -> None:
    """
    Write a molecule to a CIF file.

    Parameters
    ----------
    mol : Molecule
        The molecule to write
    filename : str
        The path to the CIF file
    """
    mol.to_cif(filename)




[docs]
def read_smiles(smiles: str, id: str = None) -> "Molecule":
    """
    Read a SMILES string and return a molecule.

    Parameters
    ----------
    smiles : str
        The SMILES string

    Returns
    -------
    molecule : Molecule
        The molecule
    """
    return Molecule.from_smiles(smiles, id=id)




[docs]
def make_smiles(
    mol: "Molecule", isomeric: bool = True, write_hydrogens: bool = False
) -> str:
    """
    Generate a SMILES string from a molecule.

    Parameters
    ----------
    mol : Molecule
        The molecule
    isomeric : bool
        Whether to include stereochemistry information
    write_hydrogens : bool
        Whether to include hydrogens in the SMILES string

    Returns
    -------
    smiles : str
        The SMILES string
    """
    return mol.to_smiles(isomeric, write_hydrogens)




[docs]
def query_pubchem(query: str, by: str = "name") -> "Molecule":
    """
    Query the PubChem database for a given
    query string to obtain a Molecule object.

    Parameters
    ----------
    query : str
        The query string
    by : str
        The type of query to perform. Can be one of:
        he method to search by. This can be any of the following:

        - cid
        - name
        - smiles
        - sdf
        - inchi
        - inchikey
        - formula

    Returns
    -------
    Molecule or None
        The molecule or None if no match was found
    """
    try:
        return Molecule.from_pubchem(query, by=by)
    except:
        return None




[docs]
def molecule(mol=None) -> "Molecule":
    """
    Generate a molecule from an input. If the input is a string, the string can be a PDB id, some filename, SMILES or InChI string, IUPAC name or abbreviation.
    This function will try its best to automatically generate the molecule with minimal user effort. However, using a dedicated classmethod is
    recommended for more efficient and predictable results.

    Parameters
    ----------
    mol : str or structure-like object
        An input string or structure-like object such as a BioPython Structure or RDKit Molecule, etc.
        If nothing is provided, a new empty molecule is generated.

    Returns
    -------
    molecule : Molecule
        The generated molecule

    Examples
    --------
    >>> from biobuild import molecule
    >>> mol = molecule("GLC")
    >>> mol = molecule("GLC.pdb")
    >>> mol = molecule("alpha-d-glucose")
    """
    if mol is None:
        return Molecule.empty()

    if isinstance(mol, bio.Structure.Structure) or isinstance(
        mol, entity.base_classes.Structure
    ):
        return Molecule(mol)
    elif isinstance(mol, Molecule):
        return mol.copy()
    elif "openbabel" in str(type(mol).__mro__[0]):
        return Molecule.from_pybel(mol)
    elif "rdkit" in str(type(mol).__mro__[0]):
        return Molecule.from_rdkit(mol)
    elif (
        "openmm" in str(type(mol).__mro__[0])
        and hasattr(mol, "topology")
        and hasattr(mol, "positions")
    ):
        return Molecule.from_openmm(mol.topology, mol.positions)

    if not isinstance(mol, str):
        raise ValueError("input must be a structure object or a string")

    # ------------------
    # mol may be a PDB id
    # ------------------
    if resources.has_compound(mol):
        return resources.get_compound(mol)

    if os.path.isfile(mol):
        _mol = mol.lower()
        if _mol.endswith(".pdb"):
            return Molecule.from_pdb(mol)
        elif _mol.endswith(".cif"):
            return Molecule.from_cif(mol)
        elif _mol.endswith(".pkl"):
            return Molecule.load(mol)
        elif _mol.endswith(".json"):
            return Molecule.from_json(mol)
        elif (
            _mol.endswith(".mol")
            or _mol.endswith(".mol2")
            or _mol.endswith(".sdf")
            or _mol.endswith(".sd")
        ):
            return Molecule.from_molfile(mol)

    if " " not in mol:
        try:
            return Molecule.from_smiles(mol)
        except:
            pass

    try:
        return Molecule.from_pubchem(mol)
    except:
        pass

    raise ValueError(f"Could not generate molecule from input: {mol}")




[docs]
def polymerize(
    mol: "Molecule",
    n: int,
    link: Union[str, "Linkage.Linkage"] = None,
    inplace: bool = False,
) -> "Molecule":
    """
    Polymerize a molecule

    Parameters
    ----------
    mol : Molecule
        The molecule to polymerize
    n : int
        The number of monomers to add
    link : str or Linkage
        The linkage to use for polymerization. If None, the default linkage of the molecule is used.
    inplace : bool
        Whether to polymerize the molecule in place or return a new molecule

    Returns
    -------
    Molecule
        The polymerized molecule
    """
    if link is None and mol._linkage is None:
        raise ValueError(
            "No patch or recipe provided and no default is set on the molecule"
        )
    return mol.repeat(n, link, inplace=inplace)




[docs]
def connect(
    mol_a: "Molecule",
    mol_b: "Molecule",
    link: Union[str, "Linkage.Linkage"],
    at_residue_a: Union[int, "bio.Residue.Residue"] = None,
    at_residue_b: Union[int, "bio.Residue.Residue"] = None,
    copy_a: bool = True,
    copy_b: bool = True,
    _topology=None,
    use_patch: bool = True,
) -> "Molecule":
    """
    Connect two molecules together

    Parameters
    ----------
    mol_a : Molecule
        The first (target) molecule
    mol_b : Molecule
        The second (source) molecule
    link : Linkage or str
        The linkage to use for connection. This can be either an instance of the Linkage class or a string identifier
        of a pre-defined patch in the (currently loaded default or specified) CHARMMTopology.
    at_residue_a : int or bio.PDB.Residue
        The residue of the first molecule to connect to. If an integer is provided, the seqid must be used, starting at 1.
    at_residue_b : int or bio.PDB.Residue
        The residue of the second molecule to connect to. If an integer is provided, the seqid must be used, starting at 1.
    copy_a : bool
        Whether to copy the first molecule before connecting
    copy_b : bool
        Whether to copy the second molecule before connecting.
        If False, all atoms of the second molecule will be added to the first molecule.
    _topology : CHARMMTopology
        A specific topology to use in case a pre-existing patch is used as link and only the string identifier
        is supplied.
    use_patch : bool
        If the linkage has internal coordinates available (i.e. is a "patch") these are used by default. Set this to False
        to force-use stitching and its associated conformational optimization instead.

    Returns
    -------
    Molecule
        The connected molecule
    """
    if copy_b:
        mol_b = mol_b.copy()
    new = mol_a.attach(
        mol_b,
        link,
        at_residue=at_residue_a,
        other_residue=at_residue_b,
        inplace=not copy_a,
        _topology=_topology,
        use_patch=use_patch,
    )
    return new




[docs]
def phosphorylate(
    mol: "Molecule",
    at_atom: Union[int, str, entity.base_classes.Atom],
    delete: Union[int, str, entity.base_classes.Atom] = None,
    inplace: bool = True,
) -> "Molecule":
    """
    Phosphorylate a molecule at a specific atom

    Parameters
    ----------
    mol : Molecule
        The molecule to phosphorylate
    at_atom : int or str or Atom
        The atom to phosphorylate. If an integer is provided, the atom seqid must be used, starting at 1.
    delete : int or str or Atom
        The atom to delete. If an integer is provided, the atom seqid must be used, starting at 1.
        If not provided, any Hydrogen atom attached to the phosphorylated atom will be deleted.
    inplace : bool
        Whether to phosphorylate the molecule in place or return a new molecule

    Returns
    -------
    Molecule
        The phosphorylated molecule
    """
    phos = Molecule.from_compound("PO4")
    at_atom = mol.get_atom(at_atom)
    delete = mol.get_atom(delete)
    at_residue = at_atom.get_parent()

    l = Linkage.Linkage("PHOS", "a custom phosphorylation")
    l.add_bond(at_atom.id, "P")
    if delete:
        l.add_delete(delete.id, "target")
    l.add_delete("O2", "source")

    if not inplace:
        mol = mol.copy()

    mol.attach(phos, l, at_residue=at_residue)
    return mol



def methylate(
    mol: "Molecule",
    at_atom: Union[int, str, entity.base_classes.Atom],
    delete: Union[int, str, entity.base_classes.Atom] = None,
    inplace: bool = True,
) -> "Molecule":
    """
    Methylate a molecule at a specific atom

    Parameters
    ----------
    mol : Molecule
        The molecule to methylate
    at_atom : int or str or Atom
        The atom to methylate. If an integer is provided, the atom seqid must be used, starting at 1.
    delete : int or str or Atom
        The atom to delete. If an integer is provided, the atom seqid must be used, starting at 1.
        This atom needs to be in the same residue as the atom to methylate.
        If not provided, any Hydrogen atom attached to the methylated atom will be deleted.
    inplace : bool
        Whether to methylate the molecule in place or return a new molecule
    """
    methyl = Molecule.from_compound("CH3")
    at_atom = mol.get_atom(at_atom)
    at_residue = at_atom.get_parent()
    delete = mol.get_atom(delete, residue=at_residue)

    l = Linkage.Linkage("METHYLATE")
    l.add_bond(at_atom.id, "C")
    l.add_delete("HC1", "source")
    if delete:
        l.add_delete(delete.id, "target")

    mol.attach(methyl, l, at_residue=at_residue, inplace=inplace)
    return mol



[docs]
class Molecule(entity.BaseEntity):
    """
    A molecule to add onto a scaffold.
    A molecule consists of a single chain.

    Parameters
    ----------
    structure : bio.PDB.Structure
        A biopython structure object
    root_atom : str or int or Atom
        The id or the serial number of the root atom
        at which the molecule would be attached to a another
        structure such as protein scaffold or another Molecule.
    model : int
        The model to use from the structure. Defaults to 0. This may be any
        valid identifier for a model in the structure, such as an integer or string.
    chain : str
        The chain to use from the structure. Defaults to the first chain in the structure.
    """

    def __init__(
        self,
        structure,
        root_atom: Union[str, int, entity.base_classes.Atom] = None,
        model: int = 0,
        chain: str = None,
    ):
        super().__init__(structure, model)

        if not chain or len(self._model.child_list) == 1:
            self._working_chain = self._model.child_list[0]
        else:
            self._working_chain = self._model.child_dict.get(chain)
            if not self._working_chain:
                raise ValueError("The chain {} is not in the structure".format(chain))

        if root_atom:
            self.set_root(root_atom)
            if not root_atom in self._working_chain.get_atoms():
                raise ValueError("The root atom is not in the structure")
        self._root_atom = root_atom


[docs]
    @classmethod
    def empty(cls, id: str = None) -> "Molecule":
        """
        Create an empty Molecule object

        Parameters
        ----------
        id : str
            The id of the Molecule. By default an id is inferred from the filename.

        Returns
        -------
        Molecule
            An empty Molecule object
        """
        structure = structural.make_empty_structure(id)
        return cls(structure)



[docs]
    @classmethod
    def from_pdb(
        cls,
        filename: str,
        root_atom: Union[str, int] = None,
        id: str = None,
        model: int = 0,
        chain: str = None,
    ) -> "Molecule":
        """
        Read a Molecule from a PDB file

        Parameters
        ----------
        filename : str
            Path to the PDB file
        root_atom : str or int
            The id or the serial number of the root atom (optional)
        id : str
            The id of the Molecule. By default an id is inferred from the filename.
        model : int
            The model to use from the structure. Defaults to 0. This may be any
            valid identifier for a model in the structure, such as an integer or string.
        chain : str
            The chain to use from the structure. Defaults to the first chain in the structure.

        Returns
        -------
        Molecule
            The Molecule object
        """
        if id is None:
            id = utils.filename_to_id(filename)
        struct = utils.defaults.__bioPDBParser__.get_structure(id, filename)
        new = cls(struct, root_atom, model=model, chain=chain)
        bonds = utils.pdb.parse_connect_lines(filename)
        if len(bonds) != 0:
            for b in bonds:
                new.add_bond(*b)
        return new



[docs]
    @classmethod
    def from_compound(
        cls,
        compound: str,
        by: str = "id",
        root_atom: Union[str, int] = None,
    ) -> "Molecule":
        """
        Create a Molecule from a reference compound from the PDBECompounds database

        Parameters
        ----------
        compound : str
            The compound to search for
        by : str
            The field to search by. This can be
            - "id" for the PDB id
            - "name" for the name of the compound (must match any known synonym of the iupac name)
            - "formula" for the chemical formula
            - "smiles" for the SMILES string (also accepts InChI)
        root_atom : str or int
            The id or the serial number of the root atom (optional)
        """
        mol = resources.get_default_compounds().get(compound, by=by)
        if isinstance(mol, list):
            raise ValueError(
                f"Multiple compounds found using '{by}={compound}', choose any of these ids specifically {[i.id for i in mol]}"
            )
        mol.set_root(root_atom)
        return mol



[docs]
    @classmethod
    def from_smiles(
        cls,
        smiles: str,
        id: str = None,
        root_atom: Union[str, int] = None,
        add_hydrogens: bool = True,
    ) -> "Molecule":
        """
        Read a Molecule from a SMILES string

        Parameters
        ----------
        smiles : str
            The SMILES string
        id : str
            The id of the Molecule. By default the provided smiles string is used.
        root_atom : str or int
            The id or the serial number of the root atom (optional)
        add_hydrogens : bool
            Whether to add hydrogens to the molecule

        Returns
        -------
        Molecule
            The Molecule object
        """
        obj = structural.read_smiles(smiles, add_hydrogens)
        if structural.smiles.use_rdkit:
            new = cls.from_rdkit(obj, id=id)
        elif structural.smiles.use_openbabel:
            obj.title = id
            new = cls.from_pybel(obj)
        new.set_root(root_atom)
        return new



[docs]
    @classmethod
    def from_pubchem(
        cls,
        query: str,
        root_atom: Union[str, int] = None,
        by: str = "name",
        idx: int = 0,
    ) -> "Molecule":
        """
        Create a Molecule from PubChem

        Note
        ----
        PubChem follows a different atom labelling scheme than the CHARMM force field!
        This means that atom names may not match the names required by the default patches that are
        integrated in biobuild. It is advisable to run `autolabel` or `relabel_hydrogens` on the molecule.
        Naturally, custom patches or recipies working with adjusted atom names will always work.

        Parameters
        ----------
        query : str
            The query to search for in the PubChem database
        root_atom : str or int
            The id or the serial number of the root atom (optional)
        by : str
            The method to search by. This can be any of the following:
            - cid
            - name
            - smiles
            - sdf
            - inchi
            - inchikey
            - formula
        idx : int
            The index of the result to use if multiple are found. By default, the first result is used.

        Returns
        -------
        Molecule
            The Molecule object
        """
        _compound_2d, _compound_3d = resources.pubchem.query(query, by=by, idx=idx)
        new = _molecule_from_pubchem(_compound_2d.iupac_name, _compound_3d)
        _new = cls(new.structure)
        _new.add_bonds(*(i.to_tuple() for i in new._bonds))
        new = _new
        new.id = _compound_2d.iupac_name
        new.set_root(root_atom)
        return new



[docs]
    def to_smiles(self, isomeric: bool = True, write_hydrogens: bool = False) -> str:
        """
        Convert the molecule to a SMILES string

        Parameters
        ----------
        isomeric : bool
            Whether to include stereochemistry information in the SMILES string
        write_hydrogens : bool
            Whether to include hydrogens in the SMILES string

        Returns
        -------
        str
            The SMILES string
        """
        return structural.make_smiles(
            self, isomeric=isomeric, add_hydrogens=write_hydrogens
        )



[docs]
    def get_residue_connections(
        self,
        residue_a=None,
        residue_b=None,
        triplet: bool = True,
        direct_by: str = None,
    ):
        """
        Get bonds between atoms that connect different residues in the structure
        This method is different from `infer_residue_connections` in that it works
        with the already present bonds in the molecule instead of computing new ones.

        Parameters
        ----------
        residue_a, residue_b : Union[int, str, tuple, bio.Residue.Residue]
            The residues to consider. If None, all residues are considered.
            Otherwise, only between the specified residues are considered.
        triplet : bool
            Whether to include bonds between atoms that are in the same residue
            but neighboring a bond that connects different residues. This is useful
            for residues that have a side chain that is connected to the main chain.
            This is mostly useful if you intend to use the returned list for some purpose,
            because the additionally returned bonds are already present in the structure
            from inference or standard-bond applying and therefore do not actually add any
            particular information to the Molecule object itself.
        direct_by : str
            The attribute to sort by. Can be either "serial", "resid" or "root".
            In the case of "serial", the bonds are sorted by the serial number of the first atom.
            In the case of "resid", the bonds are sorted by the residue id of the first atom.
            In the case of "root", the bonds are sorted by the root atom of the first atom.
            Set to None to not sort the bonds.

        Returns
        -------
        set
            A set of tuples of atom pairs that are bonded and connect different residues
        """
        bonds = super().get_residue_connections(
            residue_a=residue_a, residue_b=residue_b, triplet=triplet
        )
        if direct_by is not None:
            direct_connections = None
            if direct_by == "resid":
                direct_connections = super().get_residue_connections(
                    residue_a=residue_a, residue_b=residue_b, triplet=False
                )
                direct_connections1, direct_connections2 = set(
                    a for a, b in direct_connections
                ), set(b for a, b in direct_connections)
                direct_connections = direct_connections1.union(direct_connections2)
            bonds = self._direct_bonds(bonds, direct_by, direct_connections)
        return bonds



[docs]
    def repeat(self, n: int, link=None, inplace: bool = True):
        """
        Repeat the molecule n times into a homo-polymer.

        Parameters
        ----------
        n : int
            The number or units of the final polymer.
        link : str or Patch or Recipe
            The patch or recipe to use when patching individual units together.
            If noe is given, the default patch or recipe is used (if defined).
        inplace : bool
            If True the molecule is directly modified, otherwise a copy of the molecule is returned.

        Returns
        -------
        molecule
            The modified molecule (either the original object or a copy)
        """
        if not isinstance(n, int):
            raise TypeError("Can only multiply a molecule by an integer")
        if n <= 0:
            raise ValueError("Can only multiply a molecule by a positive integer")
        if not link and not self._linkage:
            raise RuntimeError("Cannot multiply a molecule without a patch defined")

        _other = self.copy()
        if not inplace:
            obj = self.copy()
        else:
            obj = self

        _patch = obj._linkage
        if link:
            obj % link

        for i in range(n - 1):
            obj += _other

        if _patch:
            obj % _patch

        return obj



[docs]
    def attach(
        self,
        other: "Molecule",
        link: Union[str, "Linkage.Linkage"] = None,
        at_residue: Union[int, "entity.base_classes.Residue"] = None,
        other_residue: Union[int, "entity.base_classes.Residue"] = None,
        use_patch: bool = True,
        inplace: bool = True,
        other_inplace: bool = False,
        _topology=None,
    ):
        """
        Attach another structure to this one using a Patch or a Recipe.

        Parameters
        ----------
        other : Molecule
            The other molecule to attach to this one
        link : str or Linkage
            Either a Patch to apply when attaching or a Recipe to use when stitching.
            If None is defined, the default patch or recipe that was set earlier on the molecule is used.
        at_residue : int or Residue
            The residue to attach the other molecule to. If None, the defined `attach_residue` is used.
        other_residue : int or Residue
            The residue in the other molecule to attach this molecule to. If None, the defined `attach_residue` of the other molecule is used.
        use_patch : bool
            If the specified linkage is a patch (has internal coordinates) it can and is by default applied as a patch. However, it can also be used as a recipe.
            Set this to false if you want to use the patch as a recipe.
        inplace : bool
            If True the molecule is directly modified, otherwise a copy of the molecule is returned.
        other_inplace : bool
            All atoms from the other molecule are integrated into this one. Hence, the other molecule is left empty. If False, a copy of the other molecule is used.
            Thus leaving the original molecule intact.
        _topology : Topology
            The topology to use when attaching. If None, the topology of the molecule is used. Only used if the patch is a string.
        """
        if not isinstance(other, Molecule):
            raise TypeError("Can only attach a Molecule to another Molecule")

        if not inplace:
            obj = self.copy()
        else:
            obj = self

        if not link:
            link = obj._linkage
            if not link:
                raise ValueError("Cannot attach a molecule without a patch defined")

        if not other_inplace:
            _other = other.copy()
        else:
            _other = other

        if isinstance(link, str):
            if not _topology:
                _topology = resources.get_default_topology()
            link = _topology.get_patch(link)

        if link.has_IC and use_patch:
            obj.patch_attach(
                _other,
                link,
                at_residue=at_residue,
                other_residue=other_residue,
                _topology=_topology,
            )
        else:
            obj.stitch_attach(
                _other,
                link,
                at_residue=at_residue,
                other_residue=other_residue,
            )
        return obj



[docs]
    def patch_attach(
        self,
        other: "Molecule",
        patch: Union["Linkage.Linkage", str] = None,
        at_residue: Union[int, bio.Residue.Residue] = None,
        other_residue: Union[int, bio.Residue.Residue] = None,
        _topology=None,
    ):
        """
        Attach another structure to this one using a Patch.

        Parameters
        ----------
        other : Molecule
            The other molecule to attach to this one
        patch : str or Linkage
            A linkage to apply when attaching. If none is given, the default link that was set earlier
            on the molecule is used. If no patch was set, an AttributeError is raised. If a string is
            given, it is interpreted as the name of a patch in the topology.
        at_residue : int or Residue
            The residue to attach the other molecule to. If None, the last residue of the molecule.
        other_residue : int or Residue
            The residue of the other molecule to attach. If None, the first residue of the other molecule.
        _topology
            A specific topology to use for referencing.
            If None, the default CHARMM topology is used.
        """

        if not patch:
            patch = self._linkage
        if not patch:
            raise AttributeError(
                "No patch was set for this molecule. Either set a default patch or provide a patch when attaching."
            )
        if isinstance(patch, str):
            if not _topology:
                _topology = resources.get_default_topology()
            patch = _topology.get_patch(patch)
        if not patch:
            raise ValueError(
                "No patch was found with the given name. Either set a default patch or provide a patch when attaching."
            )

        p = structural.__default_keep_keep_patcher__
        p.apply(patch, self, other, at_residue, other_residue)
        p.merge()
        return self



[docs]
    def stitch_attach(
        self,
        other: "Molecule",
        recipe: "Linkage.Linkage" = None,
        remove_atoms=None,
        other_remove_atoms=None,
        at_atom=None,
        other_at_atom=None,
        at_residue=None,
        other_residue=None,
    ):
        """
        Stitch two molecules together by removing atoms and connecting them with a bond. This works without a pre-defined patch.

        Parameters
        ----------
        other : Molecule
            The other molecule to attach to this one
        recipe : Recipe
            The recipe to use when stitching. If None, the default recipe that was set earlier on the molecule is used (if defined).
        remove_atoms : list of int
            The atoms to remove from this molecule. Only used if no recipe is provided.
        other_remove_atoms : list of int
            The atoms to remove from the other molecule. Only used if no recipe is provided.
        at_atom : int or str or Bio.PDB.Atom
            The atom forming the bond in this molecule. If a string is provided, an `at_residue` needs to be defined from which to get the atom. If None is provided, the root atom is used (if defined). Only used if no recipe is provided.
        other_at_atom : int or str or Bio.PDB.Atom
            The atom to attach to in the other molecule. If a string is provided, an `other_residue` needs to be defined from which to get the atom. If None is provided, the root atom is used (if defined). Only used if no recipe is provided.
        at_residue : int or Residue
            The residue to attach the other molecule to. If None, the `attach_residue` is used. Only used if a recipe is provided and the atoms
        """
        if not recipe and remove_atoms is None:
            if not self._linkage:
                raise AttributeError(
                    "No recipe was set for this molecule and no manual instructions were found. Either set a default recipe, provide a recipe when stitching, or provide the information about removed and bonded atoms directly."
                )
            recipe = self._linkage

        if recipe:
            target_atom, source_atom = recipe._stitch_ref_atoms
            return self.stitch_attach(
                other,
                remove_atoms=recipe.deletes[0],
                other_remove_atoms=recipe.deletes[1],
                at_atom=target_atom,
                other_at_atom=source_atom,
                at_residue=at_residue,
                other_residue=other_residue,
            )

        if not isinstance(other, Molecule):
            raise TypeError("Can only stitch two molecules together")

        if not at_atom:
            at_atom = self._root_atom
            if not at_atom:
                raise ValueError(
                    "No atom to attach to was provided and no root atom was defined in this molecule"
                )

        if not other_at_atom:
            other_at_atom = other._root_atom
            if not other_at_atom:
                raise ValueError(
                    "No atom to attach to was provided and no root atom was defined in the other molecule"
                )

        if not at_residue:
            at_residue = self.attach_residue

        if not other_residue:
            other_residue = other.attach_residue

        # since we are already copying before we can use the keep-keep stitcher, actually...
        p = structural.__default_keep_keep_stitcher__
        p.apply(
            self,
            other,
            remove_atoms,
            other_remove_atoms,
            at_atom,
            other_at_atom,
            at_residue,
            other_residue,
        )
        self = p.merge()
        return self



[docs]
    def optimize(
        self,
        residue_graph: bool = None,
        algorithm: str = None,
        rotatron: str = None,
        rotatron_kws: dict = None,
        algorithm_kws: dict = None,
        inplace: bool = True,
    ):
        """
        Optimize the molecule's conformation. This is a convenience method with less
        customizability than a manual optimization using the `optimizers` module.

        Parameters
        ----------
        residue_graph : bool
            Whether to use the residue graph or the full atom graph for optimization.
            The residue graph is faster but less accurate. If the molecule is larger than 100 atoms,
            the residue graph is used by default.
        algorithm : str
            The optimization algorithm to use. If not provided, an algorithm is determined based on the molecule's size.
            This can be one of the following:
            - "genetic" for a genetic algorithm
            - "scipy" for a scipy-implemented gradient-based optimization
            - "swarm" for a particle swarm optimization
            - "anneal" for a simulated annealing optimization
            - "rdkit" for an RDKit-implemented force-field-based optimization (if RDKit is installed)
        rotatron : str
            The rotatron to use. This can be one of the following:
            - "distance" for a distance-based rotatron (default)
            - "overlap" for an overlap-based rotatron
        algorithm_kws : dict
            Keyword arguments to pass to the optimization algorithm
        rotatron_kws : dict
            Keyword arguments to pass to the rotatron
        inplace : bool
            Whether to optimize the molecule in place or return a copy.

        Returns
        -------
        molecule
            The optimized molecule (either the original object or a copy)
        """
        if residue_graph is None:
            residue_graph = self.count_atoms() > 100

        if not algorithm_kws:
            algorithm_kws = {}

        if not rotatron_kws:
            rotatron_kws = {}

        algorithm = algorithm or optimizers.auto_algorithm(self)

        if algorithm == "rdkit":
            obj = self.copy() if not inplace else self
            out = optimizers.rdkit_optimize(obj)
            out.id = self.id
            return out
        elif algorithm == "scipy":
            algorithm = optimizers.scipy_optimize
        elif algorithm == "genetic":
            algorithm = optimizers.genetic_optimize
        elif algorithm == "swarm":
            algorithm = optimizers.swarm_optimize
        elif algorithm == "anneal":
            algorithm = optimizers.anneal_optimize
        else:
            raise ValueError(f"Unknown algorithm {algorithm}")

        rotatron = rotatron or "distance"
        if rotatron == "distance":
            rotatron = optimizers.DistanceRotatron
        elif rotatron == "overlap":
            rotatron = optimizers.OverlapRotatron
        else:
            raise ValueError(f"Unknown rotatron {rotatron}")

        obj = self.copy() if not inplace else self

        if residue_graph:
            graph = obj.make_residue_graph(detailed=True)
        else:
            graph = obj.make_atom_graph(_copy=False)

        edges = graph.find_rotatable_edges(
            graph.central_node, min_ancestors=1, min_descendants=1
        )

        env = rotatron(graph, edges, **rotatron_kws)
        sol, _ = algorithm(env, **algorithm_kws)
        if sol.shape[0] != len(env.rotatable_edges):
            sol = sol[0]
        out = optimizers.apply_solution(sol, env, obj)
        return out


    def __add__(self, other) -> "Molecule":
        """
        Add two molecules together. This will return a new molecule.
        """
        if isinstance(other, (tuple, list)):
            new = self + other[0]
            for o in other[1:]:
                new += o
            return new

        elif isinstance(other, Molecule):
            patch = self._linkage
            if not patch:
                patch = other._linkage
            if not patch:
                raise RuntimeError(
                    "Cannot add two molecules together without a patch, set a default patch on either of them (preferably on the one you are adding to, i.e. mol1 in mol1 + mol2)"
                )
            new = self.attach(other, patch, inplace=False, other_inplace=False)
            return new

        new = self.copy()
        if isinstance(other, entity.base_classes.Atom):
            new.add_atoms(other)

        elif isinstance(other, entity.base_classes.Residue):
            new.add_residues(other)

        elif isinstance(other, entity.base_classes.Chain):
            new._model.add(other)

        elif isinstance(other, entity.base_classes.Model):
            new._base_struct.add(other)

        return new

    def __iadd__(self, other) -> "Molecule":
        """
        Attach another molecule to this one
        """
        if isinstance(other, (tuple, list)):
            for o in other:
                self += o
            return self

        elif isinstance(other, Molecule):
            patch = self._linkage
            if not patch:
                patch = other._linkage
            if not patch:
                raise RuntimeError(
                    "Cannot add two molecules together without a patch, set a default patch on either of them (preferably on the one you are adding to, i.e. mol1 in mol1 += mol2)"
                )

            self.attach(other, patch)
            return self

        elif isinstance(other, entity.base_classes.Atom):
            self.add_atoms(other)

        elif isinstance(other, entity.base_classes.Residue):
            self.add_residues(other)

        elif isinstance(other, entity.base_classes.Chain):
            self._model.add(other)

        elif isinstance(other, entity.base_classes.Model):
            self._base_struct.add(other)

        return self

    def __sub__(self, other) -> "Molecule":
        """
        Remove atoms, residues, chains or models from the molecule
        """
        if isinstance(other, (tuple, list)):
            new = self.copy()
            for o in other:
                new -= o
            return new

        new = self.copy()
        if isinstance(other, entity.base_classes.Atom):
            new.remove_atoms(other)

        elif isinstance(other, entity.base_classes.Residue):
            new.remove_residues(other)

        elif isinstance(other, entity.base_classes.Chain):
            new._model.detach_child(other.get_id())

        elif isinstance(other, entity.base_classes.Model):
            new._base_struct.detach_child(other.get_id())

        elif isinstance(other, Molecule):
            new.remove_residues(*other.residues)

        return new

    def __isub__(self, other) -> "Molecule":
        """
        Remove atoms, residues, chains or models from the molecule
        """
        if isinstance(other, (tuple, list)):
            for o in other:
                self -= o
            return self

        elif isinstance(other, entity.base_classes.Atom):
            self.remove_atoms(other)

        elif isinstance(other, entity.base_classes.Residue):
            self.remove_residues(other)

        elif isinstance(other, entity.base_classes.Chain):
            self._model.detach_child(other.get_id())

        elif isinstance(other, entity.base_classes.Model):
            self._base_struct.detach_child(other.get_id())

        elif isinstance(other, Molecule):
            self.remove_residues(*other.residues)

        return self

    def __mul__(self, n) -> "Molecule":
        """
        Add multiple identical molecules together using the * operator (i.e. mol * 3)
        This requires that the molecule has a patch defined
        """
        if not self._linkage:
            raise RuntimeError("Cannot multiply a molecule without a patch defined")

        if not isinstance(n, int):
            raise TypeError("Can only multiply a molecule by an integer")
        if n <= 0:
            raise ValueError("Can only multiply a molecule by a positive integer")

        new = self.attach(self, self._linkage, inplace=False, other_inplace=False)
        for i in range(n - 2):
            new = new.attach(self, self._linkage, inplace=True, other_inplace=False)

        return new

    def __imul__(self, n) -> "Molecule":
        """
        Add multiple identical molecules together using the *= operator (i.e. mol *= 3)
        This requires that the molecule has a patch defined
        """
        self.repeat(n)
        return self

    def __repr__(self):
        return f"Molecule({self.id})"

    def __str__(self):
        string = f"Molecule {self.id}:\n"
        if len(self.residues) > 1:
            string += str(self.make_residue_graph())
        else:
            string += str(self.residues[0])
        return string



def _molecule_from_pubchem(id, comp):
    """
    Convert a PubChem compound to a Molecule.

    Parameters
    ----------
    id : str
        The id of the molecule
    comp : pcp.Compound
        The PubChem compound

    Returns
    -------
    Molecule
        The molecule
    """
    bonds = comp.bonds
    mol = Molecule.empty(id)
    residue = entity.base_classes.Residue("UNK", " ", 1)
    mol.add_residues(residue)
    # structure = bio.Structure.Structure(id)
    # model = bio.Model.Model(0)
    # structure.add(model)
    # chain = bio.Chain.Chain("A")
    # model.add(chain)
    # residue = bio.Residue.Residue((" ", 1, " "), "UNK", 1)
    # chain.add(residue)

    element_counts = {}
    adx = 1
    for atom in comp.atoms:
        element = atom.element
        element_counts[element] = element_counts.get(element, 0) + 1

        id = f"{element}{element_counts[element]}"
        _atom = entity.base_classes.Atom(
            id,
            coord=np.array((atom.x, atom.y, atom.z)),
            serial_number=adx,
            bfactor=0.0,
            occupancy=0.0,
            element=element.upper(),
            fullname=id,
            altloc=" ",
            pqr_charge=0.0,
        )
        residue.add(_atom)
        adx += 1

    for bond in bonds:
        mol.add_bond(bond.aid1, bond.aid2, bond.order)

    return mol

    # def _molecule_from_ctfile(id, ctfile):
    #     """
    #     Convert a CTFile object to a Molecule.

    #     Parameters
    #     ----------
    #     id : str
    #         The id of the molecule
    #     ctfile : CTFile
    #         The CTFile object

    #     Returns
    #     -------
    #     Molecule
    #         The molecule
    #     """
    #     new = Molecule.empty(id)
    #     residue = entity.base_classes.Residue("UNK", " ", 1)
    #     new.add_residues(residue)

    # element_counts = {}
    # for atom in ctfile.atoms:
    #     element_counts.setdefault(atom.atom_symbol, 1)
    #     count = element_counts[atom.atom_symbol]
    #     element_counts[atom.atom_symbol] += 1

    #     x = float(atom._ctab_data["x"])
    #     y = float(atom._ctab_data["y"])
    #     z = float(atom._ctab_data["z"])

    #     _atom = entity.base_classes.Atom(
    #         id=f"{atom.atom_symbol}{count}",
    #         coord=np.array((x, y, z)),
    #         serial_number=int(atom.atom_number),
    #         pqr_charge=float(atom.charge),
    #     )
    #     residue.add(_atom)

    # for bond in ctfile.bonds:
    #     a, b = bond.first_atom, bond.second_atom
    #     a = int(a.atom_number)
    #     b = int(b.atom_number)
    #     order = int(bond._ctab_data["bond_type"])
    #     for i in range(order):
    #         new.add_bond(a, b)

    # return new


if __name__ == "__main__":
    # ser = Molecule.from_json("ser.json")
    # link = Linkage.from_json("peptide_linkage.json")

    # pep = ser.attach(ser, link, use_patch=False)
    # pep.show()

    # # ser = Molecule.from_pubchem("SER")
    # ser.to_cif("ser.cif")
    # ser.to_pdb("ser.pdb")
    # recipe = Linkage()
    # recipe.add_delete("O1", "target")
    # recipe.add_delete("HO1", "target")
    # recipe.add_delete("HO4", "source")
    # recipe.add_bond(("C1", "O4"))

    # glc = Molecule.from_compound("GLC")
    # glc.set_linkage(recipe)

    # glc2 = glc.copy()

    # _current_residues = len(glc.residues)
    # glc3 = glc + glc2
    # glc3.show()
    # glc3 = glc % "14bb" + glc
    pass
    # import pickle

    # tmp = pickle.load(open("/Users/noahhk/GIT/biobuild/tmp.pickle", "rb"))
    # tmp.get_residue_connections()
    # exit()
    # from timeit import timeit

    # # man = Molecule.from_compound("MAN")
    # glc = Molecule.from_compound("GLC")

    # # man.adjust_indexing(glc)
    # # assert glc.atoms[0].serial_number == len(man.atoms) + 1

    # t1 = timeit()

    # glc.repeat(5, "14bb")
    # # glc % "14bb"
    # # glc *= 10

    # t2 = timeit()

    # print(t2 - t1)

    from biobuild.utils import visual

    # v = visual.MoleculeViewer3D(glc)
    # v.show()

    # man = Molecule.from_pdb("support/examples/membrane.pdb", model=4)
    # man.infer_bonds()
    # man.

    man = Molecule.from_pubchem("benzene")
    man.infer_bonds(restrict_residues=False)

    g = man.make_residue_graph(True)
    v = visual.MoleculeViewer3D(g)

    for c in man.get_residue_connections():
        v.draw_vector(
            None, c[0].coord, 1.3 * (c[1].coord - c[0].coord), color="magenta"
        )
    v.show()