Source code for buildamol.resources.pdbe_compounds

"""
This module defines a Parser to extract information from the PDBE compound library.

Reading the PDBE Compound Library
=================================

The PDBE Compound Library is a database of small molecules that are found in PDB structures. It can be downloaded as an mmCIF file. 
BuildAMol provides a parser to extract information from the mmCIF file.

To parse a PDBE Compound Library file, we can use the toplevel function `read_compounds` or use the `PDBECompounds` class directly

.. code-block:: python

    import BuildAMol
    from buildamol.resources import PDBECompounds

    compounds = bam.read_compounds("path/to/pdbe-compounds.cif")
    # or 
    compounds = PDBECompounds.from_file("path/to/pdbe-compounds.cif")

Saving and loading PDBECompounds
================================

Because parsing may be an intensive operation, the `PDBECompounds` class implements a `save` method to save the parsed data to a pickle file.
For future sessions the pre-parsed object can be directly loaded using the `load` method. Naturally, both operations also have a functional equivalent.

.. code-block:: python

    # save an existing PDBECompounds object
    compounds.save("path/to/pdbe-compounds.pkl")
    # or 
    bam.save_compounds("path/to/pdbe-compounds.pkl", compounds)
    
    # load a pre-parsed PDBECompounds object
    compounds = PDBECompounds.load("path/to/pdbe-compounds.pkl")
    # or (read_compounds also supports loading)
    compounds = bam.read_compounds("path/to/pdbe-compounds.pkl")

In order to export data from the PDBE compounds library, the `PDBECompounds` class also implements a `to_json` and a `to_xml` method to export the data to an agnostic data format.
Exported data can be loaded by `from_json` or `from_xml` respectively (or using `read_compounds` function which supports both formats). This is useful for sharing compound data with others
who may use different versions of BuildAMol and thus may have issues with pickeled objects, or for sharing data with other programs.

.. code-block:: python

    # save an existing PDBECompounds object
    compounds.to_json("path/to/pdbe-compounds.json")
    
    # or to XML with the toplevel function 
    bam.export_compounds("path/to/pdbe-compounds.xml", compounds)

    
Working with PDBECompounds
==========================

The `PDBECompounds` class provides a dictionary-like interface to the compounds in the library. It supports a number of query methods unified within the `get`. 
Compounds can be obtained from the library from their 3-letter _PDB ID_, their _name_, _chemical formula_, or _SMILES_ or _InChI_ string.


.. code-block:: python

    # get a compound by its PDB ID
    glc = compounds.get("GLC")

    # get a compound by its name
    glc = compounds.get("alpha d-glucose", by="name")

    # get a compound by its chemical formula
    glc = compounds.get("C6 H12 O6", by="formula")

    # get a compound by its SMILES string
    glc = compounds.get("C([C@@H]1[C@H]([C@@H]([C@H](O[C@@H]1O)O)O)O)O", by="smiles")

    # get a compound by its InChI string (note the "InChI=" prefix)
    # THIS ALSO USES `by="smiles"`!
    glc = compounds.get("InChI=1S/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6-/m1/s1", by="smiles")

    
By default the `get` method will create a `Molecule` object from the compound. Other output formats such as `biopython.Structure` or simple `dict` are also supported and can be specified using the `return_type` argument.
If multiple compounds match a query, they are returned as a list (unless `return_type` is a dictionary, in which they are kept as a dictionary).    


.. code-block:: python

    # get a compound as a `buildamol.Structure` object
    glc = compounds.get("GLC", return_type="structure")

    # get a compound as a `dict`
    glc = compounds.get("GLC", return_type="dict")

    

Setting default PDBECompounds
=============================

_BuildAMol_ loads a default PDBECompounds object for convenience. The default instance can be accessed using the `get_default_compounds` function. A custom instance can be set as the default using the `set_default_compounds` function.

.. code-block:: python

    import buildamol as bam

    # get the default PDBECompounds instance
    compounds = bam.get_default_compounds()

    # ... do something with the compounds

    # set a custom PDBECompounds instance as the default
    bam.set_default_compounds(compounds)

.. warning::

    The `set_default_compounds` has an additional keyword argument `overwrite` which is set to `False` by default. If set to `True` the default instance
    is permanently overwritten by the new one and will be used automatically by all future sessions. This is not recommended as it may lead to unexpected behaviour.
    

If the user is only interested in working with the default instance rather than a specific `PDBECompounds` instance, there are a number of toplevel functions available to short-cut the process.

For instance, in order to get the compound glucose from the default instance, we can do

.. code-block:: python

    # instead of doing
    compounds = bam.get_default_compounds()
    glc = compounds.get("GLC", return_type="dict")

    # we can do
    glc = bam.get_compound("GLC", return_type="dict")

Other toplevel functions indlude `has_compound` and `add_compound`.

"""

import os
import warnings
import numpy as np
import pdbecif.mmcif_tools as mmcif_tools
import pickle

import Bio.PDB as bio
from Bio import SVDSuperimposer

import buildamol.utils.json as json
import buildamol.utils.defaults as defaults
import buildamol.utils.auxiliary as aux
import buildamol.utils as utils
import buildamol.base_classes as base_classes
import buildamol.core.Molecule as Molecule

__loaded_compounds__ = {
    "sugars": False,
    "lipids": False,
    "small_molecules": False,
    "amino_acids": False,
    "nucleotides": False,
}

__to_float__ = {
    "_chem_comp": set(
        (
            "formula_weight",
            "pdbx_formal_charge",
        )
    ),
}

__to_delete__ = {
    "_chem_comp": set(
        (
            "pdbx_type",
            "mon_nstd_parent_comp_id",
            "pdbx_synonyms",
            "pdbx_initial_date",
            "pdbx_modified_date",
            "pdbx_release_status",
            "pdbx_ambiguous_flag",
            "pdbx_replaced_by",
            "pdbx_replaces",
            "pdbx_processing_site",
            "pdbx_model_coordinates_details",
            "pdbx_model_coordinates_db_code",
            "pdbx_ideal_coordinates_details",
            "pdbx_subcomponent_list",
            "pdbx_model_coordinates_missing_flag",
            "pdbx_ideal_coordinates_missing_flag",
        )
    ),
}

# a list of categories to ignore while parsing
__to_ignore__ = [
    "_pdbx_chem_comp_audit",
    "_pdbx_chem_comp_feature",
]

__needs_to_have__ = set(
    (
        "_chem_comp",
        "_pdbx_chem_comp_descriptor",
        "_pdbx_chem_comp_identifier",
        "_chem_comp_atom",
        "_chem_comp_bond",
    )
)

__search_by__ = ("id", "name", "formula", "smiles")


_bond_order_rev_map = {
    "SING": 1,
    "DOUB": 2,
    "TRIP": 3,
}
"""
Reverse map of bond order names to numbers.
"""

_bond_order_map = {
    1: "SING",
    2: "DOUB",
    3: "TRIP",
}
"""
Map of bond order numbers to names.
"""


[docs] def read_compounds(filename: str, set_default: bool = True) -> "PDBECompounds": """ Reads a PDBECompounds object from a CIF, JSON, XML, or pickle file. Parameters ---------- filename : str The path to the file to read. set_default : bool, optional Whether to set the read PDBECompounds object as the default. Defaults to True. Returns ------- PDBECompounds The PDBECompounds object parsed from the file. """ if filename.endswith(".cif"): compounds = PDBECompounds.from_file(filename) elif filename.endswith(".json"): compounds = PDBECompounds.from_json(filename) elif filename.endswith(".pkl") or filename.endswith(".pickle"): compounds = PDBECompounds.load(filename) elif filename.endswith(".xml"): compounds = PDBECompounds.from_xml(filename) else: raise ValueError( "Unsupported file format. Only CIF, JSON and pickle (.pkl or .pickle) files are supported." ) if set_default: set_default_compounds(compounds) return compounds
[docs] def save_compounds(filename: str, compounds: "PDBECompounds" = None): """ Saves a PDBECompounds object to a file. Parameters ---------- filename : str The path to the pickle file to save to. compounds : PDBECompounds, optional The PDBECompounds object to save. If not provided, the default PDBECompounds object is used. """ if compounds is None: compounds = get_default_compounds() compounds.save(filename)
[docs] def export_compounds(filename: str, compounds: "PDBECompounds" = None): """ Export the PDBECompounds object to a JSON or XML file. Parameters ---------- filename : str The path to the file to save to. compounds : PDBECompounds, optional The PDBECompounds object to save. If not provided, the default PDBECompounds object is used. """ if compounds is None: compounds = get_default_compounds() if filename.endswith(".json"): compounds.to_json(filename) elif filename.endswith(".xml"): compounds.to_xml(filename) else: raise ValueError( "Unsupported file format. Only JSON and XML files are supported." )
[docs] def has_compound(compound: str, search_by: str = None) -> bool: """ Checks if a given compound is available in the currently loaded default PDBECompounds instance. Parameters ---------- compound : str The compound to check. search_by : str, optional The type of search to perform. One of `id`, `name`, `formula`, `smiles` (includes `inchi`). By default, all search types are used. """ if search_by is not None: return get_default_compounds().has_residue(compound, by=search_by) for by in __search_by__: if get_default_compounds().has_residue(compound, by=by): return True return False
[docs] def get_compound(compound: str, search_by: str = None, return_type: str = "molecule"): """ Get a compound from the currently loaded default PDBECompounds instance. Parameters ---------- compound : str The compound to get. search_by : str, optional The type of search to perform. One of `id`, `name`, `formula`, `smiles` (includes `inchi`). By default, all search types are used. return_type : str, optional The type of object to return. One of `molecule` (a BuildAMol Molecule), `structure` (a BuildAMol structure only, does not include connectivity), `dict` (the compound data and coordinate dictionaries). Defaults to `molecule`. Returns ------- object or list The molecule object, structure, or tuple of dictionaries. If multiple compounds are found matching, they are returned as a list. """ if not has_compound(compound, search_by=search_by): raise ValueError( f"Compound {compound} not found in the default PDBECompounds instance." ) comps = get_default_compounds() for by in __search_by__: if comps.has_residue(compound, by=by): comp = comps.get(compound, by=by, return_type=return_type) if return_type == "dict": if isinstance(comp, list): comp = [(c, comps._pdb[c["id"]]) for c in comp] else: comp = comp, comps._pdb[comp["id"]] return comp
[docs] def add_compound( mol: "Molecule", type: str = None, names: list = None, identifiers: list = None, formula: str = None, overwrite: bool = False, ): """ Add a compound to the currently loaded default PDBECompounds instance. Parameters ---------- mol : Molecule The molecule to add. type : str, optional The type of compound. For instance `protein`, `dna`, `rna`, `ligand`, `water`, `ion`, `other`. names : list, optional A list of names for the compound. identifiers : list, optional A list of identifiers for the compound. formula : str, optional The chemical formula of the compound. overwrite : bool, optional Whether to hardcode overwrite the current default PDBECompounds instance. This will make the compound available for all future sessions. """ get_default_compounds().add( mol, type=type, names=names, identifiers=identifiers, formula=formula ) if overwrite: set_default_compounds(get_default_compounds(), overwrite=True)
[docs] def get_default_compounds() -> "PDBECompounds": """ Get the currently loaded default PDBECompounds instance. Returns ------- PDBECompounds The currently loaded default PDBECompounds instance. """ return defaults.__default_instances__["PDBECompounds"]
[docs] def set_default_compounds(obj, overwrite: bool = False): """ Set a PDBECompounds object as the new default. Parameters ---------- obj : PDBECompounds The PDBECompounds object to set as the new default overwrite : bool If set to `True`, the new object will be permanently saved as the default. Otherwise, the new object will only be used for the current session. """ if not obj.__class__.__name__ == "PDBECompounds": raise TypeError("The object must be a PDBECompounds instance.") if overwrite: current = defaults.get_default_instance("PDBECompounds") if current: if not os.path.exists( defaults.DEFAULT_PDBE_COMPONENT_FILES["base"] + ".bak" ): current.save(defaults.DEFAULT_PDBE_COMPONENT_FILES["base"] + ".bak") defaults.__default_instances__["PDBECompounds"] = obj if overwrite: obj.save(defaults.DEFAULT_PDBE_COMPONENT_FILES["base"])
[docs] def save_as_default_compounds(obj): """ Save a PDBECompounds object as the new default. This will leave the currently active default object unchanged, but will overwrite the default file for any future sessions. Parameters ---------- obj : PDBECompounds The PDBECompounds object to set as the new default """ if not obj.__class__.__name__ == "PDBECompounds": raise TypeError("The object must be a PDBECompounds instance.") current = defaults.get_default_instance("PDBECompounds") if current: current.save(defaults.DEFAULT_PDBE_COMPONENT_FILES["base"] + ".bak") obj.save(defaults.DEFAULT_PDBE_COMPONENT_FILES["base"])
[docs] def restore_default_compounds(overwrite: bool = True): """ Restore the default PDBECompounds object from the backup file Parameters ---------- overwrite : bool If set to `True`, the backup file will be permanently saved as the default again. """ if not os.path.isfile(defaults.DEFAULT_PDBE_COMPONENT_FILES["base"] + ".bak"): raise FileNotFoundError( "No backup file found. The default PDBECompounds object has not been modified." ) defaults.__default_instances__["PDBECompounds"] = PDBECompounds.from_json( defaults.DEFAULT_PDBE_COMPONENT_FILES["base"] + ".bak" ) if overwrite: os.rename( defaults.DEFAULT_PDBE_COMPONENT_FILES["base"] + ".bak", defaults.DEFAULT_PDBE_COMPONENT_FILES["base"], )
[docs] def unload_by_types(*t: str): """ Unload all loaded compounds from the default PDBECompounds instance of a given types. Parameters ---------- t: str The types to unload. """ comps = get_default_compounds() ids = [i for i in comps.ids if comps._compounds[i]["type"] in t] for i in ids: comps.remove(i)
[docs] def subset_compounds_by_types(*t: str) -> "PDBECompounds": """ Get a new PDBECompounds object containing only compounds from the currently loaded default compounds of the given types. Parameters ---------- t: str The types to include. Returns ------- PDBECompounds A new PDBECompounds object containing only compounds of the given types. """ new = PDBECompounds() comps = get_default_compounds() ids = [i for i in comps.ids if comps._compounds[i]["type"] in t] for i in ids: new._compounds[i] = comps._compounds[i].copy() new._pdb[i] = comps._pdb[i].copy() return new
[docs] def load_amino_acids(): """ Load amino acid components into the default PDBECompounds instance. """ if __loaded_compounds__["amino_acids"]: return comps = get_default_compounds() amino_acids = read_compounds( defaults.DEFAULT_PDBE_COMPONENT_FILES["amino_acids"], set_default=False ) warnings.filterwarnings("ignore") comps.merge(amino_acids) warnings.filterwarnings("default") __loaded_compounds__["amino_acids"] = True
[docs] def unload_amino_acids(): """ Unload amino acid components from the default PDBECompounds instance. """ if not __loaded_compounds__["amino_acids"]: return comps = get_default_compounds() ids = [i for i in comps.ids if comps._compounds[i]["type"] == "AMINO-ACID"] for i in ids: comps.remove(i) __loaded_compounds__["amino_acids"] = False
[docs] def load_lipids(): """ Load lipid components into the default PDBECompounds instance. """ if __loaded_compounds__["lipids"]: return comps = get_default_compounds() lipids = read_compounds( defaults.DEFAULT_PDBE_COMPONENT_FILES["lipids"], set_default=False ) warnings.filterwarnings("ignore") comps.merge(lipids) warnings.filterwarnings("default") __loaded_compounds__["lipids"] = True
[docs] def unload_lipids(): """ Unload lipid components from the default PDBECompounds instance. """ if not __loaded_compounds__["lipids"]: return comps = get_default_compounds() ids = [i for i in comps.ids if comps._compounds[i]["type"] == "LIPID"] for i in ids: comps.remove(i) __loaded_compounds__["lipids"] = False
[docs] def load_sugars(): """ Load sugar components into the default PDBECompounds instance. """ if __loaded_compounds__["sugars"]: return comps = get_default_compounds() sugars = read_compounds( defaults.DEFAULT_PDBE_COMPONENT_FILES["sugars"], set_default=False ) warnings.filterwarnings("ignore") comps.merge(sugars) warnings.filterwarnings("default") __loaded_compounds__["sugars"] = True
[docs] def unload_sugars(): """ Unload sugar components from the default PDBECompounds instance. """ if not __loaded_compounds__["sugars"]: return comps = get_default_compounds() ids = [i for i in comps.ids if comps._compounds[i]["type"] == "SACCHARIDE"] for i in ids: comps.remove(i) __loaded_compounds__["sugars"] = False
[docs] def load_nucleotides(): """ Load nucleotide components into the default PDBECompounds instance. """ if __loaded_compounds__["nucleotides"]: return comps = get_default_compounds() nucleotides = read_compounds( defaults.DEFAULT_PDBE_COMPONENT_FILES["nucleotides"], set_default=False ) warnings.filterwarnings("ignore") comps.merge(nucleotides) warnings.filterwarnings("default") __loaded_compounds__["nucleotides"] = True
[docs] def unload_nucleotides(): """ Unload nucleotide components from the default PDBECompounds instance. """ if not __loaded_compounds__["nucleotides"]: return comps = get_default_compounds() ids = [i for i in comps.ids if comps._compounds[i]["type"] == "NUCLEIC-ACID"] for i in ids: comps.remove(i) __loaded_compounds__["nucleotides"] = False
[docs] def load_small_molecules(): """ Load small molecule components into the default PDBECompounds instance. """ if __loaded_compounds__["small_molecules"]: return comps = get_default_compounds() small_molecules = read_compounds( defaults.DEFAULT_PDBE_COMPONENT_FILES["small_molecules"], set_default=False ) warnings.filterwarnings("ignore") comps.merge(small_molecules) warnings.filterwarnings("default") __loaded_compounds__["small_molecules"] = True
[docs] def unload_small_molecules(): """ Unload small molecule components from the default PDBECompounds instance. """ if not __loaded_compounds__["small_molecules"]: return comps = get_default_compounds() ids = [i for i in comps.ids if comps._compounds[i]["type"] == "SMALL-MOLECULE"] for i in ids: comps.remove(i) __loaded_compounds__["small_molecules"] = False
[docs] def load_all_compounds(): """ Load all available components into the default PDBECompounds instance. """ load_amino_acids() load_lipids() load_sugars() load_nucleotides() load_small_molecules()
[docs] def unload_all_compounds(): """ Unload all available components from the default PDBECompounds instance. """ unload_amino_acids() unload_lipids() unload_sugars() unload_nucleotides() unload_small_molecules()
[docs] class PDBECompounds: """ This class wraps a dictionary of PDBE compounds. It facilitates easy data access and searchability for compounds. Parameters ---------- compounds : dict A dictionary of PDBE compounds. id : str, optional The ID of the PDBECompounds object. Defaults to None. """ def __init__(self, compounds: dict = None, id=None) -> None: self.id = id if not compounds: compounds = {} self._compounds = {k: None for k in compounds.keys()} self._pdb = dict(self._compounds) self._setup_dictionaries(compounds) self._filename = None self._orig = compounds
[docs] @classmethod def from_file(cls, filename: str) -> "PDBECompounds": """ Create a PDBECompounds object from a `cif` file. Parameters ---------- filename : str The path to the file. Returns ------- PDBECompounds The PDBECompounds object. """ _dict = mmcif_tools.MMCIF2Dict() _dict = _dict.parse(filename, ignoreCategories=__to_ignore__) # Make sure all required fields are present _keys = set(_dict.keys()) for key in _keys: v = _dict[key] if not all(i in v.keys() for i in __needs_to_have__): del _dict[key] warnings.warn( f"Compound '{key}' does not have all required fields. It will be ignored." ) new = cls(_dict) new._filename = filename return new
[docs] @classmethod def load(self, filename: str) -> "PDBECompounds": """ Load a PDBECompounds object from a `pickle` file. Parameters ---------- filename : str The path to the file. Returns ------- PDBECompounds The PDBECompounds object. """ with open(filename, "rb") as f: return pickle.load(f)
[docs] @classmethod def from_json(cls, filename: str) -> "PDBeCompounds": """ Make a PDBECompounds object from a previously exported `json` file. Parameters ---------- filename : str The path to the file. Returns ------- PDBECompounds The PDBECompounds object. """ _dict = json.read(filename) new = cls._from_dict(_dict) new._filename = filename return new
[docs] @classmethod def from_xml(cls, filename: str) -> "PDBECompounds": """ Make a PDBECompounds object from an `xml` file. Parameters ---------- filename : str The path to the file. Returns ------- PDBECompounds The PDBECompounds object. """ xml = utils.xml.read_xml(filename) new = cls({}, id=xml["id"]) for compound in xml["metadata"]: comp_dict = { "id": compound["id"], "name": compound["name"], "type": compound["type"], "formula": compound["formula"], "one_letter_code": ( None if compound["oneletter"] == "None" else compound["oneletter"] ), "three_letter_code": ( None if compound["threeletter"] == "None" else compound["threeletter"] ), "names": [i["value"] for i in compound["names"]], "descriptors": [i["value"] for i in compound["descriptors"]], } new._compounds[compound["id"]] = comp_dict for compound in xml["pdbdata"]: pdb_dict = { "id": compound["id"], "residues": {"serials": [], "names": []}, "atoms": { "serials": [], "ids": [], "coords": [], "elements": [], "charges": [], "residue": [], }, "bonds": {"bonds": [], "parents": [], "orders": []}, } for residue in compound["residues"]: pdb_dict["residues"]["serials"].append(int(residue["serial"])) pdb_dict["residues"]["names"].append(residue["name"]) for atom in compound["atoms"]: pdb_dict["atoms"]["serials"].append(int(atom["serial"])) pdb_dict["atoms"]["ids"].append(atom["id"]) pdb_dict["atoms"]["elements"].append(atom["element"]) pdb_dict["atoms"]["charges"].append(eval(atom["charge"])) pdb_dict["atoms"]["residue"].append(int(atom["residue"])) coord = [ float(atom["x"]), float(atom["y"]), float(atom["z"]), ] pdb_dict["atoms"]["coords"].append(coord) for bond in compound["bonds"]: pdb_dict["bonds"]["bonds"].append((bond["atom1"], bond["atom2"])) pdb_dict["bonds"]["parents"].append( (int(bond["residue1"]), int(bond["residue1"])) ) pdb_dict["bonds"]["orders"].append(bond["order"]) pdb_dict["atoms"]["coords"] = np.array(pdb_dict["atoms"]["coords"]) pdb_dict["atoms"]["full_ids"] = pdb_dict["atoms"]["ids"] new._pdb[compound["id"]] = pdb_dict return new
@classmethod def _from_dict(cls, _dict: dict) -> "PDBECompounds": """ Make a PDBECompounds object from a dictionary. Parameters ---------- _dict : dict The dictionary to use. Returns ------- PDBECompounds The PDBECompounds object. """ new = cls({}, id=_dict["id"]) for compound in _dict["compounds"]: comp_dict = { "id": compound["id"], "name": compound["names"][0], "type": compound["type"], "formula": compound["formula"], "one_letter_code": compound["one_letter_code"], "three_letter_code": compound["three_letter_code"], "names": compound["names"], "descriptors": compound["identifiers"], } mol = Molecule._from_dict(compound) pdb_dict = _molecule_to_pdbx_dict(mol) new._compounds[compound["id"]] = comp_dict new._pdb[compound["id"]] = pdb_dict return new @property def ids(self) -> list: """ Get a list of all compound ids. Returns ------- list A list of all compound ids. """ return list(self._compounds.keys()) @property def formulas(self) -> list: """ Get a list of all compound formulas. Returns ------- list A list of all compound formulas. """ return [i["formula"] for i in self._compounds.values()] @property def molecules(self) -> list: """ Get a list of all compound molecules. Returns ------- list A list of all compound molecules. """ return [i for i in self.iter_molecules()]
[docs] def iter_molecules(self): """ Iterate over all compound molecules. Returns ------- iterator An iterator over all compound molecules. """ for i in self.ids: _dict = self._get(i, "id") _dict = list(_dict.values())[0] yield self._molecule(_dict)
[docs] def save(self, filename: str = None) -> None: """ Save the PDBECompounds object to a `pickle` file. Parameters ---------- filename : str The path to the file. By default the same file used to load the object is used. If no file is available, a `ValueError` is raised. """ if filename is None: if self._filename is None: raise ValueError("No filename specified.") filename = self._filename if not filename.endswith(".pkl"): filename = aux.change_suffix(filename, ".pkl") self._filename = filename with open(filename, "wb") as f: pickle.dump(self, f)
[docs] def to_json(self, filename: str = None) -> None: """ Export the PDBECompounds object to a `json` file. Parameters ---------- filename : str The path to the file. By default the same file used to load the object is used. If no file is available, a `ValueError` is raised. """ if filename is None: if self._filename is None: raise ValueError("No filename specified.") filename = self._filename if not filename.endswith(".json"): filename = aux.change_suffix(filename, ".json") json.write_pdbe_compounds(self, filename)
[docs] def to_xml(self, filename: str = None) -> None: """ Export the PDBECompounds object to an `xml` file. Parameters ---------- filename : str The path to the file. By default the same file used to load the object is used. If no file is available, a `ValueError` is raised. """ if filename is None: if self._filename is None: raise ValueError("No filename specified.") filename = self._filename if not filename.endswith(".xml"): filename = aux.change_suffix(filename, ".xml") xml = utils.xml.encode_pdbe_compounds(self) utils.xml.write_xml(filename, xml)
[docs] def merge(self, other: "PDBECompounds") -> None: """ Merge another compounds dictionary into this one. Parameters ---------- other : PDBECompounds The other object. """ for key in other._compounds.keys(): if key in self._compounds: warnings.warn( f"Compound '{key}' already present. It will be overwritten." ) self._compounds[key] = other._compounds[key] self._pdb[key] = other._pdb[key]
[docs] def get( self, query: str, by: str = "id", return_type: str = "molecule", ): """ Get a compound that matches the given criteria. Parameters ---------- query : str The query to search for. by : str, optional The type of query, by default "id". Possible values are: - "id": search by compound id - "name": search by compound name (must match any available synonym exactly) - "formula": search by compound formula - "smiles": search by compound smiles (this also works for inchi) return_type : str, optional The type of object to return, by default "molecule". Possible values are: - "molecule": return a BuildAMol `Molecule` object - "dict": return a dictionary of the compound data - "structure": return a biopython `Structure` object - "residue": return a biopython `Residue` object Returns ------- object The object that matches the given criteria, or None if no match was found. If multiple matches are found, a list of objects is returned. """ _dict = self._get(query, by) if len(_dict.keys()) == 0: return None elif len(_dict.keys()) > 1: return [self.get(i, "id", return_type) for i in _dict.keys()] _dict = list(_dict.values())[0] if return_type == "molecule": return self._molecule(_dict) elif return_type == "dict": return _dict elif return_type == "structure": return self._make_structure(_dict) elif return_type == "residue": chain = base_classes.Chain("A") res = next(self._make_residues(_dict)) chain.add(res) self._fill_residues(chain, _dict) return res else: raise ValueError(f"Invalid return_type '{return_type}'.")
[docs] def add( self, mol: Molecule, type: str = None, names: list = None, identifiers: list = None, formula: str = None, one_letter_code: str = None, three_letter_code: str = None, ): """ Add a compound to the dictionary. Parameters ---------- mol : Molecule The compound to add. type : str, optional The type of compound, e.g. "ligand", "cofactor", etc. names : list, optional A list of names for the compound. identifiers : list, optional A list of identifiers such as SMILES or InChI for the compound. formula : str, optional The formula of the compound. If not given, it will be calculated from the molecule. """ if names is None: names = [] if identifiers is None: identifiers = [] if formula is None: formula = aux.make_formula(mol.get_atoms()) # get the residue comp = { "id": mol.id, "names": set(i.lower() for i in names), "formula": formula, "descriptors": identifiers, "type": type, "one_letter_code": one_letter_code, "three_letter_code": three_letter_code, } comp["names"].add(mol.id.lower()) self._compounds[mol.id] = comp pdb = _molecule_to_pdbx_dict(mol) self._pdb[mol.id] = pdb
[docs] def remove(self, id: str) -> None: """ Remove a compound from the dictionary. Parameters ---------- id : str The id of the compound to remove. """ self._compounds.pop(id, None) self._pdb.pop(id, None)
[docs] def has_residue(self, query: str, by: str = "id") -> bool: """ Check if a compound has a residue definition. Parameters ---------- query : str The query to search for. by : str, optional The type of query, by default "id". Possible values are: - "id": search by compound id - "name": search by compound name (must match any available synonym exactly) - "formula": search by compound formula - "smiles": search by compound smiles (this also works for inchi) Returns ------- bool True if the compound has a residue definition, False otherwise. """ _dict = self._get(query, by) return len(_dict.keys()) > 0
[docs] def translate_ids_3_to_1(self, ids: list, ignore_unknown: bool = False) -> list: """ Translate a list of 3-letter compound ids to 1-letter ids. Any ids that cannot be translated are returned as "X". Parameters ---------- ids : list A list of 3-letter compound ids. ignore_unknown: bool If set to True, unknown ids are ignored and not replaced with "X". Returns ------- list A list of 1-letter compound ids. """ new = [] for i in ids: i = self.get(i, "id", "dict") if not i: if ignore_unknown: continue new.append("X") continue i = i.get("one_letter_code", None) if i is None: if ignore_unknown: continue i = "X" new.append(i) return new
[docs] def translate_ids_1_to_3(self, ids: list, ignore_unknown: bool = False) -> list: """ Translate a list of 1-letter compound ids to 3-letter ids. Any unknown ids are replaced with "XXX". Parameters ---------- ids : list A list of 1-letter compound ids. ignore_unknown: bool If set to True, unknown ids are ignored and not replaced with "XXX". Returns ------- list A list of 3-letter compound ids. """ new = [] for i in ids: i = self.get(i, "id", "dict") if i is None: if ignore_unknown: continue new.append("XXX") continue i = i.get("three_letter_code", None) if i is None: if ignore_unknown: continue i = "XXX" new.append(i) return new
[docs] def relabel_atoms(self, structure): """ Relabel the atoms of a `Molecule` object to match the atom names of the given compound. Parameters ---------- structure A `Molecule` object or biopython object holding at least one residue. Returns ------- object The object with relabeled atoms. """ imposer = SVDSuperimposer.SVDSuperimposer() for residue in structure.get_residues(): # get a reference name = residue.resname ref = self.get(name, "id", "residue") if ref is None: warnings.warn(f"Could not find residue '{name}' in PDBECompounds.") continue residue_coords = np.asarray([atom.coord for atom in residue.get_atoms()]) ref_coords = np.asarray([atom.coord for atom in ref.get_atoms()]) imposer.set(ref_coords, residue_coords) imposer.run() # get the residue coordinates new_coords = imposer.get_transformed() ref_atom_ids = [atom.id for atom in ref.get_atoms()] for idx, atom in enumerate(list(residue.get_atoms())): _new = new_coords[idx] min_idx = np.argmin(np.linalg.norm(ref_coords - _new, axis=1)) atom.id = ref_atom_ids[min_idx] return structure
def _get(self, q: str, by: str = "id"): """ Get a dictionary of compounds that match the given criteria. Parameters ---------- q : str The query to search for. by : str, optional The type of query, by default "id". Possible values are: - "id": search by compound id - "name": search by compound name (must match any available synonym exactly) - "formula": search by compound formula - "smiles": search by compound smiles (this also works for inchi) Returns ------- dict A dictionary of compounds that match the given criteria. """ if by == "id": _q = self._compounds.get(q) if _q is None: return {} return {q: _q} elif by == "name": return {k: v for k, v in self._compounds.items() if q.lower() in v["names"]} elif by == "formula": return { k: v for k, v in self._compounds.items() if q.upper().replace(" ", "") == v["formula"] } elif by == "smiles": return {k: v for k, v in self._compounds.items() if q in v["descriptors"]} else: raise ValueError(f"Invalid search type: {by}") def _setup_dictionaries(self, data_dict): """ Fill in the dictionaries with the appropriate data Parameters ---------- data_dict : dict A dictionary of data from the cif file. """ for key, value in data_dict.items(): comp = value["_chem_comp"] for k in __to_delete__["_chem_comp"]: if k in comp: comp.pop(k) for k in __to_float__["_chem_comp"]: if k in comp: comp[k] = float(comp[k]) for k in comp: if comp[k] == "?": comp[k] = None comp["descriptors"] = value["_pdbx_chem_comp_descriptor"]["descriptor"] comp["names"] = [comp["name"]] synonyms = value.get("_pdbx_chem_comp_synonyms", None) if synonyms: if not isinstance(synonyms["name"], list): synonyms["name"] = [synonyms["name"]] comp["names"].extend(synonyms["name"]) identifiers = value.get("_pdbx_chem_comp_identifier", None) if identifiers: if not isinstance(identifiers["identifier"], list): identifiers["identifier"] = [identifiers["identifier"]] comp["names"].extend(identifiers["identifier"]) comp["names"] = set(i.lower() for i in comp["names"]) if comp["formula"] is not None: comp["formula"] = comp["formula"].replace(" ", "") self._compounds[key] = comp atoms = value["_chem_comp_atom"] bonds = value["_chem_comp_bond"] atoms["pdbx_model_Cartn_x_ideal"] = [ i.replace("?", "NaN") for i in atoms["pdbx_model_Cartn_x_ideal"] ] atoms["pdbx_model_Cartn_y_ideal"] = [ i.replace("?", "NaN") for i in atoms["pdbx_model_Cartn_y_ideal"] ] atoms["pdbx_model_Cartn_z_ideal"] = [ i.replace("?", "NaN") for i in atoms["pdbx_model_Cartn_z_ideal"] ] atoms["charge"] = [i.replace("?", "NaN") for i in atoms["charge"]] pdb = { "atoms": { "full_ids": [i.replace(",", "'") for i in atoms["atom_id"]], "ids": [ i.replace(",", "'") for i in atoms["pdbx_component_atom_id"] ], "serials": np.array(atoms["pdbx_ordinal"], dtype=int), "coords": np.array( [ (float(i), float(j), float(k)) for i, j, k in zip( atoms["pdbx_model_Cartn_x_ideal"], atoms["pdbx_model_Cartn_y_ideal"], atoms["pdbx_model_Cartn_z_ideal"], ) ] ), "elements": atoms["type_symbol"], "charges": np.array(atoms["charge"], dtype=float), # ---------------------- FUTURE UPDATE ---------------------- # support multi-residue molecules # we need a proper parsing way to extract the residue information # from the pdbx mmcif files... # ---------------------- FUTURE UPDATE ---------------------- "residue": [ 1 for i in atoms["type_symbol"] ], # atoms["pdbx_component_comp_id"], }, "bonds": { "bonds": [ (a.replace(",", "'"), b.replace(",", "'")) for a, b in zip(bonds["atom_id_1"], bonds["atom_id_2"]) ], "parents": [(1, 1) for i in bonds["atom_id_1"]], "orders": np.array(bonds["value_order"]), }, "residues": { "serials": [1], "names": [atoms["pdbx_component_comp_id"][0]], }, } self._pdb[key] = pdb def _molecule(self, compound: dict) -> Molecule: """ Make a BuildAMol Molecule from a compound. Parameters ---------- compound : dict A dictionary of a compound. Returns ------- Molecule A BuildAMol Molecule. """ pdb = self._pdb[compound["id"]] struct = self._make_structure(compound) mol = Molecule(struct) for bdx in range(len(pdb["bonds"]["bonds"])): a, b = pdb["bonds"]["bonds"][bdx] res_a, res_b = pdb["bonds"]["parents"][bdx] a = mol.get_atom(a, residue=res_a) b = mol.get_atom(b, residue=res_b) order = pdb["bonds"]["orders"][bdx] order = _bond_order_rev_map.get(order) mol.add_bond(a, b, order) # for _ in range(_bond_order_rev_map.get(pdb["bonds"]["orders"][bdx])): # mol.add_bond(a, b) return mol def _make_structure(self, compound: dict) -> "bio.PDB.Structure": """ Make a structure from a compound. Parameters ---------- compound : dict A dictionary of a compound. Returns ------- buildamol.Structure A structure. """ if len(compound) == 0: return None struct = base_classes.Structure(compound["id"]) model = base_classes.Model(0) struct.add(model) chain = base_classes.Chain("A") model.add(chain) for res in self._make_residues(compound): chain.add(res) # add atoms to residue (at the end to ensure that the residue's full id is propagated) self._fill_residues(chain, compound) return struct def _fill_residues(self, chain, compound: dict): """ Fill a residue with atoms from a compound. Parameters ---------- chain : buildamol.Chain A BuildAMol Chain with the residues to fill. compound : dict A dictionary of a compound. """ pdb = self._pdb.get(compound["id"], None) if pdb is None: raise ValueError("No pdb data for compound.") atoms = pdb["atoms"] for i in range(len(atoms["ids"])): res = atoms["residue"][i] res = _residue_from_chain(res, chain) if res is None: res = chain.child_list[0] atom = base_classes.Atom( atoms["full_ids"][i], np.array(atoms["coords"][i]), atoms["serials"][i], fullname=atoms["ids"][i], element=atoms["elements"][i], pqr_charge=atoms["charges"][i], ) res.add(atom) def _make_residues(self, compound: dict) -> "Residue": """ Make a residue from a compound. Parameters ---------- compound : dict A dictionary of a compound. Yields ------- Residue A residue. """ if len(compound) == 0: return None pdb = self._pdb[compound["id"]] residues = pdb["residues"] for serial, name in zip(residues["serials"], residues["names"]): res = base_classes.Residue(name, " ", serial) yield res def __getitem__(self, key): return self._compounds[key], self._pdb[key] def __len__(self): return len(self._compounds) def __iter__(self): for key in self._compounds: yield key, self._compounds[key], self._pdb[key]
defaults.set_default_instance( "PDBECompounds", read_compounds(defaults.DEFAULT_PDBE_COMPONENT_FILES["base"], set_default=False), ) def _residue_from_chain(idx, chain): """ Get a residue from a chain """ return next((i for i in chain.child_list if i.id[1] == idx), None) def _molecule_to_pdbx_dict(mol): """ Make a pdbx dictionary from a molecule. """ _bond_dict = {bond: bond.order for bond in mol.get_bonds()} pdb = { "atoms": { "full_ids": [i.id.replace(",", "'") for i in mol.get_atoms()], "ids": [i.id.replace(",", "'") for i in mol.get_atoms()], "serials": [i.serial_number for i in mol.get_atoms()], "coords": np.array( [ (float(i), float(j), float(k)) for i, j, k in zip( (i.coord[0] for i in mol.get_atoms()), (i.coord[1] for i in mol.get_atoms()), (i.coord[2] for i in mol.get_atoms()), ) ] ), "elements": [i.element.title() for i in mol.get_atoms()], "charges": [i.pqr_charge for i in mol.get_atoms()], "residue": [i.parent.id[1] for i in mol.get_atoms()], }, "bonds": { "bonds": [ (a.id.replace(",", "'"), b.id.replace(",", "'")) for a, b in _bond_dict.keys() ], "parents": [ (a.get_parent().id[1], b.get_parent().id[1]) for a, b in _bond_dict.keys() ], "orders": [_bond_order_map.get(i) for i in _bond_dict.values()], }, "residues": { "serials": [i.id[1] for i in mol.get_residues()], "names": [i.resname for i in mol.get_residues()], }, } return pdb __all__ = [ "PDBECompounds", "set_default_compounds", "get_default_compounds", "restore_default_compounds", "get_compound", "has_compound", "add_compound", "read_compounds", "export_compounds", "save_compounds", "load_amino_acids", "load_nucleotides", "load_sugars", "load_lipids", "load_small_molecules", "load_all_compounds", "unload_amino_acids", "unload_nucleotides", "unload_sugars", "unload_lipids", "unload_small_molecules", "unload_all_compounds", "save_as_default_compounds", "unload_by_types", "subset_compounds_by_types", "_molecule_to_pdbx_dict", ] if __name__ == "__main__": f = "support/pdbe_compounds/test.cif" compounds = PDBECompounds.from_file(f) compounds.to_json("comps.json") c2 = PDBECompounds.from_json("comps.json") pass # from buildamol.core import Molecule # real = Molecule.from_compound("GLC") # scrambled = Molecule.from_compound("GLC") # counts = {"C": 0, "H": 0, "O": 0, "N": 0, "S": 0, "P": 0} # for atom in scrambled.atoms: # counts[atom.element] += 1 # atom.id = atom.element + str(counts[atom.element]) # print(scrambled.atoms) # compounds.relabel_atoms(scrambled) # print(scrambled.atoms) # from buildamol.utils.visual import MoleculeViewer3D # v = MoleculeViewer3D(scrambled) # v.show()