In this tutorial we will cover:

  • which built-in resources are available

  • how to set your own default settings

Built-in resources#

biubuild has three built-in data resources: the CHARMM force field, the PDBE compound library, and PubChem (remotely queried).

flowchart TB
  node_1(("CHARMM"))
  node_2["pre-defined linkages"]
  node_3(("PDBE Compounds"))
  node_4["small molecules"]
  node_5(("PubChem"))
  node_7["amino acids"]
  node_8["sugars"]
  node_9["lipids"]
  node_6["any other available molecule"]
  node_10["nucleotides"]
  node_1 --> node_2
  node_3 --> node_4
  node_3 --> node_7
  node_3 --> node_8
  node_3 --> node_9
  node_5 --> node_6
  node_3 --> node_10

CHARMM Force Field#

In order to connect molecules together, the user may define their own Linkage by specifying which atoms to connect and which atoms to remove in the process. However, to make life easier, biobuild references the CHARMM force field which already specifies a number of linkage types - so-called patches. Each patch specifies the atoms to connect and remove as well as the internal coordinates around the newly formed bond. This allows biobuild to generate structures by pure matrix transformation as the resulting geometry is already specified.

We can check what linkages are available by default using:

[1]:
import plotly
plotly.offline.init_notebook_mode()
[2]:
import biobuild as bb

bb.available_linkages()
[2]:
[Linkage(SCK0),
 Linkage(SCK1),
 Linkage(LLLO),
 Linkage(CERA),
 Linkage(CERB),
 Linkage(DAGA),
 Linkage(DAGB),
 Linkage(INS2A),
 Linkage(INS2B),
 Linkage(INS6A),
 Linkage(INS6B),
 Linkage(SGPA),
 Linkage(TGPA),
 Linkage(SGPB),
 Linkage(TGPB),
 Linkage(NGLA),
 Linkage(11aa),
 Linkage(11ab),
 Linkage(11bb),
 Linkage(12aa),
 Linkage(12ab),
 Linkage(12ba),
 Linkage(12bb),
 Linkage(13aa),
 Linkage(13ab),
 Linkage(13ba),
 Linkage(13bb),
 Linkage(14aa),
 Linkage(14ab),
 Linkage(14ba),
 Linkage(14bb),
 Linkage(16aa),
 Linkage(16ab),
 Linkage(16ba),
 Linkage(SUCR),
 Linkage(LCTL),
 Linkage(AB15),
 Linkage(SA23AB),
 Linkage(LINK)]

Each linkage is identified by an ID within the CHARMM force field - e.g. 12aa stands for the 1->2 alpha glycosydic linkage. Each of the pre-defined available linkages can be referenced by their (string) id when connecting molecules together.

For example, we can connect two mannoses using a 12aa linkage by:

[6]:
man = bb.molecule("files/man.pdb")

# use pre-defined 12aa linkage
man2 = bb.connect(man, man, "12aa")
man2.show()
[7]:
# define a custom 1->2 glycosydic linkage
my_12aa = bb.linkage("O2", "C1", ["HO2"], ["O1", "HO1"], id="my_12aa")

# add the linkage to the library
bb.add_linkage(my_12aa)

bb.available_linkages()
[7]:
[Linkage(SCK0),
 Linkage(SCK1),
 Linkage(LLLO),
 Linkage(CERA),
 Linkage(CERB),
 Linkage(DAGA),
 Linkage(DAGB),
 Linkage(INS2A),
 Linkage(INS2B),
 Linkage(INS6A),
 Linkage(INS6B),
 Linkage(SGPA),
 Linkage(TGPA),
 Linkage(SGPB),
 Linkage(TGPB),
 Linkage(NGLA),
 Linkage(11aa),
 Linkage(11ab),
 Linkage(11bb),
 Linkage(12aa),
 Linkage(12ab),
 Linkage(12ba),
 Linkage(12bb),
 Linkage(13aa),
 Linkage(13ab),
 Linkage(13ba),
 Linkage(13bb),
 Linkage(14aa),
 Linkage(14ab),
 Linkage(14ba),
 Linkage(14bb),
 Linkage(16aa),
 Linkage(16ab),
 Linkage(16ba),
 Linkage(SUCR),
 Linkage(LCTL),
 Linkage(AB15),
 Linkage(SA23AB),
 Linkage(LINK),
 Linkage(my_12aa)]

If the set of available linkages is quite large and we want to check if a particular one is available, we can also use the has_linkage function to check if a linkage with a given id is pre-loaded in the current session.

[8]:
bb.has_linkage("my_12aa")
[8]:
True

The CHARMMTopology class#

The data from the CHARMM force field is handled by the CHARMMTopology class, which parses (you guessed it) a CHARMM topology file (not parameter file) and stores its data in a dictionary structure. Its purpose is to store linkages.

The default instance of this class can be accessed using the get_default_topology function. Why is this useful? Well, if there is a “get”-default topology function there may be a “set”-version as well (which is totally the case). If you have your own CHARMM topology file with defined linkages and molecules, you can read_topology to parse your own file, use set_default=True to make your topology the default, and thus tailor biobuild to your specific needs.

[9]:
# read a custom topology file to make a CHARMMTopology
# (but don't set it as the default topology)
my_top = bb.read_topology("files/my_top.top", set_default=False)

# check out the patches (linkages) in the topology
my_top.patches
[9]:
[Linkage(my_14bb), Linkage(my_16ab)]

If we want to use a non-default topology we can either specify the topology we want to use as an argument to functions and methods which accept a _topology argument, or we directly provide the linkage objects we obtain from the topology.

[10]:
# connect the two clucoses using the `my_16ab` linkage from my_top

my_man2 = bb.connect(man, man, "my_16ab", _topology=my_top)
# or
my_16ab = my_top.get_patch("my_16ab")
my_man2 = bb.connect(man, man, my_16ab)

my_man2.show()

PDBE Compounds#

biobuild maintains a part of the PDBE component library of small molecules, common sugar, lipid, nucleotides, and amino acid compounds and derivatives to directly obtain molecular structures while coding, without the need to download any pdb files externally. Molecules can be obtained from the library using their PDB ID, their names or some other available identifier such as SMILES.

To reduce memory load, biobuild loads by default an empty library which can then be populated according to the user’s needs using functions:

  • load_small_molecules()

  • load_sugars()

  • load_lipids()

  • load_amino_acids()

  • load_nucleotides()

  • or load_all_compounds() (to perform all of the above)

All of the above also have an unload_ equivalent to again remove the compounds from the currently loaded default compounds.

Of course, we can set a custom default library to load automatically using the set_default_compounds function (see below).

[11]:
# we are interested in working with sugars so we load the sugars
bb.load_sugars()

# now we are able to refer to sugar compounds we want to use directly by their pdb ID or name
# for example, we can connect two glucose molecules using the 1->6 linkage
glc = bb.molecule("GLC")
glc2 = bb.connect(glc, glc, "16ab")
glc2.show()

If we would like to make the molecule makery a bit quicker by removing the overhead of first figuring out what kind of input we provided, we can use the get_compound function instead of the molecule function - this also works with all sorts of inputs automatically. One step further would be the Molecule.from_compound classmethod which requires the user to specify which type of input they use:

[12]:
# get a galactose (slowest)
gal = bb.molecule("GAL")
# or (faster)
gal = bb.get_compound("GAL")
# or (fastest)
gal = bb.Molecule.from_compound("GAL", by="id")

Similar to the CHARMM topology, we can get the default instance of the PDBECompounds class that handles the databse using

[ ]:
comp = bb.get_default_compounds()
# print how many (currently sugar-only) compounds are available in the library
len(comp)
1069

Adding custom compounds#

If a user has built a specific molecule that they are going to use a lot and do not wish to save and load all the time from their own file, they can add custom molecules to the default compounds using the add_compound function.

Let’s say our current project revolves around our two-glucose molecule, we can add it to the default compounds to ensure that it will always be available in the future.

[13]:
glc2.id = "glc2" # we need to give the molecule a unique ID
bb.add_compound(glc2,
                type="my_project_basic_molecules", # we can add some descriptive tag here (optional)
                names=["di-glucose"] # provide some name aliases (optional)
                )

# now we can refer to the compound by our self-given name 'di-glucose'
bb.molecule("di-glucose").show()

Setting default compounds#

If we have a custom set of compounds that we are always using, we can set them to be the defaults that are loaded by biobuild in every session. We can do so using the set_default_compounds function, just like with the CHARMM topology. Using the option overwrite=True we can ensure that biobuild will load the currently set default PDBE compounds library automatically in every future session.

# get the currently loaded compounds (including our newly added one)
current_compounds = bb.get_default_compounds()

# save the current compounds as defaults for every future session
bb.set_default_compounds(current_compounds, overwrite=True)

# if we realize later that we would like to reset our defaults again
# we can use:
bb.restore_default_compounds()

PubChem maintains an enormous database of readily available 3D structures for molecules. They offer a handy API in the form of the pubchempy package, which is integrated into biobuild. In fact, any call to molecule will eventually end up calling on PubChem if no compound or file could be identified for the given inputs. Hence, to use PubChem the user needs not learn anything new. Alternatively, there is a classmethod Molecule.from_pubchem that can be used as well.

Let’s get an aspirin molecule

[ ]:
bb.has_compound("aspirin")
False

So, apparently, aspirin is not available from the loaded sugar compounds (no surprise there), but it is actually also not going to be part of any of the other pre-defined sets. So, we need to refer to pubchem to get it…

[14]:
aspirin = bb.Molecule.from_pubchem("aspirin")
aspirin.show()

And there it is! Again, using the molecule function would be all we need. The call to PubChem is automatically handled by biobuild if a reference to PDBECompounds does not yield any results.