Advanced Selections and Analysis ================================ There are lots of tools in ``ampal`` for analysing protein structure, most of them are baked into the AMPAL objects themselves. Let’s load-up a PDB file: .. code:: ipython3 import ampal my_protein_3qy1 = ampal.load_pdb('3qy1.pdb') 1. Selecting All Residues or Atoms ---------------------------------- Sometimes it’s convinient to select all of the ``Residues`` or ``Atoms`` in an ``Assembly`` or ``Polypeptide`` object: .. code:: ipython3 my_protein_3qy1.get_monomers() .. parsed-literal:: As you can see an itertools object is returned. This might be slightly confusing as you might expect a list, but this is what’s known as an ``iterator``, you can loop over it like a list or string, but if you want to use it repeatedly or examine its contents you’ll need to convert it to a list. The advantage of returning an ``iterator`` is that it’s much more memory efficient, and these lists could potentially be very large. If you’d like to know more about ``iterables`` and ``iterators``, as well as related objects called ``generators``, see the following link (this is quite advanced Python and is not essential for using any of the AMPAL framework: `Iterators and Generators `__. .. code:: ipython3 list(my_protein_3qy1.get_atoms())[:20] # Showing the first 20 elements for clarity .. parsed-literal:: [, , , , , , , , , , , , , , , , , , , ] .. code:: ipython3 list(my_protein_3qy1.get_monomers())[:20] # Showing the first 20 elements for clarity .. parsed-literal:: [, , , , , , , , , , , , , , , , , , , ] 2. Analysing Composition ------------------------ We can easily look at the composition of sequences and structures using a ``Counter`` object. ``Counter`` can be fed any iterable (``lists`` and ``strings`` are the most commonly used) and will count the occurence of each element inside. We can start by looking at the composition of amino acids in a sequence: .. code:: ipython3 from collections import Counter .. code:: ipython3 my_protein_3qy1['A'].sequence .. parsed-literal:: 'DIDTLISNNALWSKMLVEEDPGFFEKLAQAQKPRFLWIGCSDSRVPAERLTGLEPGELFVHRNVANLVIHTDLNCLSVVQYAVDVLEVEHIIICGHSGCGGIKAAVENPELGLINNWLLHIRDIWLKHSSLLGKMPEEQRLDALYELNVMEQVYNLGHSTIMQSAWKRGQNVTIHGWAYSINDGLLRDLDVTATNRETLENGYHKGISALSLKYI' .. code:: ipython3 Counter(my_protein_3qy1['A'].sequence) .. parsed-literal:: Counter({'D': 11, 'I': 16, 'T': 8, 'L': 29, 'S': 13, 'N': 14, 'A': 13, 'W': 6, 'K': 9, 'M': 4, 'V': 15, 'E': 16, 'P': 6, 'G': 16, 'F': 4, 'Q': 7, 'R': 9, 'C': 4, 'H': 9, 'Y': 6}) But as stated before, you can use Counters on any iterable, not just strings. Let’s make a list of all the pdb molecule codes of the ligands: .. code:: ipython3 my_ligands = my_protein_3qy1.get_ligands() .. code:: ipython3 my_ligands[0] .. parsed-literal:: .. code:: ipython3 my_ligands[0].mol_code # This can be used to find the pdb molecule code of any residue .. parsed-literal:: 'ZN' There are two ways to generate a list of the codes, a ``for`` loop or a ``list`` comprehension, use whichever you are comfortable with. If you’d like to know more about ``list`` comprehensions, please see the following link (this is relatively advanced Python but while not essential for using any of the AMPAL framework, it is very useful: `List Comprehensions `__ (scroll down to the relevant section). .. code:: ipython3 # With a for loop mol_codes_1 = [] for lig in my_ligands: mol_codes_1.append(lig.mol_code) .. code:: ipython3 mol_codes_1[:5] # The first 5 elements .. parsed-literal:: ['ZN', 'HOH', 'HOH', 'HOH', 'HOH'] .. code:: ipython3 # A list comprehension mol_codes_2 = [lig.mol_code for lig in my_ligands] .. code:: ipython3 mol_codes_2[:5] # Showing the first 5 elements for clarity .. parsed-literal:: ['ZN', 'HOH', 'HOH', 'HOH', 'HOH'] You can use either of these methods, use whichever one you’re more comfortable with. .. code:: ipython3 mol_codes_1 == mol_codes_2 # The lists that re produced are exactly the same .. parsed-literal:: True Now the ``list`` of mol codes can be used to make a ``Counter`` object: .. code:: ipython3 Counter(mol_codes_1) .. parsed-literal:: Counter({'ZN': 2, 'HOH': 447}) As you can see, there are 447 water molecules and 2 zinc ions. 3. Distance Analysis -------------------- Now we can select all atoms in the protein and understand the structures composition, we can perform some simple analysis. Let’s try and find all the residues that are close to the zinc ions. .. code:: ipython3 zinc_1 = my_ligands[0] .. code:: ipython3 zinc_1 .. parsed-literal:: All ``Ligand`` objects are ``Monomers``, even if they only contain a single atom. So we use the zinc ``Atom`` itself when measuring distances: .. code:: ipython3 zinc_1['ZN'] .. parsed-literal:: Measuring distances is simple, you can use the ``ampal.geometry.distance`` function. It takes two 3D vectors as an input, these can be in list form, tuples, or even ``Atom`` objects: .. code:: ipython3 ampal.geometry.distance(zinc_1['ZN'], (0, 0, 0)) # Distance from the origin .. parsed-literal:: 28.179990720367528 .. code:: ipython3 first_ca = my_protein_3qy1['A'][0]['CA'] # CA of the first residue in chain A .. code:: ipython3 first_ca .. parsed-literal:: .. code:: ipython3 ampal.geometry.distance(zinc_1['ZN'], first_ca) # Distance in angstroms .. parsed-literal:: 24.41060972200408 Now we need to loop over all the atoms and find which are close (<= 3 Å) to the zinc. We can use the distance function in geometry to do this: .. code:: ipython3 atoms_close_to_zinc = [] for at in my_protein_3qy1.get_atoms(): if ampal.geometry.distance(zinc_1['ZN'], at) <= 3.0: atoms_close_to_zinc.append(at) .. code:: ipython3 atoms_close_to_zinc .. parsed-literal:: [, , , , ] There are 5 atoms within 3 Å of the zinc, *including* the zinc atom itself. One way to get ignore this atom in the above block of code is to use the ``ligands=False`` flag in ``get_atoms()``: .. code:: ipython3 atoms_close_to_zinc = [] for at in my_protein_3qy1.get_atoms(ligands=False): if ampal.geometry.distance(zinc_1['ZN'], at) <= 3.0: atoms_close_to_zinc.append(at) .. code:: ipython3 atoms_close_to_zinc .. parsed-literal:: [, , , ] Now there are 4 atoms within 3 Å of the zinc, 2 sulphur atoms, an oxygen and a nitrogen. Let’s find the residues that are coordinating the zinc: .. code:: ipython3 atoms_close_to_zinc[0] .. parsed-literal:: .. code:: ipython3 atoms_close_to_zinc[0].parent .. parsed-literal:: We can get all the residues using a for loop or a list comprehension: .. code:: ipython3 residues_close_to_zinc = [] for at in atoms_close_to_zinc: residues_close_to_zinc.append(at.parent) .. code:: ipython3 residues_close_to_zinc .. parsed-literal:: [, , , ] .. code:: ipython3 # The list comprehension is much more concise residues_close_to_zinc_2 = [at.parent for at in atoms_close_to_zinc] .. code:: ipython3 residues_close_to_zinc_2 .. parsed-literal:: [, , , ] It looks like the zinc is coordinated by two cysteines, an aspartate and a histidine residue. 4. Is Within ------------ This kind of operation is very common when analysing proteins. So we have some built-in methods for handling this on ``Assembly`` and ``Polymer`` objects: .. code:: ipython3 my_protein_3qy1.is_within(3, zinc_1['ZN']) # It takes a distance and a point .. parsed-literal:: [, , , , ] .. code:: ipython3 my_protein_3qy1.is_within(3, (-10, -10, -10)) .. parsed-literal:: [, , , , , , , ] .. code:: ipython3 my_protein_3qy1['A'].is_within(3, zinc_1['ZN']) .. parsed-literal:: [, , , , ] .. code:: ipython3 my_protein_3qy1['B'].is_within(3, zinc_1['ZN']) # This list is empty as nothing on chain B is close to the zinc .. parsed-literal:: [] There is a partner method to is ``is_within``, every ``Monomer`` (this includes ``Residues`` and ``Ligands``) has an ``close_monomers`` method. This returns all ``Monomers`` within a given cutoff value. .. code:: ipython3 zinc_1.close_monomers(my_protein_3qy1) # default cutoff is 4.0 Å .. parsed-literal:: [, , , , ] .. code:: ipython3 zinc_1.close_monomers(my_protein_3qy1, cutoff=6) .. parsed-literal:: [, , , , , , , , , , , , , , , , , , ] 5. Geometry in AMPAL -------------------- There are a range of tools in AMPAL for performing geometric operations. We’ve already covered distance, but other commonly used functions include ``angle_between_vectors``, ``dihedral``, ``unit_vector``, ``find_foot``, ``radius_of_circumcircle``. Be sure to check out the source code if you need a specific geometric function or have a look through the documentation. The ``dihedral`` function is probably the most useful of these for analysing proteins, so let’s use it to measure some torsion angles. It requires 4 3D vectors to calculate the dihedral, again these can be ``lists``, ``tuples``, ``numpy.arrays`` or ``Atoms``. **Note:** This method of calculating torsion angles is only as an example, see the Tagging tutorial for the proper, low-effort method! .. code:: ipython3 r1 = my_protein_3qy1['B'][4] r2 = my_protein_3qy1['B'][5] r3 = my_protein_3qy1['B'][6] .. code:: ipython3 omega = ampal.geometry.dihedral(r1['CA'], r1['C'], r2['N'], r2['CA']) .. code:: ipython3 phi = ampal.geometry.dihedral(r1['C'], r2['N'], r2['CA'], r2['C']) .. code:: ipython3 psi = ampal.geometry.dihedral(r2['N'], r2['CA'], r2['C'], r3['N']) .. code:: ipython3 print(omega, phi, psi) .. parsed-literal:: -179.8002001034783 -64.10861163046212 -45.84373968479124 We can use it to calculate the :math:`\chi` torsion angles too. R2 is leucine, so we can calculate the :math:`\chi_1` and :math:`\chi_2` angles: .. code:: ipython3 r2.atoms .. parsed-literal:: OrderedDict([('N', ), ('CA', ), ('C', ), ('O', ), ('CB', ), ('CG', ), ('CD1', ), ('CD2', )]) .. code:: ipython3 chi1 = ampal.geometry.dihedral(r2['N'], r2['CA'], r2['CB'], r2['CG']) .. code:: ipython3 chi2 = ampal.geometry.dihedral(r2['CA'], r2['CB'], r2['CG'], r2['CD1']) .. code:: ipython3 print(chi1, chi2) .. parsed-literal:: -51.38040366349739 -168.70006206564494 Our simple analysis shows that the leucine residue is in the gauche-/trans conformation. 6. Summary and activities ------------------------- There are lots of tools for making complex selections in ampal. Combined with the tools for geometry, detailed analysis can be performed on these selections. 1. Find all the residues that are: 1. within 5 Å of crystal water. 2. *not* within 5 Å of crystal water. 2. Find how many cis-peptide bonds there are in this structure. 3. Perform these activities on another PDB file.