Advanced Selections and Analysis¶
There are lots of tools in ampal
for analysing protein structure,
most of them are baked into the AMPAL objects themselves.
Let’s load-up a PDB file:
import ampal
my_protein_3qy1 = ampal.load_pdb('3qy1.pdb')
1. Selecting All Residues or Atoms¶
Sometimes it’s convinient to select all of the Residues
or Atoms
in an Assembly
or Polypeptide
object:
my_protein_3qy1.get_monomers()
<itertools.chain at 0x7fbf649db588>
As you can see an itertools object is returned. This might be slightly
confusing as you might expect a list, but this is what’s known as an
iterator
, you can loop over it like a list or string, but if you
want to use it repeatedly or examine its contents you’ll need to convert
it to a list. The advantage of returning an iterator
is that it’s
much more memory efficient, and these lists could potentially be very
large. If you’d like to know more about iterables
and iterators
,
as well as related objects called generators
, see the following link
(this is quite advanced Python and is not essential for using any of the
AMPAL framework: Iterators and
Generators.
list(my_protein_3qy1.get_atoms())[:20] # Showing the first 20 elements for clarity
[<Nitrogen Atom (N). Coordinates: (14.714, -30.168, -26.423)>,
<Carbon Atom (CA). Coordinates: (15.518, -30.153, -25.207)>,
<Carbon Atom (C). Coordinates: (16.111, -28.769, -24.931)>,
<Oxygen Atom (O). Coordinates: (15.960, -27.855, -25.734)>,
<Carbon Atom (CB). Coordinates: (16.613, -31.220, -25.270)>,
<Carbon Atom (CG). Coordinates: (16.067, -32.624, -25.153)>,
<Oxygen Atom (OD1). Coordinates: (14.899, -32.777, -24.743)>,
<Oxygen Atom (OD2). Coordinates: (16.807, -33.576, -25.474)>,
<Nitrogen Atom (N). Coordinates: (16.782, -28.637, -23.789)>,
<Carbon Atom (CA). Coordinates: (17.360, -27.364, -23.339)>,
<Carbon Atom (C). Coordinates: (18.466, -26.876, -24.299)>,
<Oxygen Atom (O). Coordinates: (18.513, -25.674, -24.586)>,
<Carbon Atom (CB). Coordinates: (17.842, -27.509, -21.862)>,
<Carbon Atom (CG1). Coordinates: (16.643, -27.442, -20.889)>,
<Carbon Atom (CG2). Coordinates: (18.956, -26.516, -21.456)>,
<Carbon Atom (CD1). Coordinates: (15.873, -26.033, -20.777)>,
<Nitrogen Atom (N). Coordinates: (19.310, -27.788, -24.836)>,
<Carbon Atom (CA). Coordinates: (20.370, -27.411, -25.786)>,
<Carbon Atom (C). Coordinates: (19.801, -26.707, -27.030)>,
<Oxygen Atom (O). Coordinates: (20.438, -25.781, -27.539)>]
list(my_protein_3qy1.get_monomers())[:20] # Showing the first 20 elements for clarity
[<Residue containing 8 Atoms. Residue code: ASP>,
<Residue containing 8 Atoms. Residue code: ILE>,
<Residue containing 8 Atoms. Residue code: ASP>,
<Residue containing 7 Atoms. Residue code: THR>,
<Residue containing 8 Atoms. Residue code: LEU>,
<Residue containing 8 Atoms. Residue code: ILE>,
<Residue containing 6 Atoms. Residue code: SER>,
<Residue containing 8 Atoms. Residue code: ASN>,
<Residue containing 8 Atoms. Residue code: ASN>,
<Residue containing 5 Atoms. Residue code: ALA>,
<Residue containing 8 Atoms. Residue code: LEU>,
<Residue containing 14 Atoms. Residue code: TRP>,
<Residue containing 6 Atoms. Residue code: SER>,
<Residue containing 9 Atoms. Residue code: LYS>,
<Residue containing 8 Atoms. Residue code: MET>,
<Residue containing 8 Atoms. Residue code: LEU>,
<Residue containing 7 Atoms. Residue code: VAL>,
<Residue containing 9 Atoms. Residue code: GLU>,
<Residue containing 9 Atoms. Residue code: GLU>,
<Residue containing 8 Atoms. Residue code: ASP>]
2. Analysing Composition¶
We can easily look at the composition of sequences and structures using
a Counter
object. Counter
can be fed any iterable (lists
and
strings
are the most commonly used) and will count the occurence of
each element inside. We can start by looking at the composition of amino
acids in a sequence:
from collections import Counter
my_protein_3qy1['A'].sequence
'DIDTLISNNALWSKMLVEEDPGFFEKLAQAQKPRFLWIGCSDSRVPAERLTGLEPGELFVHRNVANLVIHTDLNCLSVVQYAVDVLEVEHIIICGHSGCGGIKAAVENPELGLINNWLLHIRDIWLKHSSLLGKMPEEQRLDALYELNVMEQVYNLGHSTIMQSAWKRGQNVTIHGWAYSINDGLLRDLDVTATNRETLENGYHKGISALSLKYI'
Counter(my_protein_3qy1['A'].sequence)
Counter({'D': 11,
'I': 16,
'T': 8,
'L': 29,
'S': 13,
'N': 14,
'A': 13,
'W': 6,
'K': 9,
'M': 4,
'V': 15,
'E': 16,
'P': 6,
'G': 16,
'F': 4,
'Q': 7,
'R': 9,
'C': 4,
'H': 9,
'Y': 6})
But as stated before, you can use Counters on any iterable, not just strings. Let’s make a list of all the pdb molecule codes of the ligands:
my_ligands = my_protein_3qy1.get_ligands()
my_ligands[0]
<Ligand containing 1 Atom. Ligand code: ZN>
my_ligands[0].mol_code # This can be used to find the pdb molecule code of any residue
'ZN'
There are two ways to generate a list of the codes, a for
loop or a
list
comprehension, use whichever you are comfortable with. If you’d
like to know more about list
comprehensions, please see the
following link (this is relatively advanced Python but while not
essential for using any of the AMPAL framework, it is very useful: List
Comprehensions
(scroll down to the relevant section).
# With a for loop
mol_codes_1 = []
for lig in my_ligands:
mol_codes_1.append(lig.mol_code)
mol_codes_1[:5] # The first 5 elements
['ZN', 'HOH', 'HOH', 'HOH', 'HOH']
# A list comprehension
mol_codes_2 = [lig.mol_code for lig in my_ligands]
mol_codes_2[:5] # Showing the first 5 elements for clarity
['ZN', 'HOH', 'HOH', 'HOH', 'HOH']
You can use either of these methods, use whichever one you’re more comfortable with.
mol_codes_1 == mol_codes_2 # The lists that re produced are exactly the same
True
Now the list
of mol codes can be used to make a Counter
object:
Counter(mol_codes_1)
Counter({'ZN': 2, 'HOH': 447})
As you can see, there are 447 water molecules and 2 zinc ions.
3. Distance Analysis¶
Now we can select all atoms in the protein and understand the structures composition, we can perform some simple analysis. Let’s try and find all the residues that are close to the zinc ions.
zinc_1 = my_ligands[0]
zinc_1
<Ligand containing 1 Atom. Ligand code: ZN>
All Ligand
objects are Monomers
, even if they only contain a
single atom. So we use the zinc Atom
itself when measuring
distances:
zinc_1['ZN']
<Zinc Atom (ZN). Coordinates: (-5.817, -20.172, -18.798)>
Measuring distances is simple, you can use the
ampal.geometry.distance
function. It takes two 3D vectors as an
input, these can be in list form, tuples, or even Atom
objects:
ampal.geometry.distance(zinc_1['ZN'], (0, 0, 0)) # Distance from the origin
28.179990720367528
first_ca = my_protein_3qy1['A'][0]['CA'] # CA of the first residue in chain A
first_ca
<Carbon Atom (CA). Coordinates: (15.518, -30.153, -25.207)>
ampal.geometry.distance(zinc_1['ZN'], first_ca) # Distance in angstroms
24.41060972200408
Now we need to loop over all the atoms and find which are close (<= 3 Å) to the zinc. We can use the distance function in geometry to do this:
atoms_close_to_zinc = []
for at in my_protein_3qy1.get_atoms():
if ampal.geometry.distance(zinc_1['ZN'], at) <= 3.0:
atoms_close_to_zinc.append(at)
atoms_close_to_zinc
[<Sulfur Atom (SG). Coordinates: (-4.322, -18.933, -17.640)>,
<Oxygen Atom (OD2). Coordinates: (-4.771, -22.057, -19.213)>,
<Nitrogen Atom (NE2). Coordinates: (-6.209, -19.569, -20.787)>,
<Sulfur Atom (SG). Coordinates: (-7.753, -20.619, -17.709)>,
<Zinc Atom (ZN). Coordinates: (-5.817, -20.172, -18.798)>]
There are 5 atoms within 3 Å of the zinc, including the zinc atom
itself. One way to get ignore this atom in the above block of code is to
use the ligands=False
flag in get_atoms()
:
atoms_close_to_zinc = []
for at in my_protein_3qy1.get_atoms(ligands=False):
if ampal.geometry.distance(zinc_1['ZN'], at) <= 3.0:
atoms_close_to_zinc.append(at)
atoms_close_to_zinc
[<Sulfur Atom (SG). Coordinates: (-4.322, -18.933, -17.640)>,
<Oxygen Atom (OD2). Coordinates: (-4.771, -22.057, -19.213)>,
<Nitrogen Atom (NE2). Coordinates: (-6.209, -19.569, -20.787)>,
<Sulfur Atom (SG). Coordinates: (-7.753, -20.619, -17.709)>]
Now there are 4 atoms within 3 Å of the zinc, 2 sulphur atoms, an oxygen and a nitrogen. Let’s find the residues that are coordinating the zinc:
atoms_close_to_zinc[0]
<Sulfur Atom (SG). Coordinates: (-4.322, -18.933, -17.640)>
atoms_close_to_zinc[0].parent
<Residue containing 6 Atoms. Residue code: CYS>
We can get all the residues using a for loop or a list comprehension:
residues_close_to_zinc = []
for at in atoms_close_to_zinc:
residues_close_to_zinc.append(at.parent)
residues_close_to_zinc
[<Residue containing 6 Atoms. Residue code: CYS>,
<Residue containing 8 Atoms. Residue code: ASP>,
<Residue containing 10 Atoms. Residue code: HIS>,
<Residue containing 6 Atoms. Residue code: CYS>]
# The list comprehension is much more concise
residues_close_to_zinc_2 = [at.parent for at in atoms_close_to_zinc]
residues_close_to_zinc_2
[<Residue containing 6 Atoms. Residue code: CYS>,
<Residue containing 8 Atoms. Residue code: ASP>,
<Residue containing 10 Atoms. Residue code: HIS>,
<Residue containing 6 Atoms. Residue code: CYS>]
It looks like the zinc is coordinated by two cysteines, an aspartate and a histidine residue.
4. Is Within¶
This kind of operation is very common when analysing proteins. So we
have some built-in methods for handling this on Assembly
and
Polymer
objects:
my_protein_3qy1.is_within(3, zinc_1['ZN']) # It takes a distance and a point
[<Sulfur Atom (SG). Coordinates: (-4.322, -18.933, -17.640)>,
<Oxygen Atom (OD2). Coordinates: (-4.771, -22.057, -19.213)>,
<Nitrogen Atom (NE2). Coordinates: (-6.209, -19.569, -20.787)>,
<Sulfur Atom (SG). Coordinates: (-7.753, -20.619, -17.709)>,
<Zinc Atom (ZN). Coordinates: (-5.817, -20.172, -18.798)>]
my_protein_3qy1.is_within(3, (-10, -10, -10))
[<Nitrogen Atom (N). Coordinates: (-7.278, -11.112, -10.445)>,
<Carbon Atom (C). Coordinates: (-7.525, -10.089, -8.339)>,
<Oxygen Atom (O). Coordinates: (-8.676, -9.655, -8.177)>,
<Carbon Atom (CA). Coordinates: (-11.364, -9.364, -12.255)>,
<Carbon Atom (CB). Coordinates: (-10.337, -10.223, -12.972)>,
<Carbon Atom (CD). Coordinates: (-10.752, -12.047, -11.188)>,
<Oxygen Atom (OE1). Coordinates: (-10.046, -11.618, -10.265)>,
<Nitrogen Atom (ND2). Coordinates: (-11.667, -9.798, -7.931)>]
my_protein_3qy1['A'].is_within(3, zinc_1['ZN'])
[<Sulfur Atom (SG). Coordinates: (-4.322, -18.933, -17.640)>,
<Oxygen Atom (OD2). Coordinates: (-4.771, -22.057, -19.213)>,
<Nitrogen Atom (NE2). Coordinates: (-6.209, -19.569, -20.787)>,
<Sulfur Atom (SG). Coordinates: (-7.753, -20.619, -17.709)>,
<Zinc Atom (ZN). Coordinates: (-5.817, -20.172, -18.798)>]
my_protein_3qy1['B'].is_within(3, zinc_1['ZN'])
# This list is empty as nothing on chain B is close to the zinc
[]
There is a partner method to is is_within
, every Monomer
(this
includes Residues
and Ligands
) has an close_monomers
method.
This returns all Monomers
within a given cutoff value.
zinc_1.close_monomers(my_protein_3qy1) # default cutoff is 4.0 Å
[<Residue containing 6 Atoms. Residue code: CYS>,
<Residue containing 8 Atoms. Residue code: ASP>,
<Residue containing 10 Atoms. Residue code: HIS>,
<Residue containing 6 Atoms. Residue code: CYS>,
<Ligand containing 1 Atom. Ligand code: ZN>]
zinc_1.close_monomers(my_protein_3qy1, cutoff=6)
[<Residue containing 6 Atoms. Residue code: CYS>,
<Residue containing 6 Atoms. Residue code: SER>,
<Residue containing 8 Atoms. Residue code: ASP>,
<Residue containing 6 Atoms. Residue code: SER>,
<Residue containing 11 Atoms. Residue code: ARG>,
<Residue containing 7 Atoms. Residue code: VAL>,
<Residue containing 5 Atoms. Residue code: ALA>,
<Residue containing 8 Atoms. Residue code: ASN>,
<Residue containing 10 Atoms. Residue code: HIS>,
<Residue containing 4 Atoms. Residue code: GLY>,
<Residue containing 6 Atoms. Residue code: CYS>,
<Residue containing 4 Atoms. Residue code: GLY>,
<Residue containing 4 Atoms. Residue code: GLY>,
<Residue containing 8 Atoms. Residue code: ILE>,
<Ligand containing 1 Atom. Ligand code: ZN>,
<Ligand containing 1 Atom. Ligand code: HOH>,
<Ligand containing 1 Atom. Ligand code: HOH>,
<Ligand containing 1 Atom. Ligand code: HOH>,
<Ligand containing 1 Atom. Ligand code: HOH>]
5. Geometry in AMPAL¶
There are a range of tools in AMPAL for performing geometric operations.
We’ve already covered distance, but other commonly used functions
include angle_between_vectors
, dihedral
, unit_vector
,
find_foot
, radius_of_circumcircle
. Be sure to check out the
source code if you need a specific geometric function or have a look
through the documentation.
The dihedral
function is probably the most useful of these for
analysing proteins, so let’s use it to measure some torsion angles. It
requires 4 3D vectors to calculate the dihedral, again these can be
lists
, tuples
, numpy.arrays
or Atoms
. Note: This
method of calculating torsion angles is only as an example, see the
Tagging tutorial for the proper, low-effort method!
r1 = my_protein_3qy1['B'][4]
r2 = my_protein_3qy1['B'][5]
r3 = my_protein_3qy1['B'][6]
omega = ampal.geometry.dihedral(r1['CA'], r1['C'], r2['N'], r2['CA'])
phi = ampal.geometry.dihedral(r1['C'], r2['N'], r2['CA'], r2['C'])
psi = ampal.geometry.dihedral(r2['N'], r2['CA'], r2['C'], r3['N'])
print(omega, phi, psi)
-179.8002001034783 -64.10861163046212 -45.84373968479124
We can use it to calculate the \(\chi\) torsion angles too. R2 is leucine, so we can calculate the \(\chi_1\) and \(\chi_2\) angles:
r2.atoms
OrderedDict([('N',
<Nitrogen Atom (N). Coordinates: (-5.186, -2.004, -31.807)>),
('CA',
<Carbon Atom (CA). Coordinates: (-4.911, -3.362, -31.310)>),
('C', <Carbon Atom (C). Coordinates: (-5.985, -4.346, -31.786)>),
('O', <Oxygen Atom (O). Coordinates: (-5.650, -5.434, -32.255)>),
('CB',
<Carbon Atom (CB). Coordinates: (-4.788, -3.418, -29.770)>),
('CG',
<Carbon Atom (CG). Coordinates: (-3.838, -2.437, -29.061)>),
('CD1',
<Carbon Atom (CD1). Coordinates: (-3.653, -2.831, -27.613)>),
('CD2',
<Carbon Atom (CD2). Coordinates: (-2.478, -2.359, -29.736)>)])
chi1 = ampal.geometry.dihedral(r2['N'], r2['CA'], r2['CB'], r2['CG'])
chi2 = ampal.geometry.dihedral(r2['CA'], r2['CB'], r2['CG'], r2['CD1'])
print(chi1, chi2)
-51.38040366349739 -168.70006206564494
Our simple analysis shows that the leucine residue is in the gauche-/trans conformation.
6. Summary and activities¶
There are lots of tools for making complex selections in ampal. Combined with the tools for geometry, detailed analysis can be performed on these selections.
Find all the residues that are:
within 5 Å of crystal water.
not within 5 Å of crystal water.
Find how many cis-peptide bonds there are in this structure.
Perform these activities on another PDB file.