- Service of hetero dimer modeling
- Service of Identifying Interacting Proteins
Residue-residue statistical contact energy for protein-protein interaction
Residue-residue statistical contact energies were originally developed for coarse-grained
models of protein folding and threading(1-3). Recently, similar approaches were applied for
evaluating protein-protein interaction(4,5). In this study, we employed a typical log-odds
formula for extracting the value of contact energies. A statistical contact energy
contacting residues a and b in different polypeptide chains is estimated by the form of the
where P(a) and P(b) are the probabilities that amino acids
a and b appear on the surface,
Q(a,b) is the probability that amino acids a and b
contact each other in the protein-protein
interface. Surface residues of a protein are defined as those residues whose relative
accessible surface areas (accessibilities) are equal to or larger than 15%. Contacting residue pairs are defined as the
surface residues in different chains, whose Cbeta atoms are located within 7 angstrom of one another. Both
probabilities are estimated using the dataset for calculation of contact energy.
If the interface contacts between residues a and b are often found in
the interface, the value of econ(a,b) is large and negative.
The estimated energy values are summarized in Figure 1. Hydrophobic residues are
attractive to each other, especially in the case of the cysteine-cysteine pair. Hydrophilic
residues, however, are generally repulsive even for differently charged residue pairs, such as
the arginine-glutamic acid pair. These features are similar to those employed in previous
Fig. 1 Residue-residue statistical contact energy in protein-protein interfaces. In the
horizontal and vertical axes, 20 amino acids are arranged in descending order of
hydrophobicity. Energy values are represented from red (low energy) to blue (high energy).
The total contact energy Econ is the sum of the econ for all the contacted residue pairs as
where N and M are the total number of the residues of proteins,
and ai and aj are the amino
acids of residues i and j.
- Normalization of the energies
To normalize the contact and electrostatic energies, a Z-score is introduced based on the
previous study. The Z-score for energy E is defined as follows:
where Mean[E] and Var[E] are the average and variance of E respectively for randomly
shuffled amino acids sequences of the same composition. Z-score shows how many units of
the standard deviation a case is above or below the average. It allows us to compare the
results of different distributions. Calculation of the averages and variances of the contact
energy and electrostatic energy are described in the following sections. In contrast to studies
by other groups, we analytically estimated the average and variance of energies without
explicitly generating randomly shuffled sequences.
- Mean and variance of contact energy for randomly shuffled sequences
We assume that random contacting amino acid pairs are generated by picking up two amino
acids randomly from the surfaces of different proteins. For this random set of contacting
amino acids, the average mucon and variance sigmacon2
of the contact energy are calculated as
where P(a) and P(b) are the proportions of amino acid a and b in surface residues for each
protein, and A is the set of 20 genetically encoded amino acids. If we assume that the all the
contacting protein pairs are independent in the shuffling process, the average and variance of
the total contact energy Econ are calculated as follows:
Finally, Z-score of the contact energy Econ is obtained as follows:
- Sequence similarity between target and template
We employed sequence similarity between target protein and template protein as another
feature for finding interacting proteins. We expected that two proteins will interact with
each other if they have close homologues whose dimer structures have been experimentally
determined. Here a Z-score is also introduced to measure sequence similarities. In this
case, the number of identical residues Niden
in the alignment is normalized by average and variance values for randomly shuffled sequences:
where Niden is the number of the identical residues, and
Ncomp is the number of compared
residues in the alignment with gaps removed. We assume that random shuffling is applied
using the uniform distribution of amino acids (p is set to 1/20), and that the number of
identical residue Niden obeys the binominal distribution. Because the other two Z-scores of
energies have negative value for probable interfaces, the Z-score for sequence similarity was
multiplied by minus one to facilitate comparison. Because we are modeling dimer structures,
two different sequence similarities are obtained for one protein complex. We employed the
higher score (in other words, the lower sequence similarity) for the purposes of discrimination.
The random shuffling process for sequence similarity is subtly different from that of
contact and electrostatic energy. For contact and electrostatic energy, two amino acids on the
surface are randomly chosen. In the case of the sequence similarity, the sequence of the
template protein is fixed, and the sequence of the target protein is randomly generated using a
uniform distribution of amino acids.
- Combination of contact energy and sequence similarity
Finally, we employed the combined score Zseqcon, which is generated by adding
Zseq and Zcon without any weights:
- Evaluation of Prediction Performance
We evaluated our prediction performance using 6,556 yeast sequences registerd in
Uniprot ver 54.5 and hetero dimer datasets (Jan 23,2008), considering DIP (ver 20080114)
database as the correct standard. We picked up protein pairs having at least
one interacting residue pair is aligned. Number of total protein pairs is 13,710.
Among them, 479 pairs (3.4%) are registed in DIP database, as "interacting" protein pairs.
A recall-precision plot is shown below.
The perfomance of Zseqcon is slightly better than that of Zseq.
F-measure of Zseq is 0.373 and that of Zseqcon is 0.356. The difference of
these two F-measures was statistically significant (p<0.01), because F-measures of Zseqcon
were larger that of Zseq for 997 samples among 1000 bootstrap samples.
The relationship between value of Zscore and the corresponding value of Precision
is useful to interpret the result of HOMCOS server.
Comments and questions to :
Last modified : Apr 5, 2008
Back to HOMCOS top page
Lab for Structural and Functional Bioinformatics