 Service of hetero dimer modeling
 Service of Identifying Interacting Proteins

Residueresidue statistical contact energy for proteinprotein interaction
Residueresidue statistical contact energies were originally developed for coarsegrained
models of protein folding and threading(13). Recently, similar approaches were applied for
evaluating proteinprotein interaction(4,5). In this study, we employed a typical logodds
formula for extracting the value of contact energies. A statistical contact energy
e_{con}(a,b) for
contacting residues a and b in different polypeptide chains is estimated by the form of the
logodds score:
where P(a) and P(b) are the probabilities that amino acids
a and b appear on the surface,
Q(a,b) is the probability that amino acids a and b
contact each other in the proteinprotein
interface. Surface residues of a protein are defined as those residues whose relative
accessible surface areas (accessibilities) are equal to or larger than 15%. Contacting residue pairs are defined as the
surface residues in different chains, whose Cbeta atoms are located within 7 angstrom of one another. Both
probabilities are estimated using the dataset for calculation of contact energy.
If the interface contacts between residues a and b are often found in
the interface, the value of e_{con}(a,b) is large and negative.
The estimated energy values are summarized in Figure 1. Hydrophobic residues are
attractive to each other, especially in the case of the cysteinecysteine pair. Hydrophilic
residues, however, are generally repulsive even for differently charged residue pairs, such as
the arginineglutamic acid pair. These features are similar to those employed in previous
studies.
Fig. 1 Residueresidue statistical contact energy in proteinprotein interfaces. In the
horizontal and vertical axes, 20 amino acids are arranged in descending order of
hydrophobicity. Energy values are represented from red (low energy) to blue (high energy).
The total contact energy E_{con} is the sum of the e_{con} for all the contacted residue pairs as
follows:
where N and M are the total number of the residues of proteins,
and a_{i} and a_{j} are the amino
acids of residues i and j.
 Normalization of the energies
To normalize the contact and electrostatic energies, a Zscore is introduced based on the
previous study. The Zscore for energy E is defined as follows:
where Mean[E] and Var[E] are the average and variance of E respectively for randomly
shuffled amino acids sequences of the same composition. Zscore shows how many units of
the standard deviation a case is above or below the average. It allows us to compare the
results of different distributions. Calculation of the averages and variances of the contact
energy and electrostatic energy are described in the following sections. In contrast to studies
by other groups, we analytically estimated the average and variance of energies without
explicitly generating randomly shuffled sequences.
 Mean and variance of contact energy for randomly shuffled sequences
We assume that random contacting amino acid pairs are generated by picking up two amino
acids randomly from the surfaces of different proteins. For this random set of contacting
amino acids, the average mu_{con} and variance sigma_{con}^{2}
of the contact energy are calculated as
where P(a) and P(b) are the proportions of amino acid a and b in surface residues for each
protein, and A is the set of 20 genetically encoded amino acids. If we assume that the all the
contacting protein pairs are independent in the shuffling process, the average and variance of
the total contact energy E_{con} are calculated as follows:
Finally, Zscore of the contact energy E_{con} is obtained as follows:
 Sequence similarity between target and template
We employed sequence similarity between target protein and template protein as another
feature for finding interacting proteins. We expected that two proteins will interact with
each other if they have close homologues whose dimer structures have been experimentally
determined. Here a Zscore is also introduced to measure sequence similarities. In this
case, the number of identical residues N_{iden}
in the alignment is normalized by average and variance values for randomly shuffled sequences:
where N_{iden} is the number of the identical residues, and
N_{comp} is the number of compared
residues in the alignment with gaps removed. We assume that random shuffling is applied
using the uniform distribution of amino acids (p is set to 1/20), and that the number of
identical residue N_{iden} obeys the binominal distribution. Because the other two Zscores of
energies have negative value for probable interfaces, the Zscore for sequence similarity was
multiplied by minus one to facilitate comparison. Because we are modeling dimer structures,
two different sequence similarities are obtained for one protein complex. We employed the
higher score (in other words, the lower sequence similarity) for the purposes of discrimination.
The random shuffling process for sequence similarity is subtly different from that of
contact and electrostatic energy. For contact and electrostatic energy, two amino acids on the
surface are randomly chosen. In the case of the sequence similarity, the sequence of the
template protein is fixed, and the sequence of the target protein is randomly generated using a
uniform distribution of amino acids.
 Combination of contact energy and sequence similarity
Finally, we employed the combined score Z_{seqcon}, which is generated by adding
Z_{seq} and Z_{con} without any weights:
 Evaluation of Prediction Performance
We evaluated our prediction performance using 6,556 yeast sequences registerd in
Uniprot ver 54.5 and hetero dimer datasets (Jan 23,2008), considering DIP (ver 20080114)
database as the correct standard. We picked up protein pairs having at least
one interacting residue pair is aligned. Number of total protein pairs is 13,710.
Among them, 479 pairs (3.4%) are registed in DIP database, as "interacting" protein pairs.
A recallprecision plot is shown below.
The perfomance of Zseqcon is slightly better than that of Zseq.
Fmeasure of Zseq is 0.373 and that of Zseqcon is 0.356. The difference of
these two Fmeasures was statistically significant (p<0.01), because Fmeasures of Zseqcon
were larger that of Zseq for 997 samples among 1000 bootstrap samples.
The relationship between value of Zscore and the corresponding value of Precision
is useful to interpret the result of HOMCOS server.
Precision 
Zseq 
Zseqcon 
0.1 
18.8 
20.5 
0.2 
21.4 
22.9 
0.3 
23.7 
24.6 
0.4 
29.4 
29.6 
0.5 
39.7 
45.9 