Predicting protein-protein interactions, interaction sites and residue-residue contact matrices with machine learning techniques
Date
2016
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Protein-protein interactions (PPIs) play crucial roles in many biological
processes in living organisms, such as immune response, enzyme catalysis, and signal
transduction. Acquiring knowledge of interfacial regions between interacting proteins
is not only helpful in understanding protein functions and elucidating signal
transduction networks but also critical for structure-based drug design and disease
treatment. The cost, time, and other limitations associated with the current
experimental methods to obtain PPI information have motivated the development of
computational methods for predicting PPIs and their interfaces.
In the dissertation, I propose to use deep learning algorithms, mainly Stacked
Autoencoders and Deep Neural Networks, along with other machine learning
techniques to predict the protein-protein interactions, interaction sites, and amino acid
residue-residue contacts. These machine learning techniques include Hidden Markov
Models, Fisher Scores, Support Vector Machines, logistic regression, and clustering.
Specifically, I developed computational methods based on these machine learning
techniques to tackle the following three questions about protein-protein interaction: 1)
whether two given protein sequences can interact (protein-protein interaction
predictions), 2) if they interact, where are the interacting residues in individual
proteins (interaction site predictions), 3) how these interacting residues are paired up
across the interacting proteins (contact matrix predictions).
The first question, whether the two given protein sequences can interact (PPI
prediction), has been studied extensively, and much progress has been made and reported in the literature. I explored using the deep neural networks model as a new
tool for PPI prediction and compared this tool with one of the state-of-the-art methods
based on Support Vector Machine (SVM) models. The result showed deep neural
networks with stacked autoencoders are more effective at extracting non-linear
features and thus have led to prediction improvement as compared to the SVM-based
method.
The second question is to further identify where the interacting residues are
(interaction site predictions). The interaction profile hidden Markov model (ipHMM)
was applied to predict protein-protein interaction sites by taking into account the
interacting partner and topology information. It was found that the performance of
ipHMM at domain-domain interaction (DDI) family level was significantly lower for
DDI families with multiple topology interfaces. To address this problem, I proposed to
develop ipHMM at DDI interface topology level to predict protein interaction sites.
The results showed that the method significantly improved Matthews correlation
coefficient from 46.4% to 77.3%.
The third question is to discover how these interacting residues are paired up,
namely the contact matrix for two interacting proteins. Access to the residue-residue
contact information of two interacting proteins can provide further insight into proteinprotein
interactions and specific target candidates for mutagenesis. It could also serve
as a validation of protein docking algorithms. I introduced deep learning techniques
(specifically, Stacked Autoencoders) to extract new representations of the Fisher score
features and to build deep neural network classifier for prediction. The deep learning
model showed significant improvement over the previous machine learning model
used in the literature. Furthermore, I proposed to leverage what is learned from the contact matrix
prediction and utilize the predicted contact matrix as “feedback” to enhance the
interaction site prediction. I developed an integrated machine learning method based
on logistic regression that enables combining the predicted contact matrix and the
ipHMM interaction site prediction. The performance of the interaction site prediction
was significantly improved using the integrated model, as compared with ipHMMs.
Lastly, a web server, DDI2PPI, has been developed to make available the PPI
and residue-residue contact matrix predictions to the public. The DDI2PPI provides a
large-scale implementation of the machine learning algorithms that have been
developed from the in-depth research work of this dissertation. DDI2PPI is freely
available at http://annotation.dbi.udel.edu/ppi_prediction/.