Predicting protein-protein interactions, interaction sites and residue-residue contact matrices with machine learning techniques

Date
2016
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Protein-protein interactions (PPIs) play crucial roles in many biological processes in living organisms, such as immune response, enzyme catalysis, and signal transduction. Acquiring knowledge of interfacial regions between interacting proteins is not only helpful in understanding protein functions and elucidating signal transduction networks but also critical for structure-based drug design and disease treatment. The cost, time, and other limitations associated with the current experimental methods to obtain PPI information have motivated the development of computational methods for predicting PPIs and their interfaces. In the dissertation, I propose to use deep learning algorithms, mainly Stacked Autoencoders and Deep Neural Networks, along with other machine learning techniques to predict the protein-protein interactions, interaction sites, and amino acid residue-residue contacts. These machine learning techniques include Hidden Markov Models, Fisher Scores, Support Vector Machines, logistic regression, and clustering. Specifically, I developed computational methods based on these machine learning techniques to tackle the following three questions about protein-protein interaction: 1) whether two given protein sequences can interact (protein-protein interaction predictions), 2) if they interact, where are the interacting residues in individual proteins (interaction site predictions), 3) how these interacting residues are paired up across the interacting proteins (contact matrix predictions). The first question, whether the two given protein sequences can interact (PPI prediction), has been studied extensively, and much progress has been made and reported in the literature. I explored using the deep neural networks model as a new tool for PPI prediction and compared this tool with one of the state-of-the-art methods based on Support Vector Machine (SVM) models. The result showed deep neural networks with stacked autoencoders are more effective at extracting non-linear features and thus have led to prediction improvement as compared to the SVM-based method. The second question is to further identify where the interacting residues are (interaction site predictions). The interaction profile hidden Markov model (ipHMM) was applied to predict protein-protein interaction sites by taking into account the interacting partner and topology information. It was found that the performance of ipHMM at domain-domain interaction (DDI) family level was significantly lower for DDI families with multiple topology interfaces. To address this problem, I proposed to develop ipHMM at DDI interface topology level to predict protein interaction sites. The results showed that the method significantly improved Matthews correlation coefficient from 46.4% to 77.3%. The third question is to discover how these interacting residues are paired up, namely the contact matrix for two interacting proteins. Access to the residue-residue contact information of two interacting proteins can provide further insight into proteinprotein interactions and specific target candidates for mutagenesis. It could also serve as a validation of protein docking algorithms. I introduced deep learning techniques (specifically, Stacked Autoencoders) to extract new representations of the Fisher score features and to build deep neural network classifier for prediction. The deep learning model showed significant improvement over the previous machine learning model used in the literature. Furthermore, I proposed to leverage what is learned from the contact matrix prediction and utilize the predicted contact matrix as “feedback” to enhance the interaction site prediction. I developed an integrated machine learning method based on logistic regression that enables combining the predicted contact matrix and the ipHMM interaction site prediction. The performance of the interaction site prediction was significantly improved using the integrated model, as compared with ipHMMs. Lastly, a web server, DDI2PPI, has been developed to make available the PPI and residue-residue contact matrix predictions to the public. The DDI2PPI provides a large-scale implementation of the machine learning algorithms that have been developed from the in-depth research work of this dissertation. DDI2PPI is freely available at http://annotation.dbi.udel.edu/ppi_prediction/.
Description
Keywords
Citation