Novel machine learning algorithms for detecting biological signals in protein sequences: investigation of plasmodesmata

Date
2021
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
With continuous developments in the fields of machine learning and wet-lab technologies for molecular biology, advanced machine learning methods were developed and huge amount of biological data were produced. These developments make applying computational methods on biological problems possible. Great successful applications have been shown in the past decades and lead an important interdisciplinary field named as bioinformatics. ☐ These computational methods assist biologists in dealing with big data, and more importantly, discovering useful knowledge from the biological data. The success of these applications often rely on both the quality and quantity of the data. However, there are increasing situations that neither the quality nor the quantity of the data meets the requirements for traditional machine learning techniques to perform well. ☐ The ongoing research on plasmodesmata-located proteins (PDLPs) is one of the cases when only small number of data is available and little knowledge of the data are known. Plasmodesmata (PD) are membrane-lined intercellular communication channels through which essential nutrients and signaling molecules move between neighboring cells in the plant. This cell-to-cell exchange of molecules through PD is fundamental to the physiology, development and immunity of the plant and is a dynamically regulated cellular process. Once located in PD, biologists believe that PDLPs play their regulatory role via interaction at the transmembrane domain, although currently only very limited experimental data is available about the interaction details. Currently, there are only a limited number of PD-associated proteins have been identified, which prevents any traditional machine learning tools to work efficiently. Moreover, no universal or consensus PD-targeting signal has ever been identified, and the quality of the data is also low, which makes the computational tasks more challenging. ☐ In this dissertation, in order to (i) identify new plasmodesmata targeting proteins, (ii) detect de novo plasmodesmata targeting signals, and (iii) predict PDLP interacting proteins using the transmembrane domain (TMD) as a potential interaction interface, I developed machine learning techniques, which are designed particularly for the PDLP problems with the focus on collaborating biological domain knowledge to mitigate the issues of data quantity and quality. ☐ Chapter 1 introduces the biological problem and computational concepts that related to this dissertation. For the biological problem, the challenges of the research on PDLPs, the current known biological knowledge will be described in details. For the computational concepts, the main focus is the introduction to hidden Markov model (HMM), which is the primary machine learning technique used in my research. ☐ In chapters 2, 3 and 4, I first develop a 3-state HMM named as PDHMM for the task of decoding PD-targeting signals, and by combining PDHMM with a support vector machine to form a decision tree like hybrid classifier for the task of PD-targeting protein predictions. By applying PDHMM and the hybrid classifier on the ongoing research, results of wet-lab experiments have been produced. The majority of predictions have been identified as true positive by wet-lab experiments while some are false positive. Due to the cost of wet-lab experiments, and limitations of both the quantity and quality of the data, enhancing the power of prediction by fully utilizing the wet-lab experimental results is highly desirable in active learning fashion. For the task of detecting de novo PD-targeting signals, based on the standard Baum-Welch algorithm, we develop a novel training algorithm for HMMs when only partially labelled data is available for biological applications. For the task of PD-targeting proteins predictions, algorithm of training HMM with both positive and negative examples is developed. By comparisons with other similar methods, both new developed methods achieve significant improvement cross varies datasets. ☐ In chapter 5, from identifying new PD-targeting proteins and detecting de novo PD-targeting signals, the research focus moved to a more challenging problem: predicting interacting proteins of PDLPs. The research has been concentrated on investigating inter-helix contact predictions, which are believed to be involved in PD regulation. The developed method leverages the existing state-of-art method’s predictions and exploits features that are not fully captured by the existing method with a novel refinement selection scheme. Specifically, with an independent dataset, 2D contact structure models are constructed to extract the features reflecting 2D inter-helix contact patterns, and devise a mechanism to deal with the pitfall of inter-helix prediction refinement, which is an intrinsic challenge to any refinement methods. The cross validation results show that the prediction from the proposed method outperforms the state-of-the-art method by a notable degree even without using the refinement selection scheme. By applying the refinement selection scheme, which selects a subset of sequences to refine, the method outperforms the base method significantly in these selected sequences. ☐ Lastly, future works are discussed in chapter 6 regarding to PD-targeting signals detection for new identified PD-targeting proteins, and future investigation of interactions in PDLPs' transmembrane domains.
Description
Keywords
HMM, Inter-helix interaction, Plasmodesmata, Machine learning algorithms
Citation