Exploring long-range features in biosequences for structure and interaction prediction

Kern, Colin
Journal Title
Journal ISSN
Volume Title
University of Delaware
As whole genome sequencing and other high-throughput technologies are becoming cheaper and more common, biologists are gathering large amounts of biological sequences and other -omics data that have to be analyzed and interpreted by computational methods and further experiments in a laboratory, which are expensive and time consuming. Despite the great progress made for sequence analysis tools over the past decade or so, many challenges remain, among which is how to effectively capture the long range features and correlations present in protein sequences. In this dissertation, I developed novel computational methods to explore such long range features and correlations, specifically in a) transmembrane topology, b) protein interaction sites, and c) protein folding, to achieve more accurate prediction. For transmembrane protein topology prediction, I developed a ternary classifier based on support vector machine to learn from the sequence segments spanning the domain boundaries which are predicted by an existing method, TMMOD, based on Hidden Markov Models (HMMs). Tested on a benchmark dataset, the results showed that regional information at the domain boundaries help improve the prediction accuracy; the error rate of domain boundary prediction is reduced by about 50%. For interacting residue prediction, I developed a novel decoding algorithm, ETB-Viterbi, for HMMs to incorporate long-distance correlation that exists between interacting residues, which led to significant improvement in prediction performance, with up to 12.8% increase in AUC score. However, ETB-Viterbi is not guaranteed to find optimal paths. I reformulated the optimization problem and developed a post-decoding re-ranking method based on genetic algorithms with simulated annealing. This method was shown to be capable of further improving the prediction, by over 14% on average of the F-score, for cases where ETB-Viterbi underperforms. Finally, I devised a novel scoring scheme for the HP lattice models, which predict tertiary protein structure. By introducing information about the location of residues with respect to the protein's global structure, prediction performance improved on a benchmark dataset with statistical significance and rarely worsened prediction.