Machine learning application on genetic analysis for diseases using human interactome
Date
2018
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
As a huge amount of omics data and clinical data is being gathered, analyzing the data and identifying genetic causes of diseases has become a one central task in bioinformatics. In this dissertation, we worked on computational analysis of comorbid diseases at three different levels. Comorbidity is the phenomenon of having two or more diseases co-occurring not by random chance. These diseases present enormous challenges to accurate diagnosis and treatment. The primary goal of this research is to incorporate as much information as we can and develop and apply the state-of-art algorithms for solving several questions related to comorbid disease prediction. The dissertation is also aimed to provide new information useful to explore further the human genome and its behavior.
First, at the sequence level, we studied the effect of non-synonymous single nucleotide polymorphisms (SNPs) on diseases (cancers to be specific). Specifically, we investigated how connecting SNPs in the context of haplotype and interacting sites of proteins encoded by affected genes can improve the prediction performance. We trained classifiers on both sequential and structural features extracted from the affected genes and assessed the predictions made by the trained classifiers using cross-validation. We found that accuracy was consistently enhanced by combining sequential and structural features, with the increase ranging from a few percentages points up to more than 20 percentage points. The results of putting SNPs in the context of interacting sites were less consistent compared to individual SNPs prediction, whereas the SNPs that appear together in haplotype showed a stronger correlation with one another and with the phenotype, and therefore led to significant improvement in prediction performance, with ROC score increased from 0.81 to 0.95. We found similar prediction performance in context of residue prediction at interacting site and non-interacting site, where ROC score increased from 0.66 to 0.86.
Second, at gene cluster level, we worked on the identification of common genes associated with comorbid diseases. This task can be critical in understanding the pathobiological mechanisms of disease comorbidity. We developed a novel method to predict missing common genes related to a comorbid disease pair. Specifically, searching for missing common genes is formulated as an optimization problem to minimize network-based module separation from two subgraphs produced by mapping genes associated with disease onto the interactome. Using cross-validation on more than 600 disease pairs, our method achieves significantly higher average receiver operating characteristic ROC score of 0.95 compared to a baseline ROC score 0.60 using randomized data. Missing common genes prediction is aimed at completing the gene set associated with comorbid disease, to provide a better understanding of biological intervention such as gene-targeted therapeutics related to comorbid diseases. We also provided a few case studies to showcase the pathobiology of genes and their correlation to metabolic pathways.
Third, at the disease level, as an effort toward better understanding the genetic causes of comorbidity, we developed a method to predict how likely two given diseases are comorbid. Intuitively, two diseases that share more common genes shall have increased chance of being comorbid. Previous work shows that after mapping the associated genes onto the human interactome, the distance between the two disease modules (subgraphs) is correlated with comorbidity, and hence can used for comorbidity prediction. In order to fully incorporate structural characteristics of interactome as features for more accurate prediction of comorbidity, we developed a new method that embeds the human interactome into a high dimensional geometric space and uses the projection onto different dimension to “fingerprint” disease modules. A supervised machine learning classifier is then trained to discriminate comorbid diseases from non-comorbid diseases. In cross-validation using a dataset of more than 10,000 disease pairs, we reported that our model achieved a remarkable performance of ROC score=0.90 for comorbidity at relative risk RR=0 and ROC score =0.76 for comorbidity at relative risk RR=1, which significantly outperformed the previous method. This validated our hypothesis that embedding the interactome to a high dimension space aids the extraction of informative features for effective learning and opens the possibility of further incorporating domain specific information such as weighting known disease related pathways.