Open Access Publications

Permanent URI for this collection

Open access publications by faculty, staff, postdocs, and graduate students at the Center for Bioinformatics and Computational Biology.


Recent Submissions

Now showing 1 - 20 of 24
  • Item
    A short-term, randomized, controlled, feasibility study of the effects of different vegetables on the gut microbiota and microRNA expression in infants
    (Frontiers in Microbiomes, 2024-03-01) Ferro, Lynn E.; Bittinger, Kyle; Trudo, Sabrina P.; Beane, Kaleigh E.; Polson, Shawn W.; Kim, Jae Kyeom; Trabulsi, Jillian C.
    The complementary diet influences the gastrointestinal (gut) microbiota composition and, in turn, host health and, potentially, microRNA (miRNA) expression. This study aimed to assess the feasibility of altering the gut microbial communities with short-term food introduction and to determine the effects of different vegetables on the gut microbiota and miRNA expression in infants. A total of 11 infants were randomized to one of the following intervention arms: control, broccoli, or carrot. The control group maintained the milk diet only, while the other groups consumed either a broccoli puree or a carrot puree on days 1–3 along with their milk diet (human milk or infant formula). Genomic DNA and total RNA were extracted from fecal samples to determine the microbiota composition and miRNA expression. Short-term feeding of both broccoli and carrots resulted in changes in the microbiota and miRNA expression. Compared to the control, a trend toward a decrease in Shannon index was observed in the carrot group on days 2 and 4. The carrot and broccoli groups differed by weighted UniFrac. Streptococcus was increased on day 4 in the carrot group compared to the control. The expression of two miRNAs (i.e., miR-217 and miR-590-5p) trended towards decrease in both the broccoli and carrot groups compared to the control, whereas increases in eight and two different miRNAs were observed in the carrot and broccoli groups, respectively. Vegetable interventions differentially impacted the gut microbiota and miRNA expression, which may be a mechanism by which total vegetable intake and variety are associated with reduced disease risk.
  • Item
    Transcriptional regulation of Sis1 promotes fitness but not feedback in the heat shock response
    (eLife, 2023-05-17) Grade, Rania; Singh, Abhyudai; Ali, Asif; Pincus, David
    The heat shock response (HSR) controls expression of molecular chaperones to maintain protein homeostasis. Previously, we proposed a feedback loop model of the HSR in which heat-denatured proteins sequester the chaperone Hsp70 to activate the HSR, and subsequent induction of Hsp70 deactivates the HSR (Krakowiak et al., 2018; Zheng et al., 2016). However, recent work has implicated newly synthesized proteins (NSPs) – rather than unfolded mature proteins – and the Hsp70 co-chaperone Sis1 in HSR regulation, yet their contributions to HSR dynamics have not been determined. Here, we generate a new mathematical model that incorporates NSPs and Sis1 into the HSR activation mechanism, and we perform genetic decoupling and pulse-labeling experiments to demonstrate that Sis1 induction is dispensable for HSR deactivation. Rather than providing negative feedback to the HSR, transcriptional regulation of Sis1 by Hsf1 promotes fitness by coordinating stress granules and carbon metabolism. These results support an overall model in which NSPs signal the HSR by sequestering Sis1 and Hsp70, while induction of Hsp70 – but not Sis1 – attenuates the response.
  • Item
    Machine learning classifiers predict key genomic and evolutionary traits across the kingdoms of life
    (Scientific Reports, 2023-02-06) Hallee, Logan; Khomtchouk, Bohdan B.
    In this study, we investigate how an organism’s codon usage bias can serve as a predictor and classifier of various genomic and evolutionary traits across the domains of life. We perform secondary analysis of existing genetic datasets to build several AI/machine learning models. When trained on codon usage patterns of nearly 13,000 organisms, our models accurately predict the organelle of origin and taxonomic identity of nucleotide samples. We extend our analysis to identify the most influential codons for phylogenetic prediction with a custom feature ranking ensemble. Our results suggest that the genetic code can be utilized to train accurate classifiers of taxonomic and phylogenetic features. We then apply this classification framework to open reading frame (ORF) detection. Our statistical model assesses all possible ORFs in a nucleotide sample and rejects or deems them plausible based on the codon usage distribution. Our dataset and analyses are made publicly available on GitHub and the UCI ML Repository to facilitate open-source reproducibility and community engagement.
  • Item
    Transcriptomic Signature of the Simulated Microgravity Response in Caenorhabditis elegans and Comparison to Spaceflight Experiments
    (Cells, 2023-01-10) Çelen, İrem; Jayasinghe, Aroshan; Doh, Jung H.; Sabanayagam, Chandran R.
    Given the growing interest in human exploration of space, it is crucial to identify the effects of space conditions on biological processes. Here, we analyze the transcriptomic response of Caenorhabditis elegans to simulated microgravity and observe the maintained transcriptomic response after returning to ground conditions for four, eight, and twelve days. We show that 75% of the simulated microgravity-induced changes on gene expression persist after returning to ground conditions for four days while most of these changes are reverted after twelve days. Our results from integrative RNA-seq and mass spectrometry analyses suggest that simulated microgravity affects longevity-regulating insulin/IGF-1 and sphingolipid signaling pathways. Finally, we identified 118 genes that are commonly differentially expressed in simulated microgravity- and space-exposed worms. Overall, this work provides insight into the effect of microgravity on biological systems during and after exposure.
  • Item
    DNA Methylation Analysis Reveals Distinct Patterns in Satellite Cell–Derived Myogenic Progenitor Cells of Subjects with Spastic Cerebral Palsy
    (Journal of Personalized Medicine, 2022-11-30) Robinson, Karyn G.; Marsh, Adam G.; Lee, Stephanie K.; Hicks, Jonathan; Romero, Brigette; Batish, Mona; Crowgey, Erin L.; Shrader, M. Wade; Akins, Robert E.
    Spastic type cerebral palsy (CP) is a complex neuromuscular disorder that involves altered skeletal muscle microanatomy and growth, but little is known about the mechanisms contributing to muscle pathophysiology and dysfunction. Traditional genomic approaches have provided limited insight regarding disease onset and severity, but recent epigenomic studies indicate that DNA methylation patterns can be altered in CP. Here, we examined whether a diagnosis of spastic CP is associated with intrinsic DNA methylation differences in myoblasts and myotubes derived from muscle resident stem cell populations (satellite cells; SCs). Twelve subjects were enrolled (6 CP; 6 control) with informed consent/assent. Skeletal muscle biopsies were obtained during orthopedic surgeries, and SCs were isolated and cultured to establish patient–specific myoblast cell lines capable of proliferation and differentiation in culture. DNA methylation analyses indicated significant differences at 525 individual CpG sites in proliferating SC–derived myoblasts (MB) and 1774 CpG sites in differentiating SC–derived myotubes (MT). Of these, 79 CpG sites were common in both culture types. The distribution of differentially methylated 1 Mbp chromosomal segments indicated distinct regional hypo– and hyper–methylation patterns, and significant enrichment of differentially methylated sites on chromosomes 12, 13, 14, 15, 18, and 20. Average methylation load across 2000 bp regions flanking transcriptional start sites was significantly different in 3 genes in MBs, and 10 genes in MTs. SC derived MBs isolated from study participants with spastic CP exhibited fundamental differences in DNA methylation compared to controls at multiple levels of organization that may reveal new targets for studies of mechanisms contributing to muscle dysregulation in spastic CP.
  • Item
    Whole-genome sequencing identifies I-SceI-mediated transgene integration sites in Xenopus tropicalis snai2: eGFP line
    (G3: Genes | Genomes | Genetics, 2022-02-16) Wang, Jian; Lu, Congyu; Wei, Shuo
    Transgenesis with the meganuclease I-SceI is a safe and efficient method, but the underlying mechanisms remain unclear due to the lack of information on transgene localization. Using I-SceI, we previously developed a transgenic Xenopus tropicalis line expressing enhanced green fluorescent protein driven by the neural crest-specific snai2 promoter/enhancer, which is a powerful tool for studying neural crest development and craniofacial morphogenesis. Here we carried out whole-genome shotgun sequencing for the snai2: eGFP embryos to identify the transgene integration sites. With a 19x sequencing coverage, we estimated that 6 copies of the transgene were inserted into the X. tropicalis genome in the hemizygous transgenic embryos. Two transgene integration loci adjacent to each other were identified in a non-coding region on Chromosome 1, possibly as a result of duplication after a single transgene insertion. Interestingly, genomic DNA at the boundaries of the transgene integration loci contains short sequences homologous to the I-SceI recognition site, suggesting that the integration was not random but probably mediated by sequence homology. To our knowledge, our work represents the first genome-wide sequencing study on a transgenic organism generated with I-SceI, which is useful for evaluating the potential genetic effects of I-SceI-mediated transgenesis and further understanding the mechanisms underlying this transgenic method.
  • Item
    Protein Ontology (PRO): enhancing and scaling up the representation of protein entities
    (Oxford University Press, 2016-11-28) Natale, Darren A.; Arighi, Cecilia N.; Blake, Judith A.; Bona, Jonathan; Chen, Chuming; Chen, Sheng-Chih; Christie, Karen R.; Cowart, Julie; D’Eustachio, Peter; Diehl, Alexander D.; Drabkin, Harold J.; Duncan, William D.; Huang, Hongzhan; Ren, Jia; Ross, Karen; Ruttenberg, Alan; Shamovsky, Veronica; Smith, Barry; Wang, Qinghua; Zhang, Jian; El-Sayed, Abdelrahman; Wu, Cathy H.; Darren A. Natale, Cecilia N. Arighi, Judith A. Blake, Jonathan Bona, Chuming Chen, Sheng-Chih Chen, Karen R. Christie, Julie Cowart, Peter D’Eustachio, Alexander D. Diehl, Harold J. Drabkin, William D. Duncan, Hongzhan Huang, Jia Ren, Karen Ross, Alan Ruttenberg, Veronica Shamovsky, Barry Smith, Qinghua Wang, Jian Zhang, Abdelrahman El-Sayed and Cathy H. Wu; Arighi, Cecilia N.; Chen, Chuming; Chen, Sheng-Chih; Cowart, Julie; Huang, Hongzhan; Ren, Jia; Wang, Qinghua; Wu, Cathy H.
    The Protein Ontology (PRO; http://purl.obolibrary. org/obo/pr) formally defines and describes taxonspecific and taxon-neutral protein-related entities in three major areas: proteins related by evolution; proteins produced from a given gene; and proteincontaining complexes. PRO thus serves as a tool for referencing protein entities at any level of specificity. To enhance this ability, and to facilitate the comparison of such entities described in different resources, we developed a standardized representation of proteoforms using UniProtKB as a sequence reference and PSI-MOD as a post-translationalmodification reference. We illustrate its use in facilitating an alignment between PRO and Reactome protein entities. We also address issues of scalability, describing our first steps into the use of text mining to identify protein-related entities, the large-scale import of proteoform information from expert curated resources, and our ability to dynamically generate PRO terms. Web views for individual terms are now more informative about closely-related terms, including for example an interactive multiple sequence alignment. Finally, we describe recent improvement in semantic utility, with PRO now represented in OWL and as a SPARQL endpoint. These developments will further support the anticipated growth of PRO and facilitate discoverability of and allow aggregation of data relating to protein entities.
  • Item
    ThermoAlign: a genome-aware primer design tool for tiled amplicon resequencing
    (Nature Publishing Group, 2017-03-16) Francis, Felix; Dumas, Michael D.; Wisser, Randall J.; Felix Francis, Michael D. Dumas & Randall J. Wisser; Francis, Felix; Dumas, Michael D.; Wisser, Randall J.
    Isolating and sequencing specific regions in a genome is a cornerstone of molecular biology. This has been facilitated by computationally encoding the thermodynamics of DNA hybridization for automated design of hybridization and priming oligonucleotides. However, the repetitive composition of genomes challenges the identification of target-specific oligonucleotides, which limits genetics and genomics research on many species. Here, a tool called ThermoAlign was developed that ensures the design of target-specific primer pairs for DNA amplification. This is achieved by evaluating the thermodynamics of hybridization for full-length oligonucleotide-template alignments — thermoalignments — across the genome to identify primers predicted to bind specifically to the target site. For amplificationbased resequencing of regions that cannot be amplified by a single primer pair, a directed graph analysis method is used to identify minimum amplicon tiling paths. Laboratory validation by standard and long-range polymerase chain reaction and amplicon resequencing with maize, one of the most repetitive genomes sequenced to date (≈85% repeat content), demonstrated the specificity-by-design functionality of ThermoAlign. ThermoAlign is released under an open source license and bundled in a dependency-free container for wide distribution. It is anticipated that this tool will facilitate multiple applications in genetics and genomics and be useful in the workflow of high-throughput targeted resequencing studies.
  • Item
    ThermoAlign: a genome-aware primer design tool for tiled amplicon resequencing
    (Nature Publishing Group, 2017-03-16) Francis, Felix; Dumas, Michael D.; Wisser, RJ; Felix Francis, Michael D. Dumas and Randall J.Wisser; Wisser, Randall Jerome
    Isolating and sequencing specific regions in a genome is a cornerstone of molecular biology. This has been facilitated by computationally encoding the thermodynamics of DNA hybridization for automated design of hybridization and priming oligonucleotides. However, the repetitive composition of genomes challenges the identification of target-specific oligonucleotides, which limits genetics and genomics research on many species. Here, a tool called ThermoAlign was developed that ensures the design of target-specific primer pairs for DNA amplification. This is achieved by evaluating the thermodynamics of hybridization for full-length oligonucleotide-template alignments — thermoalignments — across the genome to identify primers predicted to bind specifically to the target site. For amplificationbased resequencing of regions that cannot be amplified by a single primer pair, a directed graph analysis method is used to identify minimum amplicon tiling paths. Laboratory validation by standard and long-range polymerase chain reaction and amplicon resequencing with maize, one of the most repetitive genomes sequenced to date (≈85% repeat content), demonstrated the specificity-by-design functionality of ThermoAlign. ThermoAlign is released under an open source license and bundled in a dependency-free container for wide distribution. It is anticipated that this tool will facilitate multiple applications in genetics and genomics and be useful in the workflow of high-throughput targeted resequencing studies.
  • Item
    The UniProtKB guide to the human proteome
    (Oxford University Press, 2/19/16) Breuza,Lionel; Poux,Sylvain; Estreicher,Anne; Famiglietti,Maria Livia; Magrane,Michele; Tognolli,Michael; Bridge,Alan; Baratin,Delphine; Redaschi,Nicole; UniProt Consortium; Lionel Breuza, Sylvain Poux, Anne Estreicher, Maria Livia Famiglietti, Michele Magrane, Michael Tognolli, Alan Bridge, Delphine Baratin, Nicole Redaschi and The UniProt Consortium; Wu, Cathy Huey-Hwa
    Advances in high-throughput and advanced technologies allow researchers to routinely perform whole genome and proteome analysis. For this purpose, they need high-quality resources providing comprehensive gene and protein sets for their organisms of interest. Using the example of the human proteome, we will describe the content of a complete proteome in the UniProt Knowledgebase (UniProtKB). We will show how manual expert curation of UniProtKB/Swiss-Prot is complemented by expert-driven automatic annotation to build a comprehensive, high-quality and traceable resource. We will also illustrate how the complexity of the human proteome is captured and structured in UniProtKB.
  • Item
    miRiaD: A Text Mining Tool for Detecting Associations of microRNAs with Diseases
    (Biomed Central Ltd, 4/29/16) Gupta,Samir; Ross,Karen E.; Tudor,Catalina O.; Wu,Cathy H.; Schmidt,Carl J.; Vijay-Shanker,K.; Samir Gupta, Karen E. Ross, Catalina O. Tudor, Cathy H. Wu, Carl J. Schmidt and K. Vijay-Shanker; Wu, Cathy Huey-Hwa;Schmidt, Carl J;Shanker, Vijay K
    Background: MicroRNAs are increasingly being appreciated as critical players in human diseases, and questions concerning the role of microRNAs arise in many areas of biomedical research. There are several manually curated databases of microRNA-disease associations gathered from the biomedical literature; however, it is difficult for curators of these databases to keep up with the explosion of publications in the microRNA-disease field. Moreover, automated literature mining tools that assist manual curation of microRNA-disease associations currently capture only one microRNA property (expression) in the context of one disease (cancer). Thus, there is a clear need to Developmentelop more sophisticated automated literature mining tools that capture a variety of microRNA properties and relations in the context of multiple diseases to provide researchers with fast access to the most recent published information and to streamline and accelerate manual curation. Methods: We have Developmenteloped miRiaD (microRNAs in association with Disease), a text-mining tool that automatically extracts associations between microRNAs and diseases from the literature. These associations are often not directly linked, and the intermediate relations are often highly informative for the biomedical researcher. Thus, miRiaD extracts the miR-disease pairs together with an explanation for their association. We also Developmenteloped a procedure that assigns scores to sentences, marking their informativeness, based on the microRNA-disease relation observed within the sentence. Results: miRiaD was applied to the entire Medline corpus, identifying 8301 PMIDs with miR-disease associations. These abstracts and the miR-disease associations are available for browsing at We evaluated the recall and precision of miRiaD with respect to information of high interest to public microRNA-disease database curators (expression and target gene associations), obtaining a recall of 88.46-90.78. When we expanded the evaluation to include sentences with a wide range of microRNA-disease information that may be of interest to biomedical researchers, miRiaD also performed very well with a F-score of 89.4. The informativeness ranking of sentences was evaluated in terms of nDCG (0.977) and correlation metrics (0.678-0.727) when compared to an annotator's ranked list. Conclusions: miRiaD, a high performance system that can capture a wide variety of microRNA-disease related information, extends beyond the scope of existing microRNA-disease resources. It can be incorporated into manual curation pipelines and serve as a resource for biomedical researchers interested in the role of microRNAs in disease. In our ongoing work we are Developmenteloping an improved miRiaD web interface that will facilitate complex queries about microRNA-disease relationships, such as "In what diseases does microRNA regulation of apoptosis play a role?" or "Is there overlap in the sets of genes targeted by microRNAs in different types of dementia?"."
  • Item
    Intercellular Variability in Protein Levels from Stochastic Expression and Noisy Cell Cycle Processes
    (Public Library Science, 2016-08-18) Soltani,Mohammad; Vargas-Garcia,Cesar A.; Antunes,Duarte; Singh,Abhyudai; Mohammad Soltani, Cesar A. Vargas-Garcia, Duarte Antunes, Abhyudai Singh; Singh, Abhyudai
    Inside individual cells, expression of genes is inherently stochastic and manifests as cell-to-cell variability or noise in protein copy numbers. Since proteins half-lives can be comparable to the cell-cycle length, randomness in cell-division times generates additional intercellular variability in protein levels. Moreover, as many mRNA/protein species are expressed at low-copy numbers, errors incurred in partitioning of molecules between two daughter cells are significant. We derive analytical formulas for the total noise in protein levels when the cell-cycle duration follows a general class of probability distributions. Using a novel hybrid approach the total noise is decomposed into components arising from i) stochastic expression; ii) partitioning errors at the time of cell division and iii) random cell-division events. These formulas reveal that random cell-division times not only generate additional extrinsic noise, but also critically affect the mean protein copy numbers and intrinsic noise components. Counter intuitively, in some parameter regimes, noise in protein levels can decrease as cell-division times become more stochastic. Computations are extended to consider genome duplication, where transcription rate is increased at a random point in the cell cycle. We systematically investigate how the timing of genome duplication influences different protein noise components. Intriguingly, results show that noise contribution from stochastic expression is minimized at an optimal genome-duplication time. Our theoretical results motivate new experimental methods for decomposing protein noise levels from synchronized and asynchronized single-cell expression data. Characterizing the contributions of individual noise mechanisms will lead to precise estimates of gene expression parameters and techniques for altering stochasticity to change phenotype of individual cells.
  • Item
    BioCreative V BioC track overview: collaborative biocurator assistant task for BioGRID
    (Oxford University Press, 8/2/16) Kim,Sun; Dogan,Rezarta Islamaj; Chatr-Aryamontri,Andrew; Chang,Christie S.; Oughtred,Rose; Rust,Jennifer; Batista-Navarro,Riza; Carter,Jacob; Ananiadou,Sophia; Matos,Sergio; Santos,Andre; Campos,David; Oliveira,Jose Luis; Singh,Onkar; Jonnagaddala,Jitendra; Dai,Hong-Jie; Su,Emily Chia-Yu; Chang,Yung-Chun; Su,Yu-Chen; Chu,Chun-Han; Chen,Chien Chin; Hsu,Wen-Lian; Peng,Yifan; Arighi,Cecilia; Wu,Cathy H.; Vijay-Shanker,K.; Aydin,Ferhat; Husunbeyi,Zehra Melce; Ozgur,Arzucan; Shin,Soo-Yong; Kwon,Dongseop; Dolinski,Kara; Tyers,Mike; Wilbur,W. John; Comeau,Donald C.; Sun Kim, Rezarta Islamaj Do gan, Andrew Chatr-Aryamontri, Christie S. Chang, Rose Oughtred, Jennifer Rust, Riza Batista-Navarro, Jacob Carter, Sophia Ananiadou, Se� rgio Matos, Andre� Santos, David Campos, Jose�Lu?s Oliveira, Onkar Singh, Jitendra Jonnagaddala, Hong-Jie Dai, Emily Chia-Yu Su, Yung-Chun Chang, Yu-Chen Su, Chun-Han Chu, Chien Chin Chen,Wen-Lian Hsu,Yifan Peng, Cecilia Arighi,Cathy H. Wu, K. Vijay-Shanker, Ferhat Ayd?n, Zehra Melce Husunbey, Arzucan Ozgu, Soo-Yong Shin, Dongseop Kwon, Kara Dolinski, Mike Tyers, W. John Wilbur and Donald C. Comeau; Arighi, Cecilia Noemi; Wu, Cathy Huey-Hwa; Shanker, Vijay K
    BioC is a simple XML format for text, annotations and relations, and was Developmenteloped to achieve interoperability for biomedical text processing. Following the success of BioC in BioCreative IV, the BioCreative V BioC track addressed a collaborative task to build an assistant system for BioGRID curation. In this paper, we describe the framework of the collaborative BioC task and discuss our findings based on the user survey. This track consisted of eight subtasks including gene/protein/organism named entity recognition, protein-protein/genetic interaction passage identification and annotation visualization. Using BioC as their data-sharing and communication medium, nine teams, world-wide, participated and contributed either new methods or improvements of existing tools to address different subtasks of the BioC track. Results from different teams were shared in BioC and made available to other teams as they addressed different subtasks of the track. In the end, all submitted runs were merged using a machine learning classifier to produce an optimized output. The biocurator assistant system was evaluated by four BioGRID curators in terms of practical usability. The curators' feedback was overall positive and highlighted the user-friendly design and the convenient gene/protein curation tool based on text mining.
  • Item
    BioC-compatible full-text passage detection for protein-protein interactions using extended dependency graph
    (Oxford University Press, 4/12/16) Peng,Yifan; Arighi,Cecilia; Wu,Cathy H.; Vijay-Shanker,K.; Yifan Peng, Cecilia Arighi, Cathy H. Wu and K. Vijay-Shanker; Arighi, Cecilia Noemi; Wu, Cathy Huey-Hwa; Shanker, Vijay K
    There has been a large growth in the number of biomedical publications that report experimental results. Many of these results concern detection of protein-protein interactions (PPI). In BioCreative V, we participated in the BioC task and Developmenteloped a PPI system to detect text passages with PPIs in the full-text articles. By adopting the BioC format, the output of the system can be seamlessly added to the biocuration pipeline with little effort required for the system integration. A distinctive feature of our PPI system is that it utilizes extended dependency graph, an intermediate level of representation that attempts to abstract away syntactic variations in text. As a result, we are able to use only a limited set of rules to extract PPI pairs in the sentences, and additional rules to detect additional passages for PPI pairs. For evaluation, we used the 95 articles that were provided for the BioC annotation task. We retrieved the unique PPIs from the BioGRID database for these articles and show that our system achieves a recall of 83.5%. In order to evaluate the detection of passages with PPIs, we further annotated Abstract and Results sections of 20 documents from the dataset and show that an f-value of 80.5% was obtained. To evaluate the generalizability of the system, we also conducted experiments on AIMed, a well-known PPI corpus. We achieved an f-value of 76.1% for sentence detection and an f-value of 64.7% for unique PPI detection.
  • Item
    InterPro in 2017––beyond protein family and domain annotations
    (Oxford University Press, 2016-11-28) Finn, Robert D.; Attwood, Teresa K.; Babbitt, Patricia C.; Bateman, Alex; Bork, Peer; Bridge, Alan J.; Chang, Hsin-Yu; Doszt´anyi, Zsuzsanna; El-Gebali, Sara; Fraser, Matthew; Gough, Julian; Haft, David; Holliday, Gemma L.; Huang, Hongzhan; Huang, Xiaosong; Letunic, Ivica; Lopez, Rodrigo; Lu, Shennan; Marchler-Bauer, Aron; Mi, Huaiyu; Mistry, Jaina; Natale, Darren A.; Necci, Marco; Nuka, Gift; Orengo, Christine A.; Park, Youngmi; Pesseat, Sebastien; Piovesan, Damiano; Potter, Simon C.; Rawlings, Neil D.; Redaschi, Nicole; Richardson, Lorna; Rivoire, Catherine; Sangrador-Vegas, Amaia; Sigrist, Christian; Sillitoe, Ian; Smithers, Ben; Squizzato, Silvano; Sutton, Granger; Thanki, Narmada; Thomas, Paul D.; Tosatto, Silvio C. E.; Wu, Cathy H.; Xenarios, Ioannis; Yeh, Lai-Su; Young, Siew-Yit; Mitchell, Alex L.; Robert D. Finn, Teresa K. Attwood, Patricia C. Babbitt, Alex Bateman, Peer Bork, Alan J. Bridge, Hsin-Yu Chang, Zsuzsanna Doszt´anyi, Sara El-Gebali, Matthew Fraser, Julian Gough, David Haft, Gemma L. Holliday, Hongzhan Huang, Xiaosong Huang, Ivica Letunic, Rodrigo Lopez, Shennan Lu, Aron Marchler-Bauer, Huaiyu Mi, Jaina Mistry, Darren A Natale, Marco Necci, Gift Nuka, Christine A. Orengo, Youngmi Park, Sebastien Pesseat, Damiano Piovesan, Simon C. Potter, Neil D. Rawlings, Nicole Redaschi, Lorna Richardson, Catherine Rivoire, Amaia Sangrador-Vegas, Christian Sigrist, Ian Sillitoe, Ben Smithers, Silvano Squizzato, Granger Sutton, Narmada Thanki, Paul D Thomas, Silvio C. E. Tosatto, Cathy H.Wu, Ioannis Xenarios, Lai-Su Yeh, Siew-Yit Young and Alex L. Mitchell; Wu, Cathy H.; Huang, Hongzhan
    InterPro ( is a freely available database used to classify protein sequences into families and to predict the presence of important domains and sites. InterProScan is the underlying software that allows both protein and nucleic acid sequences to be searched against Inter- Pro’s predictive models, which are provided by its member databases. Here, we report recent developments with InterPro and its associated software, including the addition of two new databases (SFLD and CDD), and the functionality to include residue-level annotation and prediction of intrinsic disorder. These developments enrich the annotations provided by InterPro, increase the overall number of residues annotated and allow more specific functional inferences.
  • Item
    UniProt: the universal protein knowledgebase
    (Oxford University Press on behalf of Nucleic Acids Research, 2016-11-28) The UniProt Consortium; Wu, Cathy H.; The UniProt Consortium; Wu, Cathy H.
    The UniProt knowledgebase is a large resource of protein sequences and associated detailed annotation. The database contains over 60 million sequences, of which over half a million sequences have been curated by experts who critically review experimental and predicted data for each protein. The remainder are automatically annotated based on rule systems that rely on the expert curated knowledge. Since our last update in 2014, we have more than doubled the number of reference proteomes to 5631, giving a greater coverage of taxonomic diversity. We implemented a pipeline to remove redundant highly similar proteomes that were causing excessive redundancy in UniProt. The initial run of this pipeline reduced the number of sequences in UniProt by 47 million. For our users interested in the accessory proteomes, we have made available sets of pan proteome sequences that cover the diversity of sequences for each species that is found in its strains and sub-strains. To help interpretation of genomic variants, we provide tracks of detailed protein information for the major genome browsers. We provide a SPARQL endpoint that allows complex queries of the more than 22 billion triples of data in UniProt ( UniProt resources can be accessed via the website at
  • Item
    Predicting nsSNPs that disrupt protein-protein interactions using docking
    (IEEE Computational Intelligence Society ; IEEE Computer Society ; IEEE Control Systems Society ; IEEE Engineering in Medicine and Biology Society ; The Association for Computing Machinery, 2016-01-22) Goodacre, Norman; Edwards, Nathan; Danielsen, Mark; Uetz, Peter; Wu, Cathy H.; Norman Goodacre, Nathan Edwards, Mark Danielsen, Peter Uetz, Cathy Wu; Wu, Cathy H.
    The human genome contains a large number of protein polymorphisms due to individual genome variation. How many of these polymorphisms lead to altered protein-protein interaction is unknown. We have developed a method to address this question. The intersection of the SKEMPI database (of affinity constants among interacting proteins) and CAPRI 4.0 docking benchmark was docked using HADDOCK, leading to a training set of 166 mutant pairs. A random forest classifier that uses the differences in resulting docking scores between the 166 mutant pairs and their wild-types was used, to distinguish between variants that have either completely or partially lost binding ability. 50% of non-binders were correctly predicted with a false discovery rate of only 2%. The model was tested on a set of 15 HIV-1 - human, as well as 7 human - human glioblastoma-related, mutant proteins pairs: 50% of combined non-binders were correctly predicted with a false discovery rate of 10%. The model was also used to identify 10 protein-protein interactions between human proteins and their HIV-1 partners that are likely to be abolished by rare non-synonymous single-nucleotide polymorphisms (nsSNPs). These nsSNPs may represent novel and potentially therapeutically-valuable targets for anti-viral therapy by disruption of viral binding.
  • Item
    Distribution of CpG Motifs in Upstream Gene Domains in a Reef Coral and Sea Anemone: Implications for Epigenetics in Cnidarians
    (Public Library of Science (PLOS), 2016-03-07) Marsh, Adam G.; Hoadley, Kenneth D.; Warner, Mark E.; Adam G. Marsh, Kenneth D. Hoadley, Mark E. Warner; Marsh, Adam G.; Hoadley, Kenneth D.; Warner, Mark E.
    Coral reefs are under assault from stressors including global warming, ocean acidification, and urbanization. Knowing how these factors impact the future fate of reefs requires delineating stress responses across ecological, organismal and cellular scales. Recent advances in coral reef biology have integrated molecular processes with ecological fitness and have identified putative suites of temperature acclimation genes in a Scleractinian coral Acropora hyacinthus.We wondered what unique characteristics of these genes determined their coordinate expression in response to temperature acclimation, and whether or not other corals and cnidarians would likewise possess these features. Here, we focus on cytosine methylation as an epigenetic DNA modification that is responsive to environmental stressors. We identify common conserved patterns of cytosine-guanosine dinucleotide (CpG) motif frequencies in upstream promoter domains of different functional gene groups in two cnidarian genomes: a coral (Acropora digitifera) and an anemone (Nematostella vectensis). Our analyses show that CpG motif frequencies are prominent in the promoter domains of functional genes associated with environmental adaptation, particularly those identified in A. hyacinthus. Densities of CpG sites in upstream promoter domains near the transcriptional start site (TSS) are 1.38x higher than genomic background levels upstream of -2000 bp from the TSS. The increase in CpG usage suggests selection to allow for DNA methylation events to occur more frequently within 1 kb of the TSS. In addition, observed shifts in CpG densities among functional groups of genes suggests a potential role for epigenetic DNA methylation within promoter domains to impact functional gene expression responses in A. digitifera and N. vectensis. Identifying promoter epigenetic sequence motifs among genes within specific functional groups establishes an approach to describe integrated cellular responses to environmental stress in reef corals and potential roles of epigenetics on survival and fitness in the face of global climate change.
  • Item
    RNA-Seq Analysis of Abdominal Fat in Genetically Fat and Lean Chickens Highlights a Divergence in Expression of Genes Controlling Adiposity, Hemostasis, and Lipid Metabolism
    (Public Library of Science (PLOS), 2015-10-07) Resnyk, Christopher W.; Chen, Chuming; Huang, Hongzhan; Wu, Cathy H.; Simon, Jean; Le Bihan-Duval, Elisabeth; Duclos, Michel J.; Cogburn, Larry A.; Christopher W. Resnyk, Chuming Chen, Hongzhan Huang, Cathy H. Wu, Jean Simon, Elisabeth Le Bihan-Duval, Michel J. Duclos, Larry A. Cogburn; Resnyk, Christopher W.; Chen, Chuming; Huang, Hongzhan; Wu, Cathy H.; Cogburn, Larry A.
    Genetic selection for enhanced growth rate in meat-type chickens (Gallus domesticus) is usually accompanied by excessive adiposity, which has negative impacts on both feed efficiency and carcass quality. Enhanced visceral fatness and several unique features of avian metabolism (i.e., fasting hyperglycemia and insulin insensitivity) mimic overt symptoms of obesity and related metabolic disorders in humans. Elucidation of the genetic and endocrine factors that contribute to excessive visceral fatness in chickens could also advance our understanding of human metabolic diseases. Here, RNA sequencing was used to examine differential gene expression in abdominal fat of genetically fat and lean chickens, which exhibit a 2.8-fold divergence in visceral fatness at 7 wk. Ingenuity Pathway Analysis revealed that many of 1687 differentially expressed genes are associated with hemostasis, endocrine function and metabolic syndrome in mammals. Among the highest expressed genes in abdominal fat, across both genotypes, were 25 differentially expressed genes associated with de novo synthesis and metabolism of lipids. Over-expression of numerous adipogenic and lipogenic genes in the FL chickens suggests that in situ lipogenesis in chickens could make a more substantial contribution to expansion of visceral fat mass than previously recognized. Distinguishing features of the abdominal fat transcriptome in lean chickens were high abundance of multiple hemostatic and vasoactive factors, transporters, and ectopic expression of several hormones/receptors, which could control local vasomotor tone and proteolytic processing of adipokines, hemostatic factors and novel endocrine factors. Over-expression of several thrombogenic genes in abdominal fat of lean chickens is quite opposite to the pro-thrombotic state found in obese humans. Clearly, divergent genetic selection for an extreme (2.5–2.8-fold) difference in visceral fatness provokes a number of novel regulatory responses that govern growth and metabolism of visceral fat in this unique avian model of juvenile-onset obesity and glucose-insulin imbalance.
  • Item
    pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature
    (PLOS (Public Library of Science), 2015-08-10) Ding, Ruoyao; Arighi, Cecilia N.; Lee, Jung-Youn; Wu, Cathy H.; Vijay-Shanker, K.; Ruoyao Ding, Cecilia N. Arighi, Jung-Youn Lee, Cathy H. Wu, K. Vijay-Shanker; Ding, Ruoyao; Arighi, Cecilia N.; Lee, Jung-Youn; Wu, Cathy H.; Vijay-Shanker, K.
    BACKGROUND Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines. METHODS In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases. RESULTS We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9%(Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface ( The annotated literature corpus is also publicly
Copyright: Please look at individual material in order to see what the copyright and licensing terms are. Some material may be available for reuse under a Creative Commons license; other material may be the copyright of the individual author(s) or the publisher of the journal. Copyright lines may not be present in Accepted Manuscript versions so please refer to individual journal policies and/or look up the journal policies in Sherpa Romeo.