Open Access Publications
Permanent URI for this collection
Open access publications by faculty, staff, postdocs, and graduate students at the Center for Bioinformatics and Computational Biology.
Browse
Browsing Open Access Publications by Author "Arighi, Cecilia N."
Now showing 1 - 5 of 5
Results Per Page
Sort Options
Item Bioinformatics Knowledge Map for Analysis of Beta-Catenin Function in Cancer(Public Library of Science, 2015-10-28) Çelen, İrem; Ross, Karen E.; Arighi, Cecilia N.; Wu, Cathy H.; İrem Çelen, Karen E. Ross, Cecilia N. Arighi, Cathy H. Wu; Çelen, Irem; Ross, Karen E.; Arighi, Cecilia N.; Wu, Cathy H.Given the wealth of bioinformatics resources and the growing complexity of biological information, it is valuable to integrate data from disparate sources to gain insight into the role of genes/proteins in health and disease. We have developed a bioinformatics framework that combines literature mining with information from biomedical ontologies and curated databases to create knowledge “maps” of genes/proteins of interest.We applied this approach to the study of beta-catenin, a cell adhesion molecule and transcriptional regulator implicated in cancer. The knowledge map includes post-translational modifications (PTMs), protein- protein interactions, disease-associated mutations, and transcription factors coactivated by beta-catenin and their targets and captures the major processes in which betacatenin is known to participate. Using the map, we generated testable hypotheses about beta-catenin biology in normal and cancer cells. By focusing on proteins participating in multiple relation types, we identified proteins that may participate in feedback loops regulating beta-catenin transcriptional activity. By combining multiple network relations with PTM proteoform- specific functional information, we proposed a mechanism to explain the observation that the cyclin dependent kinase CDK5 positively regulates beta-catenin co-activator activity. Finally, by overlaying cancer-associated mutation data with sequence features, we observed mutation patterns in several beta-catenin PTM sites and PTM enzyme binding sites that varied by tissue type, suggesting multiple mechanisms by which beta-catenin mutations can contribute to cancer. The approach described, which captures rich information for molecular species from genes and proteins to PTM proteoforms, is extensible to other proteins and their involvement in disease.Item pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature(PLOS (Public Library of Science), 2015-08-10) Ding, Ruoyao; Arighi, Cecilia N.; Lee, Jung-Youn; Wu, Cathy H.; Vijay-Shanker, K.; Ruoyao Ding, Cecilia N. Arighi, Jung-Youn Lee, Cathy H. Wu, K. Vijay-Shanker; Ding, Ruoyao; Arighi, Cecilia N.; Lee, Jung-Youn; Wu, Cathy H.; Vijay-Shanker, K.BACKGROUND Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines. METHODS In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases. RESULTS We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9%(Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publiclyItem pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature(Public Library of Science, 2015-08-10) Ding, Ruoyao; Arighi, Cecilia N.; Lee, Jung-Youn; Wu, Cathy H.; Vijay-Shanker, K.; Ruoyao Ding, Cecilia N. Arighi, Jung-Youn Lee, Cathy H. Wu, K. Vijay-Shanker; Dina, Ruoyao; Arighi, Cecilia N.; Lee, Jung-Youn; Wu, Cathy H.; Vijay-Shanker, K.BACKGROUND Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines. METHODS In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases. RESULTS We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9%(Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publicly available through the PIR text mining portal (proteininformationresource. org/iprolink/).Item Protein Ontology (PRO): enhancing and scaling up the representation of protein entities(Oxford University Press, 2016-11-28) Natale, Darren A.; Arighi, Cecilia N.; Blake, Judith A.; Bona, Jonathan; Chen, Chuming; Chen, Sheng-Chih; Christie, Karen R.; Cowart, Julie; D’Eustachio, Peter; Diehl, Alexander D.; Drabkin, Harold J.; Duncan, William D.; Huang, Hongzhan; Ren, Jia; Ross, Karen; Ruttenberg, Alan; Shamovsky, Veronica; Smith, Barry; Wang, Qinghua; Zhang, Jian; El-Sayed, Abdelrahman; Wu, Cathy H.; Darren A. Natale, Cecilia N. Arighi, Judith A. Blake, Jonathan Bona, Chuming Chen, Sheng-Chih Chen, Karen R. Christie, Julie Cowart, Peter D’Eustachio, Alexander D. Diehl, Harold J. Drabkin, William D. Duncan, Hongzhan Huang, Jia Ren, Karen Ross, Alan Ruttenberg, Veronica Shamovsky, Barry Smith, Qinghua Wang, Jian Zhang, Abdelrahman El-Sayed and Cathy H. Wu; Arighi, Cecilia N.; Chen, Chuming; Chen, Sheng-Chih; Cowart, Julie; Huang, Hongzhan; Ren, Jia; Wang, Qinghua; Wu, Cathy H.The Protein Ontology (PRO; http://purl.obolibrary. org/obo/pr) formally defines and describes taxonspecific and taxon-neutral protein-related entities in three major areas: proteins related by evolution; proteins produced from a given gene; and proteincontaining complexes. PRO thus serves as a tool for referencing protein entities at any level of specificity. To enhance this ability, and to facilitate the comparison of such entities described in different resources, we developed a standardized representation of proteoforms using UniProtKB as a sequence reference and PSI-MOD as a post-translationalmodification reference. We illustrate its use in facilitating an alignment between PRO and Reactome protein entities. We also address issues of scalability, describing our first steps into the use of text mining to identify protein-related entities, the large-scale import of proteoform information from expert curated resources, and our ability to dynamically generate PRO terms. Web views for individual terms are now more informative about closely-related terms, including for example an interactive multiple sequence alignment. Finally, we describe recent improvement in semantic utility, with PRO now represented in OWL and as a SPARQL endpoint. These developments will further support the anticipated growth of PRO and facilitate discoverability of and allow aggregation of data relating to protein entities.Item Toll-Like Receptor Signaling in Vertebrates: Testing the Integration of Protein, Complex, and Pathway Data in the Protein Ontology Framework(Public Library of Science (PLOS), 2015-04-20) Arighi, Cecilia N.; Shamovsky, Veronica; Masci, Anna Maria; Ruttenberg, Alan; Smith, Barry; Natale, Darren A.; Wu, Cathy H.; D’Eustachio, Peter; Cecilia Arighi, Veronica Shamovsky, Anna Maria Masci, Alan Ruttenberg, Barry Smith, Darren A. Natale, Cathy Wu, Peter D’Eustachio; Arighi, Cecilia; Wu, CathyThe Protein Ontology (PRO) provides terms for and supports annotation of species-specific protein complexes in an ontology framework that relates them both to their components and to species-independent families of complexes. Comprehensive curation of experimentally known forms and annotations thereof is expected to expose discrepancies, differences, and gaps in our knowledge. We have annotated the early events of innate immune signaling mediated by Toll-Like Receptor 3 and 4 complexes in human, mouse, and chicken. The resulting ontology and annotation data set has allowed us to identify species-specific gaps in experimental data and possible functional differences between species, and to employ inferred structural and functional relationships to suggest plausible resolutions of these discrepancies and gaps.