Open Access Publications
Permanent URI for this collection
Open access publications by faculty, staff, postdocs, and graduate students at the Center for Bioinformatics and Computational Biology.
Browse
Browsing Open Access Publications by Issue Date
Now showing 1 - 20 of 26
Results Per Page
Sort Options
Item The UniProtKB guide to the human proteome(Oxford University Press, 2/19/16) Breuza,Lionel; Poux,Sylvain; Estreicher,Anne; Famiglietti,Maria Livia; Magrane,Michele; Tognolli,Michael; Bridge,Alan; Baratin,Delphine; Redaschi,Nicole; UniProt Consortium; Lionel Breuza, Sylvain Poux, Anne Estreicher, Maria Livia Famiglietti, Michele Magrane, Michael Tognolli, Alan Bridge, Delphine Baratin, Nicole Redaschi and The UniProt Consortium; Wu, Cathy Huey-HwaAdvances in high-throughput and advanced technologies allow researchers to routinely perform whole genome and proteome analysis. For this purpose, they need high-quality resources providing comprehensive gene and protein sets for their organisms of interest. Using the example of the human proteome, we will describe the content of a complete proteome in the UniProt Knowledgebase (UniProtKB). We will show how manual expert curation of UniProtKB/Swiss-Prot is complemented by expert-driven automatic annotation to build a comprehensive, high-quality and traceable resource. We will also illustrate how the complexity of the human proteome is captured and structured in UniProtKB.Item Toll-Like Receptor Signaling in Vertebrates: Testing the Integration of Protein, Complex, and Pathway Data in the Protein Ontology Framework(Public Library of Science (PLOS), 2015-04-20) Arighi, Cecilia N.; Shamovsky, Veronica; Masci, Anna Maria; Ruttenberg, Alan; Smith, Barry; Natale, Darren A.; Wu, Cathy H.; D’Eustachio, Peter; Cecilia Arighi, Veronica Shamovsky, Anna Maria Masci, Alan Ruttenberg, Barry Smith, Darren A. Natale, Cathy Wu, Peter D’Eustachio; Arighi, Cecilia; Wu, CathyThe Protein Ontology (PRO) provides terms for and supports annotation of species-specific protein complexes in an ontology framework that relates them both to their components and to species-independent families of complexes. Comprehensive curation of experimentally known forms and annotations thereof is expected to expose discrepancies, differences, and gaps in our knowledge. We have annotated the early events of innate immune signaling mediated by Toll-Like Receptor 3 and 4 complexes in human, mouse, and chicken. The resulting ontology and annotation data set has allowed us to identify species-specific gaps in experimental data and possible functional differences between species, and to employ inferred structural and functional relationships to suggest plausible resolutions of these discrepancies and gaps.Item pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature(PLOS (Public Library of Science), 2015-08-10) Ding, Ruoyao; Arighi, Cecilia N.; Lee, Jung-Youn; Wu, Cathy H.; Vijay-Shanker, K.; Ruoyao Ding, Cecilia N. Arighi, Jung-Youn Lee, Cathy H. Wu, K. Vijay-Shanker; Ding, Ruoyao; Arighi, Cecilia N.; Lee, Jung-Youn; Wu, Cathy H.; Vijay-Shanker, K.BACKGROUND Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines. METHODS In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases. RESULTS We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9%(Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publiclyItem pGenN, a Gene Normalization Tool for Plant Genes and Proteins in Scientific Literature(Public Library of Science, 2015-08-10) Ding, Ruoyao; Arighi, Cecilia N.; Lee, Jung-Youn; Wu, Cathy H.; Vijay-Shanker, K.; Ruoyao Ding, Cecilia N. Arighi, Jung-Youn Lee, Cathy H. Wu, K. Vijay-Shanker; Dina, Ruoyao; Arighi, Cecilia N.; Lee, Jung-Youn; Wu, Cathy H.; Vijay-Shanker, K.BACKGROUND Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines. METHODS In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases. RESULTS We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9%(Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publicly available through the PIR text mining portal (proteininformationresource. org/iprolink/).Item RNA-Seq Analysis of Abdominal Fat in Genetically Fat and Lean Chickens Highlights a Divergence in Expression of Genes Controlling Adiposity, Hemostasis, and Lipid Metabolism(PLOS (Public Library of Science), 2015-10-07) Resnyk, Christopher W.; Chen, Chuming; Huang, Hongzhan; Wu, Cathy H.; Simon, Jean; Le Bihan-Duval, Elisabeth; Duclos, Michel J.; Cogburn, Larry A.; Christopher W. Resnyk, Chuming Chen, Hongzhan Huang, Cathy H. Wu, Jean Simon, Elisabeth Le Bihan-Duval, Michel J. Duclos, Larry A. Cogburn; Resnyk, Christopher W.; Chen, Chuming; Huang, Hongzhan; Wu, Cathy H.; Cogburn, Larry A.Genetic selection for enhanced growth rate in meat-type chickens (Gallus domesticus) is usually accompanied by excessive adiposity, which has negative impacts on both feed efficiency and carcass quality. Enhanced visceral fatness and several unique features of avian metabolism (i.e., fasting hyperglycemia and insulin insensitivity) mimic overt symptoms of obesity and related metabolic disorders in humans. Elucidation of the genetic and endocrine factors that contribute to excessive visceral fatness in chickens could also advance our understanding of human metabolic diseases. Here, RNA sequencing was used to examine differential gene expression in abdominal fat of genetically fat and lean chickens, which exhibit a 2.8-fold divergence in visceral fatness at 7 wk. Ingenuity Pathway Analysis revealed that many of 1687 differentially expressed genes are associated with hemostasis, endocrine function and metabolic syndrome in mammals. Among the highest expressed genes in abdominal fat, across both genotypes, were 25 differentially expressed genes associated with de novo synthesis and metabolism of lipids. Over-expression of numerous adipogenic and lipogenic genes in the FL chickens suggests that in situ lipogenesis in chickens could make a more substantial contribution to expansion of visceral fat mass than previously recognized. Distinguishing features of the abdominal fat transcriptome in lean chickens were high abundance of multiple hemostatic and vasoactive factors, transporters, and ectopic expression of several hormones/receptors, which could control local vasomotor tone and proteolytic processing of adipokines, hemostatic factors and novel endocrine factors. Over-expression of several thrombogenic genes in abdominal fat of lean chickens is quite opposite to the pro-thrombotic state found in obese humans. Clearly, divergent genetic selection for an extreme (2.5–2.8-fold) difference in visceral fatness provokes a number of novel regulatory responses that govern growth and metabolism of visceral fat in this unique avian model of juvenile-onset obesity and glucose-insulin imbalance.Item RNA-Seq Analysis of Abdominal Fat in Genetically Fat and Lean Chickens Highlights a Divergence in Expression of Genes Controlling Adiposity, Hemostasis, and Lipid Metabolism(Public Library of Science (PLOS), 2015-10-07) Resnyk, Christopher W.; Chen, Chuming; Huang, Hongzhan; Wu, Cathy H.; Simon, Jean; Le Bihan-Duval, Elisabeth; Duclos, Michel J.; Cogburn, Larry A.; Christopher W. Resnyk, Chuming Chen, Hongzhan Huang, Cathy H. Wu, Jean Simon, Elisabeth Le Bihan-Duval, Michel J. Duclos, Larry A. Cogburn; Resnyk, Christopher W.; Chen, Chuming; Huang, Hongzhan; Wu, Cathy H.; Cogburn, Larry A.Genetic selection for enhanced growth rate in meat-type chickens (Gallus domesticus) is usually accompanied by excessive adiposity, which has negative impacts on both feed efficiency and carcass quality. Enhanced visceral fatness and several unique features of avian metabolism (i.e., fasting hyperglycemia and insulin insensitivity) mimic overt symptoms of obesity and related metabolic disorders in humans. Elucidation of the genetic and endocrine factors that contribute to excessive visceral fatness in chickens could also advance our understanding of human metabolic diseases. Here, RNA sequencing was used to examine differential gene expression in abdominal fat of genetically fat and lean chickens, which exhibit a 2.8-fold divergence in visceral fatness at 7 wk. Ingenuity Pathway Analysis revealed that many of 1687 differentially expressed genes are associated with hemostasis, endocrine function and metabolic syndrome in mammals. Among the highest expressed genes in abdominal fat, across both genotypes, were 25 differentially expressed genes associated with de novo synthesis and metabolism of lipids. Over-expression of numerous adipogenic and lipogenic genes in the FL chickens suggests that in situ lipogenesis in chickens could make a more substantial contribution to expansion of visceral fat mass than previously recognized. Distinguishing features of the abdominal fat transcriptome in lean chickens were high abundance of multiple hemostatic and vasoactive factors, transporters, and ectopic expression of several hormones/receptors, which could control local vasomotor tone and proteolytic processing of adipokines, hemostatic factors and novel endocrine factors. Over-expression of several thrombogenic genes in abdominal fat of lean chickens is quite opposite to the pro-thrombotic state found in obese humans. Clearly, divergent genetic selection for an extreme (2.5–2.8-fold) difference in visceral fatness provokes a number of novel regulatory responses that govern growth and metabolism of visceral fat in this unique avian model of juvenile-onset obesity and glucose-insulin imbalance.Item Bioinformatics Knowledge Map for Analysis of Beta-Catenin Function in Cancer(Public Library of Science, 2015-10-28) Çelen, İrem; Ross, Karen E.; Arighi, Cecilia N.; Wu, Cathy H.; İrem Çelen, Karen E. Ross, Cecilia N. Arighi, Cathy H. Wu; Çelen, Irem; Ross, Karen E.; Arighi, Cecilia N.; Wu, Cathy H.Given the wealth of bioinformatics resources and the growing complexity of biological information, it is valuable to integrate data from disparate sources to gain insight into the role of genes/proteins in health and disease. We have developed a bioinformatics framework that combines literature mining with information from biomedical ontologies and curated databases to create knowledge “maps” of genes/proteins of interest.We applied this approach to the study of beta-catenin, a cell adhesion molecule and transcriptional regulator implicated in cancer. The knowledge map includes post-translational modifications (PTMs), protein- protein interactions, disease-associated mutations, and transcription factors coactivated by beta-catenin and their targets and captures the major processes in which betacatenin is known to participate. Using the map, we generated testable hypotheses about beta-catenin biology in normal and cancer cells. By focusing on proteins participating in multiple relation types, we identified proteins that may participate in feedback loops regulating beta-catenin transcriptional activity. By combining multiple network relations with PTM proteoform- specific functional information, we proposed a mechanism to explain the observation that the cyclin dependent kinase CDK5 positively regulates beta-catenin co-activator activity. Finally, by overlaying cancer-associated mutation data with sequence features, we observed mutation patterns in several beta-catenin PTM sites and PTM enzyme binding sites that varied by tissue type, suggesting multiple mechanisms by which beta-catenin mutations can contribute to cancer. The approach described, which captures rich information for molecular species from genes and proteins to PTM proteoforms, is extensible to other proteins and their involvement in disease.Item Predicting nsSNPs that disrupt protein-protein interactions using docking(IEEE Computational Intelligence Society ; IEEE Computer Society ; IEEE Control Systems Society ; IEEE Engineering in Medicine and Biology Society ; The Association for Computing Machinery, 2016-01-22) Goodacre, Norman; Edwards, Nathan; Danielsen, Mark; Uetz, Peter; Wu, Cathy H.; Norman Goodacre, Nathan Edwards, Mark Danielsen, Peter Uetz, Cathy Wu; Wu, Cathy H.The human genome contains a large number of protein polymorphisms due to individual genome variation. How many of these polymorphisms lead to altered protein-protein interaction is unknown. We have developed a method to address this question. The intersection of the SKEMPI database (of affinity constants among interacting proteins) and CAPRI 4.0 docking benchmark was docked using HADDOCK, leading to a training set of 166 mutant pairs. A random forest classifier that uses the differences in resulting docking scores between the 166 mutant pairs and their wild-types was used, to distinguish between variants that have either completely or partially lost binding ability. 50% of non-binders were correctly predicted with a false discovery rate of only 2%. The model was tested on a set of 15 HIV-1 - human, as well as 7 human - human glioblastoma-related, mutant proteins pairs: 50% of combined non-binders were correctly predicted with a false discovery rate of 10%. The model was also used to identify 10 protein-protein interactions between human proteins and their HIV-1 partners that are likely to be abolished by rare non-synonymous single-nucleotide polymorphisms (nsSNPs). These nsSNPs may represent novel and potentially therapeutically-valuable targets for anti-viral therapy by disruption of viral binding.Item Distribution of CpG Motifs in Upstream Gene Domains in a Reef Coral and Sea Anemone: Implications for Epigenetics in Cnidarians(Public Library of Science (PLOS), 2016-03-07) Marsh, Adam G.; Hoadley, Kenneth D.; Warner, Mark E.; Adam G. Marsh, Kenneth D. Hoadley, Mark E. Warner; Marsh, Adam G.; Hoadley, Kenneth D.; Warner, Mark E.Coral reefs are under assault from stressors including global warming, ocean acidification, and urbanization. Knowing how these factors impact the future fate of reefs requires delineating stress responses across ecological, organismal and cellular scales. Recent advances in coral reef biology have integrated molecular processes with ecological fitness and have identified putative suites of temperature acclimation genes in a Scleractinian coral Acropora hyacinthus.We wondered what unique characteristics of these genes determined their coordinate expression in response to temperature acclimation, and whether or not other corals and cnidarians would likewise possess these features. Here, we focus on cytosine methylation as an epigenetic DNA modification that is responsive to environmental stressors. We identify common conserved patterns of cytosine-guanosine dinucleotide (CpG) motif frequencies in upstream promoter domains of different functional gene groups in two cnidarian genomes: a coral (Acropora digitifera) and an anemone (Nematostella vectensis). Our analyses show that CpG motif frequencies are prominent in the promoter domains of functional genes associated with environmental adaptation, particularly those identified in A. hyacinthus. Densities of CpG sites in upstream promoter domains near the transcriptional start site (TSS) are 1.38x higher than genomic background levels upstream of -2000 bp from the TSS. The increase in CpG usage suggests selection to allow for DNA methylation events to occur more frequently within 1 kb of the TSS. In addition, observed shifts in CpG densities among functional groups of genes suggests a potential role for epigenetic DNA methylation within promoter domains to impact functional gene expression responses in A. digitifera and N. vectensis. Identifying promoter epigenetic sequence motifs among genes within specific functional groups establishes an approach to describe integrated cellular responses to environmental stress in reef corals and potential roles of epigenetics on survival and fitness in the face of global climate change.Item Intercellular Variability in Protein Levels from Stochastic Expression and Noisy Cell Cycle Processes(Public Library Science, 2016-08-18) Soltani,Mohammad; Vargas-Garcia,Cesar A.; Antunes,Duarte; Singh,Abhyudai; Mohammad Soltani, Cesar A. Vargas-Garcia, Duarte Antunes, Abhyudai Singh; Singh, AbhyudaiInside individual cells, expression of genes is inherently stochastic and manifests as cell-to-cell variability or noise in protein copy numbers. Since proteins half-lives can be comparable to the cell-cycle length, randomness in cell-division times generates additional intercellular variability in protein levels. Moreover, as many mRNA/protein species are expressed at low-copy numbers, errors incurred in partitioning of molecules between two daughter cells are significant. We derive analytical formulas for the total noise in protein levels when the cell-cycle duration follows a general class of probability distributions. Using a novel hybrid approach the total noise is decomposed into components arising from i) stochastic expression; ii) partitioning errors at the time of cell division and iii) random cell-division events. These formulas reveal that random cell-division times not only generate additional extrinsic noise, but also critically affect the mean protein copy numbers and intrinsic noise components. Counter intuitively, in some parameter regimes, noise in protein levels can decrease as cell-division times become more stochastic. Computations are extended to consider genome duplication, where transcription rate is increased at a random point in the cell cycle. We systematically investigate how the timing of genome duplication influences different protein noise components. Intriguingly, results show that noise contribution from stochastic expression is minimized at an optimal genome-duplication time. Our theoretical results motivate new experimental methods for decomposing protein noise levels from synchronized and asynchronized single-cell expression data. Characterizing the contributions of individual noise mechanisms will lead to precise estimates of gene expression parameters and techniques for altering stochasticity to change phenotype of individual cells.Item InterPro in 2017––beyond protein family and domain annotations(Oxford University Press, 2016-11-28) Finn, Robert D.; Attwood, Teresa K.; Babbitt, Patricia C.; Bateman, Alex; Bork, Peer; Bridge, Alan J.; Chang, Hsin-Yu; Doszt´anyi, Zsuzsanna; El-Gebali, Sara; Fraser, Matthew; Gough, Julian; Haft, David; Holliday, Gemma L.; Huang, Hongzhan; Huang, Xiaosong; Letunic, Ivica; Lopez, Rodrigo; Lu, Shennan; Marchler-Bauer, Aron; Mi, Huaiyu; Mistry, Jaina; Natale, Darren A.; Necci, Marco; Nuka, Gift; Orengo, Christine A.; Park, Youngmi; Pesseat, Sebastien; Piovesan, Damiano; Potter, Simon C.; Rawlings, Neil D.; Redaschi, Nicole; Richardson, Lorna; Rivoire, Catherine; Sangrador-Vegas, Amaia; Sigrist, Christian; Sillitoe, Ian; Smithers, Ben; Squizzato, Silvano; Sutton, Granger; Thanki, Narmada; Thomas, Paul D.; Tosatto, Silvio C. E.; Wu, Cathy H.; Xenarios, Ioannis; Yeh, Lai-Su; Young, Siew-Yit; Mitchell, Alex L.; Robert D. Finn, Teresa K. Attwood, Patricia C. Babbitt, Alex Bateman, Peer Bork, Alan J. Bridge, Hsin-Yu Chang, Zsuzsanna Doszt´anyi, Sara El-Gebali, Matthew Fraser, Julian Gough, David Haft, Gemma L. Holliday, Hongzhan Huang, Xiaosong Huang, Ivica Letunic, Rodrigo Lopez, Shennan Lu, Aron Marchler-Bauer, Huaiyu Mi, Jaina Mistry, Darren A Natale, Marco Necci, Gift Nuka, Christine A. Orengo, Youngmi Park, Sebastien Pesseat, Damiano Piovesan, Simon C. Potter, Neil D. Rawlings, Nicole Redaschi, Lorna Richardson, Catherine Rivoire, Amaia Sangrador-Vegas, Christian Sigrist, Ian Sillitoe, Ben Smithers, Silvano Squizzato, Granger Sutton, Narmada Thanki, Paul D Thomas, Silvio C. E. Tosatto, Cathy H.Wu, Ioannis Xenarios, Lai-Su Yeh, Siew-Yit Young and Alex L. Mitchell; Wu, Cathy H.; Huang, HongzhanInterPro (http://www.ebi.ac.uk/interpro/) is a freely available database used to classify protein sequences into families and to predict the presence of important domains and sites. InterProScan is the underlying software that allows both protein and nucleic acid sequences to be searched against Inter- Pro’s predictive models, which are provided by its member databases. Here, we report recent developments with InterPro and its associated software, including the addition of two new databases (SFLD and CDD), and the functionality to include residue-level annotation and prediction of intrinsic disorder. These developments enrich the annotations provided by InterPro, increase the overall number of residues annotated and allow more specific functional inferences.Item UniProt: the universal protein knowledgebase(Oxford University Press on behalf of Nucleic Acids Research, 2016-11-28) The UniProt Consortium; Wu, Cathy H.; The UniProt Consortium; Wu, Cathy H.The UniProt knowledgebase is a large resource of protein sequences and associated detailed annotation. The database contains over 60 million sequences, of which over half a million sequences have been curated by experts who critically review experimental and predicted data for each protein. The remainder are automatically annotated based on rule systems that rely on the expert curated knowledge. Since our last update in 2014, we have more than doubled the number of reference proteomes to 5631, giving a greater coverage of taxonomic diversity. We implemented a pipeline to remove redundant highly similar proteomes that were causing excessive redundancy in UniProt. The initial run of this pipeline reduced the number of sequences in UniProt by 47 million. For our users interested in the accessory proteomes, we have made available sets of pan proteome sequences that cover the diversity of sequences for each species that is found in its strains and sub-strains. To help interpretation of genomic variants, we provide tracks of detailed protein information for the major genome browsers. We provide a SPARQL endpoint that allows complex queries of the more than 22 billion triples of data in UniProt (http://sparql.uniprot.org/). UniProt resources can be accessed via the website at http://www.uniprot.org/.Item Protein Ontology (PRO): enhancing and scaling up the representation of protein entities(Oxford University Press, 2016-11-28) Natale, Darren A.; Arighi, Cecilia N.; Blake, Judith A.; Bona, Jonathan; Chen, Chuming; Chen, Sheng-Chih; Christie, Karen R.; Cowart, Julie; D’Eustachio, Peter; Diehl, Alexander D.; Drabkin, Harold J.; Duncan, William D.; Huang, Hongzhan; Ren, Jia; Ross, Karen; Ruttenberg, Alan; Shamovsky, Veronica; Smith, Barry; Wang, Qinghua; Zhang, Jian; El-Sayed, Abdelrahman; Wu, Cathy H.; Darren A. Natale, Cecilia N. Arighi, Judith A. Blake, Jonathan Bona, Chuming Chen, Sheng-Chih Chen, Karen R. Christie, Julie Cowart, Peter D’Eustachio, Alexander D. Diehl, Harold J. Drabkin, William D. Duncan, Hongzhan Huang, Jia Ren, Karen Ross, Alan Ruttenberg, Veronica Shamovsky, Barry Smith, Qinghua Wang, Jian Zhang, Abdelrahman El-Sayed and Cathy H. Wu; Arighi, Cecilia N.; Chen, Chuming; Chen, Sheng-Chih; Cowart, Julie; Huang, Hongzhan; Ren, Jia; Wang, Qinghua; Wu, Cathy H.The Protein Ontology (PRO; http://purl.obolibrary. org/obo/pr) formally defines and describes taxonspecific and taxon-neutral protein-related entities in three major areas: proteins related by evolution; proteins produced from a given gene; and proteincontaining complexes. PRO thus serves as a tool for referencing protein entities at any level of specificity. To enhance this ability, and to facilitate the comparison of such entities described in different resources, we developed a standardized representation of proteoforms using UniProtKB as a sequence reference and PSI-MOD as a post-translationalmodification reference. We illustrate its use in facilitating an alignment between PRO and Reactome protein entities. We also address issues of scalability, describing our first steps into the use of text mining to identify protein-related entities, the large-scale import of proteoform information from expert curated resources, and our ability to dynamically generate PRO terms. Web views for individual terms are now more informative about closely-related terms, including for example an interactive multiple sequence alignment. Finally, we describe recent improvement in semantic utility, with PRO now represented in OWL and as a SPARQL endpoint. These developments will further support the anticipated growth of PRO and facilitate discoverability of and allow aggregation of data relating to protein entities.Item ThermoAlign: a genome-aware primer design tool for tiled amplicon resequencing(Nature Publishing Group, 2017-03-16) Francis, Felix; Dumas, Michael D.; Wisser, Randall J.; Felix Francis, Michael D. Dumas & Randall J. Wisser; Francis, Felix; Dumas, Michael D.; Wisser, Randall J.Isolating and sequencing specific regions in a genome is a cornerstone of molecular biology. This has been facilitated by computationally encoding the thermodynamics of DNA hybridization for automated design of hybridization and priming oligonucleotides. However, the repetitive composition of genomes challenges the identification of target-specific oligonucleotides, which limits genetics and genomics research on many species. Here, a tool called ThermoAlign was developed that ensures the design of target-specific primer pairs for DNA amplification. This is achieved by evaluating the thermodynamics of hybridization for full-length oligonucleotide-template alignments — thermoalignments — across the genome to identify primers predicted to bind specifically to the target site. For amplificationbased resequencing of regions that cannot be amplified by a single primer pair, a directed graph analysis method is used to identify minimum amplicon tiling paths. Laboratory validation by standard and long-range polymerase chain reaction and amplicon resequencing with maize, one of the most repetitive genomes sequenced to date (≈85% repeat content), demonstrated the specificity-by-design functionality of ThermoAlign. ThermoAlign is released under an open source license and bundled in a dependency-free container for wide distribution. It is anticipated that this tool will facilitate multiple applications in genetics and genomics and be useful in the workflow of high-throughput targeted resequencing studies.Item ThermoAlign: a genome-aware primer design tool for tiled amplicon resequencing(Nature Publishing Group, 2017-03-16) Francis, Felix; Dumas, Michael D.; Wisser, RJ; Felix Francis, Michael D. Dumas and Randall J.Wisser; Wisser, Randall JeromeIsolating and sequencing specific regions in a genome is a cornerstone of molecular biology. This has been facilitated by computationally encoding the thermodynamics of DNA hybridization for automated design of hybridization and priming oligonucleotides. However, the repetitive composition of genomes challenges the identification of target-specific oligonucleotides, which limits genetics and genomics research on many species. Here, a tool called ThermoAlign was developed that ensures the design of target-specific primer pairs for DNA amplification. This is achieved by evaluating the thermodynamics of hybridization for full-length oligonucleotide-template alignments — thermoalignments — across the genome to identify primers predicted to bind specifically to the target site. For amplificationbased resequencing of regions that cannot be amplified by a single primer pair, a directed graph analysis method is used to identify minimum amplicon tiling paths. Laboratory validation by standard and long-range polymerase chain reaction and amplicon resequencing with maize, one of the most repetitive genomes sequenced to date (≈85% repeat content), demonstrated the specificity-by-design functionality of ThermoAlign. ThermoAlign is released under an open source license and bundled in a dependency-free container for wide distribution. It is anticipated that this tool will facilitate multiple applications in genetics and genomics and be useful in the workflow of high-throughput targeted resequencing studies.Item Whole-genome sequencing identifies I-SceI-mediated transgene integration sites in Xenopus tropicalis snai2: eGFP line(G3: Genes | Genomes | Genetics, 2022-02-16) Wang, Jian; Lu, Congyu; Wei, ShuoTransgenesis with the meganuclease I-SceI is a safe and efficient method, but the underlying mechanisms remain unclear due to the lack of information on transgene localization. Using I-SceI, we previously developed a transgenic Xenopus tropicalis line expressing enhanced green fluorescent protein driven by the neural crest-specific snai2 promoter/enhancer, which is a powerful tool for studying neural crest development and craniofacial morphogenesis. Here we carried out whole-genome shotgun sequencing for the snai2: eGFP embryos to identify the transgene integration sites. With a 19x sequencing coverage, we estimated that 6 copies of the transgene were inserted into the X. tropicalis genome in the hemizygous transgenic embryos. Two transgene integration loci adjacent to each other were identified in a non-coding region on Chromosome 1, possibly as a result of duplication after a single transgene insertion. Interestingly, genomic DNA at the boundaries of the transgene integration loci contains short sequences homologous to the I-SceI recognition site, suggesting that the integration was not random but probably mediated by sequence homology. To our knowledge, our work represents the first genome-wide sequencing study on a transgenic organism generated with I-SceI, which is useful for evaluating the potential genetic effects of I-SceI-mediated transgenesis and further understanding the mechanisms underlying this transgenic method.Item DNA Methylation Analysis Reveals Distinct Patterns in Satellite Cell–Derived Myogenic Progenitor Cells of Subjects with Spastic Cerebral Palsy(Journal of Personalized Medicine, 2022-11-30) Robinson, Karyn G.; Marsh, Adam G.; Lee, Stephanie K.; Hicks, Jonathan; Romero, Brigette; Batish, Mona; Crowgey, Erin L.; Shrader, M. Wade; Akins, Robert E.Spastic type cerebral palsy (CP) is a complex neuromuscular disorder that involves altered skeletal muscle microanatomy and growth, but little is known about the mechanisms contributing to muscle pathophysiology and dysfunction. Traditional genomic approaches have provided limited insight regarding disease onset and severity, but recent epigenomic studies indicate that DNA methylation patterns can be altered in CP. Here, we examined whether a diagnosis of spastic CP is associated with intrinsic DNA methylation differences in myoblasts and myotubes derived from muscle resident stem cell populations (satellite cells; SCs). Twelve subjects were enrolled (6 CP; 6 control) with informed consent/assent. Skeletal muscle biopsies were obtained during orthopedic surgeries, and SCs were isolated and cultured to establish patient–specific myoblast cell lines capable of proliferation and differentiation in culture. DNA methylation analyses indicated significant differences at 525 individual CpG sites in proliferating SC–derived myoblasts (MB) and 1774 CpG sites in differentiating SC–derived myotubes (MT). Of these, 79 CpG sites were common in both culture types. The distribution of differentially methylated 1 Mbp chromosomal segments indicated distinct regional hypo– and hyper–methylation patterns, and significant enrichment of differentially methylated sites on chromosomes 12, 13, 14, 15, 18, and 20. Average methylation load across 2000 bp regions flanking transcriptional start sites was significantly different in 3 genes in MBs, and 10 genes in MTs. SC derived MBs isolated from study participants with spastic CP exhibited fundamental differences in DNA methylation compared to controls at multiple levels of organization that may reveal new targets for studies of mechanisms contributing to muscle dysregulation in spastic CP.Item Transcriptomic Signature of the Simulated Microgravity Response in Caenorhabditis elegans and Comparison to Spaceflight Experiments(Cells, 2023-01-10) Çelen, İrem; Jayasinghe, Aroshan; Doh, Jung H.; Sabanayagam, Chandran R.Given the growing interest in human exploration of space, it is crucial to identify the effects of space conditions on biological processes. Here, we analyze the transcriptomic response of Caenorhabditis elegans to simulated microgravity and observe the maintained transcriptomic response after returning to ground conditions for four, eight, and twelve days. We show that 75% of the simulated microgravity-induced changes on gene expression persist after returning to ground conditions for four days while most of these changes are reverted after twelve days. Our results from integrative RNA-seq and mass spectrometry analyses suggest that simulated microgravity affects longevity-regulating insulin/IGF-1 and sphingolipid signaling pathways. Finally, we identified 118 genes that are commonly differentially expressed in simulated microgravity- and space-exposed worms. Overall, this work provides insight into the effect of microgravity on biological systems during and after exposure.Item Machine learning classifiers predict key genomic and evolutionary traits across the kingdoms of life(Scientific Reports, 2023-02-06) Hallee, Logan; Khomtchouk, Bohdan B.In this study, we investigate how an organism’s codon usage bias can serve as a predictor and classifier of various genomic and evolutionary traits across the domains of life. We perform secondary analysis of existing genetic datasets to build several AI/machine learning models. When trained on codon usage patterns of nearly 13,000 organisms, our models accurately predict the organelle of origin and taxonomic identity of nucleotide samples. We extend our analysis to identify the most influential codons for phylogenetic prediction with a custom feature ranking ensemble. Our results suggest that the genetic code can be utilized to train accurate classifiers of taxonomic and phylogenetic features. We then apply this classification framework to open reading frame (ORF) detection. Our statistical model assesses all possible ORFs in a nucleotide sample and rejects or deems them plausible based on the codon usage distribution. Our dataset and analyses are made publicly available on GitHub and the UCI ML Repository to facilitate open-source reproducibility and community engagement.Item Transcriptional regulation of Sis1 promotes fitness but not feedback in the heat shock response(eLife, 2023-05-17) Grade, Rania; Singh, Abhyudai; Ali, Asif; Pincus, DavidThe heat shock response (HSR) controls expression of molecular chaperones to maintain protein homeostasis. Previously, we proposed a feedback loop model of the HSR in which heat-denatured proteins sequester the chaperone Hsp70 to activate the HSR, and subsequent induction of Hsp70 deactivates the HSR (Krakowiak et al., 2018; Zheng et al., 2016). However, recent work has implicated newly synthesized proteins (NSPs) – rather than unfolded mature proteins – and the Hsp70 co-chaperone Sis1 in HSR regulation, yet their contributions to HSR dynamics have not been determined. Here, we generate a new mathematical model that incorporates NSPs and Sis1 into the HSR activation mechanism, and we perform genetic decoupling and pulse-labeling experiments to demonstrate that Sis1 induction is dispensable for HSR deactivation. Rather than providing negative feedback to the HSR, transcriptional regulation of Sis1 by Hsf1 promotes fitness by coordinating stress granules and carbon metabolism. These results support an overall model in which NSPs signal the HSR by sequestering Sis1 and Hsp70, while induction of Hsp70 – but not Sis1 – attenuates the response.