Development of a novel, reference-free tool for the comprehensive evaluation of genome assembly quality and its application to establish a reference assembly for chinese hamster ovary (CHO) cells
Date
2019
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Whole genome assemblies are regularly becoming available for more organisms due to the reduced time and costs of DNA sequencing. Multiple assemblies may be created for the same species with one being selected as the reference genome to guide wet-lab and bioinformatics studies. To select the most complete, continuous, and accurate assembly for an organism of interest, improved methods for quality assessment of assemblies is necessary. Currently, most methods to evaluate genome assembly quality focus on completeness or continuity only. If accuracy is assessed, a high quality reference genome for the organism of interest is often required for a direct sequence comparison.☐ Here, we emphasize the need for assembly quality assessment by using as a case study the creation of multiple genome assemblies for the Chinese hamster (CH) and Chinese hamster ovary (CHO) cells, the preferred platform for therapeutic protein production. The highest quality assembly, CH PICR, was created from combining multiple assemblies where the primary, base assembly was developed from long-read sequencing data. CH PICR was selected through manual quality assessment, annotated, and made available on the NCBI RefSeq database as the new reference genome. ☐ We then describe the development of a novel tool, EvalDNA (Evaluation of <em>De Novo</em> Assemblies) to facilitate the evaluation of mammalian genome assembly quality and the selection of the reference genome. EvalDNA overcomes the requirement of an additional genome assembly by using a machine-learning model to integrate a variety of quality metrics into a single, comprehensive quality score. The provided model can explain approximately 86% of the variation in reference-based quality scores in the test data, consisting of different draft chromosome assemblies with real/simulated errors. EvalDNA also distinguishes itself from current assembly evaluation tools because EvalDNA quality scores generated by the same model are comparable across different organisms. ☐ EvalDNA was used to evaluate the novel assemblies of the CH genome. The resulting scores showed that CH PICR was of the highest quality, agreeing with the manual quality evaluation. This observation confirms EvalDNA's ability to score assemblies from organisms not used in the training data. EvalDNA's ability to compare assemblies from different assemblers and organisms is also examined. ☐ Finally, we demonstrate the benefits of having an improved CH reference genome assembly in CHO cell genetic engineering. Successful gene knock-downs and knock-outs in CHO cells can prevent the expression of difficult-to-remove host cell proteins (HCPs). HCPs, if not removed, can cause problems in the stability, safety, and efficacy of the biotherapeutic protein being produced. Here, the CH PICR reference genome was used to identify new knockout targets with similar predicted functions and characteristics as several difficult-to-remove HCPs.