Supervised machine learning for small RNA informatics and big data analytics in plants

Abstract
My dissertation work focused on (1) the development of supervised machine learning approaches for plant small RNAs (i.e., microRNAs (miRNAs) and their secondary siRNA products, and heterochromatic siRNAs (hc-siRNAs)), and (2) the application of these data analytics tools to analyze next-generation sequencing (NGS) data of increasing size and complexity, bordering on ‘big data’. Male germline associated 21- and 24-nt phased siRNAs (reproductive phasiRNAs) are highly enriched and numerous in the Poaceae (grasses) and crucial for reproductive tissue development and success. Little is known about the characteristics and functions of reproductive phasiRNAs in the grasses despite significant genomic resources, experimental data, and a growing set of computational tools. Given the important role grasses such as maize and rice play as a prime food-source in many countries and as influential factors in the global economy, a deeper understanding about their characteristics, possible targets and functions, and biogenesis is required. I present a new machine learning based approach for in-depth characterization of phasiRNAs, demonstrating highly informative sequence-based and positional features, strand specificity, and position-specific nucleotide biases potentially influencing AGO sorting. One major goal of my work was to utilize these tools as well as several other co-published tools to assess the landscape of sRNAs, including phasiRNAs and miRNAs (especially miR2118 and miR2275, triggers of reproductive phasiRNAs) in diverse 41 angiosperm species (38 monocot species, including families from the Acorales and Arecales to the Zingiberales) dating back to at least 200 million years ago. I demonstrate the origins and conservation of miR2118 and miR2275 across the monocot species and show that 21- and 24-PHAS loci (the sources of reproductive phasiRNAs) are particularly numerous and abundant in the inflorescence tissues of the grasses. I show that in silico trigger identification indicates diverse mechanisms for the production of 21- and 24-nt phasiRNAs. I conclude that the prevalence of reproductive phasiRNAs in the flowering plants beyond the grasses shows their broad roles and importance in plant reproductive tissue development.
Description
Keywords
Citation