Enabling scalable data analysis for large computational structural biology datasets on large distributed memory systems supported by the MapReduce paradigm

Zhang, Boyu

Enabling scalable data analysis for large computational structural biology datasets on large distributed memory systems supported by the MapReduce paradigm

Author(s)	Zhang, Boyu
Date Accessioned	2015-10-27T12:49:17Z
Date Available	2015-10-27T12:49:17Z
Publication Date	2015
Abstract	Today, petascale distributed memory systems perform large-scale simulations and generate massive amounts of data in a distributed fashion at unprecedented rates. This massive amount of data presents new challenges for the scientists analyzing the data. In order to classify and cluster this data, traditional analysis methods require the comparison of single records with each other in an iterative process and therefore involve moving data across nodes of the system. When both the data and the number of nodes increase, classification and clustering methods can put increasing pressure on the system's storage and bandwidth. Thus, the methods become inefficient and do not scale. New methodologies are needed to analyze data when it is distributed across nodes of large distributed memory systems. In general, when analyzing such scientific data, we focus on specific properties of the data records. For example, in structural biology datasets, properties include the molecular geometry or the location of a molecule in a docking pocket. Based on this observation, we propose a methodology that enables the scalable analysis for large datasets, composed of millions of individual data records, in a distributed manner on large distributed memory systems. The methodology comprises two general steps. The first step extracts concise properties or features of each data record in isolation and represents them as metadata in parallel. The second step performs the analysis (i.e., classification or clustering) on the extracted properties (i.e., metadata) using machine learning techniques. We apply the methodology to three different computational structural biology datasets to (1) identify class memberships for large RNA sequences from their secondary structures, (2) identify geometrical features that can be used to predict class memberships for structural biology datasets containing ligand conformations from protein-ligand docking simulations, and (3) find recurrent folding patterns within and across trajectories (i.e., intra- and inter-trajectory, respectively) in multiple trajectories sampled from folding simulations. Since our method naturally fits in the MapReduce paradigm, we adapt it for different MapReduce frameworks (i.e., Hadoop and MapReduce-MPI) and use the frameworks on high-end clusters for the three scientific challenges listed above. Our results show that our approach enables scalable classification and clustering analyses for large-scale computational structural biology datasets on large distributed memory systems. In addition, compared with traditional analysis approaches, our method achieves similar or better accuracy.	en_US
Advisor	Taufer, Michela
Degree	Ph.D.
Department	University of Delaware, Department of Computer and Information Sciences
Unique Identifier	926934838
URL	http://udspace.udel.edu/handle/19716/17204
Publisher	University of Delaware	en_US
URI	http://search.proquest.com/docview/1708646796?accountid=10457
dc.subject.lcsh	Data mining.
dc.subject.lcsh	MapReduce (Computer file)
dc.subject.lcsh	RNA -- Data processing.
dc.subject.lcsh	Ligands -- Data processing.
dc.subject.lcsh	Protein folding -- Data processing.
Title	Enabling scalable data analysis for large computational structural biology datasets on large distributed memory systems supported by the MapReduce paradigm	en_US
Type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2015_ZhangBoyu_PhD.pdf
Size:: 4.4 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.22 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Doctoral Dissertations (Winter 2014 to Present)