Memory optimization in codelet execution model on many-core architectures

Author(s)Wu, Yao
Date Accessioned2015-03-16T12:51:30Z
Date Available2015-03-16T12:51:30Z
Publication Date2014
AbstractThe upcoming exa-scale era requires a parallel program execution model capable of achieving scalability, productivity, energy efficiency, and resiliency. The codelet model is a fine-grained dataflow-inspired execution model which is the focus of several tera-scale and exa-scale studies such as DARPA's UHPC, DOE's X-Stack, and the European TERAFLUX projects. Current codelet implementations aim to making fully use of computation resources by balancing their workload in the multi-core and many-core systems. The performance is improved by this method. However, by making use of the features of the codelet model the memory optimization can be also implemented to improve the performance as well as energy efficiency. In this thesis, we focus on the memory optimization on memory workload balance and locality exploitation in the codelet model. As a case study, various versions of FFT algorithms are implemented on IBM Cyclops-64 - a many-core system to demonstrate that the fine-grain codelet execution model is able to execute the codelets that involve different workload on the memory bandwidth in an appropriate order to reduce memory contention and thus improve performance. The experiment result shows that our fine-grain guided algorithm achieves up to 46% performance improvement comparing to a coarse-grain implementation on Cyclops-64. To automatically exploit locality in codelet execution, we provide three optimal or nearly optimal scheduling algorithms based on static information of codelet graph and locality. They have different trade-offs in algorithmic complexity, locality exploitation, program execution time, and energy efficiency. We test and analyze the three algorithms on various applications on an emulation platform of Cyclops-64. The experiment result shows that our algorithms reduce up to 59.7% of global memory access by using local memory to buffer intermediate data between two adjacent codelets on the same core and thus improve up to 68.1% performance improvement and 40.7% energy saving comparing to the dynamic codelet scheduling approach.en_US
AdvisorGao, Guang R.
DegreeM.E.E.
DepartmentUniversity of Delaware, Department of Electrical and Computer Engineering
Unique Identifier904958900
URLhttp://udspace.udel.edu/handle/19716/16691
PublisherUniversity of Delawareen_US
URIhttp://search.proquest.com/docview/1564756419?accountid=10457
dc.subject.lcshComputer algorithms.
dc.subject.lcshIBM computers.
dc.subject.lcshComputer programs -- Execution.
TitleMemory optimization in codelet execution model on many-core architecturesen_US
TypeThesisen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2014_Wu_Yao_MEE.pdf
Size:
1.6 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.22 KB
Format:
Item-specific license agreed upon to submission
Description: