Memory optimization in codelet execution model on many-core architectures

Date
2014
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
The upcoming exa-scale era requires a parallel program execution model capable of achieving scalability, productivity, energy efficiency, and resiliency. The codelet model is a fine-grained dataflow-inspired execution model which is the focus of several tera-scale and exa-scale studies such as DARPA's UHPC, DOE's X-Stack, and the European TERAFLUX projects. Current codelet implementations aim to making fully use of computation resources by balancing their workload in the multi-core and many-core systems. The performance is improved by this method. However, by making use of the features of the codelet model the memory optimization can be also implemented to improve the performance as well as energy efficiency. In this thesis, we focus on the memory optimization on memory workload balance and locality exploitation in the codelet model. As a case study, various versions of FFT algorithms are implemented on IBM Cyclops-64 - a many-core system to demonstrate that the fine-grain codelet execution model is able to execute the codelets that involve different workload on the memory bandwidth in an appropriate order to reduce memory contention and thus improve performance. The experiment result shows that our fine-grain guided algorithm achieves up to 46% performance improvement comparing to a coarse-grain implementation on Cyclops-64. To automatically exploit locality in codelet execution, we provide three optimal or nearly optimal scheduling algorithms based on static information of codelet graph and locality. They have different trade-offs in algorithmic complexity, locality exploitation, program execution time, and energy efficiency. We test and analyze the three algorithms on various applications on an emulation platform of Cyclops-64. The experiment result shows that our algorithms reduce up to 59.7% of global memory access by using local memory to buffer intermediate data between two adjacent codelets on the same core and thus improve up to 68.1% performance improvement and 40.7% energy saving comparing to the dynamic codelet scheduling approach.
Description
Keywords
Citation