Performance, power, and energy tuning using hardware and software techniques for modern parallel architectures

Wang, Wei

Performance, power, and energy tuning using hardware and software techniques for modern parallel architectures

Files

2016_WangWei_PhD.pdf (2.34 MB)

Date

2016

Authors

Wang, Wei

Publisher

University of Delaware

Abstract

As the high-performance computing (HPC) community continues the push towards exascale computing, power consumption is becoming a major concern for designing, building, maintaining, and getting the most out of supercomputers. Energy efficiency has become one of the top ten exascale system research challenges. Meeting the goal of exascale performance with 20 Megawatts of power limit requires performance, power, and energy optimization techniques at all levels, from the hardware to the application. In the meanwhile, although advances in parallel architectures promise improved peak computational performance, the use of software tools to drive parallelism of the hardware still requires expertise that is not widely available. Domain scientists are faced with a challenge to efficiently port applications to new parallel architectures like Nvidia GPUs and Intel Many-Integrated Core (MIC) accelerators. The fact that an increasing number of supercomputers are going to contain accelerators poses more challenge to application developers, who will need to get their applications ready for the supercomputers. In this dissertation, we began with the study on how to achieve both automatic parallelization using OpenACC and enhanced portability using OpenCL. We applied our parallelization techniques on GPUs as well as an Intel MIC-architecture accelerator to reduce the running time of 2D wave propagation simulations. The performance and programmability of CUDA, OpenCL, OpenACC, and OpenMP implementations of the wave propagation simulation are compared. Compared to CUDA and OpenCL, we believe that OpenACC is preferable for domain scientists because programmers can parallelize their code using simple directives, and therefore it speeds up the process of parallelizing applications. OpenACC is shown to be able to achieve comparable performance as CUDA and OpenCL on GPUs with much reduced coding effort. Our OpenMP implementation outperforms OpenCL and OpenACC on the Intel MIC accelerator. Emerging software developments like the OpenACC facilitate exploiting application parallelism offered by evolving hardware architecture. Our method of using OpenACC, OpenCL, and OpenMP to achieve efficient and effective parallelization on different accelerators can be generally applied to benefit other domains. For the energy tuning problem, we tackle the problem from using software techniques first. We integrated an energy measurement framework to an existing polyhedral transformation framework called PoCC. Loop transformations supported by PoCC have been shown to be effective in optimizing the performance of small kernels. However, there have been few studies on how these transformations affect the power and the energy. The energy measurement framework allows exploring the relationship between tuning for power/energy and tuning for performance. A high correlation of energy/performance in PoCC is observed but tuning for power is different from tuning for execution time. We constructed predictive models that achieved high prediction accuracy. In addition, we also demonstrate the potential of polyhedral transformations in optimizing the 2D cardiac wave propagation application for both performance and energy. Then, we propose to minimize energy usage without impacting the performance of HPC applications from using hardware techniques. We developed energy optimization techniques that did not only reduce power, but also Energy-Delay Product (EDP) and in some cases even Energy-Delay-Squared Product (ED2P). We took advantage of the low transition overhead of CPU clock modulation and applied it to fine-grained OpenMP parallel loops. The energy behaviour of OpenMP parallel regions is first characterized by the memory access density. By characterizing memory access density, the best clock modulation setting is determined for each region. Finally, different CPU clock settings are applied to the different loops within the same application. The resulting multi-frequency execution of OpenMP applications achieved better energy efficiency than any single frequency setting. In the last chapter of this dissertation, we combined software and hardware techniques to obtain better energy efficiency for HPC applications. In particular, on Intel Sandy Bridge architecture we applied concurrency throttling, i.e., reducing the number of threads needed by an OpenMP application, with CPU clock modulation and on IBM Power8 architecture we applied concurrency throttling with DVFS. In both cases we observed improved energy efficiency. Lastly, we combined polyhedral compilation techniques with CPU clock modulation and evaluated their interactions under a power-capped environment.

URI

http://udspace.udel.edu/handle/19716/19986

Collections

Doctoral Dissertations (Winter 2014 to Present)

Full item page