Performance, power, and energy tuning using hardware and software techniques for modern parallel architectures
Date
2016
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
As the high-performance computing (HPC) community continues the push towards
exascale computing, power consumption is becoming a major concern for designing,
building, maintaining, and getting the most out of supercomputers. Energy
efficiency has become one of the top ten exascale system research challenges. Meeting
the goal of exascale performance with 20 Megawatts of power limit requires performance,
power, and energy optimization techniques at all levels, from the hardware to
the application. In the meanwhile, although advances in parallel architectures promise
improved peak computational performance, the use of software tools to drive parallelism
of the hardware still requires expertise that is not widely available. Domain
scientists are faced with a challenge to efficiently port applications to new parallel architectures
like Nvidia GPUs and Intel Many-Integrated Core (MIC) accelerators. The
fact that an increasing number of supercomputers are going to contain accelerators
poses more challenge to application developers, who will need to get their applications
ready for the supercomputers.
In this dissertation, we began with the study on how to achieve both automatic
parallelization using OpenACC and enhanced portability using OpenCL. We applied
our parallelization techniques on GPUs as well as an Intel MIC-architecture accelerator
to reduce the running time of 2D wave propagation simulations. The performance
and programmability of CUDA, OpenCL, OpenACC, and OpenMP implementations
of the wave propagation simulation are compared. Compared to CUDA and OpenCL,
we believe that OpenACC is preferable for domain scientists because programmers can
parallelize their code using simple directives, and therefore it speeds up the process
of parallelizing applications. OpenACC is shown to be able to achieve comparable
performance as CUDA and OpenCL on GPUs with much reduced coding effort. Our
OpenMP implementation outperforms OpenCL and OpenACC on the Intel MIC accelerator.
Emerging software developments like the OpenACC facilitate exploiting
application parallelism offered by evolving hardware architecture. Our method of using
OpenACC, OpenCL, and OpenMP to achieve efficient and effective parallelization
on different accelerators can be generally applied to benefit other domains.
For the energy tuning problem, we tackle the problem from using software techniques
first. We integrated an energy measurement framework to an existing polyhedral
transformation framework called PoCC. Loop transformations supported by
PoCC have been shown to be effective in optimizing the performance of small kernels.
However, there have been few studies on how these transformations affect the power
and the energy. The energy measurement framework allows exploring the relationship
between tuning for power/energy and tuning for performance. A high correlation of
energy/performance in PoCC is observed but tuning for power is different from tuning
for execution time. We constructed predictive models that achieved high prediction
accuracy. In addition, we also demonstrate the potential of polyhedral transformations
in optimizing the 2D cardiac wave propagation application for both performance and
energy.
Then, we propose to minimize energy usage without impacting the performance
of HPC applications from using hardware techniques. We developed energy optimization
techniques that did not only reduce power, but also Energy-Delay Product (EDP)
and in some cases even Energy-Delay-Squared Product (ED2P). We took advantage
of the low transition overhead of CPU clock modulation and applied it to fine-grained
OpenMP parallel loops. The energy behaviour of OpenMP parallel regions is first
characterized by the memory access density. By characterizing memory access density,
the best clock modulation setting is determined for each region. Finally, different
CPU clock settings are applied to the different loops within the same application. The
resulting multi-frequency execution of OpenMP applications achieved better energy
efficiency than any single frequency setting.
In the last chapter of this dissertation, we combined software and hardware
techniques to obtain better energy efficiency for HPC applications. In particular, on
Intel Sandy Bridge architecture we applied concurrency throttling, i.e., reducing the
number of threads needed by an OpenMP application, with CPU clock modulation
and on IBM Power8 architecture we applied concurrency throttling with DVFS. In
both cases we observed improved energy efficiency. Lastly, we combined polyhedral
compilation techniques with CPU clock modulation and evaluated their interactions
under a power-capped environment.