A framework for group locality aware multithreading
Date
2015
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
When powerful computational cores are matched against limited resources such
as bandwidth and memory, competition across computational units can result in serious
performance degradation due to contention at different levels of the memory hierarchy.
In such a case, it becomes very important to maximize reuse at low latency memory
space. The high performance community has been actively working on finding techniques
that can improve data locality and reuse, as such endeavors have a direct and
positive impact on both performance and power. In order to better utilize low latency
memory such as caches and scratch-pad SRAMs, software techniques such as hierarchical
tiling have proven very effective. However these techniques still operate under
the paradigm that memory agnostic coarse grain parallelism is the norm. Therefore,
they function in a coarse-grain fashion where inner tiles run serially. Such behavior
can result in an under-utilization of processing and network resources. Even when
the inner tiles are assigned to fine grain tasks, their memory agnostic placement will
still incur heavy penalties when resources are contended. Finding a balance between
parallelism and locality in such cases is an arduous task and it is essential to seek a
common ground where both processing and memory resources can collaborate to yield
a better overall utilization.
This thesis explores the concept of locality aware multithreading called “Group
Locality”. Multiple group of threads work together in close proximity in time and space
as a unit, taking advantage of each other’s data movement and collaborating at a very
fine-grain level. When an execution pattern is known a priori through static analysis
and data is shared among different thread groups, the concept of group locality extends
further to rearrange data to provide better access pattern for other thread groups. This
thesis presents an end-to-end framework that takes advantage of information gained by
the compiler and utilizes it for careful orchestration of memory access, data movement
and data restructuring. The contributions of this thesis include:
• An efficient tiling strategy to exploit intratile parallelism. Using an existing polyhedral
framework and compiler toolchain, we created a multi-hierarchical tiling
with highly parallel tiles that map to the lowest level cache and communicate
with minimal overhead. Sets of these inner tiles form outer level tiles that run in
parallel.
• A highly optimized fine grain runtime where threads execute in a data-flow fashion
and synchronize mainly using atomic operations.
• A data movement and restructuring strategy based on access pattern that allows
access to have minimal interference. Based on reuse and dependency vector
information, threads move data using a low overhead restructuring strategy where
multiple threads perform collaborative data restructuring and reuse.
Our current implementation of the framework exploring intra-tile parallelism
and fine-grain execution show significant performance improvement over instances produced
by state-of-the-art compiler framework for selected stencil applications on an
Intel Xeon Phi board. Similarly, our results for our restructuring strategy on a many
core architecture, Tilera TileGx, beats optimized OpenMP code by a significant margin.
Hardware counters collected during our experiments shows improvement in reuse
and reduction in conflicts at different levels of the memory hierarchy.