A framework for group locality aware multithreading

Date
2015
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
When powerful computational cores are matched against limited resources such as bandwidth and memory, competition across computational units can result in serious performance degradation due to contention at different levels of the memory hierarchy. In such a case, it becomes very important to maximize reuse at low latency memory space. The high performance community has been actively working on finding techniques that can improve data locality and reuse, as such endeavors have a direct and positive impact on both performance and power. In order to better utilize low latency memory such as caches and scratch-pad SRAMs, software techniques such as hierarchical tiling have proven very effective. However these techniques still operate under the paradigm that memory agnostic coarse grain parallelism is the norm. Therefore, they function in a coarse-grain fashion where inner tiles run serially. Such behavior can result in an under-utilization of processing and network resources. Even when the inner tiles are assigned to fine grain tasks, their memory agnostic placement will still incur heavy penalties when resources are contended. Finding a balance between parallelism and locality in such cases is an arduous task and it is essential to seek a common ground where both processing and memory resources can collaborate to yield a better overall utilization. This thesis explores the concept of locality aware multithreading called “Group Locality”. Multiple group of threads work together in close proximity in time and space as a unit, taking advantage of each other’s data movement and collaborating at a very fine-grain level. When an execution pattern is known a priori through static analysis and data is shared among different thread groups, the concept of group locality extends further to rearrange data to provide better access pattern for other thread groups. This thesis presents an end-to-end framework that takes advantage of information gained by the compiler and utilizes it for careful orchestration of memory access, data movement and data restructuring. The contributions of this thesis include: • An efficient tiling strategy to exploit intratile parallelism. Using an existing polyhedral framework and compiler toolchain, we created a multi-hierarchical tiling with highly parallel tiles that map to the lowest level cache and communicate with minimal overhead. Sets of these inner tiles form outer level tiles that run in parallel. • A highly optimized fine grain runtime where threads execute in a data-flow fashion and synchronize mainly using atomic operations. • A data movement and restructuring strategy based on access pattern that allows access to have minimal interference. Based on reuse and dependency vector information, threads move data using a low overhead restructuring strategy where multiple threads perform collaborative data restructuring and reuse. Our current implementation of the framework exploring intra-tile parallelism and fine-grain execution show significant performance improvement over instances produced by state-of-the-art compiler framework for selected stencil applications on an Intel Xeon Phi board. Similarly, our results for our restructuring strategy on a many core architecture, Tilera TileGx, beats optimized OpenMP code by a significant margin. Hardware counters collected during our experiments shows improvement in reuse and reduction in conflicts at different levels of the memory hierarchy.
Description
Keywords
Citation