Parallel low-overhead data collection framework for a resource centric performance analysis tool
Date
2012
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
With the advent of multicore technology, computer systems have shifted to a new height of parallelism and computational power. Ramping up the frequency to in-crease performance on single processor has become a thing of the past. Nowadays, everyday computers are powered by multiple cores that share resources such as memory, network and I/O components. Moreover, they can run a larger gamut of applications at much higher speed. The increase in computational power is not re ected in the usability of such systems. The usage of these components and re-sources in an e ective manner puts an extremely high burden to the programmer and the system software, which increases the complexity in programming models and runtime systems. To alleviate this burden, there is a need for tools that can identify parallel sections in the code, identify bottlenecks and provides hint to the programmers to improve the performance of the overall computing system. More-over, the tool has to be able to pinpoint resource contentions so we know where uneven distributions of resources are. To address this issue, we introduced a tool called Memory Observant Data Analysis (MODA) [44]. MODA is a performance analysis tool that helps users analyze resource usage and alleviate resource con icts by pinpointing performance issues in an algorithmic as well as in an architectural level. The main challenge of any performance analysis tool is the tool performance itself. Performance analysis tool needs to be such that it introduces minimal to no pertur-bation of application behavior. This requires analysis tool to achieve information during runtime with a very minimal overhead. This is not easy to achieve because the tool has to not only monitor the behavior of an application but also record traces, which can later be analyzed. Tool developers in such case need to make a smart decision about the trace format, storage location, data movement and, most importantly, correctness. Under the MODA framework, there are four phases: In-strumentation Phase, Monitoring Phase, Analysis Phase and Visualization Phase. All these phases are equally important for the development of our tool. However, it is the Monitoring Phase that runs online, which brings a need for this phase to be highly optimized. The Monitoring Phase itself is a complex structure that acts like a mini-runtime, collecting all necessary traces and reacting to certain events. This thesis concentrates on the intricacies of creating a monitoring kernel for mas-sively parallel architectures. It goes through the challenges of creating an optimized kernel when several threads are working together. Its contribution includes: Contribution (1) Parallel trace collection: Monitoring kernel is designed to collect traces from multiple streams running in parallel. There is a multi-layer collection framework in which application streams write to their own local bu ers with minimal overhead; while the monitoring threads collect the monitoring data in parallel. Contribution (2) Di erential Compressed Traces: The information collected from each memory operation includes its control information (i.e. program counter), operation's target, operation's starter and timing information. In order to reduce communication and space overhead, the framework takes advantage of the opera-tion's locality in both time and space and encodes the messages to reduce stored messages to 1:3 ratio. Contribution (3) Low overhead kernel: The framework has a highly optimized mon-itoring kernel, which intercepts the marked memory operations, takes care of any outstanding events and ensures the consistency of the program. Contribution (4) Statistical Sampling: This thesis provides an analytical formula that can be used to calculate the predicted overhead of our framework when run-ning for all memory operations, or when applying sampling to the data. In our result we show 45x more overhead without using statistical sampling, which reduces signi cantly as statistical sampling is applied. In this thesis we discuss our solution and show a high level implementation on a massively parallel architecture: the Cray XMT.