A study of architecture and performance of IBM Cyclops64 interconnection network

Date
2005
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
With the increasing needs to support multiple operating systems simultaneously, the designs of high-performance processor architectures are moving toward in integrating a large number of multiple processing cores on a single chip [1]. IBM Cyclops64 (C64) is a petaflop supercomputer built on multi-core System-On-Chip (SOC) technology, which is attached to a host system through a number of Gigabit Ethernet links and provides familiar computing environment to application developers and end users [2, 3]. This system is based on the Cyclops cellular architecture and designed to achieve over 1 petaflop peak performance. A maximum configuration of a C64 system consists of 13,824 C64 processing nodes (around one million processors) arranged around a 3D mesh network [4, 2]. Each node is composed of a C64 chip, external DRAMs and a small number of external interface logic. Each C64 chip employs a multistage pipelined crossbar switch as its interconnection to provide a high bandwidth and low latency communication. ☐ Nowadays, the performance of most digital systems is constrained by their underlying communication or interconnection network. Therefore, a performance analysis of these systems, in our case the C64 crossbar, plays an important role in the verification of the architecture and performance of the new systems. ☐ This thesis presents a brief overview of the C64 crossbar switch and describes details about its performance simulation and analysis. The metrics for the performance analysis of the C64 crossbar switch that I am going to focus on are its latency and throughput. A C64 crossbar simulator, csim_crossbar, and a C64 chip simulator, LAST, are used to simulate the architecture and gather statistical data for the performance analysis. The performance analysis is done under certain constraints (Fixed channel width and node size as well as topology). Different parameters, such as type of workloads, traffic patterns, injection rate, packet size and arbitration algorithms, are implemented during the performance simulation. ☐ My experimental results provide observations on the network behavior: (1) The C64 crossbar can achieve the full hardware bandwidth and exhibit a non-blocking behavior; (2) It is a stable network; (3) The network logic design appears to provide a reasonable opportunity for sharing the channel bandwidth between traffic in either direction; (4) The segmented LRU matrix arbitration scheme does not have any notable performance gain while consuming considerably less memory space usage. (5) Application-driven benchmarks provide results comparable to synthetic workloads.
Description
Keywords
Citation