Item type |
学術雑誌論文 / Journal Article(1) |
公開日 |
2011-12-01 |
タイトル |
|
|
タイトル |
Computation-Communication Overlap of Linpack on a GPU-Accelerated PC Cluster |
|
言語 |
en |
言語 |
|
|
言語 |
eng |
キーワード |
|
|
言語 |
en |
|
主題 |
parallel processing, multi-core processor, GPU, computation-communication overlap |
資源タイプ |
|
|
資源タイプ識別子 |
http://purl.org/coar/resource_type/c_6501 |
|
資源タイプ |
journal article |
著者 |
Junichi, Ohmura
Takefumi, Miyoshi
Hidetsugu, Irie
Tsutomu, Yoshinaga
|
著者ID |
|
|
内容記述タイプ |
Other |
|
内容記述 |
1000050422407 |
著者ID |
|
|
内容記述タイプ |
Other |
|
内容記述 |
1000060210738 |
内容記述 |
|
|
内容記述タイプ |
Other |
|
内容記述 |
In this paper, we propose an approach to obtaining en-hanced performance of the Linpack benchmark on a GPU-accelerated PCcluster connected via relatively slow inter-node connections. For one nodewith a quad-core Intel Xeon W3520 processor and a NVIDIA Tesla C1060GPU card, we implement a CPU-GPU parallel double-precision generalmatrix-matirx multiplication (dgemm) operation, and achieve a perfor-mance improvement of 34% compared with the GPU-only case and 64%compared with the CPU-only case. For an entire 16-node cluster, each nodeof which is the same as the above and is connected with two gigabit Ether-net links, we use a computation-communication overlap scheme with GPUacceleration for the Linpack benchmark, and achieve a performance im-provement of 28% compared with the GPU-accelerated high-performanceLinpack benchmark (HPL) without overlapping. Our overlap GPU accel-eration solution uses overlaps in which the main inter-node communicationand data transfer to the GPU device memory are overlapped with the maincomputation task on the CPU cores. These overlaps use multi-core pro-cessors, which almost all of today’s high-performance computers use. Inparticular, as well as using a CPU core for communication tasks, we alsosimultaneously use other CPU cores and the GPU for computation tasks.In order to enable overlap between inter-node communication and com-putation tasks, we eliminate their close dependence by breaking the maincomputation task into smaller tasks and rescheduling. Based on a scheme inwhich part of the CPU computation power is simultaneously used for tasksother than computation tasks, we experimentally find the optimal compu-tation ratio for CPUs; this ratio differs from the case of parallel dgemmoperation of one node. |
書誌情報 |
IEICE Transactions on Information and Systems
巻 E94-D,
号 12,
p. 2319-2327,
発行日 2011-12-01
|
出版者 |
|
|
出版者 |
The Institute of Electronics, Information and Comunication Engineers |
ISSN |
|
|
収録物識別子タイプ |
ISSN |
|
収録物識別子 |
09168532 |
関連サイト |
|
|
|
識別子タイプ |
URI |
|
|
関連識別子 |
http://www.ieice.org/jpn/index.html |
|
|
関連名称 |
http://www.ieice.org/jpn/index.html |
著者版フラグ |
|
|
出版タイプ |
VoR |
|
出版タイプResource |
http://purl.org/coar/version/c_970fb48d4fbd8a85 |
自由記述ライセンス |
|
|
|
Copyright c 2011 The Institute of Electronics, Information and Communication Engineers |