@phdthesis{oai:uec.repo.nii.ac.jp:00008651,
 author = {工藤, 周平},
 month = {2018-04-03},
 note = {2017, The  massive-parallelism of recent high-performance computers (HPC) requires all algorithms running on them, including the singular value decomposition (SVD) algorithm, to have high-level of scalability. To achieve this goal, we focus on the one-sided Jacobi SVD (OSJ) algorithm. The OSJ with extension techniques like blocking and parallelization has a potential for high parallel efficiency due to its large-grained parallelism, but some of its theoretical properties are unknown. 
Furthermore, implementation techniques of the OSJ for HPC have not been studied well. This thesis aims to analyze the blocking and parallelization techniques of OSJ both theoretically and  experimentally  and  provides  new  parallelization  techniques for HPC. It consists of seven chapters. 
In the first chapter, the motivation and contribution of the thesis are described. In the second and third chapters, as the background, the current trend of HPC and applications of SVD are described. The idea and detailed algorithm of the OSJ and the existing extension techniques are also described.
In the fourth chapter, a new bound on the orthogonality error of Hari’s V2 method is provided. 
The V2 method is a blocking method for the OSJ suitable for HPC. The bound is tighter than the existing one due to Drmač, thanks to the exploitation of the diagonally scaled structure of the matrix. 
In the fifth chapter, a new implementation method for HPC based on 2D blocked data distribution and all-reduce type communication is described. Theoretical analysis of the method shows the number of communication is reduced compared with the existing data distributions. Experimental results  on  highly  parallel machines support the theoretical prediction and show good strong-scalability of the method. Bečka’s dynamic ordering method, which can reduce the number of iteration of OSJ, is also analyzed and a new bound on the global convergence rate of the method is provided.
In the sixth chapter, implementation techniques of DSYRK for a many-core CPU, a new architecture of CPU which is used in HPC, is considered. DSYRK is a variation of matrix-matrix multiplication used in the V2 method. The new parallelization technique of DSYRK which utilizes all the three dimensions for parallelism in the matrix-matrix multiplication accelerates the performance of DSYRK on Xeon Phi, an Intel’s many-core CPU, up to 75% of the theoretical peak performance.
The last chapter concludes the thesis and provides the future work of this study.

　現代の高性能計算機は高い並列性をもっており，特異値分解(SVD)のような科学技術計算に用いられる基本的な行列計算についてもこのような高い並列性に対応することが求められている．片側ヤコビ法は広く用いられている行列計算ライブラリLAPACKに実装されたSVD計算手法の一つであり，高い計算精度と実用的な速度を併せ持つ．また，ブロック化や並列化などの拡張手法と組み合わせることで，高い計算効率と粗粒度な並列性を持たせることができる．そのため片側ヤコビ法は高性能計算機に適すると考えられているが，ブロック化や並列化については理論的・実験的検証が不十分であり，また現代の高性能計算機に向けた実装の研究も進んでいなかった．
　本論文は，片側ヤコビ法の拡張手法について理論的・実験的解析を行うとともに高い並列性能を持つ実装手法の開発とその解析を行うものであり，7つの章で構成される．
　第１章では現状の片側ヤコビ法が持つ精度面での利点と性能面での問題点について実例を用いて示すことで本研究の目的を説明し，また本論文の成果の概略と構成を述べる．
　第２章では本研究の背景となる事柄についてまとめる．第一に現代の高性能計算機が持つ特性についてまとめ，高度な並列性に対応することがSVD計算アルゴリズムに不可欠であることを示す．第二に，SVDとその科学技術計算における応用を示し，また近年明らかにされたSVDの誤差の性質についてまとめる．そして最後にSVD計算アルゴリズムについて概要を示し，ヤコビ法やそのSVDへの応用手法である片側ヤコビ法の基礎となるアイデアについて説明する．
　第３章では片側ヤコビ法とそれに対する既存の拡張手法について詳細に解説する．ここでははじめに片側ヤコビ法の詳細な手順を示し，その問題点を指摘することで拡張手法の必要性を説明する．そして，第一に並列化と密接な関係があるヤコビ法の巡回順序について，第二に片側ヤコビ法の高性能化に必要なブロック化手法について，第三に近年開発されヤコビ法の劇的な高性能化につながった前処理手法について説明する．
　第４章では本論文の成果の1つである，片側ヤコビ法のブロック化したときにおける誤差解析の結果を示す．ブロック化手法はいくつか考えられるが本論文ではとくに高性能計算に適したHariのV2手法を対象にし，片側ヤコビ法の収束性に影響を与える直交性に対する誤差について調べている．本研究における解析はDrmačによる先行研究から発展させたものであり，行列の対角スケーリングされた構造を利用することでより良い上限が得られている．また多数の行列を用いた実験による検証により，上限に出現する行列に依存した係数が実用上小さいことを示している．
　第５章では第一に，本論文の２つ目の成果である二次元ブロック分割によるデータ配置とAllReduce型の計算を用いた分散メモリ向け並列化手法について論じている．現代の高性能計算機のほぼすべてが分散メモリ型並列計算機であるが，このような計算機ではデータ配置が計算や通信の特性を決定づける．本論文ではまずV2手法の計算パターンを利用して，二次元ブロック分割とAllReduce型計算の組み合わせた新規の並列化手法を示している．そして計算量や通信量を理論的に解析し，この手法が従来のデータ分散手法と比較してオーダーの意味で通信回数を削減することを示している．また，高性能計算機による実験的検証によってこの手法の性能が理論から得られた予測に従うことを確かめている．さらに京コンピュータを用いた大規模並列計算機上での性能検証を行い，1万次元の行列のSVDを計算するときに約2万5千コアまで性能が向上するという高い強スケーラビリティを確認している．第二に，本論文の３つ目の成果である，Bečkaらの動的順序と呼ばれる並列化手法について理論的検証を行い，一次収束性に対する新しい上界を求めている．この上界は従来のものとは異なり並列数と反比例する形となっており，動的順序の並列計算に適した特性を示している．
　第６章では本論文の４つ目の成果である，DSYRKの高性能実装手法について論じている．DSYRKはBLASにおいて定義された対称な形を持つ行列積であり，V2手法における主要な計算の1つである．本研究ではメニーコアCPUという将来使用されると目されるアーキテクチャを持つ，Xeon Phiを用いたときにおける問題点を検証する．DSYRKの既存実装ではDSYRKの持つ対称な構造のためXeon Phiが持つ高い性能を活かせていなかったが，本研究ではメニーコアCPUの持つ高い並列性に対処するため，行列積の持つ3次元的な並列化軸をすべて利用する新規並列化アルゴリズムを示している．またXeon Phi上で性能検証結果では，理論ピーク性能の約76%に達する高い性能を得ている．
　第７章では本論文をまとめ，今後の展望について述べている．},
 school = {電気通信大学},
 title = {高性能計算機に適したヤコビSVD手法の実装技術と性能解析},
 year = {},
 yomi = {クドウ, シュウヘイ}
}