DOC PREVIEW
Berkeley COMPSCI C267 - Analysis of Absorbing Sets for Array-Based LDPC Codes

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS267 Spring 2009 HW0 Mehrzad Tartibi UC Berkeley Sheet 1 of 6 About Me: I am a second year PhD student in ME department. My area of interest is solving fluid solid interaction problems in human arteriovascular system. To simulate a typical part arteriovascular system, one deals with complex nonlinear structure of arterial wall (which leads to nonlinear PDE’s) and nonlinear unsteady blood flow problem (nonlinear PDE’s). Since almost all PDE solution schemes are liner, accuracy of such simulations depends heavily on the grid resolution simulating this complicated structure. Consequently, large systems of equations are to be solved using advanced linear algebra algorithms. As the computational power increases the models have become more complex and as a result they require more computational power. Consequently, parallelism plays a critical role in solving such large systems. I would like to learn the latest parallelizing technique involved solving large finite element and finite volume simulation using many-core technology. Disclaimer: due to lack of exposure to accessing any linear algebra problem, I have extracted the following results from a paper and an online presentation in this field. Title: Evaluation of Sparse SuperLU Factorization and Triangular Solution on Multicore Platforms [1] One of the most largely used numerical techniques employed by engineers is solving linear systems. Both finite element and finite volume techniques are involve with solving sparse matrices. Due to the ever increasing rate of computational speed and high demand for solving large sparse matrices from engineering and scientific applications, sparse matrix solver techniques have been continuously improved and new methods have been created. Among three direct solvers (known as: 1. MUMPS (MUltifrontal Massively Parallel Solver), SuperLU, and UmfPack) which are commonly used to solve very large sparse linear systems, my main focus is on SuperLU [1]. A brief comparison of MUMPS [2] and SuperLU is also presented toward the end of this homework report. The latest published effort on parallelizing SuperLU is done by Xiaoye Sherry Li [1] in 2008. In her publication, Li modified SuperLU technique for computers equipped with the Chip Multiprocessor (CMP) systems. She mainly compared the performance of both pthreads and MPI implementations in all major steps of the sparse LU factorization and triangular solution algorithms in several representative multicore machines. The detailed key architectural specification of the experimental machines she used in this study is summarized in the following table and figure:CS267 Spring 2009 HW0 Mehrzad Tartibi UC Berkeley Sheet 2 of 6 More detailed information regarding these machines is provided in reference [1]. The factorization steps for SuperLU_MT using Pthreads and OpenMP are consists of the left-looking factorization scheme and the dynamic scheduling method using the elimination tree. These schemes are illustrated in the following figure-2. Figure-3 illustrates the factorization steps in SuperLU_DIST using MPI, which are the 2D block-cyclic partition and distribution for a sparse matrix.CS267 Spring 2009 HW0 Mehrzad Tartibi UC Berkeley Sheet 3 of 6 The triangular solution phase in SuperLU_MT is not yet parallel, and Li only evaluated the parallel algorithm in SuperLU_DIST. The same 2D block-cyclic distribution as used in factorization step (Figure-3) is used for this step as it is illustrated in Figure-4 (distribution and solution process). Using following equation for triangular solution phase, the four steps mentioned in figure-4 are explained in details in reference [1]. The entire structures of the two solvers are different from one another. SuperLU_MT uses the partial pivoting – files are generated dynamically - so the symbolic factorization step cannot be separated from numerical factorization. Whereas, the SuperLU_DIST uses the static pivoting – the symbolic and numerical factorization steps are separated. Figure-5 shows the major steps of the two solvers. These two algorithms were tested by some benchmark matrices provided by the University of Florida Sparse Matrix Collection [3] – see Table-2.CS267 Spring 2009 HW0 Mehrzad Tartibi UC Berkeley Sheet 4 of 6 Li used PAPI [4] to access machines’ hardware counters. She also used CrayPat Performance tool – provided on the Cray XT system [5]. The load and store instruction counts (independent of the cache memory organization) were then examined by PAPI, and . The factorization load and store instruction counts of SuperLU_MT and SuperLU_DIST are then compared. The result of this comparison is provided in Table-3. The main bottleneck in the factorization codes of both solvers are the BLAS routine which take more than 30-40% of the time. The kernel in SuperLU_MT is “BLAS 2.5” (mostly DGEMV) and in SuperLU_DIST is “BLAS 3” (mostly DGEMM). The performance of DGEMV and DGEMM on Clovertown and VictoriaFalls are shown in Figure-6. In each case, Li used the vendor’s high performance mathematical libraries.CS267 Spring 2009 HW0 Mehrzad Tartibi UC Berkeley Sheet 5 of 6 The factorization time on Intel Clovertown machine and Sun VictoriaFalls are summarized in the following two tables. These tables compare the two algorithms on the four aforementioned benchmark matrices mentioned in Table-2. The speedup is the ratio of the runtime of SuperLU_MT over SuperLU_DIST. The SuperLU_MT runtime is consistently less then that of the SuperLU_DIST. And the triangulation runtime on Intel Clovertown and IBM Power5 are listed in the following table.CS267 Spring 2009 HW0 Mehrzad Tartibi UC Berkeley Sheet 6 of 6 More detailed discussion can be found in [1]; however, it is worth mentioning here that Li concluded “the Pthreads code is usually more robust and delivers consistently better performance than the MPU code, particularly on a multicore +multithreading


View Full Document

Berkeley COMPSCI C267 - Analysis of Absorbing Sets for Array-Based LDPC Codes

Documents in this Course
Lecture 4

Lecture 4

52 pages

Split-C

Split-C

5 pages

Lecture 5

Lecture 5

40 pages

Load more
Download Analysis of Absorbing Sets for Array-Based LDPC Codes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Analysis of Absorbing Sets for Array-Based LDPC Codes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Analysis of Absorbing Sets for Array-Based LDPC Codes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?