

#### **Energy-Aware Linear Algebra**

#### Enrique S. Quintana-Ortí



UNIVERSITAT High Performance Computing and Architectures



#### Green500 vs Top500 (June 2013)

| Rank      | Site                                                       | Technology                        | MFLOPS/W |
|-----------|------------------------------------------------------------|-----------------------------------|----------|
| Top/Green |                                                            |                                   |          |
| 1/32      | Tianhe-2 - National<br>University of Defense<br>Technology | Intel Xeon E5 + Intel<br>Xeon Phi | 1.901    |
| 467/1     | Eurora - CINECA                                            | Intel Xeon E5 +<br>NVIDIA K20     | 3.208    |



#### Green500 vs Top500 (June 2013)

| Rank      | Site                                                       | Technology                        | MFLOPS/W | MW to<br>EXAFLOPS? |
|-----------|------------------------------------------------------------|-----------------------------------|----------|--------------------|
| Top/Green |                                                            |                                   |          | LAATLOF 5          |
| 1/32      | Tianhe-2 - National<br>University of Defense<br>Technology | Intel Xeon E5 + Intel<br>Xeon Phi | 1.901    | 408                |
| 467/1     | Eurora - CINECA                                            | Intel Xeon E5 +<br>NVIDIA K20     | 3.208    | 312                |



Most powerful reactor under construction in France Flamanville (EDF, 2017 for US \$9 billion): 1,630 MWe



#### Green500 vs Top500 (November 2012)

| Rank<br>Top/Green | Site                                        | Technology                    | MFLOPS/W | MW to<br>EXAFLOPS? |
|-------------------|---------------------------------------------|-------------------------------|----------|--------------------|
|                   | Tianhe-<br>Univers 1 MW ≈ \$1<br>Technology | Million/year!                 |          | 408                |
| 467/1             | Eurora - CINECA                             | Intel Xeon E5 +<br>NVIDIA K20 | 3.208    | 312                |



Most powerful reactor under construction in France Flamanville (EDF, 2017 for US \$9 billion): 1,630 MWe

The University of Texas at Austin



System ranked #1 in Green500











- Reduce energy consumption!
  - Costs over lifetime of an HPC facility often exceed acquisition costs
  - Carbon dioxide is a hazard for health and environment
  - Heat reduces hardware reliability
- Personal view
  - Hardware features some power-saving mechanisms
  - Scientific apps. are in general energy-oblivious

### **Experimental setup**



#### AMD

- 2 AMD Opteron 6128, 48GB
- DVFS per core

| P-state | $V CC_i$ | $f_i$ |
|---------|----------|-------|
| $P_0$   | 1.23     | 2.00  |
| $P_1$   | 1.17     | 1.50  |
| $P_2$   | 1.12     | 1.20  |
| $P_3$   | 1.09     | 1.00  |
| $P_4$   | 1.06     | 0.80  |

#### Intel

- 2 Intel Xeon E5504, 32GB
- DVFS per socket

| P-state | $VCC_i$ | $f_i$ |
|---------|---------|-------|
| $P_0$   | 1.04    | 2.00  |
| $P_1$   | 0.98    | 1.73  |
| $P_2$   | 0.95    | 1.60  |
| $P_3$   | 1.01    | 1.87  |

#### C-states:

C0: normal operation mode

C1, C1E: disable core components (L1/L2 caches), clock signal, mem. controller,... increases energy savings at the expense of recovery time

## **Experimental setup**



- National Instruments NI9205+NIcDAQ-9178
- 1,000 Samples/s per channel



#### Outline



- Modeling power
- Saving power in task-parallel applications
  - ILUPACK for multicore processors
  - CG for hybrid CPU-GPU platforms
- Conclusions

#### Outline



#### Modeling power

- Saving power in task-parallel applications
  - ILUPACK for multicore processors
  - CG for hybrid CPU-GPU platforms
- Conclusions



 $P = P^{(S)Y(stem)} + P^{C(PU)} = P^{Y} + P^{S(tatic)} + P^{D(ynamic)}$ 

- $P^{C}$  is the power dissipated by CPU (socket):  $P^{S} + P^{D}$
- *P<sup>S</sup>* is the static power
- *P<sup>D</sup>* is dynamic power
- *P<sup>Y</sup>* is the power of remaining components (e.g., RAM)

#### Considerations:

- $P^{Y}$  and  $P^{S}$  are constants (though  $P^{S}$  grows with temperature)
- Hot system



Estimated as *idle* power Due to off-chip components: e.g., RAM (only mainboard)

#### $P = P^Y + P^S + P^D$



 $P^Y \approx P^I = 80.15 \text{ W}$ 





#### Static power:

#### $P = P^Y + P^S + P^D$





Dynamic power:

 $P = P^{Y} + P^{S} + P^{D}$ 





 $P = P^Y + P^S + P^D$ 

Dynamic power:





#### Task-parallel DLA on multicore and CPU-GPU







#### Task-parallel DLA on multicore and CPU-GPU



$$P_{\text{Chol}}(t) = P^{U} + P^{S} + P_{\text{Chol}}^{D}(t)$$
  
=  $P^{U} + P^{S} + \sum_{i=1}^{r} \sum_{j=1}^{c} P_{i}^{D} N_{i,j}(t)$ 

|                                 | Block size, $b$ |       |       |       |  |  |  |
|---------------------------------|-----------------|-------|-------|-------|--|--|--|
| Task                            | 128 192         |       | 256   | 512   |  |  |  |
| $P_{\rm P}^D \; ({\tt dpotrf})$ | 10.26           | 10.35 | 10.45 | 11.28 |  |  |  |
| $P_{T}^{D} \; (\mathtt{dtrsm})$ | 10.12           | 10.31 | 10.32 | 10.80 |  |  |  |
| $P_{S}^{D} \; (\texttt{dsyrk})$ | 11.22           | 11.47 | 11.67 | 12.60 |  |  |  |
| $P_{G}^{D} \; (\texttt{dgemm})$ | 11.98           | 12.54 | 12.72 | 13.30 |  |  |  |
| $P^D_{\sf B}$ (busy)            | 7.62            | 7.62  | 7.62  | 7.62  |  |  |  |

- Use average
   Power
- Depends also on #active sockets!

The University of Texas at Austin



1- >

- Task-parallel DLA on multicore and CPU-GPU
  - Accommodate to memory contention

$$P_i^D \longrightarrow P_{\{i:j\}}^D = \delta_{\{i:j\}} \cdot P_j^M + (1 - \delta_{\{i:j\}}) \cdot P_j^F \qquad \delta_{\{i:j\}} = \frac{R_{\{i:j\}} - T_j(b)}{R_{\{i:j\}}}$$

$$P_{Op}(t) = P^{Y} + P^{C}(t)$$
  

$$= P^{Y} + P^{S} + P_{Op}^{D}(t)$$
  

$$= P^{Y} + P^{S} + \sum_{k=1}^{c} \sum_{j=1}^{r} \sum_{i=1}^{n_{j}} P_{\{i:j\}}^{D} \cdot M_{k,\{i:j\}}(t)$$
  

$$= P^{Y} + P^{S} + \sum_{k=1}^{c} \sum_{j=1}^{r} \sum_{i=1}^{n_{j}} \left(\delta_{\{i:j\}} \cdot P_{j}^{M} + (1 - \delta_{\{i:j\}}) \cdot P_{j}^{F}\right) \cdot M_{k,\{i:j\}}(t)$$



#### Task-parallel DLA on multicore and CPU-GPU

Accommodate memory contention



| Task                    | $P_j^M$ | $P_j^F$ | $\min_b \bar{\delta}_{j,b}$ -max <sub>b</sub> $\bar{\delta}_{j,b}$ |
|-------------------------|---------|---------|--------------------------------------------------------------------|
| CHOLESKY FACTORIZATION  | 13.32   | 18.72   | 45-86                                                              |
| TRIANGULAR SOLVE        | 7.47    | 15.66   | 14–28                                                              |
| SYMMETRIC RANK-b UPDATE | 12.83   | 16.00   | 15–38                                                              |
| MATRIX-MATRIX PRODUCT   | 14.67   | 15.70   | 7–15                                                               |
| LU FACTORIZATION        | 12.83   | 17.75   | 75–95                                                              |
| TRIANGULAR SOLVE        | 12.12   | 19.40   | 55-80                                                              |
| 2x1 LU FACTORIZATION    | 12.54   | 16.54   | 33–76                                                              |
| 2x1 triangular solve    | 12.53   | 19.55   | 81–86                                                              |
| QR FACTORIZATION        | 15.30   | 16.88   | 62-85                                                              |
| APPLY ORTH. TRANSF.     | 12.10   | 26.98   | 76–86                                                              |
| 2x1 QR FACTORIZATION    | 13.91   | 19.18   | 65-82                                                              |
| 2x1 Apply orth. transf. | 6.84    | 16.72   | 16–32                                                              |
| BUSY WAIT               | 0       | 9.21    | _                                                                  |



Task-parallel DLA on multicore and CPU-GPU





#### Task-parallel DLA on multicore and CPU-GPU





#### • Simple, yet accurate:

- Dense factorizations (Cholesky, LU, QR)

#### Multicore processors

"Modeling power and energy consumption of dense matrix factorizations on multicore processors" P. Alonso, M. F. Dolz, R. Mayo, E. S. Quintana. CCPE 2013

#### CPU-GPU platforms

"Enhancing performance and energy consumption of runtime schedulers for dense linear algebra" P. Alonso, M. F. Dolz, F. D. Igual, R. Mayo, E. S. Quintana. CCPE 2013 (submitted)

#### ILUPACK on multicore processors

"Assessing the impact of the CPU power-saving modes on the task-parallel solution of sparse linear systems" J. Aliaga, M. Barreda, M. F. Dolz, A. Martín, R. Mayo, E. S. Quintana. Cluster Computing 2013 (submitted)

#### Outline



#### Modeling power

- Saving power in task-parallel appl.
  - ILUPACK for multicore processors
  - CG for hybrid CPU-GPU platforms

#### Conclusions

# **ILUPACK on multicore**



#### Incomplete LU Package (<u>http://ilupack.tu-bs.de</u>)

- Iterative Krylov subspace methods
- Multilevel ILU preconditioners for general/symmetric/Hermitian positive definite systems
- Based on inverse ILUs with control over growth of inverse triangular factors
- Specially competitive for linear systems from 3D PDEs

## ILUPACK on multicore Task parallelism



- Multi-threaded parallelism (real s.p.d. systems)
  - Leverage task parallelism
  - Dynamic scheduling via runtime (OpenMP)



PA

## ILUPACK on multicore Task parallelism



Run-time in charge of scheduling



"Exploiting thread-level parallelism in the iterative solution of sparse linear systems" J. I. Aliaga, M. Bollhöfer, A. F. Martín, E. S. Quintana. Parallel Computing, 2011

## ILUPACK on multicore Experimental setup



- Sparse linear system benchmark
  - Laplacian equation  $-\Delta u = f$  in a 3D unit cube  $\Omega = [0,1]^3$
  - Linear system Au = b with  $A \rightarrow n \times n$ ,  $n = 252^3 \approx 16$  million unknowns and 111 millions of nonzero entries





| Platform | P-state, $P_i$ | $V_i$ | $f_i$ |
|----------|----------------|-------|-------|
|          | P0             | 1.23  | 2.00  |
|          | P1             | 1.17  | 1.50  |
| WT_AMD   | P2             | 1.12  | 1.20  |
|          | $\mathbf{P3}$  | 1.09  | 1.00  |
|          | P4             | 1.06  | 0.80  |

- DVFS = P-states (see ACPI standard)
- Moving to a higher P-state results in ↓power
- $\downarrow$ Power =  $\downarrow$ Energy?
- For a compute-bounded operation, f<sub>i</sub> is linear to time<sup>-1</sup>
- In principle, for a memory-bounded operation (ILUPACK), reducing f<sub>i</sub> should have a minor impact on performance



1st attempt: Dynamic Static voltage-frequency scaling

| P-state P <sub>i</sub> | $T_i$ | $\bar{P}_i^T$ | Ei        | $\Delta T_i$ | $\Delta \bar{P}_i^T$ | $\Delta E_i$ |
|------------------------|-------|---------------|-----------|--------------|----------------------|--------------|
| P <sub>0</sub>         | 34.06 | 282.87        | 9,634.78  | _            | _                    | —            |
| $P_1$                  | 43.57 | 235.64        | 10,267.72 | 21.88        | -16.69               | 6.53         |
| $P_2$                  | 54.48 | 210.86        | 11.478.79 | 59.91        | -25.45               | 19.20        |
| P <sub>3</sub>         | 61.58 | 197.01        | 12.132.79 | 80.73        | -30.35               | 25.87        |
| P <sub>4</sub>         | 76.50 | 186.86        | 14,295.18 | 124.47       | -33.94               | 48.28        |

Why?



1st attempt: Dynamic Static voltage-frequency scaling

| P-state P <sub>i</sub> | Vcc <sub>i</sub> | f <sub>i</sub> | $T_i$ | $\Delta T_i$ | BWi   | $\Delta BW_i$ |
|------------------------|------------------|----------------|-------|--------------|-------|---------------|
| P <sub>0</sub>         | 1.23             | 2.00           | 34.06 | _            | 30.29 | _             |
| $P_1$                  | 1.17             | 1.50           | 43.57 | 21.88        | 24.63 | -18.67        |
| $P_2$                  | 1.12             | 1.20           | 54.48 | 59.91        | 20.46 | -32.44        |
| P <sub>3</sub>         | 1.09             | 1.00           | 61.58 | 80.73        | 17.48 | -42.30        |
| P <sub>4</sub>         | 1.06             | 0.80           | 76.50 | 124.47       | 14.00 | -53.77        |
|                        |                  |                |       |              |       |               |

- Combined effect of linear decrease of CPU performance and memory bandwidth!
- Decrease of  $P_i^s(P_0 \rightarrow P_2: -21.47\%)$ , decrease of  $P_i^D(P_0 \rightarrow P_3: -60.73\%)$  but  $P_i^y$  does not change!



- 2nd attempt: DVFS during idle periods





2nd attempt: DVFS during idle periods





2nd attempt: DVFS during idle periods



The University of Texas at Austin



Active polling for work...





3rd attempt: DVFS and idle-wait





- 3rd attempt: DVFS and idle-wait:
  - Savings of 6.92% of total energy
  - Negligible impact on execution time
- ...but take into account that
  - Idle time: 23.70%
  - Dynamic power: 32.32%
  - Upper bound of savings: 39.32 · 0.2370 = 9.32%



# ILUPACK on multicore Leveraging P-states (Intel)



#### DVFS

|       | $P_i$ | $T_i$  | $\bar{P}_i^T$ | $E_i$         | $\Delta T_i$ | $\Delta \bar{P}_i^T$ | $\Delta E_i$ |
|-------|-------|--------|---------------|---------------|--------------|----------------------|--------------|
| ILU   | $P_0$ | 56.43  | 135.17        | $7,\!627.97$  | —            | _                    | —            |
|       | $P_1$ | 59.06  | 127.96        | 7,557.87      | 4.67         | -5.33                | -0.92        |
|       | $P_2$ | 62.93  | 121.99        | $7,\!676.98$  | 11.52        | -9.75                | 0.64         |
|       | $P_3$ | 67.05  | 116.22        | 7,792.77      | 18.82        | -18.82               | 2.16         |
| Solve | $P_0$ | 148.94 | 155.27        | $23,\!123.99$ | —            | _                    | —            |
|       | $P_1$ | 148.52 | 151.07        | $22,\!434.73$ | -0.28        | -2.70                | -2.98        |
|       | $P_2$ | 154.86 | 145.11        | $22,\!469.38$ | 3.97         | -6.55                | -2.83        |
|       | $P_3$ | 159.08 | 138.50        | $22,\!033.14$ | 6.81         | -10.80               | -4.72        |



# ILUPACK on multicore Leveraging P- and C-states (Intel)





#### Outline



- Modeling power
  - ILUPACK for multicore processors
- Saving power in task-parallel appl.
  - ILUPACK for multicore processors
  - CG for hybrid CPU-GPU platforms

#### Conclusions

# The CG method on CPU-GPU



- Leveraging P-states on CPU-GPU platforms?
  - Apply DVFS to the CPU while computation proceeds on the GPU?
- Leveraging C-states on CPU-GPU platforms?
  - What is the CPU doing while computation proceeds on the GPU?

### The CG method on CPU-GPU Experimental setup

#### Sandy:

- Intel i7-3770K, 16GB
- NVIDIA GeForce GTX480

#### Cases from two matrix collections

| Source  | Matrix     | #nonzeros $(n_z)$ | Size $(n)$ | $n_z/n$ |
|---------|------------|-------------------|------------|---------|
| UFMC    | audikw_1   | 77,651,847        | 943,645    | 82.28   |
|         | BMWCRA1    | 10,641,602        | 148,770    | 71.53   |
|         | CRANKSEG_2 | 14,148,858        | 63,838     | 221.63  |
|         | F1         | 26,837,113        | 343,791    | 78.06   |
|         | INLINE_1   | 38,816,170        | 503,712    | 77.06   |
|         | LDOOR      | 42,493,817        | 952,203    | 44.62   |
|         | A100       | 6,940,000         | 1,000,000  | 6.94    |
|         | A126       | 13,907,370        | 2,000,376  | 6.94    |
| Laplace | A159       | 27,986,067        | 4,019,679  | 6.94    |
|         | A200       | 55,760,000        | 8,000,000  | 6.94    |
|         | A252       | 111,640,032       | 16,003,001 | 6.94    |















#### CG: Sparse matrix-vector (SpMV) + CUBLAS

```
while((k < maxiter) && (res > epsilon)){
   SSpMV <<<Gs,Bs>>> (n, rowA, colA, valA, d, z);
   tmp = cublasSdot (n, d, 1, z, 1);
   rho = beta / tmp;
   gamma = beta;
   cublasSaxpy (n, rho, d, 1, x, 1);
   cublasSaxpy (n, -rho, z, 1, r, 1);
   beta = cublasSdot(n, r, 1, r, 1);
   alpha = beta / gamma;
   cublasSscal (n, alpha, d, 1);
   res = sqrt(beta);
   k++;
} // end-while
```



#### CG: Sparse matrix-vector (SpMV) + CUBLAS

```
while( ( k < maxiter ) && ( res > epsilon ) ){
   SSpMV <<<Gs,Bs>>> ( n, rowA, colA, valA, d, z );
   tmp = cublasSdot ( n, d, 1, z, 1 );
   rbo = beta / tmp:
```

#### Leveraging P-states:

- Basically all computation performed on the GPU
- Apply static VFS to reduced power in CPU!

```
aipna = beta / gamma;
cublasSscal (n, alpha, d, 1 );
cublasSaxpy (n, one, r, 1, d, 1 );
res = sqrt( beta );
k++;
} // end-while
```



#### CG: Sparse matrix-vector (SpMV) + CUBLAS

```
while( ( k < maxiter ) && ( res > epsilon ) ){
   SSpMV <<<Gs,Bs>>> ( n, rowA, colA, valA, d, z );
   tmp = cublasSdot ( n, d, 1, z, 1 );
   rbo = beta / tmp:
```

#### Leveraging C-states:

- What is the CPU doing while computation proceeds on the GPU?
- CUDA offers polling (active-wait) vs blocking (idle-wait) operation modes

```
res = sqrt( beta );
k++;
} // end-while
```



 Trading off energy for time: variations of CUDA blocking mode w.r.t. CUDA polling mode





 Trading off energy for time: variations of CUDA blocking mode w.r.t. CUDA polling mode





 Trading off energy for time: variations of CUDA blocking mode w.r.t. CUDA polling mode



# The CG method on CPU-GPU Merged implementation



- Can we attain polling performance and blocking energy advantage?
- Requires a reformulation of CG (merge kernels)



# The CG method on CPU-GPU Merged implementation



Time vs. CPU energy

Maintain performance of polling...



...while leveraging energy-efficiency of C-states+idle-wait

### Performance and energy consumption Summary



### "Do nothing, efficiently ... " (V. Pallipadi, A. Belay)

or

#### "Doing nothing well" (D. E. Culler)

#### Acknowledgments





A. F. Martín







J. I. Aliaga, M. Barreda, M. F. Dolz, R. Mayo, J. Pérez, E. S. Quintana-Ortí





#### Acknowledgments



 EU FP7 318793 Project "EXA2GREEN. Energy-Aware Sustainable Computing on Future Technology - Paving the Road to Exascale Computing"



Project CICYT TIN2011-23283
 "PA-HPC. Power-Aware High Performance Computing"

