

# Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners

José I. Aliaga





# Leveraging task-parallelism in energy-efficient ILU preconditioners



- Universidad Jaime I (Castellón, Spain)
  - José I. Aliaga
  - Manuel F. Dolz
  - Rafael Mayo
  - Enrique S. Quintana-Ortí
- CIMNE (Barcelona, Spain)
  - Alberto F. Martín









### **2010 PFLOPS** (10<sup>15</sup> flops/sec.)

#### **2010 JUGENE**

- 10<sup>9</sup> core level (PowerPC 450, 850MHz → 3.4 GFLOPS)
- 10¹ node level

(Quad-Core)

10<sup>5</sup> cluster level

(73.728 nodes)







### **2010 PFLOPS** (10<sup>15</sup> flops/sec.)

#### **2010 JUGENE**

- 10<sup>9</sup> core level (PowerPC 450, 850MHz → 3.4 GFLOPS)
- 10¹ node level

(Quad-Core)

10<sup>5</sup> cluster level

(73.728 nodes)

### **2020 EFLOPS** (10<sup>18</sup> flops/sec.)

- 10<sup>9.5</sup> core level
- 10³ node level!
- 10<sup>5.5</sup> cluster level





### Green500 (November 2011\*)

| Rank      | Site, Computer                                        | #Cores  | MFLOPS/W | LINPACK<br>(TFLOPS) | MW to EXAFLOPS? |
|-----------|-------------------------------------------------------|---------|----------|---------------------|-----------------|
| Green/Top |                                                       |         |          | ( 20. 3)            |                 |
| 1/29      | IBM Rochester – BlueGene/Q,<br>Power BQC 16C 1.60 GHz | 32.768  | 2.026.48 | 339,83              | 493,47          |
| 32/1      | RIKEN AICS K Computer–<br>Sparc64 VIIIfx (8-core)     | 705.024 | 830,18   | 10.510,00           | 1.204,60        |



Most powerful reactor under construction in France Flamanville (EDF, 2017 for US \$9 billion): 1,630 MWe

\*Green500 June 2012 to be released today





### Green500/Top500 (June 2012)

| Rank    | Site , Computer                                   | #Cores | MFLOPS/W | LINPACK<br>(TFLOPS) | MW to EXAFLOPS? |
|---------|---------------------------------------------------|--------|----------|---------------------|-----------------|
| Green/1 | ор                                                |        |          | (11 231 3)          | 277 ti 201 0 .  |
| 1/252   | DOE/NNSA/LLNL BlueGene/Q,<br>Power BQC 16C 1.6GHz |        | 2,100'88 | 86,35               | 475,99          |
| 20/1    | DOE/NNSA/LLNL BlueGene/Q,<br>Power BQC 16C 1.6GHz |        | 2,069'04 | 16,324,75           | 483,31          |



Most powerful reactor under construction in France Flamanville (EDF, 2017 for US \$9 billion): 1,630 MWe





### Reduce energy consumption!

- Costs over lifetime of an HPC facility often exceed acquisition costs
- Carbon dioxide is a hazard for health and environment
- Heat reduces hw reliability

#### Personal view

- Hardware features energy saving mechanism
- Scientific apps are in general energy oblivious



### **Outline**



- Introduction
- ILUPACK
- Experimental setup
- Power model
  - Leveraging P-states
  - Leveraging C-states
- Conclusions



### **ILUPACK**



- Incomplete LU Package (<a href="http://ilupack.tu-bs.de">http://ilupack.tu-bs.de</a>)
  - Numerical solution of large sparse linear systems (Ax=b)
  - Iterative Krylov subspace methods (CG, GMRES)
  - Multilevel ILU preconditioners for general/symmetric/Hermitian positive definite systems
  - Incorporate the inverse-based approach to the factorization, to control the growth of inverse triangular factors
  - Specially competitive for linear systems from 3D PDEs



### **ILUPACK**



Factorization of a five-point matrix arising from Laplace PDE discretization.



## ILUPACK Multi-threaded version (task parallelism)



- Real s.p.d. systems
- Construction of preconditioner and PCG solver
- Algebraic parallelization based on a task tree
- Leverage task parallelism of the tree
- Dynamic scheduling via runtime (OpenMP)
  - "Exploiting thread-Level parallelism in the iterative solution of sparse linear systems". J. I. Aliaga, M. Bollhöfer, A. F. Martín, E. S. Quintana-Ortí. Parallel Computing, 2011



# **ILUPACK Multi-threaded version (task parallelism)**







## ILUPACK Multi-threaded version (task parallelism)



Run-time in charge of scheduling





### **Experimental setup**



- 2 AMD Opteron 6128 processors (16 cores)
- 48 GB of RAM
- DVFS enabled per core (P-states)

| P-state $P_i$ | $VCC_i$ | $  f_i  $ |
|---------------|---------|-----------|
| $P_0$         | 1.23    | 2.00      |
| $P_1$         | 1.17    | 1.50      |
| $P_2$         | 1.12    | 1.20      |
| $P_3$         | 1.09    | 1.00      |
| $P_4$         | 1.06    | 0.80      |

#### C-states:

- C0: normal operation mode
- C1, C1E: disable core components (L1/L2 caches), clock signal, mem.
   controller,... increases energy savings at the expense of recovery time



### **Experimental setup**



### Sparse linear system benchmark

Laplacian PDE equation

$$-\Delta u = f$$

in a 3D unit cube  $\Omega = [0,1]^3$  with Dirichlet boundary conditions u = g on  $\partial\Omega$ .

- For the discretization,
  - $_{\circ}$   $\Omega$  replaced by NxNxN uniform grid
  - $_{\circ}$   $\Delta u$  approximated by centered finite differences
- Linear system Au = b with  $A \rightarrow n \times n$ ,
  - ∘ N = 252,  $n = 252^3 \approx 16$  million unknowns
  - 111 millions of nonzero entries



# Cost of energy Setup



- DC powermeter with sampling freq. = 25 Hz
  - LEM HXS 20-NP transductors with PIC microcontroller
  - RS232 serial port











$$P^{T(otal)} = P^{(S)Y(stem)} + P^{C(PU)} = P^{Y} + P^{S(tatic)} + P^{D(ynamic)}$$

- $P^{C}$  is the power dissipated by CPU (socket):  $P^{S} + P^{D}$
- P<sup>S</sup> is the static power
- P<sup>D</sup> is dynamic power
- $P^Y$  is the power of remaining components (e.g., RAM)

### Considerations:

- P<sup>D</sup> changes with the number of active cores
- $P^{Y}$  and  $P^{S}$  are constants (though  $P^{S}$  grows with temperature)
- Hot system





System power:

Estimated as *idle* power

Due to off-chip components:
e.g., RAM (only mainboard)

$$P = P^Y + P^S + P^D$$

Power dissipated as function of number of active cores



$$P^{Y} \approx P^{I} = 80.15 \text{ W}$$





### CPU power:

- Busy-wait loops
- For each P-state and c
- Linear regression

$$P = \alpha + \beta \cdot c$$

where

$$\alpha = P^Y + P^S$$
$$\beta \cdot c = P^D$$

$$P = P^Y + P^S + P^D$$

Power dissipated as function of number of active cores







Static power:

$$P = P^Y + P^S + P^D$$

Power dissipated as function of number of active cores



$$P_0^T(c) = \alpha_0 + \beta_0 \cdot c = 168.59 + 9.12 \cdot c \text{ W}$$
  
 $P_0^S(c) \approx \alpha_0 - P_0^T(c) = 168.59 - 80.15 = 88.44 \text{ W}$ 





Dynamic power:

$$P = P^Y + P^S + P^D$$

Power dissipated as function of number of active cores



$$P_0^T(c) = \alpha_0 + \beta_0 c = 168.59 + 9.12 \cdot c \text{ W}$$

Busy-wait:  $P_0^D \approx \beta_0 c = 9.12 \cdot c \text{ W}$ 





| P-state P <sub>i</sub> | $V$ cc $_i$ | fi   | $\alpha_i$ | $\beta_i$ | $\Delta P_i^S$ | $\Delta P_i^D$ |
|------------------------|-------------|------|------------|-----------|----------------|----------------|
| $P_0$                  | 1.23        | 2.00 | 168.59     | 9.12      | _              | _              |
| $P_1$                  | 1.17        | 1.50 | 161.10     | 5.77      | -9.52          | -32.14         |
| $P_2$                  | 1.12        | 1.20 | 155.90     | 4.23      | -17.09         | -50.25         |
| $P_3$                  | 1.09        | 1.00 | 152.94     | 3.15      | -21.47         | -60.73         |
| $P_4$                  | 1.06        | 0.80 | 150.61     | 2.44      | -25.73         | -70.30         |

- To analyze the goodness of the  $\alpha$  and  $\beta$  values we made an additional analysis.
- The static and dynamic power satisfied that  $P^{S} \approx Vcc^{2}$ ,  $P^{D} \approx Vcc^{2} \cdot f \cdot c$
- We have defined the variation operator as

$$\Delta x_i = (x_i - x_0)/x_0$$





| P-state P <sub>i</sub> | $V$ cc $_i$ | fi   | $\alpha_i$ | $\beta_i$ | $\Delta P_i^S$ | $\Delta P_i^D$ |
|------------------------|-------------|------|------------|-----------|----------------|----------------|
| $P_0$                  | 1.23        | 2.00 | 168.59     | 9.12      | _              | -              |
| $P_1$                  | 1.17        | 1.50 | 161.10     | 5.77      | -9.52          | -32.14         |
| $P_2$                  | 1.12        | 1.20 | 155.90     | 4.23      | -17.09         | -50.25         |
| $P_3$                  | 1.09        | 1.00 | 152.94     | 3.15      | -21.47         | -60.73         |
| $P_4$                  | 1.06        | 0.80 | 150.61     | 2.44      | -25.73         | -70.30         |

- Remember,  $P^Y \approx P^I$  is constant
- Thus, e.g., moving all cores from  $P_0$  to  $P_1$  $P_1^T(16) = P_1^Y + P_0^S(1-0.0952) + P_0^D(16)(1-0.3214)$  = 259.19 W
- These values agree within 2.5% with the linear regression models

## Power model Leveraging P-states



| P-state P <sub>i</sub> | $V$ cc $_i$ | $f_i$ | $\alpha_i$ | $\beta_i$ | $\Delta P_i^S$ | $\Delta P_i^D$ |
|------------------------|-------------|-------|------------|-----------|----------------|----------------|
| $P_0$                  | 1.23        | 2.00  | 168.59     | 9.12      | _              | -              |
| $P_1$                  | 1.17        | 1.50  | 161.10     | 5.77      | -9.52          | -32.14         |
| $P_2$                  | 1.12        | 1.20  | 155.90     | 4.23      | -17.09         | -50.25         |
| $P_3$                  | 1.09        | 1.00  | 152.94     | 3.15      | -21.47         | -60.73         |
| $P_4$                  | 1.06        | 0.80  | 150.61     | 2.44      | -25.73         | -70.30         |

- DVFS = P-states (see ACPI standard)
- Moving to a more power-friendly state results in ↓power
- ↓power = \puergy?
- For a compute-bounded operation,  $f_i$  is linear to performance and time<sup>-1</sup>
- In principle, for a memory-bounded operation (ILUPACK), decreasing f<sub>i</sub> should not affect time!



### Power model Leveraging P-states



1st attempt: <del>Dynamic</del> Static voltage-frequency scaling

| P-state $P_i$ | $T_i$ | $\bar{P}_i^T$ | Ei        | $\Delta T_i$ | $\Delta \bar{P}_{i}^{T}$ | $\Delta E_i$ |
|---------------|-------|---------------|-----------|--------------|--------------------------|--------------|
| $P_0$         | 34.06 | 282.87        | 9,634.78  | _            | _                        | _            |
| $P_1$         | 43.57 | 235.64        | 10,267.72 | 21.88        | -16.69                   | 6.53         |
| $P_2$         | 54.48 | 210.86        | 11.478.79 | 59.91        | -25.45                   | 19.20        |
| $P_3$         | 61.58 | 197.01        | 12.132.79 | 80.73        | -30.35                   | 25.87        |
| $P_4$         | 76.50 | 186.86        | 14,295.18 | 124.47       | -33.94                   | 48.28        |

Why?



## Power model Leveraging P-states



1st attempt: <del>Dynamic</del> Static voltage-frequency scaling

| P-state $P_i$  | Vcc <sub>i</sub> | $f_i$ | $T_i$ | $\Delta T_i$ | $BW_i$ | $\Delta BW_i$ |
|----------------|------------------|-------|-------|--------------|--------|---------------|
| $P_0$          | 1.23             | 2.00  | 34.06 | _            | 30.29  | -             |
| $P_1$          | 1.17             | 1.50  | 43.57 | 21.88        | 24.63  | -18.67        |
| $P_2$          | 1.12             | 1.20  | 54.48 | 59.91        | 20.46  | -32.44        |
| $P_3$          | 1.09             | 1.00  | 61.58 | 80.73        | 17.48  | -42.30        |
| P <sub>4</sub> | 1.06             | 0.80  | 76.50 | 124.47       | 14.00  | -53.77        |
|                |                  |       |       |              |        |               |

Combined effect of linear decrease of CPU performance and memory bandwidth!



## Power model Leveraging P-states



2nd attempt: DVFS during idle periods





## Power model Leveraging P-states



2nd attempt: DVFS during idle periods





## Power model Levaraging P-states



2nd attempt DVFS during idle periods





## Power model Leveraging P-states



Active polling for work...





### Power model Leveraging P- and C-states



3rd attempt: DVFS and idle-wait





### Power model Leveraging P- and C-states



- 3rd attempt: DVFS and idle-wait:
  - Savings of 6.92% of total energy
  - Negligible impact on execution time
- ...but take into account that
  - Idle time: 23.70%
  - Dynamic power: 39.32%
  - Upper bound of savings: 39.32 · 0.2370 = 9.32%



## Performance and energy consumption Summary



- A battle to be won in the core arena
  - More concurrency
  - Heterogeneous designs
- A related battle to be won in the power arena
  - "Do nothing, efficiently..." (V. Pallipadi, A. Belay) or "Doing nothing well" (D. E. Culler)
  - Don't forget the cost of system+static power



### More information



- "Energy-aware dense and sparse linear algebra", P. Alonso, M.F.
   Dolz, R. Mayo, E.S. Quintana. PMAA 2012. London (UK)
- "Modeling power and energy of the task-parallel Cholesky factorization on multicore processors", P. Alonso, M. F. Dolz, R. Mayo, E. S. Quintana-Ortí. EnaHPC 2012. Hamburg (Germany)
- "Energy-efficient execution of dense linear algebra algorithms on multicore processors". P. Alonso, M. F. Dolz, R. Mayo, E. S. Quintana-Ortí. Cluster Computing, 2012



## Leveraging Task-Parallelism in Energy-Efficient ILU Preconditioners



## Thanks for your attention!

Any question?

