# Energy-Aware Matrix Computations on Multi-Core and Many-core Platforms Enrique S. Quintana-Ortí ### Performance and energy consumption #### Top500 (November 2011) | Rank | Site | #Cores | LINPACK<br>(TFLOPS) | |------|-------------------------------------------------------------------|---------|---------------------| | 1 | RIKEN AICS K Computer- Spar64<br>VIIIfx (8-core) | 705,024 | 10,510.00* | | 2 | NSC Tianjin – NUDT YH MPP, Xeon<br>X5670 6C 2.93 GHz, NVIDIA 2050 | 186,368 | 2,566.00 | | 3 | DOE ORNL – Cray XT5-HE Opteron<br>6-core 2.6 GHz | 224,162 | 1,759.00 | | 9 | CEA (France) – Bull bullx super-node S6010/S6030 | 138,368 | 1,050.00 | | 114 | BSC (Spain) – Bull B505, Xeon E5649<br>6C 2.53 GHz, NVIDIA 2090 | 5,544 | 103.20 | <sup>1</sup> day K Computer = 394 years of the world population (7.000 million people) with a hand calculator NIVERSITAT ### Performance and energy consumption #### Green500 (November 2011) | Rank | Site | #Cores | MFLOPS/W | LINPACK<br>(TFLOPS) | |-----------|----------------------------------------------------------------------|---------|----------|---------------------| | Green/Top | | | | ( 25. 3) | | 1/29 | IBM Rochester – BlueGene/Q,<br>Power BQC 16C 1.60 GHz | 32,768 | 2,026.48 | 339.83 | | 7/114 | BSC (Spain) – Bull B505, Xeon<br>E5649 6C 2.53 GHz, NVIDIA<br>2090 | 5,544 | 1,266.26 | 103.20 | | 32/1 | RIKEN AICS K Computer–<br>Spar64 VIIIfx (8-core) | 705,024 | 830.18 | 10,510.00 | | 47/2 | NSC Tianjin – NUDT YH MPP,<br>Xeon X5670 6C 2.93 GHz,<br>NVIDIA 2050 | 186,368 | 635.15 | 2,566.00 | | 53/3 | DOE ORNL – Cray XT5-HE<br>Opteron 6-core 2.6 GHz | 582,00 | Cray | 1,759.00 | ### Multi-core and many-core platforms "Conventional" architectures New challengers... ### **Matrix computations** - Linear algebra? Please, don't run away! - Determinants, linear systems, least squares fitting, FFT, etc. - Importance: - Intel MKL, AMD ACML, IBM ESSL, NVIDIA CUBLAS, ongoing for TI #### Index - 1. Scientific applications - 2. Leveraging concurrency - 3. Cost of energy #### Index - 1. Scientific applications - 2. Leveraging concurrency - 3. Cost of energy ### Scientific applications Biological systems - Simulations of molecular dynamics - Solve $$AX = BX\Lambda$$ , dense $A,B \rightarrow n \times n$ $n = 134,484$ # Scientific applications Industrial processes - Optimal cooling of steel profiles - Solve $$A^T X + XA - XSX + Q = 0$$ , dense $A \rightarrow n \times n$ $n = 5,177$ for a mesh width of $6.91 \cdot 10^{-3}$ # Scientific applications Summary - Dense linear algebra is at the bottom of the "food chain" for many scientific and engineering apps. - Fast acoustic scattering problems - Dielectric polarization of nanostructures - Magneto-hydrodynamics - Macro-economics #### Index - 1. Scientific applications - 2. Leveraging hardware concurrency - 3. Cost of energy ### Leveraging hw. concurrency Threads #### Linear system $$2x + 3y = 3$$ $$4x - 5y = 6$$ $$A X = B$$ , with dense $A$ , $B$ → $n \times n$ : ≈ $2n^3/3 + 2n^3$ flops Intel Xeon: | n | Time 1 core | Time<br>8 cores | Time<br>16-node<br>cluster,<br>8-core<br>per node,<br>i.e., 192<br>cores | |-----------------|-------------|-----------------|--------------------------------------------------------------------------| | 100 | 33.33 ms | | - | | 1.000 | 0.33 s | | | | 10 <sup>4</sup> | 333.33 s | 41.62 s | | | 10 <sup>5</sup> | > 92 h | > 11 h | > 28 m | ### Leveraging hw. concurrency Threads #### **2010 PFLOPS** (10<sup>15</sup> flops/sec.) #### **2010 JUGENE** - 10<sup>9</sup> core level (PowerPC 450, 850MHz → 3.4 GFLOPS) - 10¹ node level (Quad-Core) 10<sup>5</sup> cluster level (73.728 nodes) #### **2020 EFLOPS** (10<sup>18</sup> flops/sec.) - 10<sup>9.5</sup> core level - 10<sup>3</sup> node level! - 10<sup>5.5</sup> cluster level # Leveraging hw. concurrency Cholesky factorization Key in the solution of s.p.d. linear systems ``` for (k=1; k<=n/b; k++) { Chol(A[k,k]); // A<sub>kk</sub> = L<sub>kk</sub> * L<sub>kk</sub> ``` Factor de Cholesky en 2 Intel Xeon QuadCore (8 cores) # Leveraging hw. concurrency Algorithmic parallelism Why? Excessive thread synchronization # Leveraging hw. concurrency Algorithmic parallelism ...but there is much more parallelism!!! # Leveraging hw. concurrency Algorithmic parallelism ...but there is much more parallelism!!! How can we leverage it? #### Scalar code #### (Super)scalar processor Something similar for (dense) linear algebra? 1st iter. ``` for (k=1; k<=n/b; k++) { F: Chol(A[k,k]); for (i=k+1; i<=n/b; i++) T: Trsm(A[k,k], A[i,k]); for (i=k+1; i<nb; i++) { U1: Syrk(A[i,k],A[i,i]); for (j=k+1; j<=i; j++) U2: Gemm(A[i,k], A[j,k], A[i,j]); }</pre> ``` Something similar for (dense) linear algebra? - Apply "scalar" techniques at the block level - Software implementation - Thread/Task-level parallelism - Target the cores/GPUs of the platform Read/written blocks determine dependencies, as in scalar case Dependencies form a dependency DAG (task tree) #### Runtime: - Decode (ID): Generate the task tree with a "symbolic analysis" of the code at execution time - Issue (ISS): Architectureaware execution of the tasks in the tree - Decode stage: - "Symbolic analysis" of the code Blocked code: Task tree: #### Issue stage: - Temporal scheduling of tasks, attending to dependencies - Mapping (spatial scheduling) of tasks to resources, aware of locality ### Leveraging hw. concurrency Implementations - SuperMatrix (UT@Austin and UJI) - Read/written blocks defined implicitly by the operations - Only valid for dense linear algebra operations encoded in libflame - SMPSs (BSC) and GPUSs (BSC and UJI) - OpenMP-like languages #pragma css task inout(A[b\*b]) void Chol(double \*A); - Applicable to task-parallel codes on different platforms: multi-core, multi-GPU, multi-accelerators, Grid,... #### Index - 1. Scientific applications - 2. Leveraging hardware concurrency - 3. Cost of energy "The free lunch is over" (H. Sutter, 2005) - Frequency wall - Power energy consumption proportional to f<sup>3</sup> f<sup>2</sup> - Electricity = money - 1<sup>st</sup> Law of Thermodynamics: Energy cannot be created or destroyed, only converted - Cost of extracting heat - Heat reduces lifetime | Rank | Site | #Cores | MFLOPS/W | LINPACK<br>(TFLOPS) | MW to EXAFLOPS? | |-----------|--------------------------------------------------------------------|---------|----------|---------------------|-----------------| | Green/Top | | | | (11 LOF 3) | LAAI LOFS! | | 1/29 | IBM Rochester – BlueGene/Q,<br>Power BQC 16C 1.60 GHz | 32.768 | 2.026.48 | 339,83 | 493,47 | | 7/114 | BSC (Spain) – Bull B505,<br>Xeon E5649 6C 2.53 GHz,<br>NVIDIA 2090 | 5.544 | 1.266,26 | 103,20 | 789,73 | | 32/1 | RIKEN AICS K Computer–<br>Spar64 VIIIfx (8-core) | 705.024 | 830,18 | 10.510,00 | 1.204,60 | NVIDIA GTX 480 (250 W) (=1/4 low power hair dryer) 2 million GTXs $\approx$ 493,47 MW! or 500.000 hair dryers | Rank | Site | #Cores | MFLOPS/W | LINPACK<br>(TFLOPS) | MW to EXAFLOPS? | |-----------|--------------------------------------------------------------------|---------|----------|---------------------|-----------------| | Green/Top | | | | (TFLOPS) | EXAFLOPS! | | 1/29 | IBM Rochester – BlueGene/Q,<br>Power BQC 16C 1.60 GHz | 32.768 | 2.026.48 | 339,83 | 493,47 | | 7/114 | BSC (Spain) – Bull B505,<br>Xeon E5649 6C 2.53 GHz,<br>NVIDIA 2090 | 5.544 | 1.266,26 | 103,20 | 789,73 | | 32/1 | RIKEN AICS K Computer–<br>Spar64 VIIIfx (8-core) | 705.024 | 830,18 | 10.510,00 | 1.204,60 | Most powerful reactor under construction in France Flamanville (EDF, 2017 for US \$9billion): 1,630 MWe # Cost of energy Setup - Modeling power of task-parallel apps. - Two Intel Xeon E5504 @ 2.0 GHz (8 cores) - Experience: more stable - Saving opportunities for task-parallel apps. - Two AMD Opteron 6128 cores @ 2.0 GHz (16 cores) - Experience: more flexible (DVFS at core level) #### Cost of energy Setup - DC powermeter with sampling freq. = 25 Hz - LEM HXS 20-NP transductors with PIC microcontroller - RS232 serial port #### Cost of energy Setup #### Cost of energy Setup #### Cost of energy Power vs. energy $$E = \int_0^T P \, dt$$ # Cost of energy Power vs. energy Which one is better, A or B? $$E_A = P_A t_A$$ $$E_B = P_B t_B$$ $$P = P^{(S)Y(stem)} + P^{C(PU)} = P^{Y} + P^{S(tatic)} + P^{D(ynamic)}$$ $P^{C}$ is power dissipated by CPU (socket): $P^{S} + P^{D}$ $P^{Y}$ is power of remaining components (e.g., RAM) #### Considerations: - $P^{Y}$ and $P^{S}$ are constants (though $P^{S}$ grows with temperature) - Hot system - Task-parallel routines - Intel platform System power: $$P = P^Y + P^S + P^D$$ Estimated as *idle* power Due to off-chip components: e.g, RAM (only mainboard) $$P^{Y} \approx P^{I} = 46.37 \text{ W}$$ Static power: $$P = P^Y + P^S + P^D$$ Also known as *Uncore* power (Intel): - LLC - Mem. controller - Interconnect controller - Power control logic - etc. Intel Xeon 5500 (4 cores) The Uncore: A Modular Approach to Feeding the High-performance Cores. D. L. Hill et al. Intel Technology Journal, Vol. 14(3), 2010 Static power: $$P = P^Y + P^S + P^D$$ $$P_{dgemm}(c) = 67.97 + 12.75 c$$ $P^{S} = 67.97 - 46.37 = 21.6 W$ Dynamic power: $$P = P^Y + P^S + P^D$$ Also known as *Core* power (Intel): - Execution units - L1 and L2 cache - Branch prediction logic - etc. UnCorePLL and Voltage CorePLL's and Voltages Pill Core 0 Core 2 Core 2 Core 2 Core 1 Core 3 1 Core 3 Core 1 Core 3 Core 1 Core 3 1 Core 1 Core 1 Core 3 4 Core 4 Core 4 Core 4 Core 5 Core 5 Core 5 Core 6 Core 6 Core 7 Core 7 Core 7 Core 8 Core 9 Co Intel Xeon 5500 (4 cores) Dynamic power: $$P = P^Y + P^S + P^D$$ $$P_{dgemm}(c) = 67.97 + 12.75 c$$ Dynamic power of task-parallelCholesky ``` P = P^Y + P^S + P^D ``` ``` for (k=1; k<=n/b; k++) { F: Chol(A[k,k]); for (i=k+1; i<=n/b; i++) T: Trsm(A[k,k], A[i,k]); for (i=k+1; i<nb; i++) { Syrk(A[i,k],A[i,i]); for (j=k+1; j<=i; j++) Gemm(A[i,k], A[j,k], A[i,j]); } }</pre> ``` #### Dynamic power of task-parallelCholesky For a given kernel, execute repeatedly till power stabilizes: $$P^{D}_{dgemm} = Pd_{gemm} - (P^{Y} + P^{S})$$ | | 1 kernel mapped to 1 core Block size, b | | | | 2 kernels mapped to 2 cores of different sockets | | | | |------------------|------------------------------------------|-------|-------|-------|--------------------------------------------------|-------|-------|-------| | | | | | | Block size, b | | | | | Task | 128 | 192 | 256 | 512 | 128 | 192 | 256 | 512 | | $P_P^D$ (dpotrf) | 10.26 | 10.35 | 10.45 | 11.28 | 9.05 | 9.09 | 9.28 | 10.44 | | $P_T^D$ (dtrsm) | 10.12 | 10.31 | 10.32 | 10.80 | 9.45 | 9.57 | 9.60 | 11.08 | | $P_S^D$ (dsyrk) | 11.22 | 11.47 | 11.67 | 12.60 | 10.42 | 10.63 | 10.82 | 11.80 | | $P_G^D$ (dgemm) | 11.98 | 12.54 | 12.72 | 13.30 | 10.90 | 12.16 | 11.28 | 11.96 | | $P_B^D$ (busy) | 7.62 | 7.62 | 7.62 | 7.62 | 7.62 | 7.62 | 7.62 | 7.62 | Power increases linearly with #cores, from 1 to 4 mapped to a single socket When two sockets are used, linear function changes Power of task-parallel Cholesky $$P_{chol} = P^{Y} + P^{S} + \sum_{i=1}^{r} \sum_{j=1}^{c} P^{D}_{i} N_{i,j}(t)$$ #### where - r is #different types of tasks - cis #cores - $P^{D}_{i}$ is the average dynamic power for task of type - $N_{i,j}(t) = 1$ if thread j is executing a task of type i at time t; $N_{i,j}(t) = 0$ otherwise Energy of task-parallel Cholesky $$E_{chol} = (P^{Y} + P^{S}) T + \sum_{i=1}^{r} \sum_{j=1}^{c} P^{D}_{i} \int_{t=0}^{T} N_{i,j}(t) dt$$ $$= (P^{Y} + P^{S}) T + \sum_{i=1}^{r} \sum_{j=1}^{c} P^{D}_{i} T_{i,j}$$ #### where - T is the total execution time - $T_{i,j}(t) = 1$ is the time that thread j has executed tasks of type i Error in dynamic energy consumption, b=128 Error in total energy consumption, b=128 Error in dynamic energy consumption, b=256 Error in total energy consumption, b=256 ACPI (Advanced Configuration and Power Interface): industry-standard interfaces enabling OS-directed configuration, power/thermal management of mobile/desktop/server platforms - Revision 5.0 (december 2011) - In the processor: Power states (C-states) and performance states (P-states) - Power states (C-states): - C0: normal execution (also a P-state) - Cx, x>0: no instructions being executed. As x grows, more savings but longer latency to reach C0 - Stop clock signal - Flush and shutdown cache (L1 and L2 flushed to LLC) - Turn off cores - Package power states (PC-states): - PC0, PC1, PC2,... Uncore subsystem remains active and consumes power as long as there is any active core on the CPU UncorePLL and Voltage UncorePLL's and Voltages | Core Intel Xeon 5500 (4 cores) #### Intel Core i7 processor: - Core C0 State - The normal operating state of a core where code is being executed - Core C1/C1E State - The core halts; it processes cache coherence snoops - Core C3 State - The core flushes the contents of its L1 instruction cache, L1 data cache, and L2 cache to the shared L3 cache, while maintaining its architectural state. All core clocks are stopped at this point. No snoops - Core C6 State - Before entering core C6, the core will save its architectural state to a dedicated SRAM on chip. Once complete, a core will have its voltage reduced to zero volts - Performance states (P-states): - P0: Highest performance and power - Pi, i > 0: As igrows, more savings but lower performance | P-state $P_i$ | $VCC_i$ | $f_i$ | | |---------------|---------|-------|--------------| | $P_0$ | 1.23 | 2.00 | | | $P_1$ | 1.17 | 1.50 | AMD platform | | $P_2$ | 1.12 | 1.20 | | | $P_3$ | 1.09 | 1.00 | | | $P_4$ | 1.06 | 0.80 | | • $P = a V^2 f$ , where a depends on the technology (but $E = \int_0^T P dt = a V^2$ ) Leveraging DVFS: cpufreq - Leveraging DVFS (transparent): Linux governors - Performance: Highest frequency/performance - Powersave: Lowest frequency/performance - Userspace: User's decision - Ondemand: If CPU utilization rises above the threshold value set in the up\_threshold parameter, the ondemand governor increases the CPU frequency to scaling\_max\_freq. When CPU utilization falls below this threshold, the governor decreases the frequency in steps. Lowest performance, growing with workload - **Conservative**: If CPU utilization is above up\_threshold, this governor will step up the frequency to the next highest frequency below or equal to scaling\_max\_freq. If CPU utilization is below down\_threshold, this governor will step down the frequency to the next lowest frequency until it reaches scaling\_min\_freq Which one is better, A or B? Which one is better, A or B? But consideralso $P^Y + P^S \simeq 50\%$ of power - To DVFS or not? General consensus: - No for compute-intensive apps.: reducing frequency increases execution time linearly Yes for memory-bounded apps. as cores are idle most of the time ...but, in many platforms, reducing frequency via DVFS also reduces memory bandwidth proportionally! | P-state $P_i$ | $VCC_i$ | $f_i$ | $\alpha_i$ | $\beta_i$ | $\Delta P_i^S$ | $\Delta P_i^D$ | $\Delta P_i^T(16)$ | $BW_i$ | $\Delta BW_i$ | |---------------|---------|-------|------------|-----------|----------------|----------------|--------------------|--------|---------------| | $P_0$ | 1.23 | 2.00 | 168.59 | 9.12 | _ | _ | _ | 4.43 | _ | | $P_1$ | 1.17 | 1.50 | 161.10 | 5.77 | -9.52 | -32.14 | -17.58 | 3.89 | -12.19 | | $P_2$ | 1.12 | 1.20 | 155.90 | 4.23 | -17.09 | -50.25 | -28.34 | 3.49 | -21.21 | | $P_3$ | 1.09 | 1.00 | 152.94 | 3.15 | -21.47 | -60.73 | -33.26 | 3.19 | -27.99 | | $P_4$ | 1.06 | 0.80 | 150.61 | 2.44 | -25.73 | -70.30 | -39.85 | 2.80 | -36.79 | - Alternative strategies for compute-intensive apps.: - Idle-wait in multithreaded apps. - Idle-wait in hybrid CPU-GPU apps. - Idle-wait during communications in MPI apps. #### Power for different thread activities Idle-wait in multithreaded apps. (ILU preconditioner) Idle-wait in hybrid CPU-GPU apps. (multi-GPU Cholesky factorization via SuperMatrix runtime) Intel Xeon E5540 @ 2.83 GHz (4 cores) and NVIDIA Tesla S2050 (4 "Fermis") EA1: no polling when there is no work EA2: no polling when work is in GPU #### Performance and energy consumption Summary - A battle to be won in the core arena - More concurrency - Heterogeneous designs - A related battle to be won in the power arena - Do nothing, efficiently... (V. Pallipadi, A. Belay) - Don't forget the cost of uncore power ...but don't always believe the salesman!