Loop Id: 383 | Module: exec | Source: ideal_gas_kernel.f90:49-55 | Coverage: 5.39% |
---|
Loop Id: 383 | Module: exec | Source: ideal_gas_kernel.f90:49-55 | Coverage: 5.39% |
---|
0x43e6c0 MOVUPD (%R15,%R10,8),%XMM6 [4] |
0x43e6c6 MOVAPD %XMM3,%XMM7 |
0x43e6ca DIVPD %XMM6,%XMM7 |
0x43e6ce MOVAPD %XMM6,%XMM8 |
0x43e6d3 MULPD %XMM4,%XMM8 |
0x43e6d8 MOVUPD (%R11,%R10,8),%XMM9 [3] |
0x43e6de MULPD %XMM8,%XMM9 |
0x43e6e3 MOVUPD %XMM9,(%R12,%R10,8) [1] |
0x43e6e9 MULPD %XMM7,%XMM7 |
0x43e6ed MULPD %XMM5,%XMM6 |
0x43e6f1 MULPD %XMM9,%XMM6 |
0x43e6f6 MULPD %XMM7,%XMM6 |
0x43e6fa SQRTPD %XMM6,%XMM6 |
0x43e6fe MOVUPD %XMM6,(%R9,%R10,8) [2] |
0x43e704 ADD $0x2,%R10 |
0x43e708 CMP %R13,%R10 |
0x43e70b JB 43e6c0 |
/beegfs/hackathon/users/eoseret/qaas_runs/170-861-0321/intel/CloverLeafFC/build/CloverLeafFC/CloverLeaf_ref/kernels/ideal_gas_kernel.f90: 49 - 55 |
-------------------------------------------------------------------------------- |
49: DO j=x_min,x_max |
50: v=1.0_8/density(j,k) |
51: pressure(j,k)=(1.4_8-1.0_8)*density(j,k)*energy(j,k) |
52: pressurebyenergy=(1.4_8-1.0_8)*density(j,k) |
53: pressurebyvolume=-density(j,k)*pressure(j,k) |
54: sound_speed_squared=v*v*(pressure(j,k)*pressurebyenergy-pressurebyvolume) |
55: soundspeed(j,k)=SQRT(sound_speed_squared) |
Path / |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.00 |
CQA speedup if FP arith vectorized | 2.00 |
CQA speedup if fully vectorized | 2.00 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 3.38 |
Bottlenecks | P8, P9, |
Function | ideal_gas_kernel_.DIR.OMP.PARALLEL.2 |
Source | ideal_gas_kernel.f90:49-55 |
Source loop unroll info | unrolled by 2 |
Source loop unroll confidence level | max |
Unroll/vectorization loop type | main |
Unroll factor | 2 |
CQA cycles | 13.50 |
CQA cycles if no scalar integer | 13.50 |
CQA cycles if FP arith vectorized | 6.75 |
CQA cycles if fully vectorized | 6.75 |
Front-end cycles | 2.67 |
DIV/SQRT cycles | 0.50 |
P0 cycles | 0.50 |
P1 cycles | 0.25 |
P2 cycles | 0.25 |
P3 cycles | 0.50 |
P4 cycles | 1.33 |
P5 cycles | 1.33 |
P6 cycles | 1.33 |
P7 cycles | 4.00 |
P8 cycles | 4.00 |
P9 cycles | 0.00 |
P10 cycles | 0.00 |
P11 cycles | 1.00 |
P12 cycles | 1.00 |
P13 cycles | 13.50 |
Inter-iter dependencies cycles | 1 |
FE+BE cycles (UFS) | NA |
Stall cycles (UFS) | NA |
Nb insns | 17.00 |
Nb uops | 16.00 |
Nb loads | 2.00 |
Nb stores | 2.00 |
Nb stack references | 0.00 |
FLOP/cycle | 1.19 |
Nb FLOP add-sub | 0.00 |
Nb FLOP mul | 12.00 |
Nb FLOP fma | 0.00 |
Nb FLOP div | 2.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 2.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 4.74 |
Bytes prefetched | 0.00 |
Bytes loaded | 32.00 |
Bytes stored | 32.00 |
Stride 0 | 0.00 |
Stride 1 | 4.00 |
Stride n | 0.00 |
Stride unknown | 0.00 |
Stride indirect | 0.00 |
Vectorization ratio all | 100.00 |
Vectorization ratio load | 100.00 |
Vectorization ratio store | 100.00 |
Vectorization ratio mul | 100.00 |
Vectorization ratio add_sub | NA |
Vectorization ratio fma | NA |
Vectorization ratio div_sqrt | 100.00 |
Vectorization ratio other | 100.00 |
Vector-efficiency ratio all | 25.00 |
Vector-efficiency ratio load | 25.00 |
Vector-efficiency ratio store | 25.00 |
Vector-efficiency ratio mul | 25.00 |
Vector-efficiency ratio add_sub | NA |
Vector-efficiency ratio fma | NA |
Vector-efficiency ratio div_sqrt | 25.00 |
Vector-efficiency ratio other | 25.00 |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.00 |
CQA speedup if FP arith vectorized | 2.00 |
CQA speedup if fully vectorized | 2.00 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 3.38 |
Bottlenecks | P8, P9, |
Function | ideal_gas_kernel_.DIR.OMP.PARALLEL.2 |
Source | ideal_gas_kernel.f90:49-55 |
Source loop unroll info | unrolled by 2 |
Source loop unroll confidence level | max |
Unroll/vectorization loop type | main |
Unroll factor | 2 |
CQA cycles | 13.50 |
CQA cycles if no scalar integer | 13.50 |
CQA cycles if FP arith vectorized | 6.75 |
CQA cycles if fully vectorized | 6.75 |
Front-end cycles | 2.67 |
DIV/SQRT cycles | 0.50 |
P0 cycles | 0.50 |
P1 cycles | 0.25 |
P2 cycles | 0.25 |
P3 cycles | 0.50 |
P4 cycles | 1.33 |
P5 cycles | 1.33 |
P6 cycles | 1.33 |
P7 cycles | 4.00 |
P8 cycles | 4.00 |
P9 cycles | 0.00 |
P10 cycles | 0.00 |
P11 cycles | 1.00 |
P12 cycles | 1.00 |
P13 cycles | 13.50 |
Inter-iter dependencies cycles | 1 |
FE+BE cycles (UFS) | NA |
Stall cycles (UFS) | NA |
Nb insns | 17.00 |
Nb uops | 16.00 |
Nb loads | 2.00 |
Nb stores | 2.00 |
Nb stack references | 0.00 |
FLOP/cycle | 1.19 |
Nb FLOP add-sub | 0.00 |
Nb FLOP mul | 12.00 |
Nb FLOP fma | 0.00 |
Nb FLOP div | 2.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 2.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 4.74 |
Bytes prefetched | 0.00 |
Bytes loaded | 32.00 |
Bytes stored | 32.00 |
Stride 0 | 0.00 |
Stride 1 | 4.00 |
Stride n | 0.00 |
Stride unknown | 0.00 |
Stride indirect | 0.00 |
Vectorization ratio all | 100.00 |
Vectorization ratio load | 100.00 |
Vectorization ratio store | 100.00 |
Vectorization ratio mul | 100.00 |
Vectorization ratio add_sub | NA |
Vectorization ratio fma | NA |
Vectorization ratio div_sqrt | 100.00 |
Vectorization ratio other | 100.00 |
Vector-efficiency ratio all | 25.00 |
Vector-efficiency ratio load | 25.00 |
Vector-efficiency ratio store | 25.00 |
Vector-efficiency ratio mul | 25.00 |
Vector-efficiency ratio add_sub | NA |
Vector-efficiency ratio fma | NA |
Vector-efficiency ratio div_sqrt | 25.00 |
Vector-efficiency ratio other | 25.00 |
Path / |
Function | ideal_gas_kernel_.DIR.OMP.PARALLEL.2 |
Source file and lines | ideal_gas_kernel.f90:49-55 |
Module | exec |
nb instructions | 17 |
nb uops | 16 |
loop length | 77 |
used x86 registers | 6 |
used mmx registers | 0 |
used xmm registers | 7 |
used ymm registers | 0 |
used zmm registers | 0 |
nb stack references | 0 |
micro-operation queue | 2.67 cycles |
front end | 2.67 cycles |
ALU0/BRU0 | ALU1 | ALU2 | ALU3 | BRU1 | AGU0 | AGU1 | AGU2 | FP0 | FP1 | FP2 | FP3 | FP4 | FP5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
uops | 0.50 | 0.50 | 0.25 | 0.25 | 0.50 | 1.33 | 1.33 | 1.33 | 4.00 | 4.00 | 0.00 | 0.00 | 1.00 | 1.00 |
cycles | 0.50 | 0.50 | 0.25 | 0.25 | 0.50 | 1.33 | 1.33 | 1.33 | 4.00 | 4.00 | 0.00 | 0.00 | 1.00 | 1.00 |
Cycles executing div or sqrt instructions | 13.50 |
Longest recurrence chain latency (RecMII) | 1.00 |
Front-end | 2.67 |
Dispatch | 4.00 |
DIV/SQRT | 13.50 |
Data deps. | 1.00 |
Overall L1 | 13.50 |
all | 100% |
load | 100% |
store | 100% |
mul | 100% |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | 100% |
other | 100% |
all | 25% |
load | 25% |
store | 25% |
mul | 25% |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | 25% |
other | 25% |
Instruction | Nb FU | ALU0/BRU0 | ALU1 | ALU2 | ALU3 | BRU1 | AGU0 | AGU1 | AGU2 | FP0 | FP1 | FP2 | FP3 | FP4 | FP5 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MOVUPD (%R15,%R10,8),%XMM6 | 1 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
MOVAPD %XMM3,%XMM7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
DIVPD %XMM6,%XMM7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 13 | 5 |
MOVAPD %XMM6,%XMM8 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
MULPD %XMM4,%XMM8 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 |
MOVUPD (%R11,%R10,8),%XMM9 | 1 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
MULPD %XMM8,%XMM9 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 |
MOVUPD %XMM9,(%R12,%R10,8) | 1 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 4 | 1 |
MULPD %XMM7,%XMM7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 |
MULPD %XMM5,%XMM6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 |
MULPD %XMM9,%XMM6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 |
MULPD %XMM7,%XMM6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 |
SQRTPD %XMM6,%XMM6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 21 | 8.50 |
MOVUPD %XMM6,(%R9,%R10,8) | 1 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 4 | 1 |
ADD $0x2,%R10 | 1 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.25 |
CMP %R13,%R10 | 1 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.25 |
JB 43e6c0 <ideal_gas_kernel_module_mp_ideal_gas_kernel_.DIR.OMP.PARALLEL.2+0x290> | 1 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50-1 |
Function | ideal_gas_kernel_.DIR.OMP.PARALLEL.2 |
Source file and lines | ideal_gas_kernel.f90:49-55 |
Module | exec |
nb instructions | 17 |
nb uops | 16 |
loop length | 77 |
used x86 registers | 6 |
used mmx registers | 0 |
used xmm registers | 7 |
used ymm registers | 0 |
used zmm registers | 0 |
nb stack references | 0 |
micro-operation queue | 2.67 cycles |
front end | 2.67 cycles |
ALU0/BRU0 | ALU1 | ALU2 | ALU3 | BRU1 | AGU0 | AGU1 | AGU2 | FP0 | FP1 | FP2 | FP3 | FP4 | FP5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
uops | 0.50 | 0.50 | 0.25 | 0.25 | 0.50 | 1.33 | 1.33 | 1.33 | 4.00 | 4.00 | 0.00 | 0.00 | 1.00 | 1.00 |
cycles | 0.50 | 0.50 | 0.25 | 0.25 | 0.50 | 1.33 | 1.33 | 1.33 | 4.00 | 4.00 | 0.00 | 0.00 | 1.00 | 1.00 |
Cycles executing div or sqrt instructions | 13.50 |
Longest recurrence chain latency (RecMII) | 1.00 |
Front-end | 2.67 |
Dispatch | 4.00 |
DIV/SQRT | 13.50 |
Data deps. | 1.00 |
Overall L1 | 13.50 |
all | 100% |
load | 100% |
store | 100% |
mul | 100% |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | 100% |
other | 100% |
all | 25% |
load | 25% |
store | 25% |
mul | 25% |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | NA (no fma vectorizable/vectorized instructions) |
div/sqrt | 25% |
other | 25% |
Instruction | Nb FU | ALU0/BRU0 | ALU1 | ALU2 | ALU3 | BRU1 | AGU0 | AGU1 | AGU2 | FP0 | FP1 | FP2 | FP3 | FP4 | FP5 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MOVUPD (%R15,%R10,8),%XMM6 | 1 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
MOVAPD %XMM3,%XMM7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
DIVPD %XMM6,%XMM7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 13 | 5 |
MOVAPD %XMM6,%XMM8 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.17 |
MULPD %XMM4,%XMM8 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 |
MOVUPD (%R11,%R10,8),%XMM9 | 1 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0.50 |
MULPD %XMM8,%XMM9 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 |
MOVUPD %XMM9,(%R12,%R10,8) | 1 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 4 | 1 |
MULPD %XMM7,%XMM7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 |
MULPD %XMM5,%XMM6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 |
MULPD %XMM9,%XMM6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 |
MULPD %XMM7,%XMM6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 3 | 0.50 |
SQRTPD %XMM6,%XMM6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 21 | 8.50 |
MOVUPD %XMM6,(%R9,%R10,8) | 1 | 0 | 0 | 0 | 0 | 0 | 0.33 | 0.33 | 0.33 | 0 | 0 | 0 | 0 | 0.50 | 0.50 | 4 | 1 |
ADD $0x2,%R10 | 1 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.25 |
CMP %R13,%R10 | 1 | 0.25 | 0.25 | 0.25 | 0.25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.25 |
JB 43e6c0 <ideal_gas_kernel_module_mp_ideal_gas_kernel_.DIR.OMP.PARALLEL.2+0x290> | 1 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.50-1 |