Loop Id: 939 | Module: exec | Source: forall.hpp:59-59 [...] | Coverage: 6.85% |
---|
Loop Id: 939 | Module: exec | Source: forall.hpp:59-59 [...] | Coverage: 6.85% |
---|
0x458860 LEA (%RSI,%RDI,1),%RCX |
0x458864 ADD 0x70(%RSP),%RCX [9] |
0x458869 VMOVUPD (%RDX,%RDI,8),%YMM24 [6] |
0x458870 VFMADD213PD (%R15,%RDI,8),%YMM28,%YMM24 [5] |
0x458877 MOV %RDX,%R12 |
0x45887a MOV 0x78(%RSP),%RDX [9] |
0x45887f LEA (%RCX,%RDX,1),%R9 |
0x458883 VFMADD231PD (%R14,%R9,8),%YMM29,%YMM24 [8] |
0x45888a MOV 0x78(%RSP),%R9 [9] |
0x45888f LEA (%RCX,%R11,1),%RDX |
0x458893 VFMADD231PD (%R14,%RDX,8),%YMM30,%YMM24 [7] |
0x45889a MOV 0x68(%RSP),%RDX [9] |
0x45889f ADD %RCX,%RDX |
0x4588a2 VFMADD231PD (%R14,%RDX,8),%YMM31,%YMM24 [4] |
0x4588a9 LEA (%RCX,%R13,1),%RDX |
0x4588ad VFMADD231PD (%R14,%RDX,8),%YMM20,%YMM24 [2] |
0x4588b4 LEA (%RCX,%R8,1),%RDX |
0x4588b8 VFMADD231PD (%R14,%RDX,8),%YMM21,%YMM24 [10] |
0x4588bf LEA (%RCX,%RBX,1),%RDX |
0x4588c3 VFMADD231PD (%R14,%RDX,8),%YMM22,%YMM24 [1] |
0x4588ca MOV %R12,%RDX |
0x4588cd ADD %RAX,%RCX |
0x4588d0 VFMADD231PD (%R14,%RCX,8),%YMM23,%YMM24 [3] |
0x4588d7 VMOVUPD %YMM24,(%R15,%RDI,8) [5] |
0x4588de ADD $0x4,%RDI |
0x4588e2 CMP %R10,%RDI |
0x4588e5 JLE 458860 |
/home/kcamus/qaas_runs/169-391-8990/intel/Kripke/build/Kripke/tpl/raja/include/RAJA/policy/loop/forall.hpp: 59 - 59 |
-------------------------------------------------------------------------------- |
59: for (decltype(distance_it) i = 0; i < distance_it; ++i) { |
/home/kcamus/qaas_runs/169-391-8990/intel/Kripke/build/Kripke/src/Kripke/Kernel/LPlusTimes.cpp: 57 - 57 |
-------------------------------------------------------------------------------- |
57: rhs(d,g,z) += ell_plus(d, nm) * phi_out(nm, g, z); |
Coverage (%) | Name | Source Location | Module |
---|---|---|---|
►100.00+ | __kmp_invoke_microtask | libiomp5.so | |
○ | __kmp_fork_call | libiomp5.so | |
○ | __kmpc_fork_call | libiomp5.so | |
○ | void LPlusTimesSdom::operator([...] | internal.hpp:345 | exec |
○ | Kripke::Kernel::LPlusTimes(Kri[...] | ArchLayout.h:179 | exec |
○ | Kripke::SteadyStateSolver(Krip[...] | SteadyStateSolver.cpp:71 | exec |
○ | main | kripke.cpp:482 | exec |
○ | __libc_init_first | libc.so.6 |
Path / |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.70 |
CQA speedup if FP arith vectorized | 1.55 |
CQA speedup if fully vectorized | 3.40 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 1.31 |
Bottlenecks | micro-operation queue, |
Function | void LPlusTimesSdom::operator() |
Source | forall.hpp:59-59,LPlusTimes.cpp:57-57 |
Source loop unroll info | unrolled by 4 |
Source loop unroll confidence level | high |
Unroll/vectorization loop type | main |
Unroll factor | 4 |
CQA cycles | 8.50 |
CQA cycles if no scalar integer | 5.00 |
CQA cycles if FP arith vectorized | 5.50 |
CQA cycles if fully vectorized | 2.50 |
Front-end cycles | 8.50 |
DIV/SQRT cycles | 5.00 |
P0 cycles | 5.00 |
P1 cycles | 6.50 |
P2 cycles | 6.50 |
P3 cycles | 1.00 |
P4 cycles | 4.50 |
P5 cycles | 4.50 |
P6 cycles | 1.00 |
P7 cycles | 0.00 |
Inter-iter dependencies cycles | 1 |
FE+BE cycles (UFS) | 8.95 |
Stall cycles (UFS) | 0.00 |
Nb insns | 27.00 |
Nb uops | 26.00 |
Nb loads | 13.00 |
Nb stores | 1.00 |
Nb stack references | 3.00 |
FLOP/cycle | 7.53 |
Nb FLOP add-sub | 0.00 |
Nb FLOP mul | 0.00 |
Nb FLOP fma | 32.00 |
Nb FLOP div | 0.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 0.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 41.41 |
Bytes prefetched | 0.00 |
Bytes loaded | 320.00 |
Bytes stored | 32.00 |
Stride 0 | 1.00 |
Stride 1 | 3.00 |
Stride n | 1.00 |
Stride unknown | 1.00 |
Stride indirect | 0.00 |
Vectorization ratio all | 90.91 |
Vectorization ratio load | 100.00 |
Vectorization ratio store | 100.00 |
Vectorization ratio mul | NA |
Vectorization ratio add_sub | NA |
Vectorization ratio fma | 100.00 |
Vectorization ratio div_sqrt | NA |
Vectorization ratio other | 0.00 |
Vector-efficiency ratio all | 46.59 |
Vector-efficiency ratio load | 50.00 |
Vector-efficiency ratio store | 50.00 |
Vector-efficiency ratio mul | NA |
Vector-efficiency ratio add_sub | NA |
Vector-efficiency ratio fma | 50.00 |
Vector-efficiency ratio div_sqrt | NA |
Vector-efficiency ratio other | 12.50 |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.70 |
CQA speedup if FP arith vectorized | 1.55 |
CQA speedup if fully vectorized | 3.40 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 1.31 |
Bottlenecks | micro-operation queue, |
Function | void LPlusTimesSdom::operator() |
Source | forall.hpp:59-59,LPlusTimes.cpp:57-57 |
Source loop unroll info | unrolled by 4 |
Source loop unroll confidence level | high |
Unroll/vectorization loop type | main |
Unroll factor | 4 |
CQA cycles | 8.50 |
CQA cycles if no scalar integer | 5.00 |
CQA cycles if FP arith vectorized | 5.50 |
CQA cycles if fully vectorized | 2.50 |
Front-end cycles | 8.50 |
DIV/SQRT cycles | 5.00 |
P0 cycles | 5.00 |
P1 cycles | 6.50 |
P2 cycles | 6.50 |
P3 cycles | 1.00 |
P4 cycles | 4.50 |
P5 cycles | 4.50 |
P6 cycles | 1.00 |
P7 cycles | 0.00 |
Inter-iter dependencies cycles | 1 |
FE+BE cycles (UFS) | 8.95 |
Stall cycles (UFS) | 0.00 |
Nb insns | 27.00 |
Nb uops | 26.00 |
Nb loads | 13.00 |
Nb stores | 1.00 |
Nb stack references | 3.00 |
FLOP/cycle | 7.53 |
Nb FLOP add-sub | 0.00 |
Nb FLOP mul | 0.00 |
Nb FLOP fma | 32.00 |
Nb FLOP div | 0.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 0.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 41.41 |
Bytes prefetched | 0.00 |
Bytes loaded | 320.00 |
Bytes stored | 32.00 |
Stride 0 | 1.00 |
Stride 1 | 3.00 |
Stride n | 1.00 |
Stride unknown | 1.00 |
Stride indirect | 0.00 |
Vectorization ratio all | 90.91 |
Vectorization ratio load | 100.00 |
Vectorization ratio store | 100.00 |
Vectorization ratio mul | NA |
Vectorization ratio add_sub | NA |
Vectorization ratio fma | 100.00 |
Vectorization ratio div_sqrt | NA |
Vectorization ratio other | 0.00 |
Vector-efficiency ratio all | 46.59 |
Vector-efficiency ratio load | 50.00 |
Vector-efficiency ratio store | 50.00 |
Vector-efficiency ratio mul | NA |
Vector-efficiency ratio add_sub | NA |
Vector-efficiency ratio fma | 50.00 |
Vector-efficiency ratio div_sqrt | NA |
Vector-efficiency ratio other | 12.50 |
Path / |
nb instructions | 27 |
nb uops | 26 |
loop length | 139 |
used x86 registers | 15 |
used mmx registers | 0 |
used xmm registers | 0 |
used ymm registers | 9 |
used zmm registers | 0 |
nb stack references | 3 |
micro-operation queue | 8.50 cycles |
front end | 8.50 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | |
---|---|---|---|---|---|---|---|---|
uops | 5.00 | 5.00 | 6.50 | 6.50 | 1.00 | 4.50 | 4.50 | 1.00 |
cycles | 5.00 | 5.00 | 6.50 | 6.50 | 1.00 | 4.50 | 4.50 | 1.00 |
Cycles executing div or sqrt instructions | NA |
Longest recurrence chain latency (RecMII) | 1.00 |
FE+BE cycles | 8.95 |
Stall cycles | 0.00 |
Front-end | 8.50 |
Dispatch | 6.50 |
Data deps. | 1.00 |
Overall L1 | 8.50 |
all | 0% |
load | NA (no load vectorizable/vectorized instructions) |
store | NA (no store vectorizable/vectorized instructions) |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | NA (no fma vectorizable/vectorized instructions) |
other | 0% |
all | 100% |
load | 100% |
store | 100% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | 100% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | NA (no other vectorizable/vectorized instructions) |
all | 90% |
load | 100% |
store | 100% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | 100% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 0% |
all | 12% |
load | NA (no load vectorizable/vectorized instructions) |
store | NA (no store vectorizable/vectorized instructions) |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | NA (no fma vectorizable/vectorized instructions) |
other | 12% |
all | 50% |
load | 50% |
store | 50% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | 50% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | NA (no other vectorizable/vectorized instructions) |
all | 46% |
load | 50% |
store | 50% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | 50% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 12% |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|
LEA (%RSI,%RDI,1),%RCX | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 1 | 0.50 |
ADD 0x70(%RSP),%RCX | 1 | 0.25 | 0.25 | 0.50 | 0.50 | 0 | 0.25 | 0.25 | 0 | 1 | 0.50 |
VMOVUPD (%RDX,%RDI,8),%YMM24 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD (%R15,%RDI,8),%YMM28,%YMM24 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
MOV %RDX,%R12 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
MOV 0x78(%RSP),%RDX | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4-5 | 0.50 |
LEA (%RCX,%RDX,1),%R9 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 1 | 0.50 |
VFMADD231PD (%R14,%R9,8),%YMM29,%YMM24 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
MOV 0x78(%RSP),%R9 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4-5 | 0.50 |
LEA (%RCX,%R11,1),%RDX | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 1 | 0.50 |
VFMADD231PD (%R14,%RDX,8),%YMM30,%YMM24 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
MOV 0x68(%RSP),%RDX | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4-5 | 0.50 |
ADD %RCX,%RDX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
VFMADD231PD (%R14,%RDX,8),%YMM31,%YMM24 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
LEA (%RCX,%R13,1),%RDX | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 1 | 0.50 |
VFMADD231PD (%R14,%RDX,8),%YMM20,%YMM24 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
LEA (%RCX,%R8,1),%RDX | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 1 | 0.50 |
VFMADD231PD (%R14,%RDX,8),%YMM21,%YMM24 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
LEA (%RCX,%RBX,1),%RDX | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 1 | 0.50 |
VFMADD231PD (%R14,%RDX,8),%YMM22,%YMM24 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
MOV %R12,%RDX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
ADD %RAX,%RCX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
VFMADD231PD (%R14,%RCX,8),%YMM23,%YMM24 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM24,(%R15,%RDI,8) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
ADD $0x4,%RDI | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
CMP %R10,%RDI | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
JLE 458860 <_ZNK14LPlusTimesSdomclIN6Kripke11ArchLayoutTINS1_12ArchT_OpenMPENS1_11LayoutT_DGZEEEEEvT_NS1_6SdomIdERKNS1_4Core3SetESB_SB_SB_RNS8_5FieldIdJNS1_6MomentENS1_5GroupENS1_4ZoneEEEERNSC_IdJNS1_9DirectionESE_SF_EEERNSC_IdJSI_SD_EEE.extracted+0x760> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50-1 |
nb instructions | 27 |
nb uops | 26 |
loop length | 139 |
used x86 registers | 15 |
used mmx registers | 0 |
used xmm registers | 0 |
used ymm registers | 9 |
used zmm registers | 0 |
nb stack references | 3 |
micro-operation queue | 8.50 cycles |
front end | 8.50 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | |
---|---|---|---|---|---|---|---|---|
uops | 5.00 | 5.00 | 6.50 | 6.50 | 1.00 | 4.50 | 4.50 | 1.00 |
cycles | 5.00 | 5.00 | 6.50 | 6.50 | 1.00 | 4.50 | 4.50 | 1.00 |
Cycles executing div or sqrt instructions | NA |
Longest recurrence chain latency (RecMII) | 1.00 |
FE+BE cycles | 8.95 |
Stall cycles | 0.00 |
Front-end | 8.50 |
Dispatch | 6.50 |
Data deps. | 1.00 |
Overall L1 | 8.50 |
all | 0% |
load | NA (no load vectorizable/vectorized instructions) |
store | NA (no store vectorizable/vectorized instructions) |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | NA (no fma vectorizable/vectorized instructions) |
other | 0% |
all | 100% |
load | 100% |
store | 100% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | 100% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | NA (no other vectorizable/vectorized instructions) |
all | 90% |
load | 100% |
store | 100% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | 100% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 0% |
all | 12% |
load | NA (no load vectorizable/vectorized instructions) |
store | NA (no store vectorizable/vectorized instructions) |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | NA (no fma vectorizable/vectorized instructions) |
other | 12% |
all | 50% |
load | 50% |
store | 50% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | 50% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | NA (no other vectorizable/vectorized instructions) |
all | 46% |
load | 50% |
store | 50% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | 50% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 12% |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|
LEA (%RSI,%RDI,1),%RCX | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 1 | 0.50 |
ADD 0x70(%RSP),%RCX | 1 | 0.25 | 0.25 | 0.50 | 0.50 | 0 | 0.25 | 0.25 | 0 | 1 | 0.50 |
VMOVUPD (%RDX,%RDI,8),%YMM24 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD (%R15,%RDI,8),%YMM28,%YMM24 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
MOV %RDX,%R12 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
MOV 0x78(%RSP),%RDX | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4-5 | 0.50 |
LEA (%RCX,%RDX,1),%R9 | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 1 | 0.50 |
VFMADD231PD (%R14,%R9,8),%YMM29,%YMM24 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
MOV 0x78(%RSP),%R9 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4-5 | 0.50 |
LEA (%RCX,%R11,1),%RDX | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 1 | 0.50 |
VFMADD231PD (%R14,%RDX,8),%YMM30,%YMM24 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
MOV 0x68(%RSP),%RDX | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4-5 | 0.50 |
ADD %RCX,%RDX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
VFMADD231PD (%R14,%RDX,8),%YMM31,%YMM24 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
LEA (%RCX,%R13,1),%RDX | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 1 | 0.50 |
VFMADD231PD (%R14,%RDX,8),%YMM20,%YMM24 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
LEA (%RCX,%R8,1),%RDX | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 1 | 0.50 |
VFMADD231PD (%R14,%RDX,8),%YMM21,%YMM24 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
LEA (%RCX,%RBX,1),%RDX | 1 | 0 | 0.50 | 0 | 0 | 0 | 0.50 | 0 | 0 | 1 | 0.50 |
VFMADD231PD (%R14,%RDX,8),%YMM22,%YMM24 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
MOV %R12,%RDX | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
ADD %RAX,%RCX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
VFMADD231PD (%R14,%RCX,8),%YMM23,%YMM24 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM24,(%R15,%RDI,8) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
ADD $0x4,%RDI | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
CMP %R10,%RDI | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
JLE 458860 <_ZNK14LPlusTimesSdomclIN6Kripke11ArchLayoutTINS1_12ArchT_OpenMPENS1_11LayoutT_DGZEEEEEvT_NS1_6SdomIdERKNS1_4Core3SetESB_SB_SB_RNS8_5FieldIdJNS1_6MomentENS1_5GroupENS1_4ZoneEEEERNSC_IdJNS1_9DirectionESE_SF_EEERNSC_IdJSI_SD_EEE.extracted+0x760> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50-1 |
Metric | run_0 |
---|---|
Coverage (% app. time) | 6.85 |
Time (s) | 2.84 |
Instance Count | 184320 |
Iteration Count - min | 1024 |
Iteration Count - avg | 1024 |
Iteration Count - max | 1024 |
Cycles per Iteration - min | 28.15 |
Cycles per Iteration - avg | 33.34 |
Cycles per Iteration - max | 765.3 |
Metric | Value |
---|---|
Bucket Coverage (% loop time) | 55.65 |
Instance Count | 184320 |
ORIG CPI:min | 49.96 |
ORIG CPI:med | 54.49 |
ORIG CPI:max | 69.72 |
DL1 CPI:min | 8.81 |
DL1 CPI:med | 8.91 |
DL1 CPI:max | 29.59 |
ORIG (min) / DL1 (min) | 5.67 |
ORIG (med) / DL1 (med) | 6.11 |
ORIG (max) / DL1 (max) | 2.36 |
Nb Iteration:min | 1024 |
Nb Iteration:med | 1024.00 |
Nb Iteration:max | 1024 |
ORIG: min (cycles) | 51158 |
ORIG: med (cycles) | 55796.00 |
ORIG: max (cycles) | 71394 |
DL1:min (cycles) | 9018 |
DL1:med (cycles) | 9126.00 |
DL1:max (cycles) | 30300 |
Metric | Value |
---|---|
Bucket Coverage (% loop time) | 44.21 |
Instance Count | 184320 |
ORIG CPI:min | 30.93 |
ORIG CPI:med | 31.83 |
ORIG CPI:max | 41.76 |
DL1 CPI:min | 8.81 |
DL1 CPI:med | 8.91 |
DL1 CPI:max | 8.94 |
ORIG (min) / DL1 (min) | 3.51 |
ORIG (med) / DL1 (med) | 3.57 |
ORIG (max) / DL1 (max) | 4.67 |
Nb Iteration:min | 1024 |
Nb Iteration:med | 1024.00 |
Nb Iteration:max | 1024 |
ORIG: min (cycles) | 31668 |
ORIG: med (cycles) | 32590.00 |
ORIG: max (cycles) | 42760 |
DL1:min (cycles) | 9018 |
DL1:med (cycles) | 9122.00 |
DL1:max (cycles) | 9152 |
Metric (average per iteration except for Time and Iteration Count) | ORIG | DL1 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min (Thread) | Med (Thread) | Avg (Thread) | Max (Thread) | Min (Instances) | Med (Instances) | Max (Instances) | Min (Thread) | Med (Thread) | Avg (Thread) | Max (Thread) | Min (Instances) | Med (Instances) | Max (Instances) | |
Time | 55796.00 | 55796.00 | 55796.00 | 55796.00 | 51158.00 | 55796.00 | 71394.00 | 9126.00 | 9126.00 | 9126.00 | 9126.00 | 9018.00 | 9126.00 | 30300.00 |
CPI MIN | 49.96 | 8.81 | ||||||||||||
CPI MED | 54.49 | 54.49 | 54.49 | 54.49 | 49.96 | 54.49 | 69.72 | 8.91 | 8.91 | 8.91 | 8.91 | 8.81 | 8.91 | 29.59 |
CPI AVG | 54.72 | 9.60 | ||||||||||||
CPI MAX | 69.72 | 29.59 | ||||||||||||
Iteration Count | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 |
Metric (average per iteration except for Time and Iteration Count) | ORIG | DL1 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min (Thread) | Med (Thread) | Avg (Thread) | Max (Thread) | Min (Instances) | Med (Instances) | Max (Instances) | Min (Thread) | Med (Thread) | Avg (Thread) | Max (Thread) | Min (Instances) | Med (Instances) | Max (Instances) | |
Time | 32590.00 | 32590.00 | 32590.00 | 32590.00 | 31668.00 | 32590.00 | 42760.00 | 9122.00 | 9122.00 | 9122.00 | 9122.00 | 9018.00 | 9122.00 | 9152.00 |
CPI MIN | 30.93 | 8.81 | ||||||||||||
CPI MED | 31.83 | 31.83 | 31.83 | 31.83 | 30.93 | 31.83 | 41.76 | 8.91 | 8.91 | 8.91 | 8.91 | 8.81 | 8.91 | 8.94 |
CPI AVG | 32.17 | 8.88 | ||||||||||||
CPI MAX | 41.76 | 8.94 | ||||||||||||
Iteration Count | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 | 1024.00 |
ORIG | DL1 | Original Code |
---|---|---|
0x532a78 ADDQ $0x1,-0x1c80(%RIP) 0x532a80 LEA (%RSI,%RDI,1),%RCX | 0x532f4d LEA (%RSI,%RDI,1),%RCX | 0x458860 LEA (%RSI,%RDI,1),%RCX |
0x532a84 ADD 0x70(%RSP),%RCX | 0x532f51 ADD -0x2c58(%RIP),%RCX | 0x458864 ADD 0x70(%RSP),%RCX |
0x532a89 VMOVUPD (%RDX,%RDI,8),%YMM24 | 0x532f58 VMOVUPD -0x2b62(%RIP),%YMM24 | 0x458869 VMOVUPD (%RDX,%RDI,8),%YMM24 |
0x532a90 VFMADD213PD (%R15,%RDI,8),%YMM28,%YMM24 | 0x532f62 VFMADD213PD -0x2b6c(%RIP),%YMM28,%YMM24 0x532f6c NOP | 0x458870 VFMADD213PD (%R15,%RDI,8),%YMM28,%YMM24 |
0x532a97 MOV %RDX,%R12 | 0x532f6d MOV %RDX,%R12 | 0x458877 MOV %RDX,%R12 |
0x532a9a MOV 0x78(%RSP),%RDX | 0x532f70 MOV -0x2c37(%RIP),%RDX | 0x45887a MOV 0x78(%RSP),%RDX |
0x532a9f LEA (%RCX,%RDX,1),%R9 | 0x532f77 LEA (%RCX,%RDX,1),%R9 | 0x45887f LEA (%RCX,%RDX,1),%R9 |
0x532aa3 VFMADD231PD (%R14,%R9,8),%YMM29,%YMM24 | 0x532f7b VFMADD231PD -0x2b85(%RIP),%YMM29,%YMM24 0x532f85 NOP | 0x458883 VFMADD231PD (%R14,%R9,8),%YMM29,%YMM24 |
0x532aaa MOV 0x78(%RSP),%R9 | 0x532f86 MOV -0x2c4d(%RIP),%R9 | 0x45888a MOV 0x78(%RSP),%R9 |
0x532aaf LEA (%RCX,%R11,1),%RDX | 0x532f8d LEA (%RCX,%R11,1),%RDX | 0x45888f LEA (%RCX,%R11,1),%RDX |
0x532ab3 VFMADD231PD (%R14,%RDX,8),%YMM30,%YMM24 | 0x532f91 VFMADD231PD -0x2b9b(%RIP),%YMM30,%YMM24 0x532f9b NOP | 0x458893 VFMADD231PD (%R14,%RDX,8),%YMM30,%YMM24 |
0x532aba MOV 0x68(%RSP),%RDX | 0x532f9c MOV -0x2c23(%RIP),%RDX | 0x45889a MOV 0x68(%RSP),%RDX |
0x532abf ADD %RCX,%RDX | 0x532fa3 ADD %RCX,%RDX | 0x45889f ADD %RCX,%RDX |
0x532ac2 VFMADD231PD (%R14,%RDX,8),%YMM31,%YMM24 | 0x532fa6 VFMADD231PD -0x2bb0(%RIP),%YMM31,%YMM24 0x532fb0 NOP | 0x4588a2 VFMADD231PD (%R14,%RDX,8),%YMM31,%YMM24 |
0x532ac9 LEA (%RCX,%R13,1),%RDX | 0x532fb1 LEA (%RCX,%R13,1),%RDX | 0x4588a9 LEA (%RCX,%R13,1),%RDX |
0x532acd VFMADD231PD (%R14,%RDX,8),%YMM20,%YMM24 | 0x532fb5 VFMADD231PD -0x2bbf(%RIP),%YMM20,%YMM24 0x532fbf NOP | 0x4588ad VFMADD231PD (%R14,%RDX,8),%YMM20,%YMM24 |
0x532ad4 LEA (%RCX,%R8,1),%RDX | 0x532fc0 LEA (%RCX,%R8,1),%RDX | 0x4588b4 LEA (%RCX,%R8,1),%RDX |
0x532ad8 VFMADD231PD (%R14,%RDX,8),%YMM21,%YMM24 | 0x532fc4 VFMADD231PD -0x2bce(%RIP),%YMM21,%YMM24 0x532fce NOP | 0x4588b8 VFMADD231PD (%R14,%RDX,8),%YMM21,%YMM24 |
0x532adf LEA (%RCX,%RBX,1),%RDX | 0x532fcf LEA (%RCX,%RBX,1),%RDX | 0x4588bf LEA (%RCX,%RBX,1),%RDX |
0x532ae3 VFMADD231PD (%R14,%RDX,8),%YMM22,%YMM24 | 0x532fd3 VFMADD231PD -0x2bdd(%RIP),%YMM22,%YMM24 0x532fdd NOP | 0x4588c3 VFMADD231PD (%R14,%RDX,8),%YMM22,%YMM24 |
0x532aea MOV %R12,%RDX | 0x532fde MOV %R12,%RDX | 0x4588ca MOV %R12,%RDX |
0x532aed ADD %RAX,%RCX | 0x532fe1 ADD %RAX,%RCX | 0x4588cd ADD %RAX,%RCX |
0x532af0 VFMADD231PD (%R14,%RCX,8),%YMM23,%YMM24 | 0x532fe4 VFMADD231PD -0x2bee(%RIP),%YMM23,%YMM24 0x532fee NOP | 0x4588d0 VFMADD231PD (%R14,%RCX,8),%YMM23,%YMM24 |
0x532af7 VMOVUPD %YMM24,(%R15,%RDI,8) | 0x532fef VMOVUPD %YMM24,-0x2a79(%RIP) 0x532ff9 NOP | 0x4588d7 VMOVUPD %YMM24,(%R15,%RDI,8) |
0x532afe ADD $0x4,%RDI | 0x532ffa ADD $0x4,%RDI | 0x4588de ADD $0x4,%RDI |
0x532b02 CMP %R10,%RDI | 0x532ffe CMP %R10,%RDI | 0x4588e2 CMP %R10,%RDI |
0x532b05 JLE 532a78 <_ZNK14LPlusTimesSdomclIN6Kripke11ArchLayoutTINS1_12ArchT_OpenMPENS1_11LayoutT_DGZEEEEEvT_NS1_6SdomIdERKNS1_4Core3SetESB_SB_SB_RNS8_5FieldIdJNS1_6MomentENS1_5GroupENS1_4ZoneEEEERNSC_IdJNS1_9DirectionESE_SF_EEERNSC_IdJSI_SD_EEE.extracted+0xda978> | 0x533001 JLE 532f4d <_ZNK14LPlusTimesSdomclIN6Kripke11ArchLayoutTINS1_12ArchT_OpenMPENS1_11LayoutT_DGZEEEEEvT_NS1_6SdomIdERKNS1_4Core3SetESB_SB_SB_RNS8_5FieldIdJNS1_6MomentENS1_5GroupENS1_4ZoneEEEERNSC_IdJNS1_9DirectionESE_SF_EEERNSC_IdJSI_SD_EEE.extracted+0xdae4d> | 0x4588e5 JLE 458860 <_ZNK14LPlusTimesSdomclIN6Kripke11ArchLayoutTINS1_12ArchT_OpenMPENS1_11LayoutT_DGZEEEEEvT_NS1_6SdomIdERKNS1_4Core3SetESB_SB_SB_RNS8_5FieldIdJNS1_6MomentENS1_5GroupENS1_4ZoneEEEERNSC_IdJNS1_9DirectionESE_SF_EEERNSC_IdJSI_SD_EEE.extracted+0x760> |
Path / |
Metric | ORIG | DL1 | Original |
---|---|---|---|
FP operations per cycle L1 | 7.11, 7.11, | 7.31, 7.31, | 7.53, 7.53, |
cycles L1 CQA | 9.00 | 8.75 | 8.50 |
cycles UFS | 9.44 | 9.16 | 8.95 |
bytes loaded | 328.00 | 320.00 | 320.00 |
bytes stored | 40.00 | 32.00 | 32.00 |
nb loads | 14.00 | 13.00 | 13.00 |
nb stores | 2.00 | 1.00 | 1.00 |
cycles dispatch | 7.00 | 6.50 | 6.50 |
cycles front end | 9.00 | 8.75 | 8.50 |
cycles P0 | 5.00 | 5.00 | 5.00 |
cycles P1 | 5.00 | 5.00 | 5.00 |
cycles P2 | 7.00 | 6.50 | 6.50 |
cycles P3 | 7.00 | 6.50 | 6.50 |
cycles P4 | 2.00 | 1.00 | 1.00 |
cycles P5 | 5.00 | 4.50 | 4.50 |
cycles P6 | 5.00 | 4.50 | 4.50 |
cycles P7 | 2.00 | 1.00 | 1.00 |
stall cycles | 0.00 | 0.00 | 0.00 |
LB full | 0.00 | 0.00 | 0.00 |
LM full | 0.00 | 0.00 | 0.00 |
PRF full | 0.00 | 0.00 | 0.00 |
PRF_FLOAT full | 0.00 | 0.00 | 0.00 |
PRF_INT full | 0.00 | 0.00 | 0.00 |
ROB full | 0.00 | 0.00 | 0.00 |
RS full | 0.00 | 0.00 | 0.00 |
SB full | 0.00 | 0.00 | 0.00 |
nb uops | 28.00 | 35.00 | 26.00 |
uops P0 | 5.00 | 5.00 | 5.00 |
uops P1 | 5.00 | 5.00 | 5.00 |
uops P2 | 7.00 | 6.50 | 6.50 |
uops P3 | 7.00 | 6.50 | 6.50 |
uops P4 | 2.00 | 1.00 | 1.00 |
uops P5 | 5.00 | 4.50 | 4.50 |
uops P6 | 5.00 | 4.50 | 4.50 |
uops P7 | 2.00 | 1.00 | 1.00 |
ID | 939 | 941 | 939 |
Metric | ORIG | DL1 | Original |
---|---|---|---|
FP operations per cycle L1 | 7.11, 7.11, | 7.31, 7.31, | 7.53, 7.53, |
cycles L1 CQA | 9.00 | 8.75 | 8.50 |
cycles UFS | 9.44 | 9.16 | 8.95 |
bytes loaded | 328.00 | 320.00 | 320.00 |
bytes stored | 40.00 | 32.00 | 32.00 |
nb loads | 14.00 | 13.00 | 13.00 |
nb stores | 2.00 | 1.00 | 1.00 |
cycles dispatch | 7.00 | 6.50 | 6.50 |
cycles front end | 9.00 | 8.75 | 8.50 |
cycles P0 | 5.00 | 5.00 | 5.00 |
cycles P1 | 5.00 | 5.00 | 5.00 |
cycles P2 | 7.00 | 6.50 | 6.50 |
cycles P3 | 7.00 | 6.50 | 6.50 |
cycles P4 | 2.00 | 1.00 | 1.00 |
cycles P5 | 5.00 | 4.50 | 4.50 |
cycles P6 | 5.00 | 4.50 | 4.50 |
cycles P7 | 2.00 | 1.00 | 1.00 |
stall cycles | 0.00 | 0.00 | 0.00 |
LB full | 0.00 | 0.00 | 0.00 |
LM full | 0.00 | 0.00 | 0.00 |
PRF full | 0.00 | 0.00 | 0.00 |
PRF_FLOAT full | 0.00 | 0.00 | 0.00 |
PRF_INT full | 0.00 | 0.00 | 0.00 |
ROB full | 0.00 | 0.00 | 0.00 |
RS full | 0.00 | 0.00 | 0.00 |
SB full | 0.00 | 0.00 | 0.00 |
nb uops | 28.00 | 35.00 | 26.00 |
uops P0 | 5.00 | 5.00 | 5.00 |
uops P1 | 5.00 | 5.00 | 5.00 |
uops P2 | 7.00 | 6.50 | 6.50 |
uops P3 | 7.00 | 6.50 | 6.50 |
uops P4 | 2.00 | 1.00 | 1.00 |
uops P5 | 5.00 | 4.50 | 4.50 |
uops P6 | 5.00 | 4.50 | 4.50 |
uops P7 | 2.00 | 1.00 | 1.00 |
ID | 939 | 941 | 939 |