Loop Id: 431 | Module: libkripke.so | Source: forall.hpp:59-59 [...] | Coverage: 12.44% |
---|
Loop Id: 431 | Module: libkripke.so | Source: forall.hpp:59-59 [...] | Coverage: 12.44% |
---|
0x4b05c VMOVUPD (%RSI,%RDX,1),%YMM10 [3] |
0x4b061 VFMADD213PD (%RAX,%RDX,1),%YMM2,%YMM10 [2] |
0x4b067 VMOVUPD %YMM10,(%RAX,%RDX,1) [2] |
0x4b06c VMOVUPD 0x20(%RDX,%RSI,1),%YMM11 [1] |
0x4b072 VFMADD213PD 0x20(%RAX,%RDX,1),%YMM2,%YMM11 [2] |
0x4b079 VMOVUPD %YMM11,0x20(%RAX,%RDX,1) [2] |
0x4b07f VMOVUPD 0x40(%RDX,%RSI,1),%YMM12 [1] |
0x4b085 VFMADD213PD 0x40(%RAX,%RDX,1),%YMM2,%YMM12 [2] |
0x4b08c VMOVUPD %YMM12,0x40(%RAX,%RDX,1) [2] |
0x4b092 VMOVUPD 0x60(%RDX,%RSI,1),%YMM13 [1] |
0x4b098 VFMADD213PD 0x60(%RAX,%RDX,1),%YMM2,%YMM13 [2] |
0x4b09f VMOVUPD %YMM13,0x60(%RAX,%RDX,1) [2] |
0x4b0a5 VMOVUPD 0x80(%RDX,%RSI,1),%YMM14 [1] |
0x4b0ae VFMADD213PD 0x80(%RAX,%RDX,1),%YMM2,%YMM14 [2] |
0x4b0b8 VMOVUPD %YMM14,0x80(%RAX,%RDX,1) [2] |
0x4b0c1 VMOVUPD 0xa0(%RDX,%RSI,1),%YMM15 [1] |
0x4b0ca VFMADD213PD 0xa0(%RAX,%RDX,1),%YMM2,%YMM15 [2] |
0x4b0d4 VMOVUPD %YMM15,0xa0(%RAX,%RDX,1) [2] |
0x4b0dd VMOVUPD 0xc0(%RDX,%RSI,1),%YMM0 [1] |
0x4b0e6 VFMADD213PD 0xc0(%RAX,%RDX,1),%YMM2,%YMM0 [2] |
0x4b0f0 VMOVUPD %YMM0,0xc0(%RAX,%RDX,1) [2] |
0x4b0f9 VMOVUPD 0xe0(%RDX,%RSI,1),%YMM4 [1] |
0x4b102 VFMADD213PD 0xe0(%RAX,%RDX,1),%YMM2,%YMM4 [2] |
0x4b10c VMOVUPD %YMM4,0xe0(%RAX,%RDX,1) [2] |
0x4b115 ADD $0x100,%RDX |
0x4b11c CMP %RDX,0xc8(%RSP) [4] |
0x4b124 JNE 4b05c |
/home/kcamus/qaas_runs/169-391-8990/intel/Kripke/build/Kripke/tpl/raja/include/RAJA/policy/loop/forall.hpp: 59 - 59 |
-------------------------------------------------------------------------------- |
59: for (decltype(distance_it) i = 0; i < distance_it; ++i) { |
/home/kcamus/qaas_runs/169-391-8990/intel/Kripke/build/Kripke/src/Kripke/Kernel/LPlusTimes.cpp: 57 - 57 |
-------------------------------------------------------------------------------- |
57: rhs(d,g,z) += ell_plus(d, nm) * phi_out(nm, g, z); |
Coverage (%) | Name | Source Location | Module |
---|---|---|---|
○100.00 | GOMP_parallel | libgomp.h:985 | libgomp.so.1.0.0 |
Path / |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.00 |
CQA speedup if FP arith vectorized | 1.03 |
CQA speedup if fully vectorized | 2.00 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 1.06 |
Bottlenecks | micro-operation queue, P2, P3, |
Function | void RAJA::internal::StatementExecutor |
Source | forall.hpp:59-59,LPlusTimes.cpp:57-57 |
Source loop unroll info | unrolled by 4 |
Source loop unroll confidence level | high |
Unroll/vectorization loop type | main |
Unroll factor | 4 |
CQA cycles | 8.50 |
CQA cycles if no scalar integer | 8.50 |
CQA cycles if FP arith vectorized | 8.25 |
CQA cycles if fully vectorized | 4.25 |
Front-end cycles | 8.50 |
DIV/SQRT cycles | 4.00 |
P0 cycles | 4.00 |
P1 cycles | 8.50 |
P2 cycles | 8.50 |
P3 cycles | 8.00 |
P4 cycles | 1.00 |
P5 cycles | 1.00 |
P6 cycles | 8.00 |
P7 cycles | 0.00 |
Inter-iter dependencies cycles | 1 |
FE+BE cycles (UFS) | 8.67 |
Stall cycles (UFS) | 0.00 |
Nb insns | 27.00 |
Nb uops | 26.00 |
Nb loads | 17.00 |
Nb stores | 8.00 |
Nb stack references | 1.00 |
FLOP/cycle | 7.53 |
Nb FLOP add-sub | 0.00 |
Nb FLOP mul | 0.00 |
Nb FLOP fma | 32.00 |
Nb FLOP div | 0.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 0.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 91.29 |
Bytes prefetched | 0.00 |
Bytes loaded | 520.00 |
Bytes stored | 256.00 |
Stride 0 | 1.00 |
Stride 1 | 1.00 |
Stride n | 2.00 |
Stride unknown | 0.00 |
Stride indirect | 0.00 |
Vectorization ratio all | 100.00 |
Vectorization ratio load | 100.00 |
Vectorization ratio store | 100.00 |
Vectorization ratio mul | NA |
Vectorization ratio add_sub | NA |
Vectorization ratio fma | 100.00 |
Vectorization ratio div_sqrt | NA |
Vectorization ratio other | NA |
Vector-efficiency ratio all | 50.00 |
Vector-efficiency ratio load | 50.00 |
Vector-efficiency ratio store | 50.00 |
Vector-efficiency ratio mul | NA |
Vector-efficiency ratio add_sub | NA |
Vector-efficiency ratio fma | 50.00 |
Vector-efficiency ratio div_sqrt | NA |
Vector-efficiency ratio other | NA |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.00 |
CQA speedup if FP arith vectorized | 1.03 |
CQA speedup if fully vectorized | 2.00 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 1.06 |
Bottlenecks | micro-operation queue, P2, P3, |
Function | void RAJA::internal::StatementExecutor |
Source | forall.hpp:59-59,LPlusTimes.cpp:57-57 |
Source loop unroll info | unrolled by 4 |
Source loop unroll confidence level | high |
Unroll/vectorization loop type | main |
Unroll factor | 4 |
CQA cycles | 8.50 |
CQA cycles if no scalar integer | 8.50 |
CQA cycles if FP arith vectorized | 8.25 |
CQA cycles if fully vectorized | 4.25 |
Front-end cycles | 8.50 |
DIV/SQRT cycles | 4.00 |
P0 cycles | 4.00 |
P1 cycles | 8.50 |
P2 cycles | 8.50 |
P3 cycles | 8.00 |
P4 cycles | 1.00 |
P5 cycles | 1.00 |
P6 cycles | 8.00 |
P7 cycles | 0.00 |
Inter-iter dependencies cycles | 1 |
FE+BE cycles (UFS) | 8.67 |
Stall cycles (UFS) | 0.00 |
Nb insns | 27.00 |
Nb uops | 26.00 |
Nb loads | 17.00 |
Nb stores | 8.00 |
Nb stack references | 1.00 |
FLOP/cycle | 7.53 |
Nb FLOP add-sub | 0.00 |
Nb FLOP mul | 0.00 |
Nb FLOP fma | 32.00 |
Nb FLOP div | 0.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 0.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 91.29 |
Bytes prefetched | 0.00 |
Bytes loaded | 520.00 |
Bytes stored | 256.00 |
Stride 0 | 1.00 |
Stride 1 | 1.00 |
Stride n | 2.00 |
Stride unknown | 0.00 |
Stride indirect | 0.00 |
Vectorization ratio all | 100.00 |
Vectorization ratio load | 100.00 |
Vectorization ratio store | 100.00 |
Vectorization ratio mul | NA |
Vectorization ratio add_sub | NA |
Vectorization ratio fma | 100.00 |
Vectorization ratio div_sqrt | NA |
Vectorization ratio other | NA |
Vector-efficiency ratio all | 50.00 |
Vector-efficiency ratio load | 50.00 |
Vector-efficiency ratio store | 50.00 |
Vector-efficiency ratio mul | NA |
Vector-efficiency ratio add_sub | NA |
Vector-efficiency ratio fma | 50.00 |
Vector-efficiency ratio div_sqrt | NA |
Vector-efficiency ratio other | NA |
Path / |
nb instructions | 27 |
nb uops | 26 |
loop length | 206 |
used x86 registers | 4 |
used mmx registers | 0 |
used xmm registers | 0 |
used ymm registers | 9 |
used zmm registers | 0 |
nb stack references | 1 |
micro-operation queue | 8.50 cycles |
front end | 8.50 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | |
---|---|---|---|---|---|---|---|---|
uops | 4.00 | 4.00 | 8.50 | 8.50 | 8.00 | 1.00 | 1.00 | 8.00 |
cycles | 4.00 | 4.00 | 8.50 | 8.50 | 8.00 | 1.00 | 1.00 | 8.00 |
Cycles executing div or sqrt instructions | NA |
Longest recurrence chain latency (RecMII) | 1.00 |
FE+BE cycles | 8.67 |
Stall cycles | 0.00 |
Front-end | 8.50 |
Dispatch | 8.50 |
Data deps. | 1.00 |
Overall L1 | 8.50 |
all | 100% |
load | 100% |
store | 100% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | 100% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | NA (no other vectorizable/vectorized instructions) |
all | 50% |
load | 50% |
store | 50% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | 50% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | NA (no other vectorizable/vectorized instructions) |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|
VMOVUPD (%RSI,%RDX,1),%YMM10 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD (%RAX,%RDX,1),%YMM2,%YMM10 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM10,(%RAX,%RDX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
VMOVUPD 0x20(%RDX,%RSI,1),%YMM11 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD 0x20(%RAX,%RDX,1),%YMM2,%YMM11 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM11,0x20(%RAX,%RDX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
VMOVUPD 0x40(%RDX,%RSI,1),%YMM12 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD 0x40(%RAX,%RDX,1),%YMM2,%YMM12 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM12,0x40(%RAX,%RDX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
VMOVUPD 0x60(%RDX,%RSI,1),%YMM13 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD 0x60(%RAX,%RDX,1),%YMM2,%YMM13 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM13,0x60(%RAX,%RDX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
VMOVUPD 0x80(%RDX,%RSI,1),%YMM14 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD 0x80(%RAX,%RDX,1),%YMM2,%YMM14 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM14,0x80(%RAX,%RDX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
VMOVUPD 0xa0(%RDX,%RSI,1),%YMM15 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD 0xa0(%RAX,%RDX,1),%YMM2,%YMM15 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM15,0xa0(%RAX,%RDX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
VMOVUPD 0xc0(%RDX,%RSI,1),%YMM0 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD 0xc0(%RAX,%RDX,1),%YMM2,%YMM0 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM0,0xc0(%RAX,%RDX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
VMOVUPD 0xe0(%RDX,%RSI,1),%YMM4 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD 0xe0(%RAX,%RDX,1),%YMM2,%YMM4 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM4,0xe0(%RAX,%RDX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
ADD $0x100,%RDX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
CMP %RDX,0xc8(%RSP) | 1 | 0.25 | 0.25 | 0.50 | 0.50 | 0 | 0.25 | 0.25 | 0 | 1 | 0.50 |
JNE 4b05c <_ZN4RAJA8internal17StatementExecutorINS_9statement8CollapseINS_26omp_parallel_collapse_execEN4camp7int_seqIlJLl0ELl2EEEEJNS2_3ForILl1ENS_6policy4loop9loop_execEJNS8_ILl3ESB_JNS2_6LambdaILl0EJEEEEEEEEEEEEE4execIRNS0_8LoopDataINS5_4listIJSG_EEENS5_5tupleIJNS_4impl4SpanINS_9Iterators16numeric_iteratorIN6Kripke9DirectionElPSS_EESS_EENSO_INSQ_INSR_6MomentElPSW_EESW_EENSO_INSQ_INSR_5GroupElPS10_EES10_EENSO_INSQ_INSR_4ZoneElPS14_EES14_EEEEENSM_IJEEEJZNK14LPlusTimesSdomclINSR_11ArchLayoutTINSR_12ArchT_OpenMPENSR_11LayoutT_DGZEEEEEvT_NSR_6SdomIdERKNSR_4Core3SetES1L_S1L_S1L_RNS1I_5FieldIdJSW_S10_S14_EEERNS1M_IdJSS_S10_S14_EEERNS1M_IdJSS_SW_EEEEUlSS_SW_S10_S14_E_EEEEEvOS1G_._omp_fn.0+0x38c> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50-1 |
nb instructions | 27 |
nb uops | 26 |
loop length | 206 |
used x86 registers | 4 |
used mmx registers | 0 |
used xmm registers | 0 |
used ymm registers | 9 |
used zmm registers | 0 |
nb stack references | 1 |
micro-operation queue | 8.50 cycles |
front end | 8.50 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | |
---|---|---|---|---|---|---|---|---|
uops | 4.00 | 4.00 | 8.50 | 8.50 | 8.00 | 1.00 | 1.00 | 8.00 |
cycles | 4.00 | 4.00 | 8.50 | 8.50 | 8.00 | 1.00 | 1.00 | 8.00 |
Cycles executing div or sqrt instructions | NA |
Longest recurrence chain latency (RecMII) | 1.00 |
FE+BE cycles | 8.67 |
Stall cycles | 0.00 |
Front-end | 8.50 |
Dispatch | 8.50 |
Data deps. | 1.00 |
Overall L1 | 8.50 |
all | 100% |
load | 100% |
store | 100% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | 100% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | NA (no other vectorizable/vectorized instructions) |
all | 50% |
load | 50% |
store | 50% |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | 50% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | NA (no other vectorizable/vectorized instructions) |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|
VMOVUPD (%RSI,%RDX,1),%YMM10 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD (%RAX,%RDX,1),%YMM2,%YMM10 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM10,(%RAX,%RDX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
VMOVUPD 0x20(%RDX,%RSI,1),%YMM11 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD 0x20(%RAX,%RDX,1),%YMM2,%YMM11 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM11,0x20(%RAX,%RDX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
VMOVUPD 0x40(%RDX,%RSI,1),%YMM12 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD 0x40(%RAX,%RDX,1),%YMM2,%YMM12 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM12,0x40(%RAX,%RDX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
VMOVUPD 0x60(%RDX,%RSI,1),%YMM13 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD 0x60(%RAX,%RDX,1),%YMM2,%YMM13 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM13,0x60(%RAX,%RDX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
VMOVUPD 0x80(%RDX,%RSI,1),%YMM14 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD 0x80(%RAX,%RDX,1),%YMM2,%YMM14 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM14,0x80(%RAX,%RDX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
VMOVUPD 0xa0(%RDX,%RSI,1),%YMM15 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD 0xa0(%RAX,%RDX,1),%YMM2,%YMM15 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM15,0xa0(%RAX,%RDX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
VMOVUPD 0xc0(%RDX,%RSI,1),%YMM0 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD 0xc0(%RAX,%RDX,1),%YMM2,%YMM0 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM0,0xc0(%RAX,%RDX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
VMOVUPD 0xe0(%RDX,%RSI,1),%YMM4 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VFMADD213PD 0xe0(%RAX,%RDX,1),%YMM2,%YMM4 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD %YMM4,0xe0(%RAX,%RDX,1) | 1 | 0 | 0 | 0.33 | 0.33 | 1 | 0 | 0 | 0.33 | 3 | 1 |
ADD $0x100,%RDX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
CMP %RDX,0xc8(%RSP) | 1 | 0.25 | 0.25 | 0.50 | 0.50 | 0 | 0.25 | 0.25 | 0 | 1 | 0.50 |
JNE 4b05c <_ZN4RAJA8internal17StatementExecutorINS_9statement8CollapseINS_26omp_parallel_collapse_execEN4camp7int_seqIlJLl0ELl2EEEEJNS2_3ForILl1ENS_6policy4loop9loop_execEJNS8_ILl3ESB_JNS2_6LambdaILl0EJEEEEEEEEEEEEE4execIRNS0_8LoopDataINS5_4listIJSG_EEENS5_5tupleIJNS_4impl4SpanINS_9Iterators16numeric_iteratorIN6Kripke9DirectionElPSS_EESS_EENSO_INSQ_INSR_6MomentElPSW_EESW_EENSO_INSQ_INSR_5GroupElPS10_EES10_EENSO_INSQ_INSR_4ZoneElPS14_EES14_EEEEENSM_IJEEEJZNK14LPlusTimesSdomclINSR_11ArchLayoutTINSR_12ArchT_OpenMPENSR_11LayoutT_DGZEEEEEvT_NSR_6SdomIdERKNSR_4Core3SetES1L_S1L_S1L_RNS1I_5FieldIdJSW_S10_S14_EEERNS1M_IdJSS_S10_S14_EEERNS1M_IdJSS_SW_EEEEUlSS_SW_S10_S14_E_EEEEEvOS1G_._omp_fn.0+0x38c> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50-1 |
Metric | run_0 |
---|---|
Coverage (% app. time) | 12.44 |
Time (s) | 3.85 |
Instance Count | 1536000 |
Iteration Count - min | 128 |
Iteration Count - avg | 128 |
Iteration Count - max | 128 |
Cycles per Iteration - min | 33.5 |
Cycles per Iteration - avg | 39.03 |
Cycles per Iteration - max | 6101.67 |
Metric | Value |
---|---|
Bucket Coverage (% loop time) | 90.66 |
Instance Count | 1536000 |
ORIG CPI:min | 55.08 |
ORIG CPI:med | 65.67 |
ORIG CPI:max | 73.72 |
DL1 CPI:min | 11.06 |
DL1 CPI:med | 11.98 |
DL1 CPI:max | 17.22 |
ORIG (min) / DL1 (min) | 4.98 |
ORIG (med) / DL1 (med) | 5.48 |
ORIG (max) / DL1 (max) | 4.28 |
Nb Iteration:min | 128 |
Nb Iteration:med | 128.00 |
Nb Iteration:max | 128 |
ORIG: min (cycles) | 7050 |
ORIG: med (cycles) | 8406.00 |
ORIG: max (cycles) | 9436 |
DL1:min (cycles) | 1416 |
DL1:med (cycles) | 1534.00 |
DL1:max (cycles) | 2204 |
Metric | Value |
---|---|
Bucket Coverage (% loop time) | 9.13 |
Instance Count | 1536000 |
ORIG CPI:min | 54.61 |
ORIG CPI:med | 68.20 |
ORIG CPI:max | 115.38 |
DL1 CPI:min | 11.05 |
DL1 CPI:med | 12.03 |
DL1 CPI:max | 12.27 |
ORIG (min) / DL1 (min) | 4.94 |
ORIG (med) / DL1 (med) | 5.67 |
ORIG (max) / DL1 (max) | 9.41 |
Nb Iteration:min | 128 |
Nb Iteration:med | 128.00 |
Nb Iteration:max | 128 |
ORIG: min (cycles) | 6990 |
ORIG: med (cycles) | 8730.00 |
ORIG: max (cycles) | 14768 |
DL1:min (cycles) | 1414 |
DL1:med (cycles) | 1540.00 |
DL1:max (cycles) | 1570 |
Metric (average per iteration except for Time and Iteration Count) | ORIG | DL1 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min (Thread) | Med (Thread) | Avg (Thread) | Max (Thread) | Min (Instances) | Med (Instances) | Max (Instances) | Min (Thread) | Med (Thread) | Avg (Thread) | Max (Thread) | Min (Instances) | Med (Instances) | Max (Instances) | |
Time | 8406.00 | 8406.00 | 8406.00 | 8406.00 | 7050.00 | 8406.00 | 9436.00 | 1534.00 | 1534.00 | 1534.00 | 1534.00 | 1416.00 | 1534.00 | 2204.00 |
CPI MIN | 55.08 | 11.06 | ||||||||||||
CPI MED | 65.67 | 65.67 | 65.67 | 65.67 | 55.08 | 65.67 | 73.72 | 11.98 | 11.98 | 11.98 | 11.98 | 11.06 | 11.98 | 17.22 |
CPI AVG | 65.17 | 12.27 | ||||||||||||
CPI MAX | 73.72 | 17.22 | ||||||||||||
Iteration Count | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 |
Metric (average per iteration except for Time and Iteration Count) | ORIG | DL1 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min (Thread) | Med (Thread) | Avg (Thread) | Max (Thread) | Min (Instances) | Med (Instances) | Max (Instances) | Min (Thread) | Med (Thread) | Avg (Thread) | Max (Thread) | Min (Instances) | Med (Instances) | Max (Instances) | |
Time | 8730.00 | 8730.00 | 8730.00 | 8730.00 | 6990.00 | 8730.00 | 14768.00 | 1540.00 | 1540.00 | 1540.00 | 1540.00 | 1414.00 | 1540.00 | 1570.00 |
CPI MIN | 54.61 | 11.05 | ||||||||||||
CPI MED | 68.20 | 68.20 | 68.20 | 68.20 | 54.61 | 68.20 | 115.38 | 12.03 | 12.03 | 12.03 | 12.03 | 11.05 | 12.03 | 12.27 |
CPI AVG | 76.81 | 11.91 | ||||||||||||
CPI MAX | 115.38 | 12.27 | ||||||||||||
Iteration Count | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 |
ORIG | DL1 | Original Code |
---|---|---|
0x11022d ADDQ $0x1,-0x1575(%RIP) 0x110235 VMOVUPD (%RSI,%RDX,1),%YMM10 | 0x1106e1 VMOVUPD -0x22a9(%RIP),%YMM10 | 0x4b05c VMOVUPD (%RSI,%RDX,1),%YMM10 |
0x11023a VFMADD213PD (%RAX,%RDX,1),%YMM2,%YMM10 | 0x1106e9 VFMADD213PD -0x22b2(%RIP),%YMM2,%YMM10 0x1106f2 NOP | 0x4b061 VFMADD213PD (%RAX,%RDX,1),%YMM2,%YMM10 |
0x110240 VMOVUPD %YMM10,(%RAX,%RDX,1) | 0x1106f3 VMOVUPD %YMM10,-0x227b(%RIP) 0x1106fb NOP | 0x4b067 VMOVUPD %YMM10,(%RAX,%RDX,1) |
0x110245 VMOVUPD 0x20(%RDX,%RSI,1),%YMM11 | 0x1106fc VMOVUPD -0x22c4(%RIP),%YMM11 | 0x4b06c VMOVUPD 0x20(%RDX,%RSI,1),%YMM11 |
0x11024b VFMADD213PD 0x20(%RAX,%RDX,1),%YMM2,%YMM11 | 0x110704 VFMADD213PD -0x22cd(%RIP),%YMM2,%YMM11 0x11070d NOP | 0x4b072 VFMADD213PD 0x20(%RAX,%RDX,1),%YMM2,%YMM11 |
0x110252 VMOVUPD %YMM11,0x20(%RAX,%RDX,1) | 0x11070e VMOVUPD %YMM11,-0x2256(%RIP) 0x110716 NOP | 0x4b079 VMOVUPD %YMM11,0x20(%RAX,%RDX,1) |
0x110258 VMOVUPD 0x40(%RDX,%RSI,1),%YMM12 | 0x110717 VMOVUPD -0x22df(%RIP),%YMM12 | 0x4b07f VMOVUPD 0x40(%RDX,%RSI,1),%YMM12 |
0x11025e VFMADD213PD 0x40(%RAX,%RDX,1),%YMM2,%YMM12 | 0x11071f VFMADD213PD -0x22e8(%RIP),%YMM2,%YMM12 0x110728 NOP | 0x4b085 VFMADD213PD 0x40(%RAX,%RDX,1),%YMM2,%YMM12 |
0x110265 VMOVUPD %YMM12,0x40(%RAX,%RDX,1) | 0x110729 VMOVUPD %YMM12,-0x2231(%RIP) 0x110731 NOP | 0x4b08c VMOVUPD %YMM12,0x40(%RAX,%RDX,1) |
0x11026b VMOVUPD 0x60(%RDX,%RSI,1),%YMM13 | 0x110732 VMOVUPD -0x22fa(%RIP),%YMM13 | 0x4b092 VMOVUPD 0x60(%RDX,%RSI,1),%YMM13 |
0x110271 VFMADD213PD 0x60(%RAX,%RDX,1),%YMM2,%YMM13 | 0x11073a VFMADD213PD -0x2303(%RIP),%YMM2,%YMM13 0x110743 NOP | 0x4b098 VFMADD213PD 0x60(%RAX,%RDX,1),%YMM2,%YMM13 |
0x110278 VMOVUPD %YMM13,0x60(%RAX,%RDX,1) | 0x110744 VMOVUPD %YMM13,-0x220c(%RIP) 0x11074c NOP | 0x4b09f VMOVUPD %YMM13,0x60(%RAX,%RDX,1) |
0x11027e VMOVUPD 0x80(%RDX,%RSI,1),%YMM14 | 0x11074d VMOVUPD -0x2315(%RIP),%YMM14 | 0x4b0a5 VMOVUPD 0x80(%RDX,%RSI,1),%YMM14 |
0x110287 VFMADD213PD 0x80(%RAX,%RDX,1),%YMM2,%YMM14 | 0x110755 VFMADD213PD -0x231e(%RIP),%YMM2,%YMM14 0x11075e NOP | 0x4b0ae VFMADD213PD 0x80(%RAX,%RDX,1),%YMM2,%YMM14 |
0x110291 VMOVUPD %YMM14,0x80(%RAX,%RDX,1) | 0x11075f VMOVUPD %YMM14,-0x21e7(%RIP) 0x110767 NOP | 0x4b0b8 VMOVUPD %YMM14,0x80(%RAX,%RDX,1) |
0x11029a VMOVUPD 0xa0(%RDX,%RSI,1),%YMM15 | 0x110768 VMOVUPD -0x2330(%RIP),%YMM15 | 0x4b0c1 VMOVUPD 0xa0(%RDX,%RSI,1),%YMM15 |
0x1102a3 VFMADD213PD 0xa0(%RAX,%RDX,1),%YMM2,%YMM15 | 0x110770 VFMADD213PD -0x2339(%RIP),%YMM2,%YMM15 0x110779 NOP | 0x4b0ca VFMADD213PD 0xa0(%RAX,%RDX,1),%YMM2,%YMM15 |
0x1102ad VMOVUPD %YMM15,0xa0(%RAX,%RDX,1) | 0x11077a VMOVUPD %YMM15,-0x21c2(%RIP) 0x110782 NOP | 0x4b0d4 VMOVUPD %YMM15,0xa0(%RAX,%RDX,1) |
0x1102b6 VMOVUPD 0xc0(%RDX,%RSI,1),%YMM0 | 0x110783 VMOVUPD -0x234b(%RIP),%YMM0 | 0x4b0dd VMOVUPD 0xc0(%RDX,%RSI,1),%YMM0 |
0x1102bf VFMADD213PD 0xc0(%RAX,%RDX,1),%YMM2,%YMM0 | 0x11078b VFMADD213PD -0x2354(%RIP),%YMM2,%YMM0 0x110794 NOP | 0x4b0e6 VFMADD213PD 0xc0(%RAX,%RDX,1),%YMM2,%YMM0 |
0x1102c9 VMOVUPD %YMM0,0xc0(%RAX,%RDX,1) | 0x110795 VMOVUPD %YMM0,-0x219d(%RIP) 0x11079d NOP | 0x4b0f0 VMOVUPD %YMM0,0xc0(%RAX,%RDX,1) |
0x1102d2 VMOVUPD 0xe0(%RDX,%RSI,1),%YMM4 | 0x11079e VMOVUPD -0x2366(%RIP),%YMM4 | 0x4b0f9 VMOVUPD 0xe0(%RDX,%RSI,1),%YMM4 |
0x1102db VFMADD213PD 0xe0(%RAX,%RDX,1),%YMM2,%YMM4 | 0x1107a6 VFMADD213PD -0x236f(%RIP),%YMM2,%YMM4 0x1107af NOP | 0x4b102 VFMADD213PD 0xe0(%RAX,%RDX,1),%YMM2,%YMM4 |
0x1102e5 VMOVUPD %YMM4,0xe0(%RAX,%RDX,1) | 0x1107b0 VMOVUPD %YMM4,-0x2178(%RIP) 0x1107b8 NOP | 0x4b10c VMOVUPD %YMM4,0xe0(%RAX,%RDX,1) |
0x1102ee ADD $0x100,%RDX | 0x1107b9 ADD $0x100,%RDX | 0x4b115 ADD $0x100,%RDX |
0x1102f5 CMP %RDX,0xc8(%RSP) | 0x1107c0 CMP %RDX,-0x23c7(%RIP) | 0x4b11c CMP %RDX,0xc8(%RSP) |
0x1102fd JNE 11022d <_ZN4RAJA8internal17StatementExecutorINS_9statement8CollapseINS_26omp_parallel_collapse_execEN4camp7int_seqIlJLl0ELl2EEEEJNS2_3ForILl1ENS_6policy4loop9loop_execEJNS8_ILl3ESB_JNS2_6LambdaILl0EJEEEEEEEEEEEEE4execIRNS0_8LoopDataINS5_4listIJSG_EEENS5_5tupleIJNS_4impl4SpanINS_9Iterators16numeric_iteratorIN6Kripke9DirectionElPSS_EESS_EENSO_INSQ_INSR_6MomentElPSW_EESW_EENSO_INSQ_INSR_5GroupElPS10_EES10_EENSO_INSQ_INSR_4ZoneElPS14_EES14_EEEEENSM_IJEEEJZNK14LPlusTimesSdomclINSR_11ArchLayoutTINSR_12ArchT_OpenMPENSR_11LayoutT_DGZEEEEEvT_NSR_6SdomIdERKNSR_4Core3SetES1L_S1L_S1L_RNS1I_5FieldIdJSW_S10_S14_EEERNS1M_IdJSS_S10_S14_EEERNS1M_IdJSS_SW_EEEEUlSS_SW_S10_S14_E_EEEEEvOS1G_._omp_fn.0+0xc555d> | 0x1107c7 JNE 1106e1 <_ZN4RAJA8internal17StatementExecutorINS_9statement8CollapseINS_26omp_parallel_collapse_execEN4camp7int_seqIlJLl0ELl2EEEEJNS2_3ForILl1ENS_6policy4loop9loop_execEJNS8_ILl3ESB_JNS2_6LambdaILl0EJEEEEEEEEEEEEE4execIRNS0_8LoopDataINS5_4listIJSG_EEENS5_5tupleIJNS_4impl4SpanINS_9Iterators16numeric_iteratorIN6Kripke9DirectionElPSS_EESS_EENSO_INSQ_INSR_6MomentElPSW_EESW_EENSO_INSQ_INSR_5GroupElPS10_EES10_EENSO_INSQ_INSR_4ZoneElPS14_EES14_EEEEENSM_IJEEEJZNK14LPlusTimesSdomclINSR_11ArchLayoutTINSR_12ArchT_OpenMPENSR_11LayoutT_DGZEEEEEvT_NSR_6SdomIdERKNSR_4Core3SetES1L_S1L_S1L_RNS1I_5FieldIdJSW_S10_S14_EEERNS1M_IdJSS_S10_S14_EEERNS1M_IdJSS_SW_EEEEUlSS_SW_S10_S14_E_EEEEEvOS1G_._omp_fn.0+0xc5a11> | 0x4b124 JNE 4b05c <_ZN4RAJA8internal17StatementExecutorINS_9statement8CollapseINS_26omp_parallel_collapse_execEN4camp7int_seqIlJLl0ELl2EEEEJNS2_3ForILl1ENS_6policy4loop9loop_execEJNS8_ILl3ESB_JNS2_6LambdaILl0EJEEEEEEEEEEEEE4execIRNS0_8LoopDataINS5_4listIJSG_EEENS5_5tupleIJNS_4impl4SpanINS_9Iterators16numeric_iteratorIN6Kripke9DirectionElPSS_EESS_EENSO_INSQ_INSR_6MomentElPSW_EESW_EENSO_INSQ_INSR_5GroupElPS10_EES10_EENSO_INSQ_INSR_4ZoneElPS14_EES14_EEEEENSM_IJEEEJZNK14LPlusTimesSdomclINSR_11ArchLayoutTINSR_12ArchT_OpenMPENSR_11LayoutT_DGZEEEEEvT_NSR_6SdomIdERKNSR_4Core3SetES1L_S1L_S1L_RNS1I_5FieldIdJSW_S10_S14_EEERNS1M_IdJSS_S10_S14_EEERNS1M_IdJSS_SW_EEEEUlSS_SW_S10_S14_E_EEEEEvOS1G_._omp_fn.0+0x38c> |
Path / |
Metric | ORIG | DL1 | Original |
---|---|---|---|
FP operations per cycle L1 | 7.11, 7.11, | 6.10, 6.10, | 7.53, 7.53, |
cycles L1 CQA | 9.00 | 10.50 | 8.50 |
cycles UFS | 9.16 | 10.66 | 8.67 |
bytes loaded | 528.00 | 520.00 | 520.00 |
bytes stored | 264.00 | 256.00 | 256.00 |
nb loads | 18.00 | 17.00 | 17.00 |
nb stores | 9.00 | 8.00 | 8.00 |
cycles dispatch | 9.00 | 8.50 | 8.50 |
cycles front end | 9.00 | 10.50 | 8.50 |
cycles P0 | 4.00 | 4.00 | 4.00 |
cycles P1 | 4.00 | 4.00 | 4.00 |
cycles P2 | 9.00 | 8.50 | 8.50 |
cycles P3 | 9.00 | 8.50 | 8.50 |
cycles P4 | 9.00 | 8.00 | 8.00 |
cycles P5 | 1.50 | 1.00 | 1.00 |
cycles P6 | 1.50 | 1.00 | 1.00 |
cycles P7 | 9.00 | 8.00 | 8.00 |
stall cycles | 0.00 | 0.00 | 0.00 |
LB full | 0.00 | 0.00 | 0.00 |
LM full | 0.00 | 0.00 | 0.00 |
PRF full | 0.00 | 0.00 | 0.00 |
PRF_FLOAT full | 0.00 | 0.00 | 0.00 |
PRF_INT full | 0.00 | 0.00 | 0.00 |
ROB full | 0.00 | 0.00 | 0.00 |
RS full | 0.00 | 0.00 | 0.00 |
SB full | 0.00 | 0.00 | 0.00 |
nb uops | 28.00 | 42.00 | 26.00 |
uops P0 | 4.00 | 4.00 | 4.00 |
uops P1 | 4.00 | 4.00 | 4.00 |
uops P2 | 9.00 | 8.50 | 8.50 |
uops P3 | 9.00 | 8.50 | 8.50 |
uops P4 | 9.00 | 8.00 | 8.00 |
uops P5 | 1.50 | 1.00 | 1.00 |
uops P6 | 1.50 | 1.00 | 1.00 |
uops P7 | 9.00 | 8.00 | 8.00 |
ID | 432 | 434 | 431 |
Metric | ORIG | DL1 | Original |
---|---|---|---|
FP operations per cycle L1 | 7.11, 7.11, | 6.10, 6.10, | 7.53, 7.53, |
cycles L1 CQA | 9.00 | 10.50 | 8.50 |
cycles UFS | 9.16 | 10.66 | 8.67 |
bytes loaded | 528.00 | 520.00 | 520.00 |
bytes stored | 264.00 | 256.00 | 256.00 |
nb loads | 18.00 | 17.00 | 17.00 |
nb stores | 9.00 | 8.00 | 8.00 |
cycles dispatch | 9.00 | 8.50 | 8.50 |
cycles front end | 9.00 | 10.50 | 8.50 |
cycles P0 | 4.00 | 4.00 | 4.00 |
cycles P1 | 4.00 | 4.00 | 4.00 |
cycles P2 | 9.00 | 8.50 | 8.50 |
cycles P3 | 9.00 | 8.50 | 8.50 |
cycles P4 | 9.00 | 8.00 | 8.00 |
cycles P5 | 1.50 | 1.00 | 1.00 |
cycles P6 | 1.50 | 1.00 | 1.00 |
cycles P7 | 9.00 | 8.00 | 8.00 |
stall cycles | 0.00 | 0.00 | 0.00 |
LB full | 0.00 | 0.00 | 0.00 |
LM full | 0.00 | 0.00 | 0.00 |
PRF full | 0.00 | 0.00 | 0.00 |
PRF_FLOAT full | 0.00 | 0.00 | 0.00 |
PRF_INT full | 0.00 | 0.00 | 0.00 |
ROB full | 0.00 | 0.00 | 0.00 |
RS full | 0.00 | 0.00 | 0.00 |
SB full | 0.00 | 0.00 | 0.00 |
nb uops | 28.00 | 42.00 | 26.00 |
uops P0 | 4.00 | 4.00 | 4.00 |
uops P1 | 4.00 | 4.00 | 4.00 |
uops P2 | 9.00 | 8.50 | 8.50 |
uops P3 | 9.00 | 8.50 | 8.50 |
uops P4 | 9.00 | 8.00 | 8.00 |
uops P5 | 1.50 | 1.00 | 1.00 |
uops P6 | 1.50 | 1.00 | 1.00 |
uops P7 | 9.00 | 8.00 | 8.00 |
ID | 432 | 434 | 431 |