Loop Id: 525 | Module: libkripke.so | Source: forall.hpp:59-59 [...] | Coverage: 0.71% |
---|
Loop Id: 525 | Module: libkripke.so | Source: forall.hpp:59-59 [...] | Coverage: 0.71% |
---|
0x5733c VMOVUPD (%RSI,%RAX,1),%YMM7 [1] |
0x57341 VMULPD (%R9,%RAX,1),%YMM7,%YMM9 [2] |
0x57347 VMOVUPD 0x40(%RSI,%RAX,1),%YMM11 [1] |
0x5734d VMULPD 0x40(%R9,%RAX,1),%YMM11,%YMM12 [2] |
0x57354 VMOVUPD 0x80(%RSI,%RAX,1),%YMM15 [1] |
0x5735d VMOVUPD 0x60(%RSI,%RAX,1),%YMM13 [1] |
0x57363 VMOVUPD 0xa0(%RSI,%RAX,1),%YMM4 [1] |
0x5736c VMULPD 0x60(%R9,%RAX,1),%YMM13,%YMM14 [2] |
0x57373 VMULPD 0x80(%R9,%RAX,1),%YMM15,%YMM3 [2] |
0x5737d VMOVUPD 0xc0(%RSI,%RAX,1),%YMM5 [1] |
0x57386 VMULPD 0xa0(%R9,%RAX,1),%YMM4,%YMM6 [2] |
0x57390 VFMADD132PD %YMM8,%YMM0,%YMM9 |
0x57395 VMOVUPD 0x20(%RSI,%RAX,1),%YMM0 [1] |
0x5739b VMULPD 0x20(%R9,%RAX,1),%YMM0,%YMM10 [2] |
0x573a2 VMULPD 0xc0(%R9,%RAX,1),%YMM5,%YMM7 [2] |
0x573ac VFMADD132PD %YMM8,%YMM9,%YMM10 |
0x573b1 VMOVUPD 0xe0(%RSI,%RAX,1),%YMM9 [1] |
0x573ba VMULPD 0xe0(%R9,%RAX,1),%YMM9,%YMM0 [2] |
0x573c4 ADD $0x100,%RAX |
0x573ca VFMADD132PD %YMM8,%YMM10,%YMM12 |
0x573cf VFMADD132PD %YMM8,%YMM12,%YMM14 |
0x573d4 VFMADD132PD %YMM8,%YMM14,%YMM3 |
0x573d9 VFMADD132PD %YMM8,%YMM3,%YMM6 |
0x573de VFMADD132PD %YMM8,%YMM6,%YMM7 |
0x573e3 VFMADD132PD %YMM8,%YMM7,%YMM0 |
0x573e8 CMP %RAX,%R14 |
0x573eb JNE 5733c |
/home/kcamus/qaas_runs/169-391-8990/intel/Kripke/build/Kripke/tpl/raja/include/RAJA/policy/loop/forall.hpp: 59 - 59 |
-------------------------------------------------------------------------------- |
59: for (decltype(distance_it) i = 0; i < distance_it; ++i) { |
/home/kcamus/qaas_runs/169-391-8990/intel/Kripke/build/Kripke/tpl/raja/include/RAJA/util/Operators.hpp: 307 - 307 |
-------------------------------------------------------------------------------- |
307: return Ret{lhs} + rhs; |
Coverage (%) | Name | Source Location | Module |
---|---|---|---|
○100.00 | GOMP_parallel | libgomp.h:985 | libgomp.so.1.0.0 |
Path / |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.00 |
CQA speedup if FP arith vectorized | 1.38 |
CQA speedup if fully vectorized | 2.00 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 3.76 |
Bottlenecks | |
Function | void RAJA::internal::StatementExecutor |
Source | forall.hpp:59-59,Operators.hpp:307-307 |
Source loop unroll info | not unrolled or unrolled with no peel/tail loop |
Source loop unroll confidence level | max |
Unroll/vectorization loop type | NA |
Unroll factor | NA |
CQA cycles | 32.00 |
CQA cycles if no scalar integer | 32.00 |
CQA cycles if FP arith vectorized | 23.11 |
CQA cycles if fully vectorized | 16.00 |
Front-end cycles | 8.50 |
DIV/SQRT cycles | 8.00 |
P0 cycles | 8.00 |
P1 cycles | 8.00 |
P2 cycles | 8.00 |
P3 cycles | 0.00 |
P4 cycles | 1.00 |
P5 cycles | 1.00 |
P6 cycles | 0.00 |
P7 cycles | 0.00 |
Inter-iter dependencies cycles | 32 |
FE+BE cycles (UFS) | 32.16 |
Stall cycles (UFS) | 23.29 |
Nb insns | 27.00 |
Nb uops | 26.00 |
Nb loads | 16.00 |
Nb stores | 0.00 |
Nb stack references | 0.00 |
FLOP/cycle | 3.00 |
Nb FLOP add-sub | 0.00 |
Nb FLOP mul | 32.00 |
Nb FLOP fma | 32.00 |
Nb FLOP div | 0.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 0.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 16.00 |
Bytes prefetched | 0.00 |
Bytes loaded | 512.00 |
Bytes stored | 0.00 |
Stride 0 | 0.00 |
Stride 1 | 2.00 |
Stride n | 0.00 |
Stride unknown | 0.00 |
Stride indirect | 0.00 |
Vectorization ratio all | 100.00 |
Vectorization ratio load | 100.00 |
Vectorization ratio store | NA |
Vectorization ratio mul | 100.00 |
Vectorization ratio add_sub | NA |
Vectorization ratio fma | 100.00 |
Vectorization ratio div_sqrt | NA |
Vectorization ratio other | NA |
Vector-efficiency ratio all | 50.00 |
Vector-efficiency ratio load | 50.00 |
Vector-efficiency ratio store | NA |
Vector-efficiency ratio mul | 50.00 |
Vector-efficiency ratio add_sub | NA |
Vector-efficiency ratio fma | 50.00 |
Vector-efficiency ratio div_sqrt | NA |
Vector-efficiency ratio other | NA |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.00 |
CQA speedup if FP arith vectorized | 1.38 |
CQA speedup if fully vectorized | 2.00 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 3.76 |
Bottlenecks | |
Function | void RAJA::internal::StatementExecutor |
Source | forall.hpp:59-59,Operators.hpp:307-307 |
Source loop unroll info | not unrolled or unrolled with no peel/tail loop |
Source loop unroll confidence level | max |
Unroll/vectorization loop type | NA |
Unroll factor | NA |
CQA cycles | 32.00 |
CQA cycles if no scalar integer | 32.00 |
CQA cycles if FP arith vectorized | 23.11 |
CQA cycles if fully vectorized | 16.00 |
Front-end cycles | 8.50 |
DIV/SQRT cycles | 8.00 |
P0 cycles | 8.00 |
P1 cycles | 8.00 |
P2 cycles | 8.00 |
P3 cycles | 0.00 |
P4 cycles | 1.00 |
P5 cycles | 1.00 |
P6 cycles | 0.00 |
P7 cycles | 0.00 |
Inter-iter dependencies cycles | 32 |
FE+BE cycles (UFS) | 32.16 |
Stall cycles (UFS) | 23.29 |
Nb insns | 27.00 |
Nb uops | 26.00 |
Nb loads | 16.00 |
Nb stores | 0.00 |
Nb stack references | 0.00 |
FLOP/cycle | 3.00 |
Nb FLOP add-sub | 0.00 |
Nb FLOP mul | 32.00 |
Nb FLOP fma | 32.00 |
Nb FLOP div | 0.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 0.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 16.00 |
Bytes prefetched | 0.00 |
Bytes loaded | 512.00 |
Bytes stored | 0.00 |
Stride 0 | 0.00 |
Stride 1 | 2.00 |
Stride n | 0.00 |
Stride unknown | 0.00 |
Stride indirect | 0.00 |
Vectorization ratio all | 100.00 |
Vectorization ratio load | 100.00 |
Vectorization ratio store | NA |
Vectorization ratio mul | 100.00 |
Vectorization ratio add_sub | NA |
Vectorization ratio fma | 100.00 |
Vectorization ratio div_sqrt | NA |
Vectorization ratio other | NA |
Vector-efficiency ratio all | 50.00 |
Vector-efficiency ratio load | 50.00 |
Vector-efficiency ratio store | NA |
Vector-efficiency ratio mul | 50.00 |
Vector-efficiency ratio add_sub | NA |
Vector-efficiency ratio fma | 50.00 |
Vector-efficiency ratio div_sqrt | NA |
Vector-efficiency ratio other | NA |
Path / |
nb instructions | 27 |
nb uops | 26 |
loop length | 181 |
used x86 registers | 4 |
used mmx registers | 0 |
used xmm registers | 0 |
used ymm registers | 14 |
used zmm registers | 0 |
nb stack references | 0 |
micro-operation queue | 8.50 cycles |
front end | 8.50 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | |
---|---|---|---|---|---|---|---|---|
uops | 8.00 | 8.00 | 8.00 | 8.00 | 0.00 | 1.00 | 1.00 | 0.00 |
cycles | 8.00 | 8.00 | 8.00 | 8.00 | 0.00 | 1.00 | 1.00 | 0.00 |
Cycles executing div or sqrt instructions | NA |
Longest recurrence chain latency (RecMII) | 32.00 |
FE+BE cycles | 32.16 |
Stall cycles | 23.29 |
LB full (events) | 23.79 |
Front-end | 8.50 |
Dispatch | 8.00 |
Data deps. | 32.00 |
Overall L1 | 32.00 |
all | 100% |
load | 100% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 100% |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | 100% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | NA (no other vectorizable/vectorized instructions) |
all | 50% |
load | 50% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 50% |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | 50% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | NA (no other vectorizable/vectorized instructions) |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|
VMOVUPD (%RSI,%RAX,1),%YMM7 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VMULPD (%R9,%RAX,1),%YMM7,%YMM9 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD 0x40(%RSI,%RAX,1),%YMM11 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VMULPD 0x40(%R9,%RAX,1),%YMM11,%YMM12 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD 0x80(%RSI,%RAX,1),%YMM15 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VMOVUPD 0x60(%RSI,%RAX,1),%YMM13 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VMOVUPD 0xa0(%RSI,%RAX,1),%YMM4 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VMULPD 0x60(%R9,%RAX,1),%YMM13,%YMM14 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMULPD 0x80(%R9,%RAX,1),%YMM15,%YMM3 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD 0xc0(%RSI,%RAX,1),%YMM5 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VMULPD 0xa0(%R9,%RAX,1),%YMM4,%YMM6 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD132PD %YMM8,%YMM0,%YMM9 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD 0x20(%RSI,%RAX,1),%YMM0 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VMULPD 0x20(%R9,%RAX,1),%YMM0,%YMM10 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMULPD 0xc0(%R9,%RAX,1),%YMM5,%YMM7 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD132PD %YMM8,%YMM9,%YMM10 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD 0xe0(%RSI,%RAX,1),%YMM9 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VMULPD 0xe0(%R9,%RAX,1),%YMM9,%YMM0 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
ADD $0x100,%RAX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
VFMADD132PD %YMM8,%YMM10,%YMM12 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD132PD %YMM8,%YMM12,%YMM14 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD132PD %YMM8,%YMM14,%YMM3 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD132PD %YMM8,%YMM3,%YMM6 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD132PD %YMM8,%YMM6,%YMM7 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD132PD %YMM8,%YMM7,%YMM0 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
CMP %RAX,%R14 | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
JNE 5733c <_ZN4RAJA8internal17StatementExecutorINS_9statement8CollapseINS_26omp_parallel_collapse_execEN4camp7int_seqIlJLl0ELl1EEEEJNS2_3ForILl2ENS_6policy4loop9loop_execEJNS2_6LambdaILl0EJEEEEEEEEEE4execIRNS0_8LoopDataINS5_4listIJSF_EEENS5_5tupleIJNS_4impl4SpanINS_9Iterators16numeric_iteratorIN6Kripke9DirectionElPSR_EESR_EENSN_INSP_INSQ_5GroupElPSV_EESV_EENSN_INSP_INSQ_4ZoneElPSZ_EESZ_EEEEENSL_IJEEEJZNK14PopulationSdomclINSQ_11ArchLayoutTINSQ_12ArchT_OpenMPENSQ_11LayoutT_DGZEEEEEvT_NSQ_6SdomIdERKNSQ_4Core3SetES1G_S1G_RNS1D_5FieldIdJSR_SV_SZ_EEERNS1H_IdJSR_EEERNS1H_IdJSZ_EEEPdEUlSR_SV_SZ_E_EEEEEvOS1B_._omp_fn.0+0x25c> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50-1 |
nb instructions | 27 |
nb uops | 26 |
loop length | 181 |
used x86 registers | 4 |
used mmx registers | 0 |
used xmm registers | 0 |
used ymm registers | 14 |
used zmm registers | 0 |
nb stack references | 0 |
micro-operation queue | 8.50 cycles |
front end | 8.50 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | |
---|---|---|---|---|---|---|---|---|
uops | 8.00 | 8.00 | 8.00 | 8.00 | 0.00 | 1.00 | 1.00 | 0.00 |
cycles | 8.00 | 8.00 | 8.00 | 8.00 | 0.00 | 1.00 | 1.00 | 0.00 |
Cycles executing div or sqrt instructions | NA |
Longest recurrence chain latency (RecMII) | 32.00 |
FE+BE cycles | 32.16 |
Stall cycles | 23.29 |
LB full (events) | 23.79 |
Front-end | 8.50 |
Dispatch | 8.00 |
Data deps. | 32.00 |
Overall L1 | 32.00 |
all | 100% |
load | 100% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 100% |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | 100% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | NA (no other vectorizable/vectorized instructions) |
all | 50% |
load | 50% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 50% |
add-sub | NA (no add-sub vectorizable/vectorized instructions) |
fma | 50% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | NA (no other vectorizable/vectorized instructions) |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|
VMOVUPD (%RSI,%RAX,1),%YMM7 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VMULPD (%R9,%RAX,1),%YMM7,%YMM9 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD 0x40(%RSI,%RAX,1),%YMM11 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VMULPD 0x40(%R9,%RAX,1),%YMM11,%YMM12 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD 0x80(%RSI,%RAX,1),%YMM15 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VMOVUPD 0x60(%RSI,%RAX,1),%YMM13 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VMOVUPD 0xa0(%RSI,%RAX,1),%YMM4 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VMULPD 0x60(%R9,%RAX,1),%YMM13,%YMM14 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMULPD 0x80(%R9,%RAX,1),%YMM15,%YMM3 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD 0xc0(%RSI,%RAX,1),%YMM5 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VMULPD 0xa0(%R9,%RAX,1),%YMM4,%YMM6 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD132PD %YMM8,%YMM0,%YMM9 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD 0x20(%RSI,%RAX,1),%YMM0 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VMULPD 0x20(%R9,%RAX,1),%YMM0,%YMM10 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMULPD 0xc0(%R9,%RAX,1),%YMM5,%YMM7 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD132PD %YMM8,%YMM9,%YMM10 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVUPD 0xe0(%RSI,%RAX,1),%YMM9 | 1 | 0 | 0 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 5-6 | 0.50 |
VMULPD 0xe0(%R9,%RAX,1),%YMM9,%YMM0 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
ADD $0x100,%RAX | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
VFMADD132PD %YMM8,%YMM10,%YMM12 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD132PD %YMM8,%YMM12,%YMM14 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD132PD %YMM8,%YMM14,%YMM3 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD132PD %YMM8,%YMM3,%YMM6 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD132PD %YMM8,%YMM6,%YMM7 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD132PD %YMM8,%YMM7,%YMM0 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
CMP %RAX,%R14 | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
JNE 5733c <_ZN4RAJA8internal17StatementExecutorINS_9statement8CollapseINS_26omp_parallel_collapse_execEN4camp7int_seqIlJLl0ELl1EEEEJNS2_3ForILl2ENS_6policy4loop9loop_execEJNS2_6LambdaILl0EJEEEEEEEEEE4execIRNS0_8LoopDataINS5_4listIJSF_EEENS5_5tupleIJNS_4impl4SpanINS_9Iterators16numeric_iteratorIN6Kripke9DirectionElPSR_EESR_EENSN_INSP_INSQ_5GroupElPSV_EESV_EENSN_INSP_INSQ_4ZoneElPSZ_EESZ_EEEEENSL_IJEEEJZNK14PopulationSdomclINSQ_11ArchLayoutTINSQ_12ArchT_OpenMPENSQ_11LayoutT_DGZEEEEEvT_NSQ_6SdomIdERKNSQ_4Core3SetES1G_S1G_RNS1D_5FieldIdJSR_SV_SZ_EEERNS1H_IdJSR_EEERNS1H_IdJSZ_EEEPdEUlSR_SV_SZ_E_EEEEEvOS1B_._omp_fn.0+0x25c> | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50-1 |
Metric | run_0 |
---|---|
Coverage (% app. time) | 0.71 |
Time (s) | 0.22 |
Instance Count | 61440 |
Iteration Count - min | 128 |
Iteration Count - avg | 128 |
Iteration Count - max | 128 |
Cycles per Iteration - min | 42.59 |
Cycles per Iteration - avg | 58.87 |
Cycles per Iteration - max | 312.86 |
Metric | Value |
---|---|
Bucket Coverage (% loop time) | 83.99 |
Instance Count | 61440 |
ORIG CPI:min | 48.11 |
ORIG CPI:med | 59.44 |
ORIG CPI:max | 81.47 |
DL1 CPI:min | 32.61 |
DL1 CPI:med | 33.28 |
DL1 CPI:max | 36.73 |
ORIG (min) / DL1 (min) | 1.48 |
ORIG (med) / DL1 (med) | 1.79 |
ORIG (max) / DL1 (max) | 2.22 |
Nb Iteration:min | 128 |
Nb Iteration:med | 128.00 |
Nb Iteration:max | 128 |
ORIG: min (cycles) | 6158 |
ORIG: med (cycles) | 7608.00 |
ORIG: max (cycles) | 10428 |
DL1:min (cycles) | 4174 |
DL1:med (cycles) | 4260.00 |
DL1:max (cycles) | 4702 |
Metric | Value |
---|---|
Bucket Coverage (% loop time) | 15.67 |
Instance Count | 61440 |
ORIG CPI:min | 46.19 |
ORIG CPI:med | 60.92 |
ORIG CPI:max | 236.36 |
DL1 CPI:min | 32.63 |
DL1 CPI:med | 33.31 |
DL1 CPI:max | 120.08 |
ORIG (min) / DL1 (min) | 1.42 |
ORIG (med) / DL1 (med) | 1.83 |
ORIG (max) / DL1 (max) | 1.97 |
Nb Iteration:min | 128 |
Nb Iteration:med | 128.00 |
Nb Iteration:max | 128 |
ORIG: min (cycles) | 5912 |
ORIG: med (cycles) | 7798.00 |
ORIG: max (cycles) | 30254 |
DL1:min (cycles) | 4176 |
DL1:med (cycles) | 4264.00 |
DL1:max (cycles) | 15370 |
Metric (average per iteration except for Time and Iteration Count) | ORIG | DL1 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min (Thread) | Med (Thread) | Avg (Thread) | Max (Thread) | Min (Instances) | Med (Instances) | Max (Instances) | Min (Thread) | Med (Thread) | Avg (Thread) | Max (Thread) | Min (Instances) | Med (Instances) | Max (Instances) | |
Time | 7608.00 | 7608.00 | 7608.00 | 7608.00 | 6158.00 | 7608.00 | 10428.00 | 4260.00 | 4260.00 | 4260.00 | 4260.00 | 4174.00 | 4260.00 | 4702.00 |
CPI MIN | 48.11 | 32.61 | ||||||||||||
CPI MED | 59.44 | 59.44 | 59.44 | 59.44 | 48.11 | 59.44 | 81.47 | 33.28 | 33.28 | 33.28 | 33.28 | 32.61 | 33.28 | 36.73 |
CPI AVG | 60.82 | 33.37 | ||||||||||||
CPI MAX | 81.47 | 36.73 | ||||||||||||
Iteration Count | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 |
Metric (average per iteration except for Time and Iteration Count) | ORIG | DL1 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min (Thread) | Med (Thread) | Avg (Thread) | Max (Thread) | Min (Instances) | Med (Instances) | Max (Instances) | Min (Thread) | Med (Thread) | Avg (Thread) | Max (Thread) | Min (Instances) | Med (Instances) | Max (Instances) | |
Time | 7798.00 | 7798.00 | 7798.00 | 7798.00 | 5912.00 | 7798.00 | 30254.00 | 4264.00 | 4264.00 | 4264.00 | 4264.00 | 4176.00 | 4264.00 | 15370.00 |
CPI MIN | 46.19 | 32.63 | ||||||||||||
CPI MED | 60.92 | 60.92 | 60.92 | 60.92 | 46.19 | 60.92 | 236.36 | 33.31 | 33.31 | 33.31 | 33.31 | 32.63 | 33.31 | 120.08 |
CPI AVG | 67.56 | 35.98 | ||||||||||||
CPI MAX | 236.36 | 120.08 | ||||||||||||
Iteration Count | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 | 128.00 |
ORIG | DL1 | Original Code |
---|---|---|
0x112a52 ADDQ $0x1,-0x365a(%RIP) 0x112a5a VMOVUPD (%RSI,%RAX,1),%YMM7 | 0x112e62 VMOVUPD -0x456a(%RIP),%YMM7 | 0x5733c VMOVUPD (%RSI,%RAX,1),%YMM7 |
0x112a5f VMULPD (%R9,%RAX,1),%YMM7,%YMM9 | 0x112e6a VMULPD -0x4572(%RIP),%YMM7,%YMM9 0x112e72 NOP | 0x57341 VMULPD (%R9,%RAX,1),%YMM7,%YMM9 |
0x112a65 VMOVUPD 0x40(%RSI,%RAX,1),%YMM11 | 0x112e73 VMOVUPD -0x457b(%RIP),%YMM11 | 0x57347 VMOVUPD 0x40(%RSI,%RAX,1),%YMM11 |
0x112a6b VMULPD 0x40(%R9,%RAX,1),%YMM11,%YMM12 | 0x112e7b VMULPD -0x4583(%RIP),%YMM11,%YMM12 0x112e83 NOP | 0x5734d VMULPD 0x40(%R9,%RAX,1),%YMM11,%YMM12 |
0x112a72 VMOVUPD 0x80(%RSI,%RAX,1),%YMM15 | 0x112e84 VMOVUPD -0x458c(%RIP),%YMM15 | 0x57354 VMOVUPD 0x80(%RSI,%RAX,1),%YMM15 |
0x112a7b VMOVUPD 0x60(%RSI,%RAX,1),%YMM13 | 0x112e8c VMOVUPD -0x4594(%RIP),%YMM13 | 0x5735d VMOVUPD 0x60(%RSI,%RAX,1),%YMM13 |
0x112a81 VMOVUPD 0xa0(%RSI,%RAX,1),%YMM4 | 0x112e94 VMOVUPD -0x459c(%RIP),%YMM4 | 0x57363 VMOVUPD 0xa0(%RSI,%RAX,1),%YMM4 |
0x112a8a VMULPD 0x60(%R9,%RAX,1),%YMM13,%YMM14 | 0x112e9c VMULPD -0x45a4(%RIP),%YMM13,%YMM14 0x112ea4 NOP | 0x5736c VMULPD 0x60(%R9,%RAX,1),%YMM13,%YMM14 |
0x112a91 VMULPD 0x80(%R9,%RAX,1),%YMM15,%YMM3 | 0x112ea5 VMULPD -0x45ad(%RIP),%YMM15,%YMM3 0x112ead NOP | 0x57373 VMULPD 0x80(%R9,%RAX,1),%YMM15,%YMM3 |
0x112a9b VMOVUPD 0xc0(%RSI,%RAX,1),%YMM5 | 0x112eae VMOVUPD -0x45b6(%RIP),%YMM5 | 0x5737d VMOVUPD 0xc0(%RSI,%RAX,1),%YMM5 |
0x112aa4 VMULPD 0xa0(%R9,%RAX,1),%YMM4,%YMM6 | 0x112eb6 VMULPD -0x45be(%RIP),%YMM4,%YMM6 0x112ebe NOP | 0x57386 VMULPD 0xa0(%R9,%RAX,1),%YMM4,%YMM6 |
0x112aae VFMADD132PD %YMM8,%YMM0,%YMM9 | 0x112ebf VFMADD132PD %YMM8,%YMM0,%YMM9 | 0x57390 VFMADD132PD %YMM8,%YMM0,%YMM9 |
0x112ab3 VMOVUPD 0x20(%RSI,%RAX,1),%YMM0 | 0x112ec4 VMOVUPD -0x45cc(%RIP),%YMM0 | 0x57395 VMOVUPD 0x20(%RSI,%RAX,1),%YMM0 |
0x112ab9 VMULPD 0x20(%R9,%RAX,1),%YMM0,%YMM10 | 0x112ecc VMULPD -0x45d4(%RIP),%YMM0,%YMM10 0x112ed4 NOP | 0x5739b VMULPD 0x20(%R9,%RAX,1),%YMM0,%YMM10 |
0x112ac0 VMULPD 0xc0(%R9,%RAX,1),%YMM5,%YMM7 | 0x112ed5 VMULPD -0x45dd(%RIP),%YMM5,%YMM7 0x112edd NOP | 0x573a2 VMULPD 0xc0(%R9,%RAX,1),%YMM5,%YMM7 |
0x112aca VFMADD132PD %YMM8,%YMM9,%YMM10 | 0x112ede VFMADD132PD %YMM8,%YMM9,%YMM10 | 0x573ac VFMADD132PD %YMM8,%YMM9,%YMM10 |
0x112acf VMOVUPD 0xe0(%RSI,%RAX,1),%YMM9 | 0x112ee3 VMOVUPD -0x45eb(%RIP),%YMM9 | 0x573b1 VMOVUPD 0xe0(%RSI,%RAX,1),%YMM9 |
0x112ad8 VMULPD 0xe0(%R9,%RAX,1),%YMM9,%YMM0 | 0x112eeb VMULPD -0x45f3(%RIP),%YMM9,%YMM0 0x112ef3 NOP | 0x573ba VMULPD 0xe0(%R9,%RAX,1),%YMM9,%YMM0 |
0x112ae2 ADD $0x100,%RAX | 0x112ef4 ADD $0x100,%RAX | 0x573c4 ADD $0x100,%RAX |
0x112ae8 VFMADD132PD %YMM8,%YMM10,%YMM12 | 0x112efa VFMADD132PD %YMM8,%YMM10,%YMM12 | 0x573ca VFMADD132PD %YMM8,%YMM10,%YMM12 |
0x112aed VFMADD132PD %YMM8,%YMM12,%YMM14 | 0x112eff VFMADD132PD %YMM8,%YMM12,%YMM14 | 0x573cf VFMADD132PD %YMM8,%YMM12,%YMM14 |
0x112af2 VFMADD132PD %YMM8,%YMM14,%YMM3 | 0x112f04 VFMADD132PD %YMM8,%YMM14,%YMM3 | 0x573d4 VFMADD132PD %YMM8,%YMM14,%YMM3 |
0x112af7 VFMADD132PD %YMM8,%YMM3,%YMM6 | 0x112f09 VFMADD132PD %YMM8,%YMM3,%YMM6 | 0x573d9 VFMADD132PD %YMM8,%YMM3,%YMM6 |
0x112afc VFMADD132PD %YMM8,%YMM6,%YMM7 | 0x112f0e VFMADD132PD %YMM8,%YMM6,%YMM7 | 0x573de VFMADD132PD %YMM8,%YMM6,%YMM7 |
0x112b01 VFMADD132PD %YMM8,%YMM7,%YMM0 | 0x112f13 VFMADD132PD %YMM8,%YMM7,%YMM0 | 0x573e3 VFMADD132PD %YMM8,%YMM7,%YMM0 |
0x112b06 CMP %RAX,%R14 | 0x112f18 CMP %RAX,%R14 | 0x573e8 CMP %RAX,%R14 |
0x112b09 JNE 112a52 <_ZN4RAJA8internal17StatementExecutorINS_9statement8CollapseINS_26omp_parallel_collapse_execEN4camp7int_seqIlJLl0ELl1EEEEJNS2_3ForILl2ENS_6policy4loop9loop_execEJNS2_6LambdaILl0EJEEEEEEEEEE4execIRNS0_8LoopDataINS5_4listIJSF_EEENS5_5tupleIJNS_4impl4SpanINS_9Iterators16numeric_iteratorIN6Kripke9DirectionElPSR_EESR_EENSN_INSP_INSQ_5GroupElPSV_EESV_EENSN_INSP_INSQ_4ZoneElPSZ_EESZ_EEEEENSL_IJEEEJZNK14PopulationSdomclINSQ_11ArchLayoutTINSQ_12ArchT_OpenMPENSQ_11LayoutT_DGZEEEEEvT_NSQ_6SdomIdERKNSQ_4Core3SetES1G_S1G_RNS1D_5FieldIdJSR_SV_SZ_EEERNS1H_IdJSR_EEERNS1H_IdJSZ_EEEPdEUlSR_SV_SZ_E_EEEEEvOS1B_._omp_fn.0+0xbb972> | 0x112f1b JNE 112e62 <_ZN4RAJA8internal17StatementExecutorINS_9statement8CollapseINS_26omp_parallel_collapse_execEN4camp7int_seqIlJLl0ELl1EEEEJNS2_3ForILl2ENS_6policy4loop9loop_execEJNS2_6LambdaILl0EJEEEEEEEEEE4execIRNS0_8LoopDataINS5_4listIJSF_EEENS5_5tupleIJNS_4impl4SpanINS_9Iterators16numeric_iteratorIN6Kripke9DirectionElPSR_EESR_EENSN_INSP_INSQ_5GroupElPSV_EESV_EENSN_INSP_INSQ_4ZoneElPSZ_EESZ_EEEEENSL_IJEEEJZNK14PopulationSdomclINSQ_11ArchLayoutTINSQ_12ArchT_OpenMPENSQ_11LayoutT_DGZEEEEEvT_NSQ_6SdomIdERKNSQ_4Core3SetES1G_S1G_RNS1D_5FieldIdJSR_SV_SZ_EEERNS1H_IdJSR_EEERNS1H_IdJSZ_EEEPdEUlSR_SV_SZ_E_EEEEEvOS1B_._omp_fn.0+0xbbd82> | 0x573eb JNE 5733c <_ZN4RAJA8internal17StatementExecutorINS_9statement8CollapseINS_26omp_parallel_collapse_execEN4camp7int_seqIlJLl0ELl1EEEEJNS2_3ForILl2ENS_6policy4loop9loop_execEJNS2_6LambdaILl0EJEEEEEEEEEE4execIRNS0_8LoopDataINS5_4listIJSF_EEENS5_5tupleIJNS_4impl4SpanINS_9Iterators16numeric_iteratorIN6Kripke9DirectionElPSR_EESR_EENSN_INSP_INSQ_5GroupElPSV_EESV_EENSN_INSP_INSQ_4ZoneElPSZ_EESZ_EEEEENSL_IJEEEJZNK14PopulationSdomclINSQ_11ArchLayoutTINSQ_12ArchT_OpenMPENSQ_11LayoutT_DGZEEEEEvT_NSQ_6SdomIdERKNSQ_4Core3SetES1G_S1G_RNS1D_5FieldIdJSR_SV_SZ_EEERNS1H_IdJSR_EEERNS1H_IdJSZ_EEEPdEUlSR_SV_SZ_E_EEEEEvOS1B_._omp_fn.0+0x25c> |
Path / |
Metric | ORIG | DL1 | Original |
---|---|---|---|
FP operations per cycle L1 | 3.00, 3.00, | 3.00, 3.00, | 3.00, 3.00, |
cycles L1 CQA | 32.00 | 32.00 | 32.00 |
cycles UFS | 32.17 | 32.16 | 32.16 |
bytes loaded | 520.00 | 512.00 | 512.00 |
bytes stored | 8.00 | 0.00 | 0.00 |
nb loads | 17.00 | 16.00 | 16.00 |
nb stores | 1.00 | 0.00 | 0.00 |
cycles dispatch | 8.50 | 8.00 | 8.00 |
cycles front end | 9.00 | 8.50 | 8.50 |
cycles P0 | 8.00 | 8.00 | 8.00 |
cycles P1 | 8.00 | 8.00 | 8.00 |
cycles P2 | 8.50 | 8.00 | 8.00 |
cycles P3 | 8.50 | 8.00 | 8.00 |
cycles P4 | 1.00 | 0.00 | 0.00 |
cycles P5 | 1.50 | 1.00 | 1.00 |
cycles P6 | 1.50 | 1.00 | 1.00 |
cycles P7 | 1.00 | 0.00 | 0.00 |
stall cycles | 22.80 | 23.29 | 23.29 |
LB full | 24.78 | 24.78 | 23.79 |
LM full | 0.00 | 0.00 | 0.00 |
PRF full | 0.00 | 0.00 | 0.00 |
PRF_FLOAT full | 0.00 | 0.00 | 0.00 |
PRF_INT full | 0.00 | 0.00 | 0.00 |
ROB full | 0.00 | 0.00 | 0.00 |
RS full | 0.00 | 0.00 | 0.00 |
SB full | 0.00 | 0.00 | 0.00 |
nb uops | 28.00 | 34.00 | 26.00 |
uops P0 | 8.00 | 8.00 | 8.00 |
uops P1 | 8.00 | 8.00 | 8.00 |
uops P2 | 8.50 | 8.00 | 8.00 |
uops P3 | 8.50 | 8.00 | 8.00 |
uops P4 | 1.00 | 0.00 | 0.00 |
uops P5 | 1.50 | 1.00 | 1.00 |
uops P6 | 1.50 | 1.00 | 1.00 |
uops P7 | 1.00 | 0.00 | 0.00 |
ID | 535 | 537 | 525 |
Metric | ORIG | DL1 | Original |
---|---|---|---|
FP operations per cycle L1 | 3.00, 3.00, | 3.00, 3.00, | 3.00, 3.00, |
cycles L1 CQA | 32.00 | 32.00 | 32.00 |
cycles UFS | 32.17 | 32.16 | 32.16 |
bytes loaded | 520.00 | 512.00 | 512.00 |
bytes stored | 8.00 | 0.00 | 0.00 |
nb loads | 17.00 | 16.00 | 16.00 |
nb stores | 1.00 | 0.00 | 0.00 |
cycles dispatch | 8.50 | 8.00 | 8.00 |
cycles front end | 9.00 | 8.50 | 8.50 |
cycles P0 | 8.00 | 8.00 | 8.00 |
cycles P1 | 8.00 | 8.00 | 8.00 |
cycles P2 | 8.50 | 8.00 | 8.00 |
cycles P3 | 8.50 | 8.00 | 8.00 |
cycles P4 | 1.00 | 0.00 | 0.00 |
cycles P5 | 1.50 | 1.00 | 1.00 |
cycles P6 | 1.50 | 1.00 | 1.00 |
cycles P7 | 1.00 | 0.00 | 0.00 |
stall cycles | 22.80 | 23.29 | 23.29 |
LB full | 24.78 | 24.78 | 23.79 |
LM full | 0.00 | 0.00 | 0.00 |
PRF full | 0.00 | 0.00 | 0.00 |
PRF_FLOAT full | 0.00 | 0.00 | 0.00 |
PRF_INT full | 0.00 | 0.00 | 0.00 |
ROB full | 0.00 | 0.00 | 0.00 |
RS full | 0.00 | 0.00 | 0.00 |
SB full | 0.00 | 0.00 | 0.00 |
nb uops | 28.00 | 34.00 | 26.00 |
uops P0 | 8.00 | 8.00 | 8.00 |
uops P1 | 8.00 | 8.00 | 8.00 |
uops P2 | 8.50 | 8.00 | 8.00 |
uops P3 | 8.50 | 8.00 | 8.00 |
uops P4 | 1.00 | 0.00 | 0.00 |
uops P5 | 1.50 | 1.00 | 1.00 |
uops P6 | 1.50 | 1.00 | 1.00 |
uops P7 | 1.00 | 0.00 | 0.00 |
ID | 535 | 537 | 525 |