Loop Id: 300 | Module: exec | Source: BsplineFunctor.h:246-260 | Coverage: 0.1% |
---|
Loop Id: 300 | Module: exec | Source: BsplineFunctor.h:246-260 | Coverage: 0.1% |
---|
0x41ec80 VMULPD (%R9,%RSI,8),%YMM18,%YMM2 [3] |
0x41ec87 VCVTTPD2DQ %YMM2,%XMM4 |
0x41ec8b KXNORW %K0,%K0,%K1 |
0x41ec8f VXORPD %XMM20,%XMM20,%XMM20 |
0x41ec95 VGATHERDPD (%RDX,%XMM4,8),%YMM20{%K1} [1] |
0x41ec9c KXNORW %K0,%K0,%K1 |
0x41eca0 VXORPD %XMM21,%XMM21,%XMM21 |
0x41eca6 VGATHERDPD 0x8(%RDX,%XMM4,8),%YMM21{%K1} [1] |
0x41ecae KXNORW %K0,%K0,%K1 |
0x41ecb2 VXORPD %XMM22,%XMM22,%XMM22 |
0x41ecb8 VGATHERDPD 0x10(%RDX,%XMM4,8),%YMM22{%K1} [1] |
0x41ecc0 VRNDSCALEPD $0xb,%YMM2,%YMM23 |
0x41ecc7 VSUBPD %YMM23,%YMM2,%YMM2 |
0x41eccd VPADDD %XMM0,%XMM4,%XMM4 |
0x41ecd1 KXNORW %K0,%K0,%K1 |
0x41ecd5 VXORPD %XMM23,%XMM23,%XMM23 |
0x41ecdb VGATHERDPD (%RDX,%XMM4,8),%YMM23{%K1} [2] |
0x41ece2 VMOVAPD %YMM2,%YMM4 |
0x41ece6 VFMADD213PD %YMM1,%YMM24,%YMM4 |
0x41ecec VFMADD213PD %YMM5,%YMM2,%YMM4 |
0x41ecf1 VFMADD213PD %YMM25,%YMM2,%YMM4 |
0x41ecf7 VFMADD213PD %YMM19,%YMM20,%YMM4 |
0x41ecfd VMOVAPD %YMM2,%YMM19 |
0x41ed03 VFMADD213PD %YMM11,%YMM17,%YMM19 |
0x41ed09 VFMADD213PD %YMM13,%YMM2,%YMM19 |
0x41ed0f VFMADD213PD %YMM14,%YMM2,%YMM19 |
0x41ed15 VFMADD213PD %YMM4,%YMM21,%YMM19 |
0x41ed1b VMOVAPD %YMM2,%YMM4 |
0x41ed1f VFMADD213PD %YMM3,%YMM6,%YMM4 |
0x41ed24 VFMADD213PD %YMM9,%YMM2,%YMM4 |
0x41ed29 VFMADD213PD %YMM8,%YMM2,%YMM4 |
0x41ed2e VFMADD213PD %YMM19,%YMM22,%YMM4 |
0x41ed34 VMOVAPD %YMM2,%YMM19 |
0x41ed3a VFMADD213PD %YMM7,%YMM12,%YMM19 |
0x41ed40 VFMADD213PD %YMM15,%YMM2,%YMM19 |
0x41ed46 VFMADD213PD %YMM16,%YMM2,%YMM19 |
0x41ed4c VFMADD213PD %YMM4,%YMM23,%YMM19 |
0x41ed52 ADD $0x4,%RSI |
0x41ed56 CMP %RAX,%RSI |
0x41ed59 JB 41ec80 |
/home/kcamus/qaas_runs/169-451-1869/intel/miniqmc/build/miniqmc/src/QMCWaveFunctions/Jastrow/BsplineFunctor.h: 246 - 260 |
-------------------------------------------------------------------------------- |
246: for (int jat = 0; jat < iCount; jat++) |
247: { |
248: real_type r = distArrayCompressed[jat]; |
249: r *= DeltaRInv; |
250: int i = (int)r; |
251: real_type t = r - real_type(i); |
252: real_type tp0 = t * t * t; |
253: real_type tp1 = t * t; |
254: real_type tp2 = t; |
255: |
256: real_type d1 = SplineCoefs[i + 0] * (A[0] * tp0 + A[1] * tp1 + A[2] * tp2 + A[3]); |
257: real_type d2 = SplineCoefs[i + 1] * (A[4] * tp0 + A[5] * tp1 + A[6] * tp2 + A[7]); |
258: real_type d3 = SplineCoefs[i + 2] * (A[8] * tp0 + A[9] * tp1 + A[10] * tp2 + A[11]); |
259: real_type d4 = SplineCoefs[i + 3] * (A[12] * tp0 + A[13] * tp1 + A[14] * tp2 + A[15]); |
260: d += (d1 + d2 + d3 + d4); |
Coverage (%) | Name | Source Location | Module |
---|---|---|---|
►85.71+ | miniqmcreference::TwoBodyJastr[...] | TwoBodyJastrowRef.h:130 | exec |
○ | qmcplusplus::WaveFunction::rat[...] | WaveFunction.cpp:219 | exec |
○ | main.extracted.104 | refwrap.h:347 | exec |
○ | __kmp_invoke_microtask | libiomp5.so | |
○ | __kmp_fork_call | libiomp5.so | |
○ | __kmpc_fork_call | libiomp5.so | |
○ | main | miniqmc.cpp:404 | exec |
○ | __libc_init_first | libc.so.6 | |
►14.29+ | miniqmcreference::OneBodyJastr[...] | OneBodyJastrowRef.h:150 | exec |
○ | qmcplusplus::WaveFunction::rat[...] | WaveFunction.cpp:219 | exec |
○ | main.extracted.104 | refwrap.h:347 | exec |
○ | __kmp_invoke_microtask | libiomp5.so | |
○ | __kmp_fork_call | libiomp5.so | |
○ | __kmpc_fork_call | libiomp5.so | |
○ | main | miniqmc.cpp:404 | exec |
○ | __libc_init_first | libc.so.6 |
Path / |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.00 |
CQA speedup if FP arith vectorized | 1.68 |
CQA speedup if fully vectorized | 2.37 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 1.10 |
Bottlenecks | |
Function | qmcplusplus::BsplineFunctor |
Source | BsplineFunctor.h:246-260 |
Source loop unroll info | not unrolled or unrolled with no peel/tail loop |
Source loop unroll confidence level | max |
Unroll/vectorization loop type | NA |
Unroll factor | NA |
CQA cycles | 16.00 |
CQA cycles if no scalar integer | 16.00 |
CQA cycles if FP arith vectorized | 9.50 |
CQA cycles if fully vectorized | 6.75 |
Front-end cycles | 13.50 |
DIV/SQRT cycles | 14.50 |
P0 cycles | 14.50 |
P1 cycles | 8.50 |
P2 cycles | 8.50 |
P3 cycles | 0.00 |
P4 cycles | 6.00 |
P5 cycles | 2.00 |
P6 cycles | 0.00 |
P7 cycles | 0.00 |
Inter-iter dependencies cycles | 16 |
FE+BE cycles (UFS) | 40.51 |
Stall cycles (UFS) | 28.46 |
Nb insns | 40.00 |
Nb uops | 53.00 |
Nb loads | 5.00 |
Nb stores | 0.00 |
Nb stack references | 0.00 |
FLOP/cycle | 8.50 |
Nb FLOP add-sub | 4.00 |
Nb FLOP mul | 4.00 |
Nb FLOP fma | 64.00 |
Nb FLOP div | 0.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 0.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 10.00 |
Bytes prefetched | 0.00 |
Bytes loaded | 160.00 |
Bytes stored | 0.00 |
Stride 0 | 0.00 |
Stride 1 | 1.00 |
Stride n | 0.00 |
Stride unknown | 0.00 |
Stride indirect | 1.00 |
Vectorization ratio all | 100.00 |
Vectorization ratio load | 100.00 |
Vectorization ratio store | NA |
Vectorization ratio mul | 100.00 |
Vectorization ratio add_sub | 100.00 |
Vectorization ratio fma | 100.00 |
Vectorization ratio div_sqrt | NA |
Vectorization ratio other | 100.00 |
Vector-efficiency ratio all | 46.21 |
Vector-efficiency ratio load | 50.00 |
Vector-efficiency ratio store | NA |
Vector-efficiency ratio mul | 50.00 |
Vector-efficiency ratio add_sub | 37.50 |
Vector-efficiency ratio fma | 50.00 |
Vector-efficiency ratio div_sqrt | NA |
Vector-efficiency ratio other | 42.86 |
Metric | Value |
---|---|
CQA speedup if no scalar integer | 1.00 |
CQA speedup if FP arith vectorized | 1.68 |
CQA speedup if fully vectorized | 2.37 |
CQA speedup if no inter-iteration dependency | NA |
CQA speedup if next bottleneck killed | 1.10 |
Bottlenecks | |
Function | qmcplusplus::BsplineFunctor |
Source | BsplineFunctor.h:246-260 |
Source loop unroll info | not unrolled or unrolled with no peel/tail loop |
Source loop unroll confidence level | max |
Unroll/vectorization loop type | NA |
Unroll factor | NA |
CQA cycles | 16.00 |
CQA cycles if no scalar integer | 16.00 |
CQA cycles if FP arith vectorized | 9.50 |
CQA cycles if fully vectorized | 6.75 |
Front-end cycles | 13.50 |
DIV/SQRT cycles | 14.50 |
P0 cycles | 14.50 |
P1 cycles | 8.50 |
P2 cycles | 8.50 |
P3 cycles | 0.00 |
P4 cycles | 6.00 |
P5 cycles | 2.00 |
P6 cycles | 0.00 |
P7 cycles | 0.00 |
Inter-iter dependencies cycles | 16 |
FE+BE cycles (UFS) | 40.51 |
Stall cycles (UFS) | 28.46 |
Nb insns | 40.00 |
Nb uops | 53.00 |
Nb loads | 5.00 |
Nb stores | 0.00 |
Nb stack references | 0.00 |
FLOP/cycle | 8.50 |
Nb FLOP add-sub | 4.00 |
Nb FLOP mul | 4.00 |
Nb FLOP fma | 64.00 |
Nb FLOP div | 0.00 |
Nb FLOP rcp | 0.00 |
Nb FLOP sqrt | 0.00 |
Nb FLOP rsqrt | 0.00 |
Bytes/cycle | 10.00 |
Bytes prefetched | 0.00 |
Bytes loaded | 160.00 |
Bytes stored | 0.00 |
Stride 0 | 0.00 |
Stride 1 | 1.00 |
Stride n | 0.00 |
Stride unknown | 0.00 |
Stride indirect | 1.00 |
Vectorization ratio all | 100.00 |
Vectorization ratio load | 100.00 |
Vectorization ratio store | NA |
Vectorization ratio mul | 100.00 |
Vectorization ratio add_sub | 100.00 |
Vectorization ratio fma | 100.00 |
Vectorization ratio div_sqrt | NA |
Vectorization ratio other | 100.00 |
Vector-efficiency ratio all | 46.21 |
Vector-efficiency ratio load | 50.00 |
Vector-efficiency ratio store | NA |
Vector-efficiency ratio mul | 50.00 |
Vector-efficiency ratio add_sub | 37.50 |
Vector-efficiency ratio fma | 50.00 |
Vector-efficiency ratio div_sqrt | NA |
Vector-efficiency ratio other | 42.86 |
Path / |
Function | qmcplusplus::BsplineFunctor |
Source file and lines | BsplineFunctor.h:246-260 |
Module | exec |
nb instructions | 40 |
nb uops | 53 |
loop length | 223 |
used x86 registers | 4 |
used mmx registers | 0 |
used xmm registers | 6 |
used ymm registers | 24 |
used zmm registers | 0 |
nb stack references | 0 |
ADD-SUB / MUL ratio | 1.00 |
micro-operation queue | 13.50 cycles |
front end | 13.50 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | |
---|---|---|---|---|---|---|---|---|
uops | 14.50 | 14.50 | 8.50 | 8.50 | 0.00 | 6.00 | 2.00 | 0.00 |
cycles | 14.50 | 14.50 | 8.50 | 8.50 | 0.00 | 6.00 | 2.00 | 0.00 |
Cycles executing div or sqrt instructions | NA |
Longest recurrence chain latency (RecMII) | 16.00 |
FE+BE cycles | 40.51 |
Stall cycles | 28.46 |
ROB full (events) | 29.88 |
RS full (events) | 0.17 |
Front-end | 13.50 |
Dispatch | 14.50 |
Data deps. | 16.00 |
Overall L1 | 16.00 |
all | 100% |
load | NA (no load vectorizable/vectorized instructions) |
store | NA (no store vectorizable/vectorized instructions) |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | 100% |
fma | NA (no fma vectorizable/vectorized instructions) |
other | NA (no other vectorizable/vectorized instructions) |
all | 100% |
load | 100% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 100% |
add-sub | 100% |
fma | 100% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 100% |
all | 100% |
load | 100% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 100% |
add-sub | 100% |
fma | 100% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 100% |
all | 25% |
load | NA (no load vectorizable/vectorized instructions) |
store | NA (no store vectorizable/vectorized instructions) |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | 25% |
fma | NA (no fma vectorizable/vectorized instructions) |
other | NA (no other vectorizable/vectorized instructions) |
all | 46% |
load | 50% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 50% |
add-sub | 50% |
fma | 50% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 42% |
all | 46% |
load | 50% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 50% |
add-sub | 37% |
fma | 50% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 42% |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|
VMULPD (%R9,%RSI,8),%YMM18,%YMM2 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VCVTTPD2DQ %YMM2,%XMM4 | 2 | 0.50 | 0.50 | 0 | 0 | 0 | 1 | 0 | 0 | 7 | 1 |
KXNORW %K0,%K0,%K1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
VXORPD %XMM20,%XMM20,%XMM20 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VGATHERDPD (%RDX,%XMM4,8),%YMM20{%K1} | 4 | 1 | 0 | 2 | 2 | 0 | 1 | 0 | 0 | 20 | 4 |
KXNORW %K0,%K0,%K1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
VXORPD %XMM21,%XMM21,%XMM21 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VGATHERDPD 0x8(%RDX,%XMM4,8),%YMM21{%K1} | 4 | 1 | 0 | 2 | 2 | 0 | 1 | 0 | 0 | 20 | 4 |
KXNORW %K0,%K0,%K1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
VXORPD %XMM22,%XMM22,%XMM22 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VGATHERDPD 0x10(%RDX,%XMM4,8),%YMM22{%K1} | 4 | 1 | 0 | 2 | 2 | 0 | 1 | 0 | 0 | 20 | 4 |
VRNDSCALEPD $0xb,%YMM2,%YMM23 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 1 |
VSUBPD %YMM23,%YMM2,%YMM2 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VPADDD %XMM0,%XMM4,%XMM4 | 1 | 0.33 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 1 | 0.33 |
KXNORW %K0,%K0,%K1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
VXORPD %XMM23,%XMM23,%XMM23 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VGATHERDPD (%RDX,%XMM4,8),%YMM23{%K1} | 4 | 1 | 0 | 2 | 2 | 0 | 1 | 0 | 0 | 20 | 4 |
VMOVAPD %YMM2,%YMM4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VFMADD213PD %YMM1,%YMM24,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM5,%YMM2,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM25,%YMM2,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM19,%YMM20,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVAPD %YMM2,%YMM19 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VFMADD213PD %YMM11,%YMM17,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM13,%YMM2,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM14,%YMM2,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM4,%YMM21,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVAPD %YMM2,%YMM4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VFMADD213PD %YMM3,%YMM6,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM9,%YMM2,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM8,%YMM2,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM19,%YMM22,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVAPD %YMM2,%YMM19 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VFMADD213PD %YMM7,%YMM12,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM15,%YMM2,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM16,%YMM2,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM4,%YMM23,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
ADD $0x4,%RSI | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
CMP %RAX,%RSI | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
JB 41ec80 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50-1 |
Function | qmcplusplus::BsplineFunctor |
Source file and lines | BsplineFunctor.h:246-260 |
Module | exec |
nb instructions | 40 |
nb uops | 53 |
loop length | 223 |
used x86 registers | 4 |
used mmx registers | 0 |
used xmm registers | 6 |
used ymm registers | 24 |
used zmm registers | 0 |
nb stack references | 0 |
ADD-SUB / MUL ratio | 1.00 |
micro-operation queue | 13.50 cycles |
front end | 13.50 cycles |
P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | |
---|---|---|---|---|---|---|---|---|
uops | 14.50 | 14.50 | 8.50 | 8.50 | 0.00 | 6.00 | 2.00 | 0.00 |
cycles | 14.50 | 14.50 | 8.50 | 8.50 | 0.00 | 6.00 | 2.00 | 0.00 |
Cycles executing div or sqrt instructions | NA |
Longest recurrence chain latency (RecMII) | 16.00 |
FE+BE cycles | 40.51 |
Stall cycles | 28.46 |
ROB full (events) | 29.88 |
RS full (events) | 0.17 |
Front-end | 13.50 |
Dispatch | 14.50 |
Data deps. | 16.00 |
Overall L1 | 16.00 |
all | 100% |
load | NA (no load vectorizable/vectorized instructions) |
store | NA (no store vectorizable/vectorized instructions) |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | 100% |
fma | NA (no fma vectorizable/vectorized instructions) |
other | NA (no other vectorizable/vectorized instructions) |
all | 100% |
load | 100% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 100% |
add-sub | 100% |
fma | 100% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 100% |
all | 100% |
load | 100% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 100% |
add-sub | 100% |
fma | 100% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 100% |
all | 25% |
load | NA (no load vectorizable/vectorized instructions) |
store | NA (no store vectorizable/vectorized instructions) |
mul | NA (no mul vectorizable/vectorized instructions) |
add-sub | 25% |
fma | NA (no fma vectorizable/vectorized instructions) |
other | NA (no other vectorizable/vectorized instructions) |
all | 46% |
load | 50% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 50% |
add-sub | 50% |
fma | 50% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 42% |
all | 46% |
load | 50% |
store | NA (no store vectorizable/vectorized instructions) |
mul | 50% |
add-sub | 37% |
fma | 50% |
div/sqrt | NA (no div/sqrt vectorizable/vectorized instructions) |
other | 42% |
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | P6 | P7 | Latency | Recip. throughput |
---|---|---|---|---|---|---|---|---|---|---|---|
VMULPD (%R9,%RSI,8),%YMM18,%YMM2 | 1 | 0.50 | 0.50 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VCVTTPD2DQ %YMM2,%XMM4 | 2 | 0.50 | 0.50 | 0 | 0 | 0 | 1 | 0 | 0 | 7 | 1 |
KXNORW %K0,%K0,%K1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
VXORPD %XMM20,%XMM20,%XMM20 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VGATHERDPD (%RDX,%XMM4,8),%YMM20{%K1} | 4 | 1 | 0 | 2 | 2 | 0 | 1 | 0 | 0 | 20 | 4 |
KXNORW %K0,%K0,%K1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
VXORPD %XMM21,%XMM21,%XMM21 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VGATHERDPD 0x8(%RDX,%XMM4,8),%YMM21{%K1} | 4 | 1 | 0 | 2 | 2 | 0 | 1 | 0 | 0 | 20 | 4 |
KXNORW %K0,%K0,%K1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
VXORPD %XMM22,%XMM22,%XMM22 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VGATHERDPD 0x10(%RDX,%XMM4,8),%YMM22{%K1} | 4 | 1 | 0 | 2 | 2 | 0 | 1 | 0 | 0 | 20 | 4 |
VRNDSCALEPD $0xb,%YMM2,%YMM23 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 1 |
VSUBPD %YMM23,%YMM2,%YMM2 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VPADDD %XMM0,%XMM4,%XMM4 | 1 | 0.33 | 0.33 | 0 | 0 | 0 | 0.33 | 0 | 0 | 1 | 0.33 |
KXNORW %K0,%K0,%K1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
VXORPD %XMM23,%XMM23,%XMM23 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VGATHERDPD (%RDX,%XMM4,8),%YMM23{%K1} | 4 | 1 | 0 | 2 | 2 | 0 | 1 | 0 | 0 | 20 | 4 |
VMOVAPD %YMM2,%YMM4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VFMADD213PD %YMM1,%YMM24,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM5,%YMM2,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM25,%YMM2,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM19,%YMM20,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVAPD %YMM2,%YMM19 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VFMADD213PD %YMM11,%YMM17,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM13,%YMM2,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM14,%YMM2,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM4,%YMM21,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVAPD %YMM2,%YMM4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VFMADD213PD %YMM3,%YMM6,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM9,%YMM2,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM8,%YMM2,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM19,%YMM22,%YMM4 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VMOVAPD %YMM2,%YMM19 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.25 |
VFMADD213PD %YMM7,%YMM12,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM15,%YMM2,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM16,%YMM2,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
VFMADD213PD %YMM4,%YMM23,%YMM19 | 1 | 0.50 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0.50 |
ADD $0x4,%RSI | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
CMP %RAX,%RSI | 1 | 0.25 | 0.25 | 0 | 0 | 0 | 0.25 | 0.25 | 0 | 1 | 0.25 |
JB 41ec80 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0 | 0.50 | 0 | 0 | 0.50-1 |