The Significance of the x86 LFENCE Instruction

2 Votes

Note: SFENCE is discussed in another blog post. This post is about LFENCE.

The x86 ISA currently offers three “fence” instructions: MFENCE, SFENCE, and LFENCE. Sometimes they are described as “memory fence” instructions. In some other architectures and in the literature about memory ordering models, terms such as memory fences, store fences, and load fences are used. The terms “memory fence” and “load fence” have not been used in the Intel Manual Volume 3, but they have been used in the Intel Manual Volume 2 and in the AMD manuals a couple of times. I’ll focus in this article on “load fences”. Throughout this article, I’ll be referring to the latest Intel and AMD manuals at the time of writing this article.

The fact that the term “load fence” has been used in different ISAs, textbooks, and research papers has resulted in a critical misunderstanding of the x86 LFENCE instruction and confusion regarding what it does and how to use it.Calling it “load fence” gives the impression that it serializes load operations. And since the x86 memory ordering model already guarantees that loads (those explicitly issued by the instructions being executed as specified in the ISA, see the comments for discussion) will not be reordered according to the observable behavior, then it appears that LFENCE is rather useless. However, LFENCE is actually not a load fence in the traditional meaning of the word, even though it’s sometimes called so.

LFENCE was first introduced in the Pentium 4 processor in 2001 as part of the SSE2 instruction set extension, and is supported by all later Intel x86 processors. Also it is supported by the AMD Opteron and Athlon 64 processors and all later AMD x86 processors.

If you read the Intel manuals for processors that precede the Pentium 4 (P5 and P6, but not 486 and earlier, whose manuals did not explicitly use the term “serialization”), you’ll see that there might be a need for using serializing instructions or I/O instructions in certain situations. However, Intel have decided to introduce the fence instructions (SFENCE in Pentium III and LFENCE and MFENCE in Pentium 4) to provide some ordering guarantees but without being fully serializing instructions, potentially making them have less impact on performance.

Intel Manual Volume 3 Section 8.2.5:
The SFENCE, LFENCE, and MFENCE instructions provide a performance-efficient way of ensuring load and store memory ordering between routines that produce weakly-ordered results and routines that consume that data.

Intel Manual Volume 3 Section 8.2.5:
Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory ordering than the CPUID instruction.

One important phrase here is “weakly-ordered results”. But what does it mean? Is the x86 memory model a binary thing (weak vs. strong) or is it more complicated than that? Unfortunately, Intel did not define precisely what that phrase means. Anyway, although it’s very important and relevant to the LFENCE instruction, I’ll not directly address this issue in this article.

The fence instructions including LFENCE can be executed at any privilege level, in any operating mode, and in any architectural state. Their behavior is the same in all Intel processors that support them, except for LFENCE in AMD processors as I’ll explain later.

Intel Manual Volume 3 Section 8.2.5:
LFENCE — Serializes all load (read) operations that occurred prior to the LFENCE instruction in the program instruction stream, but does not affect store operations.

This sounds like the definition of a load fence, but there is more in the footnote in the same section and also repeated in Volume 2 (the sentences are numbered by me).

(1) Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes. (2) As a result, an instruction that loads from memory and that precedes an LFENCE receives data from memory prior to completion of the LFENCE. (3) An LFENCE that follows an instruction that stores to memory might complete before the data being stored have become globally visible. (4) Instructions following an LFENCE may be fetched from memory before the LFENCE, but they will not execute until the LFENCE completes.

This part indicates that LFENCE is more than just a load fence, but it’s a little vague. What does “completed locally” mean? Well, the second sentence says that LFENCE does not complete until all previous loads in program order have received the data. This means that other agents in the system might be able to determine that the logical processor have read a particular memory location in case it was globally updated (the update has reached the coherence domain) at least once by an agent other than the logical processor. This is possible because the logical processor may produce different results based on the fetched data. But can we say that the other agents can definitely observe these loads? Yes we can. That’s because Intel uses the terms “retire” and “complete” interchangeably throughout the manual. When an instruction retires, it means that all of its side effects are either globally visible or will become globally visible at some time later. In other words, the logical processor cannot retract a retired instruction, but it might be able to retire other subsequent instructions that override its effects before they become globally visible. The third sentence clarifies this situation. Writes to memory from retired instructions may not yet be globally visible even though they are visible from the logical processor that retired the instructions. Such writes can be held in buffers known as write buffers and are only guaranteed to become globally visible when they leave these write buffers. Note that this behavior is only specified by the ISA, but the implementation can actually be such that retirement means global visibility.

(Note how Intel uses the term “load from memory” and “load” interchangeably. It’s important to know that when reading the Intel manuals to avoid confusion.)

The first and fourth sentences and the following quote show that LFENCE has some serialization properties (in contrast to SFENCE and MFENCE, which are purely memory fences).

Intel Manual Volume 3 Section 8.3:
LFENCE does provide some guarantees on instruction ordering. It does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes.

In particular, LFENCE does not prevent the processor from fetching and decoding instructions, but it does prevent the processor from executing (dispatching) almost any later instructions (I’ll discuss the exceptions later) until LFENCE itself retires, which happens only when all prior instructions retire. This means that later memory loads and stores will not get issued until all earlier instructions retire. This applies to memory accesses from memory regions of all types.

But doesn’t that make it a serializing instruction? Not really. The two most important differences are the following:

LFENCE does not prevent the processor from fetching and decoding later instructions.
LFENCE does not ensure that earlier memory writes to become globally visible (in contrast to SFENCE and MFENCE and serializing instructions).

Again, this is important, LFENCE is not a guaranteed to be a serializing instruction.

The Intel manual shows, but does not explain, three use cases of the LFENCE instruction. I’ll discuss them here.

Intel Manual Volume 2:
The RDTSC instruction is not a serializing instruction. It does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the read operation is performed. The following items may guide software seeking to order executions of RDTSC:
· If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads are globally visible, it can execute LFENCE immediately before RDTSC.
· If software requires RDTSC to be executed only after all previous instructions have executed and all previous loads and stores are globally visible, it can execute the sequence MFENCE;LFENCE immediately before RDTSC.
· If software requires RDTSC to be executed prior to execution of any subsequent instruction (including any memory accesses), it can execute the sequence LFENCE immediately after RDTSC. This instruction was introduced by the Pentium processor.

The RDTSC instruction can be used to measure execution time. However, it’s not a serializing instruction, so it might be executed speculatively and out-of-order with respect to other instructions, which may jeopardize the accuracy of the measurement. Let’s discuss each of the three points from the manual.

inst1load1inst2store1LFENCERDTSCinst3load2store2

What’s the impact of LFENCE at that location? Well, LFENCE will allow the processor to fetch and decode all later instructions including RDTSC, but not execute any of them until LFENCE and all earlier instructions retire. Although this does not necessarily mean that all stores have become globally visible. This means the value captured by RDTSC will account for the execution of all earlier instructions. However, this may include also the execution time of later instructions because there are no guarantees regarding the order in which instructions after RDTSC will execute (including other RDTSC’s, which may lead to non-monotonic measurements on the same logical processor).

load1inst2store1MFENCELFENCERDTSC...

MFENCE here enables RDTSC to capture the time it takes to make all stores globally visible. This is the only difference that it makes in this code. Now consider this.

load1inst2store1LFENCEMFENCERDTSC...

Interesting, right? I changed the order of LFENCE and MFENCE. What do you think? If you already know what MFENCE does, then you should be able to figure it out by yourself. Otherwise, you can just skip it.

The sequence LFENCE;RDTSC enabled us to order RDTSC with respect to all previous instructions (with few exceptions discussed later). We can also do something similar so that RDTSC is ordered with respect to all later instructions (with few exceptions discussed later). It’s not hard to see that this can be achieved using the RDTSC;LFENCE sequence. A standard-compliant accurate (low variance) measurement requires sandwiching RDTSC between two LFENCE instructions. It’s worth noting that multiple executions of RDTSC in the same software thread may still result in non-monotonic samples when the thread gets rescheduled to run on different logical processors.

The Intel Manual Volume 2 also shows how LFENCE can be useful when using RDTSCP. It’s basically the same thing, so I’ll just skip it.

The third example is shown in the following quote.

Intel Manual Volume 3 Section 10.12.3:
To allow for efficient access to the APIC registers in x2APIC mode, the serializing semantics of WRMSR are relaxed when writing to the APIC registers. Thus, system software should not use “WRMSR to APIC registers in x2APIC mode” as a serializing instruction. Read and write accesses to the APIC registers will occur in program order. A WRMSR to an APIC register may complete before all preceding stores are globally visible; software can prevent this by inserting a serializing instruction or the sequence MFENCE;LFENCE before the WRMSR.

The sequence MFENCE;LFENCE has the same effect as before and is used for a similar purpose.

Now something very important from the AMD manual.

AMD Manual Volume 3
(1) LFENCE acts as a barrier to force strong memory ordering (serialization) between load instructions preceding the LFENCE and load instructions that follow the LFENCE. (2) Loads from differing memory types may be performed out of order, in particular between WC/WC+ and other memory types. (3) The LFENCE instruction assures that the system completes all previous loads before executing subsequent loads.

AMD has always in their manual described their implementation of LFENCE as a load serializing instruction. The original use case for LFENCE was ordering WC memory type loads. WC Loads may be performed out of order with respect to all other loads. This is also mentioned (although difficult to find) in the Intel manual.

Intel Manual Volume 3 Section 8.1.2.2:
Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.
Intel Manual Volume 3 Section 11.3.1:
The WC memory type is weakly ordered by definition.

Even though it says “weakly ordered memory types”, there is currently only one weakly ordered memory type, namely WC.

However, after the speculative execution vulnerabilities were discovered, AMD released a document in January 2018 entitled “Software techniques for managing speculation on AMD processors” that discusses when and how to make LFENCE behave similar to Intel’s LFENCE (giving it stronger serializing properties). This is what they say in that document.

Description: Set an MSR in the processor so that LFENCE is a dispatch serializing instruction and then use LFENCE in code streams to serialize dispatch (LFENCE is faster than RDTSCP which is also dispatch serializing). This mode of LFENCE may be enabled by setting MSR C001_1029[1]=1.
Effect: Upon encountering an LFENCE when the MSR bit is set, dispatch will stop until the LFENCE instruction becomes the oldest instruction in the machine.
Applicability: All AMD family 10h/12h/14h/15h/16h/17h processors support this MSR. LFENCE support is indicated by CPUID function1 EDX bit 26, SSE2. AMD family 0Fh/11h processors support LFENCE as serializing always but do not support this MSR. AMD plans support for this MSR and access to this bit for all future processors.

This is the first and only document in which MSR C001_1029[1] is mentioned (other bits of C001_1029 are discussed in some AMD documents, but not bit 1). When C001_1029[1] is set to 1, LFENCE behaves as a dispatch serializing instruction (which is more expensive than merely load serializing). Since this MSR is available on most older AMD processors, it seems that it has almost always been supported. Maybe because they thought they might need in the future to maintain compatibility with Intel processors regarding the behavior of LFENCE.

One thing not clear to me is the part regarding AMD families 0Fh and 11h processors. That statement is vague because it doesn’t clearly say whether LFENCE on AMD families 0Fh and 11h is fully serializing (in AMD terminology) or dispatch serializing (in AMD terminology). To be safe, it should be interpreted as dispatch serializing only. The AMD family-specific manuals don’t mention LFENCE or MSR C001_1029.

There are exceptions to the ordering rules of fence instructions and serializing instructions and instructions that have serializing properties. These exceptions are subtly different between Intel and AMD processors. One example is the CLFLUSH instruction. So AMD and Intel mean slightly different things when they talk about instructions with serializing properties.

The last sentence of quote indicates that MSR C001_1029[1] is not part of the AMD x86 architecture.

That LFENCE with this serializing behavior on Intel and AMD processors has been used to control speculative execution as a mitigation for the Spectre vulnerabilities.

Now I’ll discuss the weak-ordering characteristics of LFENCE on both Intel and AMD processors.

First, as already discussed, LFENCE does not prevent the processor from fetching and decoding instructions, but only from dispatching instructions. This means that LFENCE is concurrent with instruction fetches.

LFENCE is not ordered with SFENCE, the global visibility of earlier writes, software prefetching instructions, hardware prefetching, and page table walks, as specified in the following quotes and in other locations in the manuals. This means that LFENCE is concurrent with these operations.

Intel Manual Volume 2
Processors are free to fetch and cache data speculatively from regions of system memory that use the WB, WC, and WT memory types. This speculative fetching can occur at any time and is not tied to instruction execution. Thus, it is not ordered with respect to executions of the LFENCE instruction; data can be brought into the caches speculatively just before, during, or after the execution of an LFENCE instruction.

AMD Manual Volume 3
The LFENCE instruction is weakly-ordered with respect to store instructions, data and instruction prefetches, and the SFENCE instruction. Speculative loads initiated by the processor, or specified explicitly using cache-prefetch instructions, can be reordered around an LFENCE.

In the following quote from the Intel manual, it says that writes, CLFLUSH, and CLFLUSHOPT cannot pass earlier LFENCE. We already know this about the writes.

Intel Manual Volume 3 Section 8.2.2:
Writes and executions of CLFLUSH and CLFLUSHOPT cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.

This quote alone might give impression that CLFLUSH and CLFLUSHOPT might pass later LFENCE. However, other places in the manual specify that they are fully ordered with LFENCE.

Intel Manual Volume 2:
Executions of the CLFLUSH instruction are ordered with respect to each other and with respect to writes, locked read-modify-write instructions, fence instructions, and executions of CLFLUSHOPT to the same cache line. They are not ordered with respect to executions of CLFLUSHOPT to different cache lines.

Intel Manual Volume 2:
Executions of the CLFLUSHOPT instruction are ordered with respect to fence instructions and to locked read-modify-write instructions; they are also ordered with respect to the following accesses to the cache line being invalidated: writes, executions of CLFLUSH, and executions of CLFLUSHOPT. They are not ordered with respect to writes, executions of CLFLUSH, or executions of CLFLUSHOPT that access other cache lines; to enforce ordering with such an operation, software can insert an SFENCE instruction between CFLUSHOPT and that operation.

(Note the typo “CFLUSHOPT” at the end of the second quote. LOL. I demand an explanation.)

AMD is slightly different on the ordering between LFENCE and CLFLUSH and CLFLUSHOPT. In particular, on AMD processors, LFENCE is not ordered with respect to CLFLUSH.

AMD Manual Volume 3:
The LFENCE, SFENCE, and serializing instructions are not ordered with respect to CLFLUSH.

AMD Manual Volume 3:
The CLFLUSHOPT instruction is ordered with respect to fence instructions and locked operations.

The MONITOR instruction is treated as a demand WB load operation.

Intel Manual Volume 2:
The MONITOR instruction is ordered as a load operation with respect to other memory transactions. The instruction is subject to the permission checking and faults associated with a byte load. Like a load, MONITOR sets the A-bit but not the D-bit in page tables.

Note that Intel uses the terms “memory transactions” and “memory operations” interchangeably. I didn’t find any such statement in the AMD manual, but I expect that it’s treated the same way there too.

By the way, all of these ordering rules regarding LFENCE are important. Don’t get the false impression that they are arbitrary or irrelevant. In the future, I might write more about this. Also I might write similar articles for the other fence instructions and other related instructions.

LFENCE in the Linux kernel

Linux defines a list of all x86 CPU features in /arch/x86/include/asm/cpufeatures.h. The X86_FEATURE_LFENCE_RDTSC flag represents support for a dispatch serializing LFENCE. The X86_FEATURE_XMM2 flag represents support for SSE2. On Intel processors, X86_FEATURE_XMM2 implies X86_FEATURE_LFENCE_RDTSC. On AMD processors, X86_FEATURE_LFENCE_RDTSC requires X86_FEATURE_XMM2 and an extra check needs to be performed. On AMD processors that support SSE2, there are three cases to be considered:

MSR C001_1029[1] must be set to enable the dispatch serializing behavior of LFENCE. This can only be done if and only if the MSR is supported.
The MSR is not supported (AMD 0Fh/11h). LFENCE is by default at least dispatch serializing. Nothing needs to be done.
The MSR is supported, but we are running under a hypervisor that does not support writing that MSR (because perhaps the hypervisor has not been updated yet). In this case, resort to the slower MFENCE (which is fully serializing on AMD processors and represented by the X86_FEATURE_MFENCE_RDTSC macro) for serializing RDTSC and use a Spectre mitigation that does not require LFENCE (i.e., generic retpoline).

static void init_amd(struct cpuinfo_x86 *c){        ...        	if (cpu_has(c, X86_FEATURE_XMM2)) {		unsigned long long val;		int ret;		/*		 * A serializing LFENCE has less overhead than MFENCE, so		 * use it for execution serialization.  On families which		 * don't have that MSR, LFENCE is already serializing.		 * msr_set_bit() uses the safe accessors, too, even if the MSR		 * is not present.		 */		msr_set_bit(MSR_F10H_DECFG,MSR_F10H_DECFG_LFENCE_SERIALIZE_BIT);		/*		 * Verify that the MSR write was successful (could be running		 * under a hypervisor) and only then assume that LFENCE is		 * serializing.		 */		ret = rdmsrl_safe(MSR_F10H_DECFG, &val);		if (!ret && (val & MSR_F10H_DECFG_LFENCE_SERIALIZE)) {			/* A serializing LFENCE stops RDTSC speculation */			set_cpu_cap(c, X86_FEATURE_LFENCE_RDTSC);		} else {			/* MFENCE stops RDTSC speculation */			set_cpu_cap(c, X86_FEATURE_MFENCE_RDTSC);		}	}        ...}

The following relevant macros are defined in msr-index.h and they are only used on AMD processors:

#define MSR_F10H_DECFG			0xc0011029#define MSR_F10H_DECFG_LFENCE_SERIALIZE_BIT	1

X86_FEATURE_LFENCE_RDTSC is used elsewhere in the code to choose a Spectre mitigation.

Linux only uses LFENCE when it has the dispatch serializing properties. If Linux is running on an Intel or AMD processor that doesn’t support SSE2 (typically, 32-bit processors), it resorts to the lock prefix.

The Linux kernel defines the x86 memory fences that it uses in /arch/x86/include/asm/barrier.h as follows:

/* * Force strict CPU ordering. * And yes, this might be required on UP too when we're talking * to devices. */#ifdef CONFIG_X86_32#define mb() asm volatile(ALTERNATIVE("lock; addl $0,-4(%%esp)", "mfence", 				      X86_FEATURE_XMM2) ::: "memory", "cc")#define rmb() asm volatile(ALTERNATIVE("lock; addl $0,-4(%%esp)", "lfence", 				       X86_FEATURE_XMM2) ::: "memory", "cc")#define wmb() asm volatile(ALTERNATIVE("lock; addl $0,-4(%%esp)", "sfence", 				       X86_FEATURE_XMM2) ::: "memory", "cc")#else#define mb() 	asm volatile("mfence":::"memory")#define rmb()	asm volatile("lfence":::"memory")#define wmb()	asm volatile("sfence" ::: "memory")#endif

On 32-bit x86 CPUs that don’t support LFENCE, Linux resorts to the lock prefix (which is supported by all Intel and AMD x86 processors since the Intel 8086). On both Intel and AMD processors, X86_FEATURE_XMM2 represents support for SSE2, but not necessarily a dispatch serializing LFENCE. The rmb barrier is used many times in the kernel. Note how volatile is used to define all the barriers so that they constitute compiler barriers as well.

Potential Uses of LFENCE

There are few ways to use LFENCE to improve performance and potentially reduce energy consumption and heat emission of the CPU.

Consider the following C code.

#include <unistd.h> // sleep#include <pthread.h> // pthreadunsigned var = 0;void *writer(void *unused) { sleep(2); // let the reader loop for a while. var = 1; asm volatile ("sfence" ::: "memory"); return NULL;}void *reader(void *unused) { while(var == 0) { } // spinwait return NULL;}int main(void) { pthread_t thread1, thread2; void *status; pthread_create(&thread1, NULL, reader, NULL); pthread_create(&thread2, NULL, writer, NULL); // wait for the threads to terminate. pthread_join(thread2, &status); pthread_join(thread1, &status); return 0;}

I know that this is not 100% language-compliant code, but serves the purpose of demonstrating a use case of LFENCE.

In this code, There are three threads: the main thread, a reader thread, and a writer thread. The writer thread sleeps for 2 seconds and then writes to a shared variable. I’ve used SFENCE at the end of the writer thread to force the write to become globally visible so that the reader thread can see it. The reader thread simply iterates in an empty loop until the writer thread updates the shared variable. Compile the code using the command `gcc main.c -pthread` and run the generated executable. Sure enough, after about 2 seconds, all threads terminate.

The assembly code of the reader loop looks like this:

.L4: movl var(%rip), %eax testl %eax, %eax je .L4

Let’s measure a couple of important hardware performance counters using a command similar to this one`perf stat -r 5 -e r1D1,r10E,r1C2,r0C0 ./a.out`. You should specify the raw events supported by your CPU. On my Intel Haswell processor, r1D1 represents the event of hitting the L1 data cache (MEM_LOAD_UOPS_RETIRED.L1_HIT), r10E represents the event of issuing (Intel terminology) any uop (UOPS_ISSUED.ANY), r1C2 represents the event of retiring any uop (UOPS_RETIRED.ALL), and r0C0 represents the event of retiring any instruction (INST_RETIRED.ANY_P). If your CPU supports hyperthreading, disable it. On my system, I got the following results:

Performance counter stats for './a.out' (5 runs):  7,37,90,18,318 r1D1 ( +- 0.25% ) 14,76,48,32,136 r10E ( +- 0.25% ) 14,76,35,29,766 r1C2 ( +- 0.25% ) 22,13,74,78,876 r0C0 ( +- 0.25% )2.000678814 seconds time elapsed ( +- 0.00% )

Most of the executed instructions would come from the reader loop. There are three instructions in the reader loop. The first one is translated to a single uop and the other two get translated to a single fused uop. Therefore, the number of retired instructions should be around 33% larger than the number of uops. The number of issued uops should be close to the number of retired uops because there is very little branch misprediction and because the number of uops is the same in the fused (UOPS_ISSUED.ANY) and unfused domain (UOPS_RETIRED.ALL). The number of L1 data cache hits is close to the number of iterations of the reader loop. It’s nice to see that the standard deviation is very low.

Now let’s use LFENCE inside the loop.

void *reader(void *unused) { while(var == 0) // efficient spinwait {    asm volatile ("lfence" ::: "memory"); } return NULL;}

The assembly code of the reader loop looks like this:

.L5: lfence movl var(%rip), %eax testl %eax, %eax je .L5

LFENCE prevents the logical processor from issue instances of instructions that belong to later iterations of the loop until the value of the memory load of the current iteration has been determined. This basically has the effect of slowing down the loop, but in an intelligent manner. There is no point in rapidly issuing load requests. Compile the code using the command `gcc main.c -pthread` and use perf on the generated executable.

Performance counter stats for './a.out' (5 runs):   46,53,77,643 r1D1 ( +- 0.25% ) 3,72,30,83,120 r10E ( +- 0.25% ) 3,72,28,10,960 r1C2 ( +- 0.25% ) 1,86,11,02,392 r0C0 ( +- 0.25% )2.000734895 seconds time elapsed ( +- 0.00% )

Nice! The number of load requests (which is very close to the number of iterations of the reader loop) has been reduced by more than 10x. The number of retired/issued uops has been reduced by about 5x. The number of retired instructions is now much smaller than the number of retired uops. Although the execution time has been increased by about 0.5 mircosecond.

Just like before, and as expected, the number of retired instructions is about 4 times the number of L1 data cache hits (number of iterations). It seems that LFENCE is made up of 5 uops on Haswell.

This technique is particularly useful when hyperthreading is enabled. LFENCE prevents the reader from unnecessarily consuming execution resources, making them available more often to the other threads. However, the PAUSE instruction might be more suitable for that purpose.

Finally, I’ll discuss how LFENCE can be used to control speculative execution using the following program:

#include <stdlib.h>int main(void) { register unsigned count = 100000000; while(count>0) {  if(rand()%2 == 0){ --count; }  //asm volatile ("lfence" ::: "memory"); } return 0;}

I’m using the `rand` function to generate random values used to evaluate the condition of a branch. The goal here is to basically defeat the branch predictor no matter how sophisticated it is or how it works. The random number generator has not been seeded to make sure that all runs exhibit the same branching decisions.

Let’s use perf to measure the number of issued uops, retired uops, and retired instructions.

Performance counter stats for './a.out' (5 runs): 18,17,68,22,733 r10E ( +- 0.06% ) 15,98,76,66,390 r1C2 ( +- 0.01% ) 10,89,74,98,843 r0C0 ( +- 0.01% )2.221222619 seconds time elapsed ( +- 0.26% )

The fact that the number of uops in the fused domain is larger than the number of uops in the unfused domain indicates that the CPU experienced a lot of branch mispredictions. By using LFENCE in the loop, speculative execution can be eliminated. Although LFENCE does not prevent the CPU from speculatively fetching instructions.

Performance counter stats for './a.out' (5 runs): 15,43,65,94,751 r10E ( +- 0.00% ) 17,29,00,37,902 r1C2 ( +- 0.00% ) 11,09,63,98,795 r0C0 ( +- 0.00% )2.735851398 seconds time elapsed ( +- 0.32% )

For more information on the impact of LFENCE on performance and on how it is implemented in Intel processors, refer to the following Stack Overflow post: Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths.

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。