Blame - tools/memory-model/Documentation/explanation.txt - yocto/kernel/common

blob: 68caa9a976d0c2e468f07bbe4871e809b21751c7 [file] [log] [blame]

Alan Stern	bf28ae5	2018-02-20 15:25:12 -0800	[diff] [blame]	1	Explanation of the Linux-Kernel Memory Consistency Model
				2	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	3
				4	:Author: Alan Stern <stern@rowland.harvard.edu>
				5	:Created: October 2017
				6
				7	.. Contents
				8
				9	1. INTRODUCTION
				10	2. BACKGROUND
				11	3. A SIMPLE EXAMPLE
				12	4. A SELECTION OF MEMORY MODELS
				13	5. ORDERING AND CYCLES
				14	6. EVENTS
				15	7. THE PROGRAM ORDER RELATION: po AND po-loc
				16	8. A WARNING
				17	9. DEPENDENCY RELATIONS: data, addr, and ctrl
				18	10. THE READS-FROM RELATION: rf, rfi, and rfe
				19	11. CACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe
				20	12. THE FROM-READS RELATION: fr, fri, and fre
				21	13. AN OPERATIONAL MODEL
				22	14. PROPAGATION ORDER RELATION: cumul-fence
				23	15. DERIVATION OF THE LKMM FROM THE OPERATIONAL MODEL
				24	16. SEQUENTIAL CONSISTENCY PER VARIABLE
				25	17. ATOMIC UPDATES: rmw
				26	18. THE PRESERVED PROGRAM ORDER RELATION: ppo
				27	19. AND THEN THERE WAS ALPHA
				28	20. THE HAPPENS-BEFORE RELATION: hb
				29	21. THE PROPAGATES-BEFORE RELATION: pb
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	30	22. RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-fence, and rb
Alan Stern	6e89e83	2018-09-26 11:29:17 -0700	[diff] [blame]	31	23. LOCKING
				32	24. ODDS AND ENDS
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	33
				34
				35
				36	INTRODUCTION
				37	------------
				38
Alan Stern	bf28ae5	2018-02-20 15:25:12 -0800	[diff] [blame]	39	The Linux-kernel memory consistency model (LKMM) is rather complex and
				40	obscure. This is particularly evident if you read through the
				41	linux-kernel.bell and linux-kernel.cat files that make up the formal
				42	version of the model; they are extremely terse and their meanings are
				43	far from clear.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	44
				45	This document describes the ideas underlying the LKMM. It is meant
Alan Stern	bf28ae5	2018-02-20 15:25:12 -0800	[diff] [blame]	46	for people who want to understand how the model was designed. It does
				47	not go into the details of the code in the .bell and .cat files;
				48	rather, it explains in English what the code expresses symbolically.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	49
				50	Sections 2 (BACKGROUND) through 5 (ORDERING AND CYCLES) are aimed
Alan Stern	bf28ae5	2018-02-20 15:25:12 -0800	[diff] [blame]	51	toward beginners; they explain what memory consistency models are and
				52	the basic notions shared by all such models. People already familiar
				53	with these concepts can skim or skip over them. Sections 6 (EVENTS)
				54	through 12 (THE FROM_READS RELATION) describe the fundamental
				55	relations used in many models. Starting in Section 13 (AN OPERATIONAL
				56	MODEL), the workings of the LKMM itself are covered.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	57
				58	Warning: The code examples in this document are not written in the
				59	proper format for litmus tests. They don't include a header line, the
				60	initializations are not enclosed in braces, the global variables are
				61	not passed by pointers, and they don't have an "exists" clause at the
				62	end. Converting them to the right format is left as an exercise for
				63	the reader.
				64
				65
				66	BACKGROUND
				67	----------
				68
				69	A memory consistency model (or just memory model, for short) is
				70	something which predicts, given a piece of computer code running on a
				71	particular kind of system, what values may be obtained by the code's
				72	load instructions. The LKMM makes these predictions for code running
				73	as part of the Linux kernel.
				74
				75	In practice, people tend to use memory models the other way around.
				76	That is, given a piece of code and a collection of values specified
				77	for the loads, the model will predict whether it is possible for the
				78	code to run in such a way that the loads will indeed obtain the
				79	specified values. Of course, this is just another way of expressing
				80	the same idea.
				81
				82	For code running on a uniprocessor system, the predictions are easy:
				83	Each load instruction must obtain the value written by the most recent
				84	store instruction accessing the same location (we ignore complicating
				85	factors such as DMA and mixed-size accesses.) But on multiprocessor
				86	systems, with multiple CPUs making concurrent accesses to shared
				87	memory locations, things aren't so simple.
				88
				89	Different architectures have differing memory models, and the Linux
				90	kernel supports a variety of architectures. The LKMM has to be fairly
				91	permissive, in the sense that any behavior allowed by one of these
				92	architectures also has to be allowed by the LKMM.
				93
				94
				95	A SIMPLE EXAMPLE
				96	----------------
				97
				98	Here is a simple example to illustrate the basic concepts. Consider
				99	some code running as part of a device driver for an input device. The
				100	driver might contain an interrupt handler which collects data from the
				101	device, stores it in a buffer, and sets a flag to indicate the buffer
				102	is full. Running concurrently on a different CPU might be a part of
				103	the driver code being executed by a process in the midst of a read(2)
				104	system call. This code tests the flag to see whether the buffer is
				105	ready, and if it is, copies the data back to userspace. The buffer
				106	and the flag are memory locations shared between the two CPUs.
				107
				108	We can abstract out the important pieces of the driver code as follows
				109	(the reason for using WRITE_ONCE() and READ_ONCE() instead of simple
				110	assignment statements is discussed later):
				111
				112	int buf = 0, flag = 0;
				113
				114	P0()
				115	{
				116	WRITE_ONCE(buf, 1);
				117	WRITE_ONCE(flag, 1);
				118	}
				119
				120	P1()
				121	{
				122	int r1;
				123	int r2 = 0;
				124
				125	r1 = READ_ONCE(flag);
				126	if (r1)
				127	r2 = READ_ONCE(buf);
				128	}
				129
				130	Here the P0() function represents the interrupt handler running on one
				131	CPU and P1() represents the read() routine running on another. The
				132	value 1 stored in buf represents input data collected from the device.
				133	Thus, P0 stores the data in buf and then sets flag. Meanwhile, P1
				134	reads flag into the private variable r1, and if it is set, reads the
				135	data from buf into a second private variable r2 for copying to
				136	userspace. (Presumably if flag is not set then the driver will wait a
				137	while and try again.)
				138
				139	This pattern of memory accesses, where one CPU stores values to two
				140	shared memory locations and another CPU loads from those locations in
				141	the opposite order, is widely known as the "Message Passing" or MP
				142	pattern. It is typical of memory access patterns in the kernel.
				143
				144	Please note that this example code is a simplified abstraction. Real
				145	buffers are usually larger than a single integer, real device drivers
				146	usually use sleep and wakeup mechanisms rather than polling for I/O
				147	completion, and real code generally doesn't bother to copy values into
				148	private variables before using them. All that is beside the point;
				149	the idea here is simply to illustrate the overall pattern of memory
				150	accesses by the CPUs.
				151
				152	A memory model will predict what values P1 might obtain for its loads
				153	from flag and buf, or equivalently, what values r1 and r2 might end up
				154	with after the code has finished running.
				155
				156	Some predictions are trivial. For instance, no sane memory model would
				157	predict that r1 = 42 or r2 = -7, because neither of those values ever
				158	gets stored in flag or buf.
				159
				160	Some nontrivial predictions are nonetheless quite simple. For
				161	instance, P1 might run entirely before P0 begins, in which case r1 and
				162	r2 will both be 0 at the end. Or P0 might run entirely before P1
				163	begins, in which case r1 and r2 will both be 1.
				164
				165	The interesting predictions concern what might happen when the two
				166	routines run concurrently. One possibility is that P1 runs after P0's
				167	store to buf but before the store to flag. In this case, r1 and r2
				168	will again both be 0. (If P1 had been designed to read buf
				169	unconditionally then we would instead have r1 = 0 and r2 = 1.)
				170
				171	However, the most interesting possibility is where r1 = 1 and r2 = 0.
				172	If this were to occur it would mean the driver contains a bug, because
				173	incorrect data would get sent to the user: 0 instead of 1. As it
				174	happens, the LKMM does predict this outcome can occur, and the example
				175	driver code shown above is indeed buggy.
				176
				177
				178	A SELECTION OF MEMORY MODELS
				179	----------------------------
				180
				181	The first widely cited memory model, and the simplest to understand,
				182	is Sequential Consistency. According to this model, systems behave as
				183	if each CPU executed its instructions in order but with unspecified
				184	timing. In other words, the instructions from the various CPUs get
				185	interleaved in a nondeterministic way, always according to some single
				186	global order that agrees with the order of the instructions in the
				187	program source for each CPU. The model says that the value obtained
				188	by each load is simply the value written by the most recently executed
				189	store to the same memory location, from any CPU.
				190
				191	For the MP example code shown above, Sequential Consistency predicts
				192	that the undesired result r1 = 1, r2 = 0 cannot occur. The reasoning
				193	goes like this:
				194
				195	Since r1 = 1, P0 must store 1 to flag before P1 loads 1 from
				196	it, as loads can obtain values only from earlier stores.
				197
				198	P1 loads from flag before loading from buf, since CPUs execute
				199	their instructions in order.
				200
				201	P1 must load 0 from buf before P0 stores 1 to it; otherwise r2
				202	would be 1 since a load obtains its value from the most recent
				203	store to the same address.
				204
				205	P0 stores 1 to buf before storing 1 to flag, since it executes
				206	its instructions in order.
				207
				208	Since an instruction (in this case, P1's store to flag) cannot
				209	execute before itself, the specified outcome is impossible.
				210
				211	However, real computer hardware almost never follows the Sequential
				212	Consistency memory model; doing so would rule out too many valuable
				213	performance optimizations. On ARM and PowerPC architectures, for
				214	instance, the MP example code really does sometimes yield r1 = 1 and
				215	r2 = 0.
				216
				217	x86 and SPARC follow yet a different memory model: TSO (Total Store
				218	Ordering). This model predicts that the undesired outcome for the MP
				219	pattern cannot occur, but in other respects it differs from Sequential
				220	Consistency. One example is the Store Buffer (SB) pattern, in which
				221	each CPU stores to its own shared location and then loads from the
				222	other CPU's location:
				223
				224	int x = 0, y = 0;
				225
				226	P0()
				227	{
				228	int r0;
				229
				230	WRITE_ONCE(x, 1);
				231	r0 = READ_ONCE(y);
				232	}
				233
				234	P1()
				235	{
				236	int r1;
				237
				238	WRITE_ONCE(y, 1);
				239	r1 = READ_ONCE(x);
				240	}
				241
				242	Sequential Consistency predicts that the outcome r0 = 0, r1 = 0 is
				243	impossible. (Exercise: Figure out the reasoning.) But TSO allows
				244	this outcome to occur, and in fact it does sometimes occur on x86 and
				245	SPARC systems.
				246
				247	The LKMM was inspired by the memory models followed by PowerPC, ARM,
				248	x86, Alpha, and other architectures. However, it is different in
				249	detail from each of them.
				250
				251
				252	ORDERING AND CYCLES
				253	-------------------
				254
				255	Memory models are all about ordering. Often this is temporal ordering
				256	(i.e., the order in which certain events occur) but it doesn't have to
				257	be; consider for example the order of instructions in a program's
				258	source code. We saw above that Sequential Consistency makes an
				259	important assumption that CPUs execute instructions in the same order
				260	as those instructions occur in the code, and there are many other
				261	instances of ordering playing central roles in memory models.
				262
				263	The counterpart to ordering is a cycle. Ordering rules out cycles:
				264	It's not possible to have X ordered before Y, Y ordered before Z, and
				265	Z ordered before X, because this would mean that X is ordered before
				266	itself. The analysis of the MP example under Sequential Consistency
				267	involved just such an impossible cycle:
				268
				269	W: P0 stores 1 to flag executes before
				270	X: P1 loads 1 from flag executes before
				271	Y: P1 loads 0 from buf executes before
				272	Z: P0 stores 1 to buf executes before
				273	W: P0 stores 1 to flag.
				274
				275	In short, if a memory model requires certain accesses to be ordered,
				276	and a certain outcome for the loads in a piece of code can happen only
				277	if those accesses would form a cycle, then the memory model predicts
				278	that outcome cannot occur.
				279
				280	The LKMM is defined largely in terms of cycles, as we will see.
				281
				282
				283	EVENTS
				284	------
				285
				286	The LKMM does not work directly with the C statements that make up
				287	kernel source code. Instead it considers the effects of those
				288	statements in a more abstract form, namely, events. The model
				289	includes three types of events:
				290
				291	Read events correspond to loads from shared memory, such as
				292	calls to READ_ONCE(), smp_load_acquire(), or
				293	rcu_dereference().
				294
				295	Write events correspond to stores to shared memory, such as
				296	calls to WRITE_ONCE(), smp_store_release(), or atomic_set().
				297
				298	Fence events correspond to memory barriers (also known as
				299	fences), such as calls to smp_rmb() or rcu_read_lock().
				300
				301	These categories are not exclusive; a read or write event can also be
				302	a fence. This happens with functions like smp_load_acquire() or
				303	spin_lock(). However, no single event can be both a read and a write.
				304	Atomic read-modify-write accesses, such as atomic_inc() or xchg(),
				305	correspond to a pair of events: a read followed by a write. (The
				306	write event is omitted for executions where it doesn't occur, such as
				307	a cmpxchg() where the comparison fails.)
				308
				309	Other parts of the code, those which do not involve interaction with
				310	shared memory, do not give rise to events. Thus, arithmetic and
				311	logical computations, control-flow instructions, or accesses to
				312	private memory or CPU registers are not of central interest to the
				313	memory model. They only affect the model's predictions indirectly.
				314	For example, an arithmetic computation might determine the value that
				315	gets stored to a shared memory location (or in the case of an array
				316	index, the address where the value gets stored), but the memory model
				317	is concerned only with the store itself -- its value and its address
				318	-- not the computation leading up to it.
				319
				320	Events in the LKMM can be linked by various relations, which we will
				321	describe in the following sections. The memory model requires certain
				322	of these relations to be orderings, that is, it requires them not to
				323	have any cycles.
				324
				325
				326	THE PROGRAM ORDER RELATION: po AND po-loc
				327	-----------------------------------------
				328
				329	The most important relation between events is program order (po). You
				330	can think of it as the order in which statements occur in the source
				331	code after branches are taken into account and loops have been
				332	unrolled. A better description might be the order in which
				333	instructions are presented to a CPU's execution unit. Thus, we say
				334	that X is po-before Y (written as "X ->po Y" in formulas) if X occurs
				335	before Y in the instruction stream.
				336
				337	This is inherently a single-CPU relation; two instructions executing
				338	on different CPUs are never linked by po. Also, it is by definition
				339	an ordering so it cannot have any cycles.
				340
				341	po-loc is a sub-relation of po. It links two memory accesses when the
				342	first comes before the second in program order and they access the
				343	same memory location (the "-loc" suffix).
				344
				345	Although this may seem straightforward, there is one subtle aspect to
				346	program order we need to explain. The LKMM was inspired by low-level
				347	architectural memory models which describe the behavior of machine
				348	code, and it retains their outlook to a considerable extent. The
				349	read, write, and fence events used by the model are close in spirit to
				350	individual machine instructions. Nevertheless, the LKMM describes
				351	kernel code written in C, and the mapping from C to machine code can
				352	be extremely complex.
				353
				354	Optimizing compilers have great freedom in the way they translate
				355	source code to object code. They are allowed to apply transformations
				356	that add memory accesses, eliminate accesses, combine them, split them
				357	into pieces, or move them around. Faced with all these possibilities,
				358	the LKMM basically gives up. It insists that the code it analyzes
				359	must contain no ordinary accesses to shared memory; all accesses must
				360	be performed using READ_ONCE(), WRITE_ONCE(), or one of the other
				361	atomic or synchronization primitives. These primitives prevent a
				362	large number of compiler optimizations. In particular, it is
				363	guaranteed that the compiler will not remove such accesses from the
				364	generated code (unless it can prove the accesses will never be
				365	executed), it will not change the order in which they occur in the
				366	code (within limits imposed by the C standard), and it will not
				367	introduce extraneous accesses.
				368
				369	This explains why the MP and SB examples above used READ_ONCE() and
				370	WRITE_ONCE() rather than ordinary memory accesses. Thanks to this
				371	usage, we can be certain that in the MP example, P0's write event to
				372	buf really is po-before its write event to flag, and similarly for the
				373	other shared memory accesses in the examples.
				374
				375	Private variables are not subject to this restriction. Since they are
				376	not shared between CPUs, they can be accessed normally without
				377	READ_ONCE() or WRITE_ONCE(), and there will be no ill effects. In
				378	fact, they need not even be stored in normal memory at all -- in
				379	principle a private variable could be stored in a CPU register (hence
				380	the convention that these variables have names starting with the
				381	letter 'r').
				382
				383
				384	A WARNING
				385	---------
				386
				387	The protections provided by READ_ONCE(), WRITE_ONCE(), and others are
				388	not perfect; and under some circumstances it is possible for the
				389	compiler to undermine the memory model. Here is an example. Suppose
				390	both branches of an "if" statement store the same value to the same
				391	location:
				392
				393	r1 = READ_ONCE(x);
				394	if (r1) {
				395	WRITE_ONCE(y, 2);
				396	... /* do something */
				397	} else {
				398	WRITE_ONCE(y, 2);
				399	... /* do something else */
				400	}
				401
				402	For this code, the LKMM predicts that the load from x will always be
				403	executed before either of the stores to y. However, a compiler could
				404	lift the stores out of the conditional, transforming the code into
				405	something resembling:
				406
				407	r1 = READ_ONCE(x);
				408	WRITE_ONCE(y, 2);
				409	if (r1) {
				410	... /* do something */
				411	} else {
				412	... /* do something else */
				413	}
				414
				415	Given this version of the code, the LKMM would predict that the load
				416	from x could be executed after the store to y. Thus, the memory
				417	model's original prediction could be invalidated by the compiler.
				418
				419	Another issue arises from the fact that in C, arguments to many
				420	operators and function calls can be evaluated in any order. For
				421	example:
				422
				423	r1 = f(5) + g(6);
				424
				425	The object code might call f(5) either before or after g(6); the
				426	memory model cannot assume there is a fixed program order relation
				427	between them. (In fact, if the functions are inlined then the
				428	compiler might even interleave their object code.)
				429
				430
				431	DEPENDENCY RELATIONS: data, addr, and ctrl
				432	------------------------------------------
				433
				434	We say that two events are linked by a dependency relation when the
				435	execution of the second event depends in some way on a value obtained
				436	from memory by the first. The first event must be a read, and the
				437	value it obtains must somehow affect what the second event does.
				438	There are three kinds of dependencies: data, address (addr), and
				439	control (ctrl).
				440
				441	A read and a write event are linked by a data dependency if the value
				442	obtained by the read affects the value stored by the write. As a very
				443	simple example:
				444
				445	int x, y;
				446
				447	r1 = READ_ONCE(x);
				448	WRITE_ONCE(y, r1 + 5);
				449
				450	The value stored by the WRITE_ONCE obviously depends on the value
				451	loaded by the READ_ONCE. Such dependencies can wind through
				452	arbitrarily complicated computations, and a write can depend on the
				453	values of multiple reads.
				454
				455	A read event and another memory access event are linked by an address
				456	dependency if the value obtained by the read affects the location
				457	accessed by the other event. The second event can be either a read or
				458	a write. Here's another simple example:
				459
				460	int a[20];
				461	int i;
				462
				463	r1 = READ_ONCE(i);
				464	r2 = READ_ONCE(a[r1]);
				465
				466	Here the location accessed by the second READ_ONCE() depends on the
				467	index value loaded by the first. Pointer indirection also gives rise
				468	to address dependencies, since the address of a location accessed
				469	through a pointer will depend on the value read earlier from that
				470	pointer.
				471
				472	Finally, a read event and another memory access event are linked by a
				473	control dependency if the value obtained by the read affects whether
				474	the second event is executed at all. Simple example:
				475
				476	int x, y;
				477
				478	r1 = READ_ONCE(x);
				479	if (r1)
				480	WRITE_ONCE(y, 1984);
				481
				482	Execution of the WRITE_ONCE() is controlled by a conditional expression
				483	which depends on the value obtained by the READ_ONCE(); hence there is
				484	a control dependency from the load to the store.
				485
				486	It should be pretty obvious that events can only depend on reads that
				487	come earlier in program order. Symbolically, if we have R ->data X,
				488	R ->addr X, or R ->ctrl X (where R is a read event), then we must also
				489	have R ->po X. It wouldn't make sense for a computation to depend
				490	somehow on a value that doesn't get loaded from shared memory until
				491	later in the code!
				492
				493
				494	THE READS-FROM RELATION: rf, rfi, and rfe
				495	-----------------------------------------
				496
				497	The reads-from relation (rf) links a write event to a read event when
				498	the value loaded by the read is the value that was stored by the
				499	write. In colloquial terms, the load "reads from" the store. We
				500	write W ->rf R to indicate that the load R reads from the store W. We
				501	further distinguish the cases where the load and the store occur on
				502	the same CPU (internal reads-from, or rfi) and where they occur on
				503	different CPUs (external reads-from, or rfe).
				504
				505	For our purposes, a memory location's initial value is treated as
				506	though it had been written there by an imaginary initial store that
				507	executes on a separate CPU before the program runs.
				508
				509	Usage of the rf relation implicitly assumes that loads will always
				510	read from a single store. It doesn't apply properly in the presence
				511	of load-tearing, where a load obtains some of its bits from one store
				512	and some of them from another store. Fortunately, use of READ_ONCE()
				513	and WRITE_ONCE() will prevent load-tearing; it's not possible to have:
				514
				515	int x = 0;
				516
				517	P0()
				518	{
				519	WRITE_ONCE(x, 0x1234);
				520	}
				521
				522	P1()
				523	{
				524	int r1;
				525
				526	r1 = READ_ONCE(x);
				527	}
				528
				529	and end up with r1 = 0x1200 (partly from x's initial value and partly
				530	from the value stored by P0).
				531
				532	On the other hand, load-tearing is unavoidable when mixed-size
				533	accesses are used. Consider this example:
				534
				535	union {
				536	u32 w;
				537	u16 h[2];
				538	} x;
				539
				540	P0()
				541	{
				542	WRITE_ONCE(x.h[0], 0x1234);
				543	WRITE_ONCE(x.h[1], 0x5678);
				544	}
				545
				546	P1()
				547	{
				548	int r1;
				549
				550	r1 = READ_ONCE(x.w);
				551	}
				552
				553	If r1 = 0x56781234 (little-endian!) at the end, then P1 must have read
				554	from both of P0's stores. It is possible to handle mixed-size and
				555	unaligned accesses in a memory model, but the LKMM currently does not
				556	attempt to do so. It requires all accesses to be properly aligned and
				557	of the location's actual size.
				558
				559
				560	CACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe
				561	------------------------------------------------------------------
				562
				563	Cache coherence is a general principle requiring that in a
				564	multi-processor system, the CPUs must share a consistent view of the
				565	memory contents. Specifically, it requires that for each location in
				566	shared memory, the stores to that location must form a single global
				567	ordering which all the CPUs agree on (the coherence order), and this
				568	ordering must be consistent with the program order for accesses to
				569	that location.
				570
				571	To put it another way, for any variable x, the coherence order (co) of
				572	the stores to x is simply the order in which the stores overwrite one
				573	another. The imaginary store which establishes x's initial value
				574	comes first in the coherence order; the store which directly
				575	overwrites the initial value comes second; the store which overwrites
				576	that value comes third, and so on.
				577
				578	You can think of the coherence order as being the order in which the
				579	stores reach x's location in memory (or if you prefer a more
				580	hardware-centric view, the order in which the stores get written to
				581	x's cache line). We write W ->co W' if W comes before W' in the
				582	coherence order, that is, if the value stored by W gets overwritten,
				583	directly or indirectly, by the value stored by W'.
				584
				585	Coherence order is required to be consistent with program order. This
				586	requirement takes the form of four coherency rules:
				587
				588	Write-write coherence: If W ->po-loc W' (i.e., W comes before
				589	W' in program order and they access the same location), where W
				590	and W' are two stores, then W ->co W'.
				591
				592	Write-read coherence: If W ->po-loc R, where W is a store and R
				593	is a load, then R must read from W or from some other store
				594	which comes after W in the coherence order.
				595
				596	Read-write coherence: If R ->po-loc W, where R is a load and W
				597	is a store, then the store which R reads from must come before
				598	W in the coherence order.
				599
				600	Read-read coherence: If R ->po-loc R', where R and R' are two
				601	loads, then either they read from the same store or else the
				602	store read by R comes before the store read by R' in the
				603	coherence order.
				604
				605	This is sometimes referred to as sequential consistency per variable,
				606	because it means that the accesses to any single memory location obey
				607	the rules of the Sequential Consistency memory model. (According to
				608	Wikipedia, sequential consistency per variable and cache coherence
				609	mean the same thing except that cache coherence includes an extra
				610	requirement that every store eventually becomes visible to every CPU.)
				611
				612	Any reasonable memory model will include cache coherence. Indeed, our
				613	expectation of cache coherence is so deeply ingrained that violations
				614	of its requirements look more like hardware bugs than programming
				615	errors:
				616
				617	int x;
				618
				619	P0()
				620	{
				621	WRITE_ONCE(x, 17);
				622	WRITE_ONCE(x, 23);
				623	}
				624
				625	If the final value stored in x after this code ran was 17, you would
				626	think your computer was broken. It would be a violation of the
				627	write-write coherence rule: Since the store of 23 comes later in
				628	program order, it must also come later in x's coherence order and
				629	thus must overwrite the store of 17.
				630
				631	int x = 0;
				632
				633	P0()
				634	{
				635	int r1;
				636
				637	r1 = READ_ONCE(x);
				638	WRITE_ONCE(x, 666);
				639	}
				640
				641	If r1 = 666 at the end, this would violate the read-write coherence
				642	rule: The READ_ONCE() load comes before the WRITE_ONCE() store in
				643	program order, so it must not read from that store but rather from one
				644	coming earlier in the coherence order (in this case, x's initial
				645	value).
				646
				647	int x = 0;
				648
				649	P0()
				650	{
				651	WRITE_ONCE(x, 5);
				652	}
				653
				654	P1()
				655	{
				656	int r1, r2;
				657
				658	r1 = READ_ONCE(x);
				659	r2 = READ_ONCE(x);
				660	}
				661
				662	If r1 = 5 (reading from P0's store) and r2 = 0 (reading from the
				663	imaginary store which establishes x's initial value) at the end, this
				664	would violate the read-read coherence rule: The r1 load comes before
				665	the r2 load in program order, so it must not read from a store that
				666	comes later in the coherence order.
				667
				668	(As a minor curiosity, if this code had used normal loads instead of
				669	READ_ONCE() in P1, on Itanium it sometimes could end up with r1 = 5
				670	and r2 = 0! This results from parallel execution of the operations
				671	encoded in Itanium's Very-Long-Instruction-Word format, and it is yet
				672	another motivation for using READ_ONCE() when accessing shared memory
				673	locations.)
				674
				675	Just like the po relation, co is inherently an ordering -- it is not
				676	possible for a store to directly or indirectly overwrite itself! And
				677	just like with the rf relation, we distinguish between stores that
				678	occur on the same CPU (internal coherence order, or coi) and stores
				679	that occur on different CPUs (external coherence order, or coe).
				680
				681	On the other hand, stores to different memory locations are never
				682	related by co, just as instructions on different CPUs are never
				683	related by po. Coherence order is strictly per-location, or if you
				684	prefer, each location has its own independent coherence order.
				685
				686
				687	THE FROM-READS RELATION: fr, fri, and fre
				688	-----------------------------------------
				689
				690	The from-reads relation (fr) can be a little difficult for people to
				691	grok. It describes the situation where a load reads a value that gets
				692	overwritten by a store. In other words, we have R ->fr W when the
				693	value that R reads is overwritten (directly or indirectly) by W, or
				694	equivalently, when R reads from a store which comes earlier than W in
				695	the coherence order.
				696
				697	For example:
				698
				699	int x = 0;
				700
				701	P0()
				702	{
				703	int r1;
				704
				705	r1 = READ_ONCE(x);
				706	WRITE_ONCE(x, 2);
				707	}
				708
				709	The value loaded from x will be 0 (assuming cache coherence!), and it
				710	gets overwritten by the value 2. Thus there is an fr link from the
				711	READ_ONCE() to the WRITE_ONCE(). If the code contained any later
				712	stores to x, there would also be fr links from the READ_ONCE() to
				713	them.
				714
				715	As with rf, rfi, and rfe, we subdivide the fr relation into fri (when
				716	the load and the store are on the same CPU) and fre (when they are on
				717	different CPUs).
				718
				719	Note that the fr relation is determined entirely by the rf and co
				720	relations; it is not independent. Given a read event R and a write
				721	event W for the same location, we will have R ->fr W if and only if
				722	the write which R reads from is co-before W. In symbols,
				723
				724	(R ->fr W) := (there exists W' with W' ->rf R and W' ->co W).
				725
				726
				727	AN OPERATIONAL MODEL
				728	--------------------
				729
				730	The LKMM is based on various operational memory models, meaning that
				731	the models arise from an abstract view of how a computer system
				732	operates. Here are the main ideas, as incorporated into the LKMM.
				733
				734	The system as a whole is divided into the CPUs and a memory subsystem.
				735	The CPUs are responsible for executing instructions (not necessarily
				736	in program order), and they communicate with the memory subsystem.
				737	For the most part, executing an instruction requires a CPU to perform
				738	only internal operations. However, loads, stores, and fences involve
				739	more.
				740
				741	When CPU C executes a store instruction, it tells the memory subsystem
				742	to store a certain value at a certain location. The memory subsystem
				743	propagates the store to all the other CPUs as well as to RAM. (As a
				744	special case, we say that the store propagates to its own CPU at the
				745	time it is executed.) The memory subsystem also determines where the
				746	store falls in the location's coherence order. In particular, it must
				747	arrange for the store to be co-later than (i.e., to overwrite) any
				748	other store to the same location which has already propagated to CPU C.
				749
				750	When a CPU executes a load instruction R, it first checks to see
				751	whether there are any as-yet unexecuted store instructions, for the
				752	same location, that come before R in program order. If there are, it
				753	uses the value of the po-latest such store as the value obtained by R,
				754	and we say that the store's value is forwarded to R. Otherwise, the
				755	CPU asks the memory subsystem for the value to load and we say that R
				756	is satisfied from memory. The memory subsystem hands back the value
				757	of the co-latest store to the location in question which has already
				758	propagated to that CPU.
				759
				760	(In fact, the picture needs to be a little more complicated than this.
				761	CPUs have local caches, and propagating a store to a CPU really means
				762	propagating it to the CPU's local cache. A local cache can take some
				763	time to process the stores that it receives, and a store can't be used
				764	to satisfy one of the CPU's loads until it has been processed. On
				765	most architectures, the local caches process stores in
				766	First-In-First-Out order, and consequently the processing delay
				767	doesn't matter for the memory model. But on Alpha, the local caches
				768	have a partitioned design that results in non-FIFO behavior. We will
				769	discuss this in more detail later.)
				770
				771	Note that load instructions may be executed speculatively and may be
				772	restarted under certain circumstances. The memory model ignores these
				773	premature executions; we simply say that the load executes at the
				774	final time it is forwarded or satisfied.
				775
				776	Executing a fence (or memory barrier) instruction doesn't require a
				777	CPU to do anything special other than informing the memory subsystem
				778	about the fence. However, fences do constrain the way CPUs and the
				779	memory subsystem handle other instructions, in two respects.
				780
				781	First, a fence forces the CPU to execute various instructions in
				782	program order. Exactly which instructions are ordered depends on the
				783	type of fence:
				784
				785	Strong fences, including smp_mb() and synchronize_rcu(), force
				786	the CPU to execute all po-earlier instructions before any
				787	po-later instructions;
				788
				789	smp_rmb() forces the CPU to execute all po-earlier loads
				790	before any po-later loads;
				791
				792	smp_wmb() forces the CPU to execute all po-earlier stores
				793	before any po-later stores;
				794
				795	Acquire fences, such as smp_load_acquire(), force the CPU to
				796	execute the load associated with the fence (e.g., the load
				797	part of an smp_load_acquire()) before any po-later
				798	instructions;
				799
				800	Release fences, such as smp_store_release(), force the CPU to
				801	execute all po-earlier instructions before the store
				802	associated with the fence (e.g., the store part of an
				803	smp_store_release()).
				804
				805	Second, some types of fence affect the way the memory subsystem
				806	propagates stores. When a fence instruction is executed on CPU C:
				807
Yauheni Kaliuta	0fcff17	2018-07-16 11:06:04 -0700	[diff] [blame]	808	For each other CPU C', smp_wmb() forces all po-earlier stores
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	809	on C to propagate to C' before any po-later stores do.
				810
				811	For each other CPU C', any store which propagates to C before
				812	a release fence is executed (including all po-earlier
				813	stores executed on C) is forced to propagate to C' before the
				814	store associated with the release fence does.
				815
				816	Any store which propagates to C before a strong fence is
				817	executed (including all po-earlier stores on C) is forced to
				818	propagate to all other CPUs before any instructions po-after
				819	the strong fence are executed on C.
				820
				821	The propagation ordering enforced by release fences and strong fences
				822	affects stores from other CPUs that propagate to CPU C before the
				823	fence is executed, as well as stores that are executed on C before the
				824	fence. We describe this property by saying that release fences and
				825	strong fences are A-cumulative. By contrast, smp_wmb() fences are not
				826	A-cumulative; they only affect the propagation of stores that are
				827	executed on C before the fence (i.e., those which precede the fence in
				828	program order).
				829
Alan Stern	bd5c0ba	2018-03-07 09:27:40 -0800	[diff] [blame]	830	rcu_read_lock(), rcu_read_unlock(), and synchronize_rcu() fences have
Alan Stern	bf28ae5	2018-02-20 15:25:12 -0800	[diff] [blame]	831	other properties which we discuss later.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	832
				833
				834	PROPAGATION ORDER RELATION: cumul-fence
				835	---------------------------------------
				836
				837	The fences which affect propagation order (i.e., strong, release, and
				838	smp_wmb() fences) are collectively referred to as cumul-fences, even
				839	though smp_wmb() isn't A-cumulative. The cumul-fence relation is
				840	defined to link memory access events E and F whenever:
				841
				842	E and F are both stores on the same CPU and an smp_wmb() fence
				843	event occurs between them in program order; or
				844
				845	F is a release fence and some X comes before F in program order,
				846	where either X = E or else E ->rf X; or
				847
				848	A strong fence event occurs between some X and F in program
				849	order, where either X = E or else E ->rf X.
				850
				851	The operational model requires that whenever W and W' are both stores
				852	and W ->cumul-fence W', then W must propagate to any given CPU
				853	before W' does. However, for different CPUs C and C', it does not
				854	require W to propagate to C before W' propagates to C'.
				855
				856
				857	DERIVATION OF THE LKMM FROM THE OPERATIONAL MODEL
				858	-------------------------------------------------
				859
				860	The LKMM is derived from the restrictions imposed by the design
				861	outlined above. These restrictions involve the necessity of
				862	maintaining cache coherence and the fact that a CPU can't operate on a
				863	value before it knows what that value is, among other things.
				864
				865	The formal version of the LKMM is defined by five requirements, or
				866	axioms:
				867
				868	Sequential consistency per variable: This requires that the
				869	system obey the four coherency rules.
				870
				871	Atomicity: This requires that atomic read-modify-write
				872	operations really are atomic, that is, no other stores can
				873	sneak into the middle of such an update.
				874
				875	Happens-before: This requires that certain instructions are
				876	executed in a specific order.
				877
				878	Propagation: This requires that certain stores propagate to
				879	CPUs and to RAM in a specific order.
				880
				881	Rcu: This requires that RCU read-side critical sections and
				882	grace periods obey the rules of RCU, in particular, the
				883	Grace-Period Guarantee.
				884
				885	The first and second are quite common; they can be found in many
				886	memory models (such as those for C11/C++11). The "happens-before" and
				887	"propagation" axioms have analogs in other memory models as well. The
				888	"rcu" axiom is specific to the LKMM.
				889
				890	Each of these axioms is discussed below.
				891
				892
				893	SEQUENTIAL CONSISTENCY PER VARIABLE
				894	-----------------------------------
				895
				896	According to the principle of cache coherence, the stores to any fixed
				897	shared location in memory form a global ordering. We can imagine
				898	inserting the loads from that location into this ordering, by placing
				899	each load between the store that it reads from and the following
				900	store. This leaves the relative positions of loads that read from the
				901	same store unspecified; let's say they are inserted in program order,
				902	first for CPU 0, then CPU 1, etc.
				903
				904	You can check that the four coherency rules imply that the rf, co, fr,
				905	and po-loc relations agree with this global ordering; in other words,
				906	whenever we have X ->rf Y or X ->co Y or X ->fr Y or X ->po-loc Y, the
				907	X event comes before the Y event in the global ordering. The LKMM's
				908	"coherence" axiom expresses this by requiring the union of these
				909	relations not to have any cycles. This means it must not be possible
				910	to find events
				911
				912	X0 -> X1 -> X2 -> ... -> Xn -> X0,
				913
				914	where each of the links is either rf, co, fr, or po-loc. This has to
				915	hold if the accesses to the fixed memory location can be ordered as
				916	cache coherence demands.
				917
				918	Although it is not obvious, it can be shown that the converse is also
				919	true: This LKMM axiom implies that the four coherency rules are
				920	obeyed.
				921
				922
				923	ATOMIC UPDATES: rmw
				924	-------------------
				925
				926	What does it mean to say that a read-modify-write (rmw) update, such
				927	as atomic_inc(&x), is atomic? It means that the memory location (x in
				928	this case) does not get altered between the read and the write events
				929	making up the atomic operation. In particular, if two CPUs perform
				930	atomic_inc(&x) concurrently, it must be guaranteed that the final
				931	value of x will be the initial value plus two. We should never have
				932	the following sequence of events:
				933
				934	CPU 0 loads x obtaining 13;
				935	CPU 1 loads x obtaining 13;
				936	CPU 0 stores 14 to x;
				937	CPU 1 stores 14 to x;
				938
				939	where the final value of x is wrong (14 rather than 15).
				940
				941	In this example, CPU 0's increment effectively gets lost because it
				942	occurs in between CPU 1's load and store. To put it another way, the
				943	problem is that the position of CPU 0's store in x's coherence order
				944	is between the store that CPU 1 reads from and the store that CPU 1
				945	performs.
				946
				947	The same analysis applies to all atomic update operations. Therefore,
				948	to enforce atomicity the LKMM requires that atomic updates follow this
				949	rule: Whenever R and W are the read and write events composing an
				950	atomic read-modify-write and W' is the write event which R reads from,
				951	there must not be any stores coming between W' and W in the coherence
				952	order. Equivalently,
				953
				954	(R ->rmw W) implies (there is no X with R ->fr X and X ->co W),
				955
				956	where the rmw relation links the read and write events making up each
				957	atomic update. This is what the LKMM's "atomic" axiom says.
				958
				959
				960	THE PRESERVED PROGRAM ORDER RELATION: ppo
				961	-----------------------------------------
				962
				963	There are many situations where a CPU is obligated to execute two
				964	instructions in program order. We amalgamate them into the ppo (for
				965	"preserved program order") relation, which links the po-earlier
				966	instruction to the po-later instruction and is thus a sub-relation of
				967	po.
				968
				969	The operational model already includes a description of one such
				970	situation: Fences are a source of ppo links. Suppose X and Y are
				971	memory accesses with X ->po Y; then the CPU must execute X before Y if
				972	any of the following hold:
				973
				974	A strong (smp_mb() or synchronize_rcu()) fence occurs between
				975	X and Y;
				976
				977	X and Y are both stores and an smp_wmb() fence occurs between
				978	them;
				979
				980	X and Y are both loads and an smp_rmb() fence occurs between
				981	them;
				982
				983	X is also an acquire fence, such as smp_load_acquire();
				984
				985	Y is also a release fence, such as smp_store_release().
				986
				987	Another possibility, not mentioned earlier but discussed in the next
				988	section, is:
				989
				990	X and Y are both loads, X ->addr Y (i.e., there is an address
Alan Stern	bf28ae5	2018-02-20 15:25:12 -0800	[diff] [blame]	991	dependency from X to Y), and X is a READ_ONCE() or an atomic
				992	access.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	993
				994	Dependencies can also cause instructions to be executed in program
				995	order. This is uncontroversial when the second instruction is a
				996	store; either a data, address, or control dependency from a load R to
				997	a store W will force the CPU to execute R before W. This is very
				998	simply because the CPU cannot tell the memory subsystem about W's
				999	store before it knows what value should be stored (in the case of a
				1000	data dependency), what location it should be stored into (in the case
				1001	of an address dependency), or whether the store should actually take
				1002	place (in the case of a control dependency).
				1003
				1004	Dependencies to load instructions are more problematic. To begin with,
				1005	there is no such thing as a data dependency to a load. Next, a CPU
				1006	has no reason to respect a control dependency to a load, because it
				1007	can always satisfy the second load speculatively before the first, and
				1008	then ignore the result if it turns out that the second load shouldn't
				1009	be executed after all. And lastly, the real difficulties begin when
				1010	we consider address dependencies to loads.
				1011
				1012	To be fair about it, all Linux-supported architectures do execute
				1013	loads in program order if there is an address dependency between them.
				1014	After all, a CPU cannot ask the memory subsystem to load a value from
				1015	a particular location before it knows what that location is. However,
				1016	the split-cache design used by Alpha can cause it to behave in a way
				1017	that looks as if the loads were executed out of order (see the next
Alan Stern	bf28ae5	2018-02-20 15:25:12 -0800	[diff] [blame]	1018	section for more details). The kernel includes a workaround for this
				1019	problem when the loads come from READ_ONCE(), and therefore the LKMM
				1020	includes address dependencies to loads in the ppo relation.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1021
				1022	On the other hand, dependencies can indirectly affect the ordering of
				1023	two loads. This happens when there is a dependency from a load to a
				1024	store and a second, po-later load reads from that store:
				1025
				1026	R ->dep W ->rfi R',
				1027
				1028	where the dep link can be either an address or a data dependency. In
				1029	this situation we know it is possible for the CPU to execute R' before
				1030	W, because it can forward the value that W will store to R'. But it
				1031	cannot execute R' before R, because it cannot forward the value before
				1032	it knows what that value is, or that W and R' do access the same
				1033	location. However, if there is merely a control dependency between R
				1034	and W then the CPU can speculatively forward W to R' before executing
				1035	R; if the speculation turns out to be wrong then the CPU merely has to
				1036	restart or abandon R'.
				1037
				1038	(In theory, a CPU might forward a store to a load when it runs across
				1039	an address dependency like this:
				1040
				1041	r1 = READ_ONCE(ptr);
				1042	WRITE_ONCE(*r1, 17);
				1043	r2 = READ_ONCE(*r1);
				1044
				1045	because it could tell that the store and the second load access the
				1046	same location even before it knows what the location's address is.
				1047	However, none of the architectures supported by the Linux kernel do
				1048	this.)
				1049
				1050	Two memory accesses of the same location must always be executed in
				1051	program order if the second access is a store. Thus, if we have
				1052
				1053	R ->po-loc W
				1054
				1055	(the po-loc link says that R comes before W in program order and they
				1056	access the same location), the CPU is obliged to execute W after R.
				1057	If it executed W first then the memory subsystem would respond to R's
				1058	read request with the value stored by W (or an even later store), in
				1059	violation of the read-write coherence rule. Similarly, if we had
				1060
				1061	W ->po-loc W'
				1062
				1063	and the CPU executed W' before W, then the memory subsystem would put
				1064	W' before W in the coherence order. It would effectively cause W to
				1065	overwrite W', in violation of the write-write coherence rule.
				1066	(Interestingly, an early ARMv8 memory model, now obsolete, proposed
				1067	allowing out-of-order writes like this to occur. The model avoided
				1068	violating the write-write coherence rule by requiring the CPU not to
				1069	send the W write to the memory subsystem at all!)
				1070
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1071
				1072	AND THEN THERE WAS ALPHA
				1073	------------------------
				1074
				1075	As mentioned above, the Alpha architecture is unique in that it does
				1076	not appear to respect address dependencies to loads. This means that
				1077	code such as the following:
				1078
				1079	int x = 0;
				1080	int y = -1;
				1081	int *ptr = &y;
				1082
				1083	P0()
				1084	{
				1085	WRITE_ONCE(x, 1);
				1086	smp_wmb();
				1087	WRITE_ONCE(ptr, &x);
				1088	}
				1089
				1090	P1()
				1091	{
				1092	int *r1;
				1093	int r2;
				1094
Alan Stern	bf28ae5	2018-02-20 15:25:12 -0800	[diff] [blame]	1095	r1 = ptr;
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1096	r2 = READ_ONCE(*r1);
				1097	}
				1098
Alan Stern	bf28ae5	2018-02-20 15:25:12 -0800	[diff] [blame]	1099	can malfunction on Alpha systems (notice that P1 uses an ordinary load
				1100	to read ptr instead of READ_ONCE()). It is quite possible that r1 = &x
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1101	and r2 = 0 at the end, in spite of the address dependency.
				1102
				1103	At first glance this doesn't seem to make sense. We know that the
				1104	smp_wmb() forces P0's store to x to propagate to P1 before the store
				1105	to ptr does. And since P1 can't execute its second load
				1106	until it knows what location to load from, i.e., after executing its
				1107	first load, the value x = 1 must have propagated to P1 before the
				1108	second load executed. So why doesn't r2 end up equal to 1?
				1109
				1110	The answer lies in the Alpha's split local caches. Although the two
				1111	stores do reach P1's local cache in the proper order, it can happen
				1112	that the first store is processed by a busy part of the cache while
				1113	the second store is processed by an idle part. As a result, the x = 1
				1114	value may not become available for P1's CPU to read until after the
				1115	ptr = &x value does, leading to the undesirable result above. The
				1116	final effect is that even though the two loads really are executed in
				1117	program order, it appears that they aren't.
				1118
				1119	This could not have happened if the local cache had processed the
Alan Stern	bd5c0ba	2018-03-07 09:27:40 -0800	[diff] [blame]	1120	incoming stores in FIFO order. By contrast, other architectures
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1121	maintain at least the appearance of FIFO order.
				1122
Alan Stern	bf28ae5	2018-02-20 15:25:12 -0800	[diff] [blame]	1123	In practice, this difficulty is solved by inserting a special fence
				1124	between P1's two loads when the kernel is compiled for the Alpha
				1125	architecture. In fact, as of version 4.15, the kernel automatically
				1126	adds this fence (called smp_read_barrier_depends() and defined as
				1127	nothing at all on non-Alpha builds) after every READ_ONCE() and atomic
				1128	load. The effect of the fence is to cause the CPU not to execute any
				1129	po-later instructions until after the local cache has finished
				1130	processing all the stores it has already received. Thus, if the code
				1131	was changed to:
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1132
				1133	P1()
				1134	{
				1135	int *r1;
				1136	int r2;
				1137
				1138	r1 = READ_ONCE(ptr);
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1139	r2 = READ_ONCE(*r1);
				1140	}
				1141
				1142	then we would never get r1 = &x and r2 = 0. By the time P1 executed
				1143	its second load, the x = 1 store would already be fully processed by
Alan Stern	bf28ae5	2018-02-20 15:25:12 -0800	[diff] [blame]	1144	the local cache and available for satisfying the read request. Thus
				1145	we have yet another reason why shared data should always be read with
				1146	READ_ONCE() or another synchronization primitive rather than accessed
				1147	directly.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1148
				1149	The LKMM requires that smp_rmb(), acquire fences, and strong fences
				1150	share this property with smp_read_barrier_depends(): They do not allow
				1151	the CPU to execute any po-later instructions (or po-later loads in the
				1152	case of smp_rmb()) until all outstanding stores have been processed by
				1153	the local cache. In the case of a strong fence, the CPU first has to
				1154	wait for all of its po-earlier stores to propagate to every other CPU
				1155	in the system; then it has to wait for the local cache to process all
				1156	the stores received as of that time -- not just the stores received
				1157	when the strong fence began.
				1158
				1159	And of course, none of this matters for any architecture other than
				1160	Alpha.
				1161
				1162
				1163	THE HAPPENS-BEFORE RELATION: hb
				1164	-------------------------------
				1165
				1166	The happens-before relation (hb) links memory accesses that have to
				1167	execute in a certain order. hb includes the ppo relation and two
				1168	others, one of which is rfe.
				1169
				1170	W ->rfe R implies that W and R are on different CPUs. It also means
				1171	that W's store must have propagated to R's CPU before R executed;
				1172	otherwise R could not have read the value stored by W. Therefore W
				1173	must have executed before R, and so we have W ->hb R.
				1174
				1175	The equivalent fact need not hold if W ->rfi R (i.e., W and R are on
				1176	the same CPU). As we have already seen, the operational model allows
				1177	W's value to be forwarded to R in such cases, meaning that R may well
				1178	execute before W does.
				1179
				1180	It's important to understand that neither coe nor fre is included in
				1181	hb, despite their similarities to rfe. For example, suppose we have
				1182	W ->coe W'. This means that W and W' are stores to the same location,
				1183	they execute on different CPUs, and W comes before W' in the coherence
				1184	order (i.e., W' overwrites W). Nevertheless, it is possible for W' to
				1185	execute before W, because the decision as to which store overwrites
				1186	the other is made later by the memory subsystem. When the stores are
				1187	nearly simultaneous, either one can come out on top. Similarly,
				1188	R ->fre W means that W overwrites the value which R reads, but it
				1189	doesn't mean that W has to execute after R. All that's necessary is
				1190	for the memory subsystem not to propagate W to R's CPU until after R
				1191	has executed, which is possible if W executes shortly before R.
				1192
				1193	The third relation included in hb is like ppo, in that it only links
				1194	events that are on the same CPU. However it is more difficult to
				1195	explain, because it arises only indirectly from the requirement of
				1196	cache coherence. The relation is called prop, and it links two events
				1197	on CPU C in situations where a store from some other CPU comes after
				1198	the first event in the coherence order and propagates to C before the
				1199	second event executes.
				1200
				1201	This is best explained with some examples. The simplest case looks
				1202	like this:
				1203
				1204	int x;
				1205
				1206	P0()
				1207	{
				1208	int r1;
				1209
				1210	WRITE_ONCE(x, 1);
				1211	r1 = READ_ONCE(x);
				1212	}
				1213
				1214	P1()
				1215	{
				1216	WRITE_ONCE(x, 8);
				1217	}
				1218
				1219	If r1 = 8 at the end then P0's accesses must have executed in program
				1220	order. We can deduce this from the operational model; if P0's load
				1221	had executed before its store then the value of the store would have
				1222	been forwarded to the load, so r1 would have ended up equal to 1, not
				1223	8. In this case there is a prop link from P0's write event to its read
				1224	event, because P1's store came after P0's store in x's coherence
				1225	order, and P1's store propagated to P0 before P0's load executed.
				1226
				1227	An equally simple case involves two loads of the same location that
				1228	read from different stores:
				1229
				1230	int x = 0;
				1231
				1232	P0()
				1233	{
				1234	int r1, r2;
				1235
				1236	r1 = READ_ONCE(x);
				1237	r2 = READ_ONCE(x);
				1238	}
				1239
				1240	P1()
				1241	{
				1242	WRITE_ONCE(x, 9);
				1243	}
				1244
				1245	If r1 = 0 and r2 = 9 at the end then P0's accesses must have executed
				1246	in program order. If the second load had executed before the first
				1247	then the x = 9 store must have been propagated to P0 before the first
				1248	load executed, and so r1 would have been 9 rather than 0. In this
				1249	case there is a prop link from P0's first read event to its second,
				1250	because P1's store overwrote the value read by P0's first load, and
				1251	P1's store propagated to P0 before P0's second load executed.
				1252
				1253	Less trivial examples of prop all involve fences. Unlike the simple
				1254	examples above, they can require that some instructions are executed
				1255	out of program order. This next one should look familiar:
				1256
				1257	int buf = 0, flag = 0;
				1258
				1259	P0()
				1260	{
				1261	WRITE_ONCE(buf, 1);
				1262	smp_wmb();
				1263	WRITE_ONCE(flag, 1);
				1264	}
				1265
				1266	P1()
				1267	{
				1268	int r1;
				1269	int r2;
				1270
				1271	r1 = READ_ONCE(flag);
				1272	r2 = READ_ONCE(buf);
				1273	}
				1274
				1275	This is the MP pattern again, with an smp_wmb() fence between the two
				1276	stores. If r1 = 1 and r2 = 0 at the end then there is a prop link
				1277	from P1's second load to its first (backwards!). The reason is
				1278	similar to the previous examples: The value P1 loads from buf gets
				1279	overwritten by P0's store to buf, the fence guarantees that the store
				1280	to buf will propagate to P1 before the store to flag does, and the
				1281	store to flag propagates to P1 before P1 reads flag.
				1282
				1283	The prop link says that in order to obtain the r1 = 1, r2 = 0 result,
				1284	P1 must execute its second load before the first. Indeed, if the load
				1285	from flag were executed first, then the buf = 1 store would already
				1286	have propagated to P1 by the time P1's load from buf executed, so r2
				1287	would have been 1 at the end, not 0. (The reasoning holds even for
				1288	Alpha, although the details are more complicated and we will not go
				1289	into them.)
				1290
				1291	But what if we put an smp_rmb() fence between P1's loads? The fence
				1292	would force the two loads to be executed in program order, and it
				1293	would generate a cycle in the hb relation: The fence would create a ppo
				1294	link (hence an hb link) from the first load to the second, and the
				1295	prop relation would give an hb link from the second load to the first.
				1296	Since an instruction can't execute before itself, we are forced to
				1297	conclude that if an smp_rmb() fence is added, the r1 = 1, r2 = 0
				1298	outcome is impossible -- as it should be.
				1299
				1300	The formal definition of the prop relation involves a coe or fre link,
				1301	followed by an arbitrary number of cumul-fence links, ending with an
				1302	rfe link. You can concoct more exotic examples, containing more than
				1303	one fence, although this quickly leads to diminishing returns in terms
				1304	of complexity. For instance, here's an example containing a coe link
				1305	followed by two fences and an rfe link, utilizing the fact that
				1306	release fences are A-cumulative:
				1307
				1308	int x, y, z;
				1309
				1310	P0()
				1311	{
				1312	int r0;
				1313
				1314	WRITE_ONCE(x, 1);
				1315	r0 = READ_ONCE(z);
				1316	}
				1317
				1318	P1()
				1319	{
				1320	WRITE_ONCE(x, 2);
				1321	smp_wmb();
				1322	WRITE_ONCE(y, 1);
				1323	}
				1324
				1325	P2()
				1326	{
				1327	int r2;
				1328
				1329	r2 = READ_ONCE(y);
				1330	smp_store_release(&z, 1);
				1331	}
				1332
				1333	If x = 2, r0 = 1, and r2 = 1 after this code runs then there is a prop
				1334	link from P0's store to its load. This is because P0's store gets
				1335	overwritten by P1's store since x = 2 at the end (a coe link), the
				1336	smp_wmb() ensures that P1's store to x propagates to P2 before the
				1337	store to y does (the first fence), the store to y propagates to P2
				1338	before P2's load and store execute, P2's smp_store_release()
				1339	guarantees that the stores to x and y both propagate to P0 before the
				1340	store to z does (the second fence), and P0's load executes after the
				1341	store to z has propagated to P0 (an rfe link).
				1342
				1343	In summary, the fact that the hb relation links memory access events
				1344	in the order they execute means that it must not have cycles. This
				1345	requirement is the content of the LKMM's "happens-before" axiom.
				1346
				1347	The LKMM defines yet another relation connected to times of
				1348	instruction execution, but it is not included in hb. It relies on the
				1349	particular properties of strong fences, which we cover in the next
				1350	section.
				1351
				1352
				1353	THE PROPAGATES-BEFORE RELATION: pb
				1354	----------------------------------
				1355
				1356	The propagates-before (pb) relation capitalizes on the special
				1357	features of strong fences. It links two events E and F whenever some
				1358	store is coherence-later than E and propagates to every CPU and to RAM
				1359	before F executes. The formal definition requires that E be linked to
				1360	F via a coe or fre link, an arbitrary number of cumul-fences, an
				1361	optional rfe link, a strong fence, and an arbitrary number of hb
				1362	links. Let's see how this definition works out.
				1363
				1364	Consider first the case where E is a store (implying that the sequence
				1365	of links begins with coe). Then there are events W, X, Y, and Z such
				1366	that:
				1367
				1368	E ->coe W ->cumul-fence* X ->rfe? Y ->strong-fence Z ->hb* F,
				1369
				1370	where the * suffix indicates an arbitrary number of links of the
				1371	specified type, and the ? suffix indicates the link is optional (Y may
				1372	be equal to X). Because of the cumul-fence links, we know that W will
				1373	propagate to Y's CPU before X does, hence before Y executes and hence
				1374	before the strong fence executes. Because this fence is strong, we
				1375	know that W will propagate to every CPU and to RAM before Z executes.
				1376	And because of the hb links, we know that Z will execute before F.
				1377	Thus W, which comes later than E in the coherence order, will
				1378	propagate to every CPU and to RAM before F executes.
				1379
				1380	The case where E is a load is exactly the same, except that the first
				1381	link in the sequence is fre instead of coe.
				1382
				1383	The existence of a pb link from E to F implies that E must execute
				1384	before F. To see why, suppose that F executed first. Then W would
				1385	have propagated to E's CPU before E executed. If E was a store, the
				1386	memory subsystem would then be forced to make E come after W in the
				1387	coherence order, contradicting the fact that E ->coe W. If E was a
				1388	load, the memory subsystem would then be forced to satisfy E's read
				1389	request with the value stored by W or an even later store,
				1390	contradicting the fact that E ->fre W.
				1391
				1392	A good example illustrating how pb works is the SB pattern with strong
				1393	fences:
				1394
				1395	int x = 0, y = 0;
				1396
				1397	P0()
				1398	{
				1399	int r0;
				1400
				1401	WRITE_ONCE(x, 1);
				1402	smp_mb();
				1403	r0 = READ_ONCE(y);
				1404	}
				1405
				1406	P1()
				1407	{
				1408	int r1;
				1409
				1410	WRITE_ONCE(y, 1);
				1411	smp_mb();
				1412	r1 = READ_ONCE(x);
				1413	}
				1414
				1415	If r0 = 0 at the end then there is a pb link from P0's load to P1's
				1416	load: an fre link from P0's load to P1's store (which overwrites the
				1417	value read by P0), and a strong fence between P1's store and its load.
				1418	In this example, the sequences of cumul-fence and hb links are empty.
				1419	Note that this pb link is not included in hb as an instance of prop,
				1420	because it does not start and end on the same CPU.
				1421
				1422	Similarly, if r1 = 0 at the end then there is a pb link from P1's load
				1423	to P0's. This means that if both r1 and r2 were 0 there would be a
				1424	cycle in pb, which is not possible since an instruction cannot execute
				1425	before itself. Thus, adding smp_mb() fences to the SB pattern
				1426	prevents the r0 = 0, r1 = 0 outcome.
				1427
				1428	In summary, the fact that the pb relation links events in the order
				1429	they execute means that it cannot have cycles. This requirement is
				1430	the content of the LKMM's "propagation" axiom.
				1431
				1432
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1433	RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-fence, and rb
				1434	-------------------------------------------------------------
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1435
				1436	RCU (Read-Copy-Update) is a powerful synchronization mechanism. It
				1437	rests on two concepts: grace periods and read-side critical sections.
				1438
				1439	A grace period is the span of time occupied by a call to
				1440	synchronize_rcu(). A read-side critical section (or just critical
				1441	section, for short) is a region of code delimited by rcu_read_lock()
				1442	at the start and rcu_read_unlock() at the end. Critical sections can
				1443	be nested, although we won't make use of this fact.
				1444
				1445	As far as memory models are concerned, RCU's main feature is its
				1446	Grace-Period Guarantee, which states that a critical section can never
				1447	span a full grace period. In more detail, the Guarantee says:
				1448
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1449	For any critical section C and any grace period G, at least
				1450	one of the following statements must hold:
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1451
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1452	(1) C ends before G does, and in addition, every store that
				1453	propagates to C's CPU before the end of C must propagate to
				1454	every CPU before G ends.
				1455
				1456	(2) G starts before C does, and in addition, every store that
				1457	propagates to G's CPU before the start of G must propagate
				1458	to every CPU before C starts.
				1459
				1460	In particular, it is not possible for a critical section to both start
				1461	before and end after a grace period.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1462
				1463	Here is a simple example of RCU in action:
				1464
				1465	int x, y;
				1466
				1467	P0()
				1468	{
				1469	rcu_read_lock();
				1470	WRITE_ONCE(x, 1);
				1471	WRITE_ONCE(y, 1);
				1472	rcu_read_unlock();
				1473	}
				1474
				1475	P1()
				1476	{
				1477	int r1, r2;
				1478
				1479	r1 = READ_ONCE(x);
				1480	synchronize_rcu();
				1481	r2 = READ_ONCE(y);
				1482	}
				1483
				1484	The Grace Period Guarantee tells us that when this code runs, it will
				1485	never end with r1 = 1 and r2 = 0. The reasoning is as follows. r1 = 1
				1486	means that P0's store to x propagated to P1 before P1 called
				1487	synchronize_rcu(), so P0's critical section must have started before
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1488	P1's grace period, contrary to part (2) of the Guarantee. On the
				1489	other hand, r2 = 0 means that P0's store to y, which occurs before the
				1490	end of the critical section, did not propagate to P1 before the end of
				1491	the grace period, contrary to part (1). Together the results violate
				1492	the Guarantee.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1493
Alan Stern	1ee2da5	2018-05-14 16:33:39 -0700	[diff] [blame]	1494	In the kernel's implementations of RCU, the requirements for stores
				1495	to propagate to every CPU are fulfilled by placing strong fences at
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1496	suitable places in the RCU-related code. Thus, if a critical section
				1497	starts before a grace period does then the critical section's CPU will
				1498	execute an smp_mb() fence after the end of the critical section and
				1499	some time before the grace period's synchronize_rcu() call returns.
				1500	And if a critical section ends after a grace period does then the
				1501	synchronize_rcu() routine will execute an smp_mb() fence at its start
				1502	and some time before the critical section's opening rcu_read_lock()
				1503	executes.
				1504
				1505	What exactly do we mean by saying that a critical section "starts
				1506	before" or "ends after" a grace period? Some aspects of the meaning
				1507	are pretty obvious, as in the example above, but the details aren't
Alan Stern	1ee2da5	2018-05-14 16:33:39 -0700	[diff] [blame]	1508	entirely clear. The LKMM formalizes this notion by means of the
				1509	rcu-link relation. rcu-link encompasses a very general notion of
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1510	"before": If E and F are RCU fence events (i.e., rcu_read_lock(),
				1511	rcu_read_unlock(), or synchronize_rcu()) then among other things,
				1512	E ->rcu-link F includes cases where E is po-before some memory-access
				1513	event X, F is po-after some memory-access event Y, and we have any of
				1514	X ->rfe Y, X ->co Y, or X ->fr Y.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1515
Alan Stern	1ee2da5	2018-05-14 16:33:39 -0700	[diff] [blame]	1516	The formal definition of the rcu-link relation is more than a little
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1517	obscure, and we won't give it here. It is closely related to the pb
				1518	relation, and the details don't matter unless you want to comb through
				1519	a somewhat lengthy formal proof. Pretty much all you need to know
Alan Stern	1ee2da5	2018-05-14 16:33:39 -0700	[diff] [blame]	1520	about rcu-link is the information in the preceding paragraph.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1521
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1522	The LKMM also defines the rcu-gp and rcu-rscsi relations. They bring
				1523	grace periods and read-side critical sections into the picture, in the
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1524	following way:
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1525
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1526	E ->rcu-gp F means that E and F are in fact the same event,
				1527	and that event is a synchronize_rcu() fence (i.e., a grace
				1528	period).
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1529
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1530	E ->rcu-rscsi F means that E and F are the rcu_read_unlock()
				1531	and rcu_read_lock() fence events delimiting some read-side
				1532	critical section. (The 'i' at the end of the name emphasizes
				1533	that this relation is "inverted": It links the end of the
				1534	critical section to the start.)
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1535
Alan Stern	1ee2da5	2018-05-14 16:33:39 -0700	[diff] [blame]	1536	If we think of the rcu-link relation as standing for an extended
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1537	"before", then X ->rcu-gp Y ->rcu-link Z roughly says that X is a
				1538	grace period which ends before Z begins. (In fact it covers more than
				1539	this, because it also includes cases where some store propagates to
				1540	Z's CPU before Z begins but doesn't propagate to some other CPU until
				1541	after X ends.) Similarly, X ->rcu-rscsi Y ->rcu-link Z says that X is
				1542	the end of a critical section which starts before Z begins.
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1543
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1544	The LKMM goes on to define the rcu-fence relation as a sequence of
				1545	rcu-gp and rcu-rscsi links separated by rcu-link links, in which the
				1546	number of rcu-gp links is >= the number of rcu-rscsi links. For
				1547	example:
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1548
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1549	X ->rcu-gp Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1550
				1551	would imply that X ->rcu-fence V, because this sequence contains two
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1552	rcu-gp links and one rcu-rscsi link. (It also implies that
				1553	X ->rcu-fence T and Z ->rcu-fence V.) On the other hand:
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1554
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1555	X ->rcu-rscsi Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1556
				1557	does not imply X ->rcu-fence V, because the sequence contains only
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1558	one rcu-gp link but two rcu-rscsi links.
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1559
				1560	The rcu-fence relation is important because the Grace Period Guarantee
				1561	means that rcu-fence acts kind of like a strong fence. In particular,
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1562	E ->rcu-fence F implies not only that E begins before F ends, but also
				1563	that any write po-before E will propagate to every CPU before any
				1564	instruction po-after F can execute. (However, it does not imply that
				1565	E must execute before F; in fact, each synchronize_rcu() fence event
				1566	is linked to itself by rcu-fence as a degenerate case.)
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1567
				1568	To prove this in full generality requires some intellectual effort.
				1569	We'll consider just a very simple case:
				1570
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1571	G ->rcu-gp W ->rcu-link Z ->rcu-rscsi F.
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1572
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1573	This formula means that G and W are the same event (a grace period),
				1574	and there are events X, Y and a read-side critical section C such that:
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1575
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1576	1. G = W is po-before or equal to X;
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1577
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1578	2. X comes "before" Y in some sense (including rfe, co and fr);
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1579
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1580	2. Y is po-before Z;
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1581
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1582	4. Z is the rcu_read_unlock() event marking the end of C;
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1583
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1584	5. F is the rcu_read_lock() event marking the start of C.
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1585
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1586	From 1 - 4 we deduce that the grace period G ends before the critical
				1587	section C. Then part (2) of the Grace Period Guarantee says not only
				1588	that G starts before C does, but also that any write which executes on
				1589	G's CPU before G starts must propagate to every CPU before C starts.
				1590	In particular, the write propagates to every CPU before F finishes
				1591	executing and hence before any instruction po-after F can execute.
				1592	This sort of reasoning can be extended to handle all the situations
				1593	covered by rcu-fence.
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1594
				1595	Finally, the LKMM defines the RCU-before (rb) relation in terms of
				1596	rcu-fence. This is done in essentially the same way as the pb
				1597	relation was defined in terms of strong-fence. We will omit the
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1598	details; the end result is that E ->rb F implies E must execute
				1599	before F, just as E ->pb F does (and for much the same reasons).
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1600
				1601	Putting this all together, the LKMM expresses the Grace Period
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1602	Guarantee by requiring that the rb relation does not contain a cycle.
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1603	Equivalently, this "rcu" axiom requires that there are no events E
				1604	and F with E ->rcu-link F ->rcu-fence E. Or to put it a third way,
				1605	the axiom requires that there are no cycles consisting of rcu-gp and
				1606	rcu-rscsi alternating with rcu-link, where the number of rcu-gp links
				1607	is >= the number of rcu-rscsi links.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1608
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1609	Justifying the axiom isn't easy, but it is in fact a valid
				1610	formalization of the Grace Period Guarantee. We won't attempt to go
				1611	through the detailed argument, but the following analysis gives a
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1612	taste of what is involved. Suppose both parts of the Guarantee are
				1613	violated: A critical section starts before a grace period, and some
				1614	store propagates to the critical section's CPU before the end of the
				1615	critical section but doesn't propagate to some other CPU until after
				1616	the end of the grace period.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1617
				1618	Putting symbols to these ideas, let L and U be the rcu_read_lock() and
				1619	rcu_read_unlock() fence events delimiting the critical section in
				1620	question, and let S be the synchronize_rcu() fence event for the grace
				1621	period. Saying that the critical section starts before S means there
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1622	are events Q and R where Q is po-after L (which marks the start of the
				1623	critical section), Q is "before" R in the sense used by the rcu-link
				1624	relation, and R is po-before the grace period S. Thus we have:
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1625
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1626	L ->rcu-link S.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1627
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1628	Let W be the store mentioned above, let Y come before the end of the
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1629	critical section and witness that W propagates to the critical
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1630	section's CPU by reading from W, and let Z on some arbitrary CPU be a
				1631	witness that W has not propagated to that CPU, where Z happens after
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1632	some event X which is po-after S. Symbolically, this amounts to:
				1633
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1634	S ->po X ->hb* Z ->fr W ->rf Y ->po U.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1635
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1636	The fr link from Z to W indicates that W has not propagated to Z's CPU
				1637	at the time that Z executes. From this, it can be shown (see the
				1638	discussion of the rcu-link relation earlier) that S and U are related
				1639	by rcu-link:
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1640
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1641	S ->rcu-link U.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1642
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1643	Since S is a grace period we have S ->rcu-gp S, and since L and U are
				1644	the start and end of the critical section C we have U ->rcu-rscsi L.
				1645	From this we obtain:
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1646
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1647	S ->rcu-gp S ->rcu-link U ->rcu-rscsi L ->rcu-link S,
Alan Stern	9d03688	2018-05-14 16:33:40 -0700	[diff] [blame]	1648
				1649	a forbidden cycle. Thus the "rcu" axiom rules out this violation of
				1650	the Grace Period Guarantee.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1651
				1652	For something a little more down-to-earth, let's see how the axiom
				1653	works out in practice. Consider the RCU code example from above, this
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1654	time with statement labels added:
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1655
				1656	int x, y;
				1657
				1658	P0()
				1659	{
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1660	L: rcu_read_lock();
				1661	X: WRITE_ONCE(x, 1);
				1662	Y: WRITE_ONCE(y, 1);
				1663	U: rcu_read_unlock();
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1664	}
				1665
				1666	P1()
				1667	{
				1668	int r1, r2;
				1669
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1670	Z: r1 = READ_ONCE(x);
				1671	S: synchronize_rcu();
				1672	W: r2 = READ_ONCE(y);
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1673	}
				1674
				1675
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1676	If r2 = 0 at the end then P0's store at Y overwrites the value that
				1677	P1's load at W reads from, so we have W ->fre Y. Since S ->po W and
				1678	also Y ->po U, we get S ->rcu-link U. In addition, S ->rcu-gp S
				1679	because S is a grace period.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1680
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1681	If r1 = 1 at the end then P1's load at Z reads from P0's store at X,
				1682	so we have X ->rfe Z. Together with L ->po X and Z ->po S, this
				1683	yields L ->rcu-link S. And since L and U are the start and end of a
				1684	critical section, we have U ->rcu-rscsi L.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1685
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1686	Then U ->rcu-rscsi L ->rcu-link S ->rcu-gp S ->rcu-link U is a
				1687	forbidden cycle, violating the "rcu" axiom. Hence the outcome is not
				1688	allowed by the LKMM, as we would expect.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1689
				1690	For contrast, let's see what can happen in a more complicated example:
				1691
				1692	int x, y, z;
				1693
				1694	P0()
				1695	{
				1696	int r0;
				1697
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1698	L0: rcu_read_lock();
				1699	r0 = READ_ONCE(x);
				1700	WRITE_ONCE(y, 1);
				1701	U0: rcu_read_unlock();
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1702	}
				1703
				1704	P1()
				1705	{
				1706	int r1;
				1707
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1708	r1 = READ_ONCE(y);
				1709	S1: synchronize_rcu();
				1710	WRITE_ONCE(z, 1);
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1711	}
				1712
				1713	P2()
				1714	{
				1715	int r2;
				1716
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1717	L2: rcu_read_lock();
				1718	r2 = READ_ONCE(z);
				1719	WRITE_ONCE(x, 1);
				1720	U2: rcu_read_unlock();
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1721	}
				1722
				1723	If r0 = r1 = r2 = 1 at the end, then similar reasoning to before shows
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1724	that U0 ->rcu-rscsi L0 ->rcu-link S1 ->rcu-gp S1 ->rcu-link U2 ->rcu-rscsi
				1725	L2 ->rcu-link U0. However this cycle is not forbidden, because the
				1726	sequence of relations contains fewer instances of rcu-gp (one) than of
				1727	rcu-rscsi (two). Consequently the outcome is allowed by the LKMM.
				1728	The following instruction timing diagram shows how it might actually
				1729	occur:
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1730
				1731	P0 P1 P2
				1732	-------------------- -------------------- --------------------
				1733	rcu_read_lock()
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1734	WRITE_ONCE(y, 1)
				1735	r1 = READ_ONCE(y)
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1736	synchronize_rcu() starts
				1737	. rcu_read_lock()
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1738	. WRITE_ONCE(x, 1)
				1739	r0 = READ_ONCE(x) .
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1740	rcu_read_unlock() .
				1741	synchronize_rcu() ends
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1742	WRITE_ONCE(z, 1)
				1743	r2 = READ_ONCE(z)
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1744	rcu_read_unlock()
				1745
				1746	This requires P0 and P2 to execute their loads and stores out of
				1747	program order, but of course they are allowed to do so. And as you
				1748	can see, the Grace Period Guarantee is not violated: The critical
				1749	section in P0 both starts before P1's grace period does and ends
				1750	before it does, and the critical section in P2 both starts after P1's
				1751	grace period does and ends after it does.
				1752
Alan Stern	648e717	2018-12-11 11:38:53 -0500	[diff] [blame]	1753	Addendum: The LKMM now supports SRCU (Sleepable Read-Copy-Update) in
				1754	addition to normal RCU. The ideas involved are much the same as
				1755	above, with new relations srcu-gp and srcu-rscsi added to represent
				1756	SRCU grace periods and read-side critical sections. There is a
				1757	restriction on the srcu-gp and srcu-rscsi links that can appear in an
				1758	rcu-fence sequence (the srcu-rscsi links must be paired with srcu-gp
				1759	links having the same SRCU domain with proper nesting); the details
				1760	are relatively unimportant.
				1761
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1762
Alan Stern	6e89e83	2018-09-26 11:29:17 -0700	[diff] [blame]	1763	LOCKING
				1764	-------
				1765
				1766	The LKMM includes locking. In fact, there is special code for locking
				1767	in the formal model, added in order to make tools run faster.
				1768	However, this special code is intended to be more or less equivalent
				1769	to concepts we have already covered. A spinlock_t variable is treated
				1770	the same as an int, and spin_lock(&s) is treated almost the same as:
				1771
				1772	while (cmpxchg_acquire(&s, 0, 1) != 0)
				1773	cpu_relax();
				1774
				1775	This waits until s is equal to 0 and then atomically sets it to 1,
				1776	and the read part of the cmpxchg operation acts as an acquire fence.
				1777	An alternate way to express the same thing would be:
				1778
				1779	r = xchg_acquire(&s, 1);
				1780
				1781	along with a requirement that at the end, r = 0. Similarly,
				1782	spin_trylock(&s) is treated almost the same as:
				1783
				1784	return !cmpxchg_acquire(&s, 0, 1);
				1785
				1786	which atomically sets s to 1 if it is currently equal to 0 and returns
				1787	true if it succeeds (the read part of the cmpxchg operation acts as an
				1788	acquire fence only if the operation is successful). spin_unlock(&s)
				1789	is treated almost the same as:
				1790
				1791	smp_store_release(&s, 0);
				1792
				1793	The "almost" qualifiers above need some explanation. In the LKMM, the
				1794	store-release in a spin_unlock() and the load-acquire which forms the
				1795	first half of the atomic rmw update in a spin_lock() or a successful
				1796	spin_trylock() -- we can call these things lock-releases and
				1797	lock-acquires -- have two properties beyond those of ordinary releases
				1798	and acquires.
				1799
				1800	First, when a lock-acquire reads from a lock-release, the LKMM
				1801	requires that every instruction po-before the lock-release must
				1802	execute before any instruction po-after the lock-acquire. This would
				1803	naturally hold if the release and acquire operations were on different
				1804	CPUs, but the LKMM says it holds even when they are on the same CPU.
				1805	For example:
				1806
				1807	int x, y;
				1808	spinlock_t s;
				1809
				1810	P0()
				1811	{
				1812	int r1, r2;
				1813
				1814	spin_lock(&s);
				1815	r1 = READ_ONCE(x);
				1816	spin_unlock(&s);
				1817	spin_lock(&s);
				1818	r2 = READ_ONCE(y);
				1819	spin_unlock(&s);
				1820	}
				1821
				1822	P1()
				1823	{
				1824	WRITE_ONCE(y, 1);
				1825	smp_wmb();
				1826	WRITE_ONCE(x, 1);
				1827	}
				1828
				1829	Here the second spin_lock() reads from the first spin_unlock(), and
				1830	therefore the load of x must execute before the load of y. Thus we
				1831	cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
				1832	MP pattern).
				1833
				1834	This requirement does not apply to ordinary release and acquire
				1835	fences, only to lock-related operations. For instance, suppose P0()
				1836	in the example had been written as:
				1837
				1838	P0()
				1839	{
				1840	int r1, r2, r3;
				1841
				1842	r1 = READ_ONCE(x);
				1843	smp_store_release(&s, 1);
				1844	r3 = smp_load_acquire(&s);
				1845	r2 = READ_ONCE(y);
				1846	}
				1847
				1848	Then the CPU would be allowed to forward the s = 1 value from the
				1849	smp_store_release() to the smp_load_acquire(), executing the
				1850	instructions in the following order:
				1851
				1852	r3 = smp_load_acquire(&s); // Obtains r3 = 1
				1853	r2 = READ_ONCE(y);
				1854	r1 = READ_ONCE(x);
				1855	smp_store_release(&s, 1); // Value is forwarded
				1856
				1857	and thus it could load y before x, obtaining r2 = 0 and r1 = 1.
				1858
				1859	Second, when a lock-acquire reads from a lock-release, and some other
				1860	stores W and W' occur po-before the lock-release and po-after the
				1861	lock-acquire respectively, the LKMM requires that W must propagate to
				1862	each CPU before W' does. For example, consider:
				1863
				1864	int x, y;
				1865	spinlock_t x;
				1866
				1867	P0()
				1868	{
				1869	spin_lock(&s);
				1870	WRITE_ONCE(x, 1);
				1871	spin_unlock(&s);
				1872	}
				1873
				1874	P1()
				1875	{
				1876	int r1;
				1877
				1878	spin_lock(&s);
				1879	r1 = READ_ONCE(x);
				1880	WRITE_ONCE(y, 1);
				1881	spin_unlock(&s);
				1882	}
				1883
				1884	P2()
				1885	{
				1886	int r2, r3;
				1887
				1888	r2 = READ_ONCE(y);
				1889	smp_rmb();
				1890	r3 = READ_ONCE(x);
				1891	}
				1892
				1893	If r1 = 1 at the end then the spin_lock() in P1 must have read from
				1894	the spin_unlock() in P0. Hence the store to x must propagate to P2
				1895	before the store to y does, so we cannot have r2 = 1 and r3 = 0.
				1896
				1897	These two special requirements for lock-release and lock-acquire do
				1898	not arise from the operational model. Nevertheless, kernel developers
				1899	have come to expect and rely on them because they do hold on all
				1900	architectures supported by the Linux kernel, albeit for various
				1901	differing reasons.
				1902
				1903
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1904	ODDS AND ENDS
				1905	-------------
				1906
				1907	This section covers material that didn't quite fit anywhere in the
				1908	earlier sections.
				1909
				1910	The descriptions in this document don't always match the formal
				1911	version of the LKMM exactly. For example, the actual formal
				1912	definition of the prop relation makes the initial coe or fre part
				1913	optional, and it doesn't require the events linked by the relation to
				1914	be on the same CPU. These differences are very unimportant; indeed,
				1915	instances where the coe/fre part of prop is missing are of no interest
				1916	because all the other parts (fences and rfe) are already included in
				1917	hb anyway, and where the formal model adds prop into hb, it includes
				1918	an explicit requirement that the events being linked are on the same
				1919	CPU.
				1920
				1921	Another minor difference has to do with events that are both memory
				1922	accesses and fences, such as those corresponding to smp_load_acquire()
				1923	calls. In the formal model, these events aren't actually both reads
				1924	and fences; rather, they are read events with an annotation marking
				1925	them as acquires. (Or write events annotated as releases, in the case
				1926	smp_store_release().) The final effect is the same.
				1927
				1928	Although we didn't mention it above, the instruction execution
				1929	ordering provided by the smp_rmb() fence doesn't apply to read events
				1930	that are part of a non-value-returning atomic update. For instance,
				1931	given:
				1932
				1933	atomic_inc(&x);
				1934	smp_rmb();
				1935	r1 = READ_ONCE(y);
				1936
				1937	it is not guaranteed that the load from y will execute after the
				1938	update to x. This is because the ARMv8 architecture allows
				1939	non-value-returning atomic operations effectively to be executed off
				1940	the CPU. Basically, the CPU tells the memory subsystem to increment
				1941	x, and then the increment is carried out by the memory hardware with
				1942	no further involvement from the CPU. Since the CPU doesn't ever read
				1943	the value of x, there is nothing for the smp_rmb() fence to act on.
				1944
				1945	The LKMM defines a few extra synchronization operations in terms of
Alan Stern	bf28ae5	2018-02-20 15:25:12 -0800	[diff] [blame]	1946	things we have already covered. In particular, rcu_dereference() is
				1947	treated as READ_ONCE() and rcu_assign_pointer() is treated as
				1948	smp_store_release() -- which is basically how the Linux kernel treats
				1949	them.
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1950
				1951	There are a few oddball fences which need special treatment:
				1952	smp_mb__before_atomic(), smp_mb__after_atomic(), and
				1953	smp_mb__after_spinlock(). The LKMM uses fence events with special
				1954	annotations for them; they act as strong fences just like smp_mb()
				1955	except for the sets of events that they order. Instead of ordering
				1956	all po-earlier events against all po-later events, as smp_mb() does,
				1957	they behave as follows:
				1958
				1959	smp_mb__before_atomic() orders all po-earlier events against
				1960	po-later atomic updates and the events following them;
				1961
				1962	smp_mb__after_atomic() orders po-earlier atomic updates and
				1963	the events preceding them against all po-later events;
				1964
				1965	smp_mb_after_spinlock() orders po-earlier lock acquisition
				1966	events and the events preceding them against all po-later
				1967	events.
				1968
Paul E. McKenney	1c27b64	2018-01-18 19:58:55 -0800	[diff] [blame]	1969	Interestingly, RCU and locking each introduce the possibility of
				1970	deadlock. When faced with code sequences such as:
				1971
				1972	spin_lock(&s);
				1973	spin_lock(&s);
				1974	spin_unlock(&s);
				1975	spin_unlock(&s);
				1976
				1977	or:
				1978
				1979	rcu_read_lock();
				1980	synchronize_rcu();
				1981	rcu_read_unlock();
				1982
				1983	what does the LKMM have to say? Answer: It says there are no allowed
				1984	executions at all, which makes sense. But this can also lead to
				1985	misleading results, because if a piece of code has multiple possible
				1986	executions, some of which deadlock, the model will report only on the
				1987	non-deadlocking executions. For example:
				1988
				1989	int x, y;
				1990
				1991	P0()
				1992	{
				1993	int r0;
				1994
				1995	WRITE_ONCE(x, 1);
				1996	r0 = READ_ONCE(y);
				1997	}
				1998
				1999	P1()
				2000	{
				2001	rcu_read_lock();
				2002	if (READ_ONCE(x) > 0) {
				2003	WRITE_ONCE(y, 36);
				2004	synchronize_rcu();
				2005	}
				2006	rcu_read_unlock();
				2007	}
				2008
				2009	Is it possible to end up with r0 = 36 at the end? The LKMM will tell
				2010	you it is not, but the model won't mention that this is because P1
				2011	will self-deadlock in the executions where it stores 36 in y.