qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH 2/4] atomics: update documentation for C11


From: Paolo Bonzini
Subject: [PATCH 2/4] atomics: update documentation for C11
Date: Mon, 6 Apr 2020 15:13:18 -0400

Deprecate atomic_mb_read and atomic_mb_set; it is not really possible to
use them correctly because they do not interoperate with sequentially-consistent
RMW operations.

Signed-off-by: Paolo Bonzini <address@hidden>
---
 docs/devel/atomics.rst | 290 ++++++++++++++++-------------------------
 1 file changed, 114 insertions(+), 176 deletions(-)

diff --git a/docs/devel/atomics.rst b/docs/devel/atomics.rst
index 00f3f7d3ed..e92752a64d 100644
--- a/docs/devel/atomics.rst
+++ b/docs/devel/atomics.rst
@@ -11,10 +11,15 @@ that is consistent with the expectations of the programmer.
 The most basic tool is locking.  Mutexes, condition variables and
 semaphores are used in QEMU, and should be the default approach to
 synchronization.  Anything else is considerably harder, but it's
-also justified more often than one would like.  The two tools that
-are provided by ``qemu/atomic.h`` are memory barriers and atomic operations.
+also justified more often than one would like;
+the most performance-critical parts of QEMU in particular require
+a very low level approach to concurrency, involving memory barriers
+and atomic operations.  The semantics of concurrent memory accesses are 
governed
+by the C11 memory model.
 
-Macros defined by ``qemu/atomic.h`` fall in three camps:
+QEMU provides a header, ``qemu/atomic.h``, which wraps C11 atomics to
+provide better portability and a less verbose syntax.  ``qemu/atomic.h``
+provides macros that fall in three camps:
 
 - compiler barriers: ``barrier()``;
 
@@ -24,6 +29,14 @@ Macros defined by ``qemu/atomic.h`` fall in three camps:
 
 - sequentially consistent atomic access: everything else.
 
+In general, use of ``qemu/atomic.h`` should be wrapped with more easily
+used data structures (e.g. the lock-free singly-liked list operations
+``QSLIST_INSERT_HEAD_ATOMIC`` and ``QSLIST_MOVE_ATOMIC``) or synchronization
+primitives (such as RCU, ``QemuEvent`` or ``QemuLockCnt``).  Bare use of
+atomic operations and memory barriers should be limited to inter-thread
+checking of flags and documented thoroughly.
+
+
 
 Compiler memory barrier
 =======================
@@ -85,36 +98,14 @@ Similar operations return the new value of ``*ptr``::
     typeof(*ptr) atomic_or_fetch(ptr, val)
     typeof(*ptr) atomic_xor_fetch(ptr, val)
 
-Sequentially consistent loads and stores can be done using::
-
-    atomic_fetch_add(ptr, 0) for loads
-    atomic_xchg(ptr, val) for stores
+These operations operate on any type that is as wide as an int or smaller.
 
-However, they are quite expensive on some platforms, notably POWER and
-Arm.  Therefore, qemu/atomic.h provides two primitives with slightly
-weaker constraints::
+``qemu/atomic.h`` also provides sequentially consistent loads and stores can::
 
     typeof(*ptr) atomic_mb_read(ptr)
     void         atomic_mb_set(ptr, val)
 
-The semantics of these primitives map to Java volatile variables,
-and are strongly related to memory barriers as used in the Linux
-kernel (see below).
-
-As long as you use atomic_mb_read and atomic_mb_set, accesses cannot
-be reordered with each other, and it is also not possible to reorder
-"normal" accesses around them.
-
-However, and this is the important difference between
-atomic_mb_read/atomic_mb_set and sequential consistency, it is important
-for both threads to access the same volatile variable.  It is not the
-case that everything visible to thread A when it writes volatile field f
-becomes visible to thread B after it reads volatile field g. The store
-and load have to "match" (i.e., be performed on the same volatile
-field) to achieve the right semantics.
-
-
-These operations operate on any type that is as wide as an int or smaller.
+which however are deprecated.
 
 
 Weak atomic access and manual memory barriers
@@ -208,135 +199,62 @@ They come in six kinds:
   dependency and a full read barrier or better is required.
 
 
-This is the set of barriers that is required *between* two ``atomic_read()``
-and ``atomic_set()`` operations to achieve sequential consistency:
-
-   +----------------+-------------------------------------------------------+
-   |                |                 2nd operation                         |
-   |                +------------------+-----------------+------------------+
-   | 1st operation  | (after last)     | atomic_read     | atomic_set       |
-   +----------------+------------------+-----------------+------------------+
-   | (before first) | ..               | none            | smp_mb_release() |
-   +----------------+------------------+-----------------+------------------+
-   | atomic_read    | smp_mb_acquire() | smp_rmb() [1]_  | [2]_             |
-   +----------------+------------------+-----------------+------------------+
-   | atomic_set     | none             | smp_mb() [3]_   | smp_wmb()        |
-   +----------------+------------------+-----------------+------------------+
-
-   .. [1] Or smp_read_barrier_depends().
-
-   .. [2] This requires a load-store barrier.  This is achieved by
-          either smp_mb_acquire() or smp_mb_release().
-
-   .. [3] This requires a store-load barrier.  On most machines, the only
-          way to achieve this is a full barrier.
-
-
-You can see that the two possible definitions of ``atomic_mb_read()``
-and ``atomic_mb_set()`` are the following:
-
-  1) | atomic_mb_read(p)   = atomic_read(p); smp_mb_acquire()
-     | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); smp_mb()
-
-  2) | atomic_mb_read(p)   = smp_mb() atomic_read(p); smp_mb_acquire()
-     | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v);
-
-Usually the former is used, because ``smp_mb()`` is expensive and a program
-normally has more reads than writes.  Therefore it makes more sense to
-make ``atomic_mb_set()`` the more expensive operation.
-
-There are two common cases in which atomic_mb_read and atomic_mb_set
-generate too many memory barriers, and thus it can be useful to manually
-place barriers, or use atomic_load_acquire/atomic_store_release instead:
-
-- when a data structure has one thread that is always a writer
-  and one thread that is always a reader, manual placement of
-  memory barriers makes the write side faster.  Furthermore,
-  correctness is easy to check for in this case using the "pairing"
-  trick that is explained below:
-
-    +----------------------------------------------------------------------+
-    | thread 1                                                             |
-    +-----------------------------------+----------------------------------+
-    | before                            | after                            |
-    +===================================+==================================+
-    | ::                                | ::                               |
-    |                                   |                                  |
-    |   (other writes)                  |                                  |
-    |   atomic_mb_set(&a, x)            |   atomic_store_release(&a, x)    |
-    |   atomic_mb_set(&b, y)            |   atomic_store_release(&b, y)    |
-    +-----------------------------------+----------------------------------+
-
-    +----------------------------------------------------------------------+
-    | thread 2                                                             |
-    +-----------------------------------+----------------------------------+
-    | before                            | after                            |
-    +===================================+==================================+
-    | ::                                | ::                               |
-    |                                   |                                  |
-    |   y = atomic_mb_read(&b)          |   y = atomic_load_acquire(&b)    |
-    |   x = atomic_mb_read(&a)          |   x = atomic_load_acquire(&a)    |
-    |   (other reads)                   |                                  |
-    +-----------------------------------+----------------------------------+
-
-  Note that the barrier between the stores in thread 1, and between
-  the loads in thread 2, has been optimized here to a write or a
-  read memory barrier respectively.  On some architectures, notably
-  ARMv7, smp_mb_acquire and smp_mb_release are just as expensive as
-  smp_mb, but smp_rmb and/or smp_wmb are more efficient.
-
-- sometimes, a thread is accessing many variables that are otherwise
-  unrelated to each other (for example because, apart from the current
-  thread, exactly one other thread will read or write each of these
-  variables).  In this case, it is possible to "hoist" the implicit
-  barriers provided by ``atomic_mb_read()`` and ``atomic_mb_set()`` outside
-  a loop.  For example, the above definition ``atomic_mb_read()`` gives
-  the following transformation:
-
-    +-----------------------------------+----------------------------------+
-    | before                            | after                            |
-    +===================================+==================================+
-    | ::                                | ::                               |
-    |                                   |                                  |
-    |   n = 0;                          |   n = 0;                         |
-    |   for (i = 0; i < 10; i++)        |   for (i = 0; i < 10; i++)       |
-    |     n += atomic_mb_read(&a[i]);   |     n += atomic_read(&a[i]);     |
-    |                                   |   smp_mb_acquire();              |
-    +-----------------------------------+----------------------------------+
-
-  Similarly, atomic_mb_set() can be transformed as follows:
-
-    +-----------------------------------+----------------------------------+
-    | before                            | after                            |
-    +===================================+==================================+
-    | ::                                | ::                               |
-    |                                   |                                  |
-    |                                   |   smp_mb_release();              |
-    |   for (i = 0; i < 10; i++)        |   for (i = 0; i < 10; i++)       |
-    |     atomic_mb_set(&a[i], false);  |     atomic_set(&a[i], false);    |
-    |                                   |   smp_mb();                      |
-    +-----------------------------------+----------------------------------+
-
-
-  The other thread can still use ``atomic_mb_read()``/``atomic_mb_set()``.
+Memory barriers and ``atomic_load_acquire``/``atomic_store_release`` are
+mostly used when a data structure has one thread that is always a writer
+and one thread that is always a reader:
+
+    +----------------------------------+----------------------------------+
+    | thread 1                         | thread 2                         |
+    +==================================+==================================+
+    | ::                               | ::                               |
+    |                                  |                                  |
+    |   atomic_store_release(&a, x)    |   y = atomic_load_acquire(&b)    |
+    |   atomic_store_release(&b, y)    |   x = atomic_load_acquire(&a)    |
+    +----------------------------------+----------------------------------+
+
+In this case, correctness is easy to check for in this case using the
+"pairing" trick that is explained below.
+
+Sometimes, a thread is accessing many variables that are otherwise
+unrelated to each other (for example because, apart from the current
+thread, exactly one other thread will read or write each of these
+variables).  In this case, it is possible to "hoist" the barriers
+outside a loop.  For example:
+
+    
+------------------------------------------+----------------------------------+
+    | before                                   | after                         
   |
+    
+==========================================+==================================+
+    | ::                                       | ::                            
   |
+    |                                          |                               
   |
+    |   n = 0;                                 |   n = 0;                      
   |
+    |   for (i = 0; i < 10; i++)               |   for (i = 0; i < 10; i++)    
   |
+    |     n += atomic_load_acquire(&a[i]);     |     n += atomic_read(&a[i]);  
   |
+    |                                          |   smp_mb_acquire();           
   |
+    
+------------------------------------------+----------------------------------+
+    | ::                                       | ::                            
   |
+    |                                          |                               
   |
+    |                                          |   smp_mb_release();           
   |
+    |   for (i = 0; i < 10; i++)               |   for (i = 0; i < 10; i++)    
   |
+    |     atomic_store_release(&a[i], false);  |     atomic_set(&a[i], false); 
   |
+    
+------------------------------------------+----------------------------------+
 
 The two tricks can be combined.  In this case, splitting a loop in
-two lets you hoist the barriers out of the loops _and_ eliminate the
-expensive ``smp_mb()``:
-
-    +-----------------------------------+----------------------------------+
-    | before                            | after                            |
-    +===================================+==================================+
-    | ::                                | ::                               |
-    |                                   |                                  |
-    |                                   |     smp_mb_release();            |
-    |   for (i = 0; i < 10; i++) {      |     for (i = 0; i < 10; i++)     |
-    |     atomic_mb_set(&a[i], false);  |       atomic_set(&a[i], false);  |
-    |     atomic_mb_set(&b[i], false);  |     smb_wmb();                   |
-    |   }                               |     for (i = 0; i < 10; i++)     |
-    |                                   |       atomic_set(&a[i], false);  |
-    |                                   |     smp_mb();                    |
-    +-----------------------------------+----------------------------------+
+two lets you hoist the barriers out of the loops, and replace a
+``smp_mb_release()`` with a (possibly cheaper, and clearer as well)
+``smp_wmb()``:
+
+    
+------------------------------------------+----------------------------------+
+    | before                                   | after                         
   |
+    
+==========================================+==================================+
+    | ::                                       | ::                            
   |
+    |                                          |                               
   |
+    |                                          |     smp_mb_release();         
   |
+    |   for (i = 0; i < 10; i++) {             |     for (i = 0; i < 10; i++)  
   |
+    |     atomic_store_release(&a[i], false);  |       atomic_set(&a[i], 
false);  |
+    |     atomic_store_release(&b[i], false);  |     smb_wmb();                
   |
+    |   }                                      |     for (i = 0; i < 10; i++)  
   |
+    |                                          |       atomic_set(&b[i], 
false);  |
+    
+------------------------------------------+----------------------------------+
 
 
 Memory barrier pairing
@@ -347,11 +265,13 @@ always, be paired with another barrier.  In the case of 
QEMU, however,
 note that the other barrier may actually be in a driver that runs in
 the guest!
 
-For the purposes of pairing, ``smp_read_barrier_depends()`` and ``smp_rmb()``
-both count as read barriers.  A read barrier shall pair with a write
-barrier or a full barrier; a write barrier shall pair with a read
-barrier or a full barrier.  A full barrier can pair with anything.
-For example:
+For the purposes of pairing, ``smp_read_barrier_depends()``, ``smp_rmb()``
+and ``smp_mb_acquire()`` all count as read barriers; ``smp_wmb()` and
+``smp_mb_release()`` both count as write barriers.
+
+A read barrier shall pair with a write barrier or a full barrier;
+a write barrier shall pair with a read barrier or a full barrier.
+A full barrier can pair with anything.  For example:
 
       +--------------------+------------------------------+
       | thread 1           | thread 2                     |
@@ -387,9 +307,6 @@ access and for data dependency barriers:
       |                    |  z = b[y];                   |
       +--------------------+------------------------------+
 
-``smp_wmb()`` also pairs with ``atomic_mb_read()`` and ``smp_mb_acquire()``.
-and ``smp_rmb()`` also pairs with ``atomic_mb_set()`` and ``smp_mb_release()``.
-
 
 Comparison with Linux kernel memory barriers
 ============================================
@@ -424,24 +341,45 @@ and memory barriers, and the equivalents in QEMU:
          ``atomic_cmpxchg``    returns the old value of the variable
          ===================== =========================================
 
-  In QEMU, the second kind does not exist.  Currently Linux has
-  atomic_fetch_or only.  QEMU provides and, or, inc, dec, add, sub.
+  In QEMU, the second kind does not exist.
 
 - different atomic read-modify-write operations in Linux imply
   a different set of memory barriers; in QEMU, all of them enforce
-  sequential consistency, which means they imply full memory barriers
-  before and after the operation.
-
-- Linux does not have an equivalent of ``atomic_mb_set()``.  In particular,
-  note that ``smp_store_mb()`` is a little weaker than ``atomic_mb_set()``.
-  ``atomic_mb_read()`` compiles to the same instructions as Linux's
-  ``smp_load_acquire()``, but this should be treated as an implementation
-  detail.
+  sequential consistency.
+
+- in QEMU, ``atomic_read()`` and ``atomic_set()`` do not participate in
+  the total ordering enforced by sequentially-consistent operations.
+  This is because QEMU uses the C11 memory model.  The following example
+  is correct in Linux but not in QEMU:
+
+      +----------------------------------+--------------------------------+
+      | Linux (correct)                  | QEMU (incorrect)               |
+      +==================================+================================+
+      | ::                               | ::                             |
+      |                                  |                                |
+      |   a = atomic_fetch_add(&x, 2);   |   a = atomic_fetch_add(&x, 2); |
+      |   b = READ_ONCE(&y);             |   b = atomic_read(&y);         |
+      +----------------------------------+--------------------------------+
+
+  because the read of ``y`` can be moved (by either the processor or the
+  compiler) before the write of ``x``.
+
+  Fixing this requires an ``smp_mb()`` memory barrier between the write
+  of ``x`` and the read of ``y``.  In the common case where only one thread
+  writes ``x``, it is also possible to write it like this:
+
+      +--------------------------------+
+      | QEMU (correct)                 |
+      +================================+
+      | ::                             |
+      |                                |
+      |   a = atomic_read(&x);         |
+      |   atomic_set(&x, a + 2);       |
+      |   smp_mb();                    |
+      |   b = atomic_read(&y);         |
+      +--------------------------------+
 
 Sources
 =======
 
 * Documentation/memory-barriers.txt from the Linux kernel
-
-* "The JSR-133 Cookbook for Compiler Writers", available at
-  http://g.oswego.edu/dl/jmm/cookbook.html
-- 
2.18.2





reply via email to

[Prev in Thread] Current Thread [Next in Thread]