Notes on synchronization in the Linux Kernel

23 Jul 2015

Just a small collection of notes on synchronization mechanisms in the Kernel. The underlying reference for which is the Linux Kernel Development book by Robert Love.

Atomic Operations

Indivisible operations - executed without interruption
Keep state consistent across threads of execution
Some architectural support
Atomic Integer Operations
- Use special type atomic_t for int
  - Prevent compiler from optimizing value
  - hide architecture specific implementation differences
  - defined in linux/types.h

    typedef struct {
      volatile int counter;
    } atomic_t;

* Devlopers need to limit the usage of `atomic_t` to `24 bits` to not  break on sparc
* on sparc lock embeded in lower 8 bits (this may no longer be relevant)   * atomic operations in `asm/atomic.h`   * defining atomic

   /* define v */
   atomic_t v;
   
   /* define u and initialize it to zero */
   atomic_t u = ATOMIC_INIT(0);

Some atomic operations :

  /* v = 4 (atomically) */
  atomic_set(&v, 4);
  /* v = v + 2 = 6 (atomically) */
  atomic_add(2, &v);  
  /* v = v + 1 = 7 (atomically) */
  atomic_inc(&v);

  /* will print "7" convert atomic_t to int */
  printk(“%d\n”, atomic_read(&v));

Using atomic_add , atomic_inc ,atomic_dec lighter weight instead of complex locking
int atomic_dec_and_test(atomic_t *v)
- decrement v
  - if v is zero returns true
  - else return false
- List of Atomic Operations

    // Atomic Integer Operation Description
    //At declaration, initialize to i.
    ATOMIC_INIT(int i)
    // Atomically read the integer value of v.
    int atomic_read(atomic_t *v)
    
    // Atomically set v equal to i.
    void atomic_set(atomic_t *v, int i)
    
    // Atomically add i to v.
    void atomic_add(int i, atomic_t *v)
    
    // Atomically subtract i from v.
    void atomic_sub(int i, atomic_t *v)
    
    // Atomically add one to v.
    void atomic_inc(atomic_t *v)
    
    // Atomically subtract one from v.
    void atomic_dec(atomic_t *v)
    
    // Atomically subtract i from v and
    // return true if the result is zero;
    // otherwise false.
    int atomic_sub_and_test(int i, atomic_t *v)
    
    //Atomically add i to v and return
    //true if the result is negative;
    //otherwise false.
    int atomic_add_negative(int i, atomic_t *v)
    
    //Atomically add i to v and return
    //the result.
    int atomic_add_return(int i, atomic_t *v)
    
    //Atomically subtract i from v and
    //return the result.
    int atomic_sub_return(int i, atomic_t *v)
    
    //Atomically increment v by one and
    //return the result.
    int atomic_inc_return(int i, atomic_t *v)
    
    // Atomically decrement v by one and
    // return the result.
    int atomic_dec_return(int i, atomic_t *v)
    
    //Atomically decrement v by one and
    //return true if zero; false otherwise.
    int atomic_dec_and_test(atomic_t *v)

    //Atomically increment v by one and
    //return true if the result is zero;
    //false otherwise.
    int atomic_inc_and_test(atomic_t *v)

Implemented as inline functions & inline assembly
word sized reads always atomic
most architectures read/write of a single byte is atomic( no two simultaneous operations)
read is consistent(ordered either before or after the write)
Thus on most architectures we get

/**
* atomic_read - read atomic variable
* @v: pointer of type atomic_t
*
* Atomically reads the value of @v.
*/
static inline int atomic_read(const atomic_t *v)
{
   return v->counter;
}

Consitent atomicity does not guarantee consistent ordering (use memory barriers for enforced ordering)
64-bit architectures
- atomic_t size cant be different for different architectuers
  - atomic_t is consitently 32-bit accross architectuers
- atomic64_t is used to for 64-bit atomic operations on 64 bit machines
- nearly all atomic operations listed above implement a 64-bit version
- all corresponding functions shown below

typedef struct {
  volatile long counter;
} atomic64_t;
ATOMIC64_INIT(long i)
long atomic64_read(atomic64_t *v)
void atomic64_set(atomic64_t *v, int i)
void atomic64_add(int i, atomic64_t *v)
void atomic64_sub(int i, atomic64_t *v)
void atomic64_inc(atomic64_t *v)
void atomic64_dec(atomic64_t *v)
int atomic64_sub_and_test(int i, atomic64_t *v)
int atomic64_add_negative(int i, atomic64_t *v)
long atomic64_add_return(int i, atomic64_t *v)
long atomic64_sub_return(int i, atomic64_t *v)
long atomic64_inc_return(int i, atomic64_t *v)
long atomic64_dec_return(int i, atomic64_t *v)
int atomic64_dec_and_test(atomic64_t *v)
int atomic64_inc_and_test(atomic64_t *v)

Atomic Bit Operations
- architecture specific code
- asm/bitops.h for details
- operate on generic pointers
- on 32-bit machines
  - bit 31 most significant bit
  - bit 0 least significant bit
- Non-atomic versions of bit operations provided but their name prefixed with __
  - test_bit() atomic version , __test_bit() non-atomic version
  - when dont need locking use non-atomic versions for performance

/**
 * Atomically set the nr -th bit starting from addr.
 */
void set_bit(int nr, void *addr)

/**
 * Atomically clear the nr -th bit starting from addr.
 */
void clear_bit(int nr, void *addr)

/**
 * Atomically flip the value of the nr -th bit starting from addr.
 */
void change_bit(int nr, void *addr)

/**
 * Atomically set the nr -th bit starting from addr and return the previous value.
 */
int test_and_set_bit(int nr, void *addr)

/**
 * Atomically clear the nr -th bit starting from addr and return the
 * previous value.
 */
int test_and_clear_bit(int nr, void *addr)

/**
 * Atomically flip the nr -th bit starting from addr and return the
 * previous value.
 */
int test_and_change_bit(int nr, void *addr)

/**
 * Atomically return the value of the nr -
 * th bit starting from addr.
 */ 
int test_bit(int nr, void *addr)

See example usage setting and clearing bits atomically

unsigned long word = 0;
/* bit zero is now set (atomically) */
set_bit(0, &word);

/* bit one is now set (atomically) */
set_bit(1, &word);

/* will print “3” */
printk(“%ul\n”, word);
/* bit one is now unset (atomically) */
clear_bit(1, &word);
/*bit zero is flipped; now it is unset (atomically) */
change_bit(0, &word);

/* atomically sets bit zero and returns the previous value (zero) */
if (test_and_set_bit(0, &word)) {
/* never true ... */
}

/* the following is legal; you can mix atomic bit instructions with normal C */
word = 7;

Spin Locks

Allow for creation of critical regions in code
spinlock can be held by at most one thread at a time
contending threads busy loop waiting for the lock to unlock
if lock is uncontended acquire lock and proceed instantly
busy loop will "waste" processor time
thus must minmize length of time spinlock is held
advantage that doesnt result in context switches
can be used in contexts which do not support blocking (interrupt context) or (preemption disabled?)
Spin Lock Methods
- Defined in linux/spinlock.h

  
  DEFINE_SPINLOCK(my_lock);
  // acquire the spin lock
  spin_lock(&my_lock);
  ... place critical region here ..
  //release the spin lock
  spin_unlock(&my_lock);

On uniprocessor system spinlocks compile away
- “some people” say they have some side effects, and spinlocs are not noops
- need to verify
- act as markers to enable disable kernel preemption
Spinlocks are not recursive
- trying to acquire the same lock again will hang the thread of execution
Using spinlock in interrupt context - need to disable local interrupts on this processor
- if preemption not disabled may deadlock on the lock this processor already holds
- Thus need to use methods to which will disable interrupts
spin_lock_irqsave
- save the current state of interrupts
- disables interrupts locally
- obtain the spinlock
spin_unlock_irqrestore
- unlock the given lock
- returns interrupts to previous state
  - This is key , if the interrupts were disabled on entry then we need should keep them disabled
  - allows nesting of different spinlocks and interrupt disabling clode
Locks should be associated with data structures which they lock
make association of lock with data clearer by naming conventions

DEFINE_SPINLOCK(my_lock);

unsigned long flags;

spin_lock_irqsave(&my_lock, flags);

/* critical region ... */

spin_unlock_irqrestore(&my_lock, flags);

Kernel Configs to help Spin Lock Debugging
- CONFIG_DEBUG_SPINLOCK
  - using uninitialized spinlock detection
  - unlocking a lock which was never locked
- CONFIG_DEBUG_LOCK_ALLOC
  - debugging lock lifecycles?
Dynamic Allocation of Spinlock
- spin_lock_init() initialize dynamically created spinlock, ie with a pointer
- spin_trylock() - try to obtain lock, if fail return 0, if succed return non-zero

/* Acquires given lock*/
spin_lock()

/** Disables local interrupts and acquires given lock */
spin_lock_irq()

/** Saves current state of local interrupts, disables local inter-
/  rupts, and acquires given lock */
spin_lock_irqsave()

/** Releases given lock */
spin_unlock()

/** Releases given lock and enables local interrupts */
spin_unlock_irq()

/** Releases given lock and restores local interrupts to given pre-
vious state */
spin_unlock_irqrestore()

/** Dynamically initializes given spinlock_t */
spin_lock_init()

/** Tries to acquire given lock; if unavailable, returns nonzero */
spin_trylock()

/** Returns nonzero if the given lock is currently acquired, other-
wise it returns zero */
spin_is_locked()

Spin Locks and Bottom Halfs
- Bottom halfs preempt process contexts
- two tasklets of same type don’t run concurrently
- tasklet never preempts another tasklet on same processor
- interrupt context can preempt bottom halfs and process contexts
- Process context should disable bottom half preemption over contended locks
Reader-Writer Spin Locks
- Dealing with the asymmetrical nature of reading and writing.
- Writing demands mutal exclusion
  - no reading permitted
  - no writing permitted
- Reading requires - write exclusion
  - no writers permitted
  - multiple readers ok
- Reader-Writer spinlocks increase throughput by allowing multiple readers
- Provide separate variants of read and write locks
- one or more readers can hold read locks
- only one writer and no one readers or writers can hold write lock
- alternative terminology: (reader,writer) ,(shared/exlusive) , (concurrent/exclusive)

DEFINE_RWLOCK(mr_rwlock);

// Attain read lock
read_lock(&mr_rwlock);
/* critical section (read only) ... */
read_unlock(&mr_rwlock);

// Attain write lock
write_lock(&mr_rwlock);
/* critical section (read and write) ... */
write_unlock(&mr_rwlock);

read locks cant later be transformed to write locks without first releasing the read lock and then asking for a write lock

    read_lock(&mr_rwlock);
    
    // Dead locks will never get a write lock since the read lock will
    // never get released
    
    write_lock(&mr_rwlock);

reader locks are recursive, same thread can re-obtain same reader locks without deadlock
Thus if only readers in interrupt handlers no need to disable interrupts.
- use read_lock() instead of read_lock_irqsave()
- Must use write_lock_irqsave() in interrupts

// Acquires given lock for reading
 read_lock()

// Disables local interrupts and acquires given lock for reading
 read_lock_irq()

/**
 * Saves the current state of local interrupts, disables local in-
 * terrupts, and acquires the given lock for reading
 */
 read_lock_irqsave()

// Releases given lock for reading
 read_unlock()

// Releases given lock and enables local interrupts
 read_unlock_irq()

/**
 * Releases given lock and restores local interrupts to the
 * given previous state
 */
 read_unlock_ irqrestore()

//Acquires given lock for writing
 write_lock()

/**
 * Disables local interrupts and acquires the given lock for
 * writing
 */
 write_lock_irq()

/**
 * Saves current state of local interrupts, disables local inter-
 * rupts, and acquires the given lock for writing
 */
 write_lock_irqsave()

//Releases given lock
  write_unlock()

//Releases given lock and enables local interrupts
  write_unlock_irq()

/**
 * Releases given lock and restores local interrupts to given
 * previous state
 */
 write_unlock_irqrestore()

/**
 * Tries to acquire given lock for writing; if unavailable, returns
 * nonzero
 */
 write_trylock()

//Initializes given rwlock_t
rwlock_init()

Need to think of appropriate priority for writers. Else lots of readers may starve writer.
Spinlocks are on small timescales (nanosceconds?) For larger timescales blocking semaphores are advisable instead.

Semaphores

For larger timescales where putting process context to sleep is advisable (ms) scale
Sets processor free to execute other code.
Should only be obtained in process context cannot be used in interrupt context, interrupt context is not schedulable.
Cannot hold a spinlock while acquiring a semaphore. Since you may sleep and cause a deadlock on the spinlock (how is this enforced?)
Semaphores don’t disable kernel preemption, don’t hurt affect scheduling latency as much as spinlocks
Counting vs Binary Semaphores
- Unlike spinlocks allow arbitrary number of simultaneous lock holders
- usage count or count - number of permisable simultaneous lock holders
- Binary semaphore/mutex - enforces mutual exclusion - usage count is one
- Counting semaphore - usage count greater than one
- While count is greater than zero acquiring semaphore succeeds.
- asm/semaphore.h
- struct semaphore key structure

    struct semaphore name;
    sema_init(&name, count);
    // or alternatively for delcaring and inializing a staitc mutex
    static DECLARE_MUTEX(name);

* `init_MUTEX(sem)` - initialize a dynamically created semaphore
* To try acquire use `down_*` methods.
* To release use `up_*` methods
* `down_interruptible()`
  * one failure to acquire lock puts calling process in
    `TASK_INTERRUPTIBLE` and sleep
  * If `process` receives a `signal` then wake up and fail by
    returning `-EINTR`
    
* `down()`
  * place task in `TASK_UNINTERRUPTIBLE`, then sleep
  * process will not respond to signals

* prefer `down_interruptible()` over `down()`

* `down_trylock()`
  * acquire semaphore without blocking
  * if lock held return non-zero

   /* define and declare a semaphore, named mr_sem, with a count of one */
   static DECLARE_MUTEX(mr_sem);
   /* attempt to acquire the semaphore ... */
   if (down_interruptible(&mr_sem)) {
   /* signal received, semaphore not acquired ... */
   }
   /* critical region ... */
   /* release the given semaphore */
   up(&mr_sem);

Semaphore Methods

/**
 * Initializes the dynamically created semaphore
 * to the given count
 */
sema_init(struct semaphore *, int)

/**
 * Initializes the dynamically created semaphore
 * with a count of one
 */
init_MUTEX(struct semaphore *)

/**
 * Initializes the dynamically created semaphore
 * with a count of zero (so it is initially locked)
 */
init_MUTEX_LOCKED(struct semaphore *)

/**
 * Tries to acquire the given semaphore and
 * enter interruptible sleep if it is contended
 */
down_interruptible (struct semaphore *)

/**
 * Tries to acquire the given semaphore and
 * enter uninterruptible sleep if it is contended
 */
down(struct semaphore *)

/**
 * Tries to acquire the given semaphore and
 * immediately return nonzero if it is contended
 */
down_trylock(struct semaphore *)

/**
 * Releases the given semaphore and wakes a
 * waiting task, if any
 */
up(struct semaphore *)

Reader-Writer Semaphores

Similar to reader-writer spinlocks
linux/rwsem.h
Static Creation
- static DECLARE_RWSEM(name);
Dynamic Creation
- init_rwsem(struct rw_semaphore *sem)
Define mutual exclusion on writers with usage count 1
Allow simultaneous reads
All readers and writers use uninterruptible sleep

// statically declare  a read only semaphore
static DECLARE_RWSEM(mr_rwsem);

/* attempt to acquire the semaphore for reading ... */
down_read(&mr_rwsem);

/* critical region (read only) ... */
/* release the semaphore */
up_read(&mr_rwsem);

/* attempt to acquire the semaphore for writing ... */
down_write(&mr_rwsem);
/* critical region (read and write) ... */
/* release the semaphore */
up_write(&mr_sem);

For non-blocking lock-acquisition use down_read_trylock() and down_write_trylock()
- return non-zero if lock can be acquired
- return zero if lock cannot bet acquired
downgrade_write() converts a write lock to a read lock

Mutexes

struct mutex
used to simplify common use case of semaphores
Static Definition
- DEFINE_MUTEX(name)
Dynamic Definition
- mutex_init(&mutex)
Locking and Unlocking Mutex

  mutex_lock(&mutex);
  /* critical region ... */
  mutex_unlock(&mutex);

Does not require usage counts

/**
 * Locks the given mutex; sleeps if the lock is
 * unavailable
 */
mutex_lock(struct mutex *)

/**
 * Unlocks the given mutex
 */
mutex_unlock(struct mutex *)

/**
 * Tries to acquire the given mutex; returns one if suc-
 * cessful and the lock is acquired and zero otherwise
 */
mutex_trylock(struct mutex *)

/**
 * Returns one if the lock is locked and zero otherwise
 */
mutex_is_locked (struct mutex *)

Constrains Imposed on mutex usage
- usage count is always one
- single task holds mutex
- only original locker allowed to unlock mutex
- recursive locks and unlocks are not allowed.
- process cannot exit while holding a mutex
- a mutex cannot be acquired by an interrupt handler or bottom half
- Constraints checked via CONFIG_DEBUG_MUTEXES
Usage guidelines
- start with mutex use semaphore if constriants not statisfiable
When to use Spinlocks vs (Semaphore , Mutex) ?

| Low overhead                        | Spin lock              |
| Short lock hold time                | Spin lock              |
| Long lock hold time                 | Mutex is preferred.    |
| Need to lock from interrupt context | Spin lock is required. |
| Need to sleep while holding lock    | Mutex is required.     |

Completion Variables

Event signalling between tasks
One task waits for signal, other task signals on completion
On completion wake up all sleeping tasks
Ex. Completion Variable, wake up parent when child exits
- - See kernel/sched.c and kernel/fork.c
Static
- DECLARE_COMPLETION(mr_comp);
Dynamic
- init_completion()
To wait for completion call wait_for_completion
To signal completion call complete

/**
 * Initializes the given dynamically created
 * completion variable
 */

init_completion(struct completion *)

/**
 * Waits for the given completion variable
 * to be signaled
 */
wait_for_completion(struct completion *)
/**
 * Signals any waiting tasks to wake up
 */
complete(struct completion *)

BKL: The big kernel lock

Spinlock used to ease transition to SMP
global spinlock
Lock dropped on sleep
Is a recursive lock: same process multiple acquisition allowed
Forbidden new users
Transition away from it
lock_kernel acquires lock
unclock_kernel releases recursively
kernel_locked returns
- 0 - lock already held
- non-zero -lock not being held
linux/smp_lock.h
problem with single lock - difficult to determine which two parties actually need to syncrhonize with each other # Sequential Locks

Sequential Locking

Simple mechanism reading/writing shared data
maintains a sequence counter
when sequence counter is odd a write is taking palce
check sequence counter is even prior to and after read to be sure no write was underway

// define a seq lock
seqlock_t mr_seq_lock = DEFINE_SEQLOCK(mr_seq_lock);

// Create a write lock region
write_seqlock(&mr_seq_lock);
/* write lock is obtained... */
write_sequnlock(&mr_seq_lock);

// Read  path
unsigned long seq;
do {
  seq = read_seqbegin(&mr_seq_lock);
  /* read data here ... */
} while (read_seqretry(&mr_seq_lock, seq));

Write lock always succeeds as long as no writers
Favor writers over readers
Ideal cases
- Many of readers
- Few writers
- readers never starve writers
- simple data
Ex : seq_lock on jiffies

u64 get_jiffies_64(void)
{
    unsigned long seq;
    u64 ret;
    do {
       seq = read_seqbegin(&xtime_lock);
       ret = jiffies_64;
    } while (read_seqretry(&xtime_lock, seq)); // if unsuccesful read tries again
    return ret;
}

// update jiffies in timer interrupt
write_seqlock(&xtime_lock);
jiffies_64 += 1;
write_sequnlock(&xtime_lock);

Preemption Disabling

kernel is preemptive
spin locks disable premption for duration held
during modification of per-processor data disable preemption to prevent corrupting in flight data
preempt_disable and preempt_enable can be nested
methods maintain counts of the number of times preeption is enabled or disabled.
If count is 0 kernel is preemptive

/**
 * Disables kernel preemption by incrementing the preemp-
 * tion counter
 */
preempt_disable()

/**
 * Decrements the preemption counter and checks and serv-
 * ices any pending reschedules if the count is now zero
 */
preempt_enable()

/**
 * Enables kernel preemption but does not check for any
 * pending reschedules
 */
preempt_enable_no_resched()

/* Returns the preemption count*/
preempt_count()

Alternatively calling get_cpu() to obtain an index into perprocessor data will disable kernel preemption.
put_cpu() will reenable kernel preemption

Ordering and Barriers

Ensuring ordering of memory loads and stores
Compiler and Processors like to reorder loads and stores
Barriers prevent compiler and processor from reordering instructions
x86 does not do out of order stores
Other processors might do out of order stores
Reorderings can have dependencies :

a = 1
b = a

data dependency between b and a
static reordering : compiler
- Reflected in object code
dynamic reordering: processor
rmb()
- a read memory barrier
- loads before the rmb() will never be reordered to loads after the call
- read_barrier_depends()
  - read barrier for loads
  - only for loads where the subsequent load depends on previous load
  - much quicker on some architectures
  - on some architectures transforms to noop
wmb()
- write barrier
- stores before the wmb() will never be reordered with stores after the call
- x86 does not reorder stores
mb()
- read/write barrier
- (loads and stores) before the wmb() will never be reordered with (loads and stores) after the call
smp_rmb , smp_wmb and smp_read_barrier_depends
- turn memory barrier to compiler barrier on smp
- compiler barrires are nearly free - only preventing static rearrangement

Aakarsh Nair

Notes on synchronization in the Linux Kernel

Atomic Operations

Spin Locks

Semaphores

Reader-Writer Semaphores

Mutexes

Completion Variables

BKL: The big kernel lock

Sequential Locking

Preemption Disabling

Ordering and Barriers

Summary

Discussion, links, and tweets

Aakarsh Nair

Notes on synchronization in the Linux Kernel

Atomic Operations

Spin Locks

Semaphores

Reader-Writer Semaphores

Mutexes

Completion Variables

BKL: The big kernel lock

Sequential Locking

Preemption Disabling

Ordering and Barriers

Summary

Related Posts

Emacs/Evil-mode - A basic reference to using evil mode in Emacs. 09 Mar 2017

Notes on memory management in the Linux Kernel 23 Jul 2015

Programming Language Reference - EmacsLisp (draft) 23 Jun 2015

Discussion, links, and tweets