Notes on memory management in the Linux Kernel

23 Jul 2015

Just a small collection of notes on memory management in the Kernel. The underlying reference for which is the Linux Kernel Development book by Robert Love.

Why is memory allocation in the kernel hard ?

Not easy to deal with memory allocation errors.
Kernel often cannot sleep.
Require special primitives different from userspace.

Pages

Physical Pages act as basic unit of memory management
Different from processor’s smallest addressable unit (byte or word)
Hardware provides assistance via the MMU (memory management unit)
virtual memory pages are smallest unit.
Page sizes architecture specific 32-bit 4Kb page size 64-bit Kb page size
Kernel keeps track of pages in struct page structure
pages kept track are the actual physical pages these are not virtual pages.
struct page defined in <linux/mm_types.h>

struct page {
   unsigned long flags;
   atomic_t _count;
   atomic_t _mapcount;
   unsigned long private;
   struct address_space *mapping;
   pgoff_t index;
   struct list_head lru;
   void *virtual;
};

flags stores the status of the page 32-different flags available see <linux/page-flags.h>
- See the page flags enum for list of states

enum pageflags {
	PG_locked,		/* Page is locked. Don't touch. */
	PG_error,
	PG_referenced,
	PG_uptodate,
	PG_dirty,
	PG_lru,
	PG_active,
	PG_slab,
	PG_owner_priv_1,	/* Owner use. If pagecache, fs may use*/
	PG_arch_1,
	PG_reserved,
	PG_private,		/* If pagecache, has fs-private data */
	PG_private_2,		/* If pagecache, has fs aux data */
	PG_writeback,		/* Page is under writeback */
#ifdef CONFIG_PAGEFLAGS_EXTENDED
	PG_head,		/* A head page */
	PG_tail,		/* A tail page */
#else
	PG_compound,		/* A compound page */
#endif
	PG_swapcache,		/* Swap page: swp_entry_t in private */
	PG_mappedtodisk,	/* Has blocks allocated on-disk */
	PG_reclaim,		/* To be reclaimed asap */
	PG_swapbacked,		/* Page is backed by RAM/swap */
	PG_unevictable,		/* Page is "unevictable"  */
#ifdef CONFIG_MMU
	PG_mlocked,		/* Page is vma mlocked */
#endif
#ifdef CONFIG_ARCH_USES_PG_UNCACHED
	PG_uncached,		/* Page has been mapped as uncached */
#endif
#ifdef CONFIG_MEMORY_FAILURE
	PG_hwpoison,		/* hardware poisoned page. Don't touch */
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
	PG_compound_lock,
#endif
	__NR_PAGEFLAGS,

	/* Filesystems */
	PG_checked = PG_owner_priv_1,

	/* Two page bits are conscripted by FS-Cache to maintain local caching
	 * state.  These bits are set on pages belonging to the netfs's inodes
	 * when those inodes are being locally cached.
	 */
	PG_fscache = PG_private_2,	/* page backed by cache */

	/* XEN */
	PG_pinned = PG_owner_priv_1,
	PG_savepinned = PG_dirty,

	/* SLOB */
	PG_slob_free = PG_private,
};

_count field represents the usage count for a page
- negative count field indicates page is free for allocation
- access via the page_count() macro, provides locking, atomic read
virtual virtual address of the page.
- the virtual address of page
- for pages in HIGH_MEM which need to be mapped as needed NULL
This structure keeps track of the data about physical pages has less to do with actual pages.
struct page consumes about 40 bytes
- Assuming 4Gb system with 8Kb page size
- Means 524,288 pages
- That is 20 Mb of struct pages in memory

Zones

Allows for non uniform treatment of pages
Zones group pages of similar properties
Example of hardware limitations
- being able to perform DMA (direct memory access) only within certain memory addresses
- large physically addressability vs small virtual addressability
  - allow for pages not permanently mapped into kernel address space
Four primary zones in Linux
- ZONE_DMA - Pages that can do DMA
- ZONE_DMA32 - Pages that can do DMA and accessible on only 32 bit devices
- ZONE_NORMAL - Regularly mapped pages
- ZONE_HIGHMEM - Pages not permanently mapped into kernel address space
Zones defined in linux/mmzone.h
Actual usage of zones are architecture dependent
- ISA on x86-32 limited to DMA on first 16 Mb of memory
- on x86-32 ZONE_DMA 0-16mb
- on x86-32 ZONE_NORMAL 16 mb-896 Mb
- on 32-bit ZONE_HIGHMEM all memory above 896 Mb
- x86-64 No High Mem and all memory is mappable all memory in ZONE_NORMAL
Thus ZONE provides some logical grouping of pages.
Some Key fields in struct zone
lock - spinlock to prevent concurrent modification
watermark - minimum and low values for this zone.
name - null terminated string
Initialized during boot in mm/page_alloc.c

struct zone {
	/* Fields commonly accessed by the page allocator */

	/* zone watermarks, access with *_wmark_pages(zone) macros */
	unsigned long watermark[NR_WMARK];

	/*
	 * When free pages are below this point, additional steps are taken
	 * when reading the number of free pages to avoid per-cpu counter
	 * drift allowing watermarks to be breached
	 */
	unsigned long percpu_drift_mark;

	/*
	 * We don't know if the memory that we're going to allocate will be freeable
	 * or/and it will be released eventually, so to avoid totally wasting several
	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
	 * to run OOM on the lower zones despite there's tons of freeable ram
	 * on the higher zones). This array is recalculated at runtime if the
	 * sysctl_lowmem_reserve_ratio sysctl changes.
	 */
	unsigned long		lowmem_reserve[MAX_NR_ZONES];

	/*
	 * This is a per-zone reserve of pages that should not be
	 * considered dirtyable memory.
	 */
	unsigned long		dirty_balance_reserve;

#ifdef CONFIG_NUMA
	int node;
	/*
	 * zone reclaim becomes active if more unmapped pages exist.
	 */
	unsigned long		min_unmapped_pages;
	unsigned long		min_slab_pages;
#endif
	struct per_cpu_pageset __percpu *pageset;
	/*
	 * free areas of different sizes
	 */
	spinlock_t		lock;
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
	/* Set to true when the PG_migrate_skip bits should be cleared */
	bool			compact_blockskip_flush;

	/* pfn where compaction free scanner should start */
	unsigned long		compact_cached_free_pfn;
	/* pfn where async and sync compaction migration scanner should start */
	unsigned long		compact_cached_migrate_pfn[2];
#endif
#ifdef CONFIG_MEMORY_HOTPLUG
	/* see spanned/present_pages for more description */
	seqlock_t		span_seqlock;
#endif
	struct free_area	free_area[MAX_ORDER];

#ifndef CONFIG_SPARSEMEM
	/*
	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
	 * In SPARSEMEM, this map is stored in struct mem_section
	 */
	unsigned long		*pageblock_flags;
#endif /* CONFIG_SPARSEMEM */

#ifdef CONFIG_COMPACTION
	/*
	 * On compaction failure, 1<<compact_defer_shift compactions
	 * are skipped before trying again. The number attempted since
	 * last failure is tracked with compact_considered.
	 */
	unsigned int		compact_considered;
	unsigned int		compact_defer_shift;
	int			compact_order_failed;
#endif

	ZONE_PADDING(_pad1_)

	/* Fields commonly accessed by the page reclaim scanner */
	spinlock_t		lru_lock;
	struct lruvec		lruvec;

	/* Evictions & activations on the inactive file list */
	atomic_long_t		inactive_age;

	unsigned long		pages_scanned;	   /* since last reclaim */
	unsigned long		flags;		   /* zone flags, see below */

	/* Zone statistics */
	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];

	/*
	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
	 * this zone's LRU.  Maintained by the pageout code.
	 */
	unsigned int inactive_ratio;


	ZONE_PADDING(_pad2_)
	/* Rarely used or read-mostly fields */

	/*
	 * wait_table		-- the array holding the hash table
	 * wait_table_hash_nr_entries	-- the size of the hash table array
	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
	 *
	 * The purpose of all these is to keep track of the people
	 * waiting for a page to become available and make them
	 * runnable again when possible. The trouble is that this
	 * consumes a lot of space, especially when so few things
	 * wait on pages at a given time. So instead of using
	 * per-page waitqueues, we use a waitqueue hash table.
	 *
	 * The bucket discipline is to sleep on the same queue when
	 * colliding and wake all in that wait queue when removing.
	 * When something wakes, it must check to be sure its page is
	 * truly available, a la thundering herd. The cost of a
	 * collision is great, but given the expected load of the
	 * table, they should be so rare as to be outweighed by the
	 * benefits from the saved space.
	 *
	 * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
	 * primary users of these fields, and in mm/page_alloc.c
	 * free_area_init_core() performs the initialization of them.
	 */
	wait_queue_head_t	* wait_table;
	unsigned long		wait_table_hash_nr_entries;
	unsigned long		wait_table_bits;

	/*
	 * Discontig memory support fields.
	 */
	struct pglist_data	*zone_pgdat;
	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
	unsigned long		zone_start_pfn;

	/*
	 * spanned_pages is the total pages spanned by the zone, including
	 * holes, which is calculated as:
	 * 	spanned_pages = zone_end_pfn - zone_start_pfn;
	 *
	 * present_pages is physical pages existing within the zone, which
	 * is calculated as:
	 *	present_pages = spanned_pages - absent_pages(pages in holes);
	 *
	 * managed_pages is present pages managed by the buddy system, which
	 * is calculated as (reserved_pages includes pages allocated by the
	 * bootmem allocator):
	 *	managed_pages = present_pages - reserved_pages;
	 *
	 * So present_pages may be used by memory hotplug or memory power
	 * management logic to figure out unmanaged pages by checking
	 * (present_pages - managed_pages). And managed_pages should be used
	 * by page allocator and vm scanner to calculate all kinds of watermarks
	 * and thresholds.
	 *
	 * Locking rules:
	 *
	 * zone_start_pfn and spanned_pages are protected by span_seqlock.
	 * It is a seqlock because it has to be read outside of zone->lock,
	 * and it is done in the main allocator path.  But, it is written
	 * quite infrequently.
	 *
	 * The span_seq lock is declared along with zone->lock because it is
	 * frequently read in proximity to zone->lock.  It's good to
	 * give them a chance of being in the same cacheline.
	 *
	 * Write access to present_pages at runtime should be protected by
	 * mem_hotplug_begin/end(). Any reader who can't tolerant drift of
	 * present_pages should get_online_mems() to get a stable value.
	 *
	 * Read access to managed_pages should be safe because it's unsigned
	 * long. Write access to zone->managed_pages and totalram_pages are
	 * protected by managed_page_count_lock at runtime. Idealy only
	 * adjust_managed_page_count() should be used instead of directly
	 * touching zone->managed_pages and totalram_pages.
	 */
	unsigned long		spanned_pages;
	unsigned long		present_pages;
	unsigned long		managed_pages;

	/*
	 * Number of MIGRATE_RESEVE page block. To maintain for just
	 * optimization. Protected by zone->lock.
	 */
	int			nr_migrate_reserve_block;

	/*
	 * rarely used fields:
	 */
	const char		*name;
} ____cacheline_internodealigned_in_smp;

Getting Pages

methods for requesting memory at page size granularity in linux/gfp.h
struct page * alloc_pages(gfp_t gfp_mask, unsigned int order)
- core function to fetch pages
- allocates 2^order or (1 << order) pages
- returns poniter to first page
- Convert to logical address of where page resides using void * page_address(struct page *page)
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
- Returns logical address where the page resides
- used when struct page not required
- returned pages are contiguous and follow the first
unsigned long get_zeroed_page(unsigned int gfp_mask)
- useful for pages handed to userspace
- prevent leaking sensitive data
Freeing pages.
- void __free_pages(struct page *page, unsigned int order)
- void free_pages(unsigned long addr, unsigned int order)
- void free_page(unsigned long addr)
- double free is a serious problem in the kernel
All allocations can fail and we can get NULL logical address

unsigned long page;

// allocalte 2^3 or eight pages
// page is the logical address
page = __get_free_pages(GFP_KERNEL, 3);

if (!page) {
/* insufficient memory: you must handle this error! */
return –ENOMEM;
}

/* ‘page’ is now the address of the first of eight contiguous pages ... */

// free the 8 pages.
free_pages(page, 3);

`kmalloc()`

Useful or byte sized allocation
declared in <linux/slab.h>
preferred choice if allocations are not multiples of page size
void * kmalloc(size_t size, gfp_t flags)
returned memory is at least size bytes in length

struct dog *p;

p = kmalloc(sizeof(struct dog), GFP_KERNEL);

if (!p)
   /* handle error ... */

gfp_mask Flags
- flags defined in linux/types.h
- as unsigned int
- Flag Types
  - action modifiers
    - how memory is allocated
    - eg. during interrupt handler allocations should fail but never sleep
  - zone modifiers
    - specify where the memory is allocated
  - types
    - Specify combination of action and zone modifiers
    - GFP_KERNEL
      - for code in process context
Action Modifiers

+------------------------------------------------------------------------+  
| Flag             | Description                                         |
+------------------------------------------------------------------------+
|__GFP_WAIT        | The allocator can sleep.                            |
|__GFP_HIGH        | The allocator can access emergency pools.           |
|__GFP_IO          | The allocator can start disk I/O.                   |
|__GFP_FS          | The allocator can start filesystem I/O.             |
|__GFP_COLD        | The allocator should use cache cold pages.          |
|__GFP_NOWARN      | The allocator does not print failure warnings.      |
|__GFP_REPEAT      | The allocator repeats the allocation if it          |
|                  | fails, but the allocation can potentially fail.     |
|__GFP_NOFAIL      | The allocator indefinitely repeats the allocation.  |
|                  | The allocation cannot fail.                         |
|                  |                                                     |
|__GFP_NORETRY     | The allocator never retries if the allocation       |
|                  | fails.                                              |
|__GFP_NOMEMALLOC  | The allocator does not fall back on reserves.       |
|__GFP_HARDWALL    | The allocator enforces “hardwall” cpuset boundaries.|
|__GFP_RECLAIMABLE | The allocator marks the pages reclaimable.          |
|__GFP_COMP        | The allocator adds compound page metadata           | 
+------------------------------------------------------------------------+

Sample usage

// Sleepable, Disk IO able, file system operationable
ptr = kmalloc(size, __GFP_WAIT | __GFP_IO | __GFP_FS);

kmalloc calls ultimately use the alloc_pages and above grants great flexibility in page allocation
Zone Modifiers

+------------------------------------------------------------+
|Flag          | Description                                 |
+------------------------------------------------------------+  
|__GFP_DMA     | Allocates only from ZONE_DMA                |
|__GFP_DMA32   | Allocates only from ZONE_DMA32              |
|__GFP_HIGHMEM | Allocates from ZONE_HIGHMEM or ZONE_NORMAL  |
+------------------------------------------------------------+

Use GFP_DMA if you must have dma-able memory
__GFP_HIGHMEM can use if needed
- only alloc_pages can return High memory
- since logical address cannot be returned for memory not mapped into kernels virtual address space
Type Flags
- specify combinations of action and zone modifiers
- simpler and less error prone

+--------------------------------------------------------------------+
|Flag          | Description                                         |
+--------------------------------------------------------------------+  
|GFP_ATOMIC        | The allocation is high priority and must
|                  | not sleep. This is the flag to use in interrupt
|                  | handlers, in bottom halves, while holding a spin-
|                  | lock, and in other situations where you cannot sleep.
|                  |
|GFP_NOWAIT        | Like GFP_ATOMIC , except that the call will not
|                  | fallback on emergency memory pools. This increases the
|                  | liklihood of the memory allocation failing.
|                  |
|GFP_NOIO          | This allocation can block, but must not initiate
|                  | disk I/O. This is the flag to use in block I/O code
|                  | when you cannot cause more disk I/O, which might lead
|                  | to some unpleasant recursion.
|GFP_NOFS          | This allocation can block and can initiate disk I/O,
|                  | if it must, but it will not initiate a filesystem
|                  | operation. This is the flag to use in filesystem code
|                  | when you cannot start another filesystem operation.
|GFP_KERNEL        | This is a normal allocation and might block. This is
|                  | the flag to use in process context code when it is safe
|                  | to sleep. The kernel will do whatever it has to do to
|                  | obtain the memory requested by the caller. This flag
|                  | should be your default choice.
|GFP_USER          | This is a normal allocation and might block. This flag is used to
|                  | allocate memory for user-space processes.
|GFP_HIGHUSER      | This is an allocation from ZONE_HIGHMEM and might block. This
|                  | flag is used to allocate memory for user-space processes.
|GFP_DMA           | This is an allocation from ZONE_DMA . Device drivers that need
|                  |DMA-able memory use this flag, usually in combination with one of
|                  |the preceding flags.
+-----------------------------------------------------------------------------------+

GFP_NOFS (__GFP_WAIT | __GFP_IO)
GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS)
GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
GFP_DMA __GFP_DMA
GFP_KERNEL most frequently used flag
- can block so cannot be used interrupt context
- has high probability of succeeding
- can put caller to sleep , swap inactive pages to disk, flush dirty pages,
GFP_ATOMIC
- most restrictive
- memory allocations which cannot sleep
- if no continguous chunk is available then will not call free instead just fail
- less chance of succeeding
kfree()
- frees block of memory previously allocated
- void kfree(const void *ptr)
- double free is a serious bug.
- kfree(NULL) is checked for and works

char *buf;

buf = kmalloc(BUF_SIZE, GFP_ATOMIC);
if (!buf)
   /* error allocating memory ! */
   
kfree(buf);

`vmalloc()`

Allocates memory which is only virtually continguous unlike kmalloc whose pages are physically contiguous
Acheived by fixing up page tables map memory into contiguos chunks in logical address space
Cannot be use when hardware requires contiguos pages, due to not being behind MMU
All memory would appear to kernel as logically contiguous
most kernel code will still use kmalloc for performance reasons
Uses greater number entries in TLB
vmalloc used rarely when need to allocate large regions which may fail with kmalloc
delcared in linux/vmalloc.h defined in linux/vmalloc.c
void * vmalloc(unsigned long size)
returned pointer is at least size bytes
cannot be used in interrupt context

char *buf;
buf = vmalloc(16 * PAGE_SIZE); /* get 16 pages */
if (!buf)
/* error! failed to allocate memory */
/*
* buf now points to at least a 16*PAGE_SIZE bytes
* of virtually contiguous block of memory
*/
// Free with
vfree(buf);

Slab Layer

Generalization of the idea of free lists of certain granularity of data
Tries to avoid cost of allocation/deallocation
consolidates the idea of free lists in the kernel
Allows kernel global management of free lists
Why do we use the slab allocator ?
- Frequently used data structures are allocated and freed often
- arranging free lists contiguously means less memory fragmentation from frequent alloc/free
- freed objects immediately available for subsequent use
- allocator aware of obj size, page size ,total cache
- using some processor specific memory in slabs , means fewer locks
- NUMA aware allocators be location sensitive in alloc/free
Design of Slab layer
- objects devided into caches
- caches devided into slabs
- slabs composed of one or more contiguous pages (typically single page)
- slab states : full,partial,empty
- requests satisfied from partial slabs
- if no partial slabs then request satisfied from empty slab
- eg. cache to store struct inode from inode_cachep , cache for task_struct
- cache represented kmem_cache with three lists slabs_full,slabs_partial,slabs_empty
- above lists stored in kmem_list3
- slab described by structure in mm/slab.c
Slab Allocator Interface
- int kmem_cache_destroy(struct kmem_cache *cachep)
  - invoked on module shutdown to free cache
  - may sleep dont call from interrupt context
  - caller must be sure that cache is empty , no active slabs
  - caller must ensure synchronization
- struct kmem_cache * kmem_cache_create(const char *name, size_t size, size_t align, unsigned long flags, void (*ctor)(void *))
  - Creates a cache
  - returns a pointer to the cache created
  - /proc/slabinfo to see caches , name shows up there
  - size - size of each cache element
  - align - offset of first element
  - ctor - slab constructor
    - not used but called when new pages added to cache
  - flags
    - SLAB_HWCACHE_ALIGN :
      - This flag instructs the slab layer to align each object within a slab to a cache line
    - SLAB_POISON:
      - fill slab with known value a5a5a5a5 used to catch uninitialized memory
    - SLAB_RED_ZONE
      - use red zones around to detect buffer overruns
    - SLAB_PANIC
      - panic if allocation fails
      - indicate that allocations must not fail
    - SLAB_CACHE_DMA
      - each slab must be in dma’able memory
- void * kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
  - return poniter to object from cache
  - allocates new pages if no free slabs
Example Usage

// cache for task_structs
struct kmem_cache *task_struct_cachep;

// create the cache
task_struct_cachep = kmem_cache_create(“task_struct”,sizeof(struct task_struct),
                                        ARCH_MIN_TASKALIGN,SLAB_PANIC | SLAB_NOTRACK,NULL);

// allocate a task struct from the cache as needed

struct task_struct *tsk;

tsk = kmem_cache_alloc(task_struct_cachep, GFP_KERNEL);

if (!tsk)
   return NULL;



// free a task struct
kmem_cache_free(task_struct_cachep, tsk);

Statically Allocating on the Stack

Kernel stacks are small and fixed
Kernel stack generally 2 pages per process
8Kb on 32-bit or 16 Kb on 64-bit
Sometimes beneficial to deal with single page stacks
- deal with memory fragmentation
- allocation of new process becomes harder , not able to find contiguous pages.
interrupts use kernel stacks of process they interrupted
use interrupt stacks instead - one page per processor
this depends on enablement of single page interrupt stacks.
kernel stacks will overflow into process thread info structure
Thus keep stack allocations minimum and use dynamic allocation.

High Memory Pages

Pages from alloc_pages() and __GFP_HIGHMEM.
Since no permanent, high mem pages might not have logical address
x86-32 all memory beyond 896 Mb is high memory(not permanently mapped to kernel address space)
x86-32 can theoretically map about 2^32(4 Gb) and (63 Gb) with PAE
x86-32 high memory pages get mapped in and out between 3 Gb and 4 Gb
Permanent Mappings
- <linux/highmem.h>
- void * kmap(struct page* page)
  - works on both high and low memory
  - returns the virtual memory if page in low memory
  - if high memory page creates a mapping and returns the page.
  - function may sleep - only works in process context
  - mappings are permanent user responsible for unmapping
  - good to unmap when usage finishes
- void kunmap(struct page *page)
  - unmap the created mapping of high memory
Temporary/Atomic Mappings
- To create mappings in interrupt context and other non-blocking contexts
- void * kmap_atomic(struct page * page , enum km_type type)
  - does not block
  - can be used in non-schedulable contexts
  - Defined in <asm-generic/kmap_types.h>

    enum km_type {
       KM_BOUNCE_READ,
       KM_SKB_SUNRPC_DATA,
       KM_SKB_DATA_SOFTIRQ,
       KM_USER0,
       KM_USER1,
       KM_BIO_SRC_IRQ,
       KM_BIO_DST_IRQ,
       KM_PTE0,
       KM_PTE1,
       KM_PTE2,
       KM_IRQ0,
       KM_IRQ1,
       KM_SOFTIRQ0,
       KM_SOFTIRQ1,
       KM_SYNC_ICACHE,
       KM_SYNC_DCACHE,
       KM_UML_USERCOPY,
       KM_IRQ_PTE,
       KM_NMI,
       KM_NMI_PTE,
       KM_TYPE_NR
    };

* Disables kernel preemption - mappings are processor unique
* `void kunmap_atomic(void *kvaddr, enum km_type type)`
  * Ability to undo mapping at `kvaddr`
  * On most architectures does nothing but enable kernel preemption
  * a Temporary mapping is only valid until next kernel mapping

Per CPU Allocation

On SMP use data unique to a CPU
Per-CPU data stored in an array
Items in array correspond to processor specific data
get_cpu() - get the current cpu and disable kernel preemption
put_cpu() - re-enable kernel preemption

unsigned long my_percpu[NR_CPUS];

int cpu;
/* get current processor and disable kernel preemption */
cpu = get_cpu();

// use the cpu specific data
my_percpu[cpu]++;
printk(“my_percpu on cpu=%d is %lu\n”, cpu, my_percpu[cpu]);

/* enable kernel preemption */
put_cpu();

locking is not required for since data unique to cpu
Problems with Kernel Preemption
- cpu variable will become invalid if kernel is preempted and rescheduled on another processor
- another thread may access now be able to access dirty data structure on same processor
Using get_cpu ensures that kernel preemption on the procssor is disabled

The `percpu` Interface

linux/percpu.h
Definitions in mm/slab.c and asm/percpu.h
DEFINE_PER_CPU(type, name);
- An instance of a percpu variable with type and name
get_cpu_var(name) and put_cpu_var(name)
- disable kernel preemption and get cpu specific value
per_cpu(name, cpu)++;
- fetch another processor per cpu variable
- dangerous method since doesnt disable preemption and doesnt provide locking
Per-CPU data at Runtime
- In <linux/percpu.h>
- void *alloc_percpu(type); /* a macro */
  - Allocate one instace per processor
  - macro around __alloc_percpu
  - alligns at byte boundary
  - __alignof__ - gcc feature to get recomended alignment
  - returned pointer indirectly references dynamically created data
  - get_cpu_var(ptr) fetches cpu specific pointer to dynamically created data.
- void *__alloc_percpu(size_t size, size_t align);
  - number of bytes to allocate and alignment
- void free_percpu(const void *);
  - frees data on all the processors

void *percpu_ptr;

unsigned long *foo;

percpu_ptr = alloc_percpu(unsigned long);

if (!ptr)

/* error allocating memory .. */
foo = get_cpu_var(percpu_ptr);

/* manipulate foo .. */
put_cpu_var(percpu_ptr);

Reasons for using Per-CPU Data

reduction in locking
reduces cache invalidation due to data being modified on other processors
Cannot sleep in the middle of accessing per-CPU data

Summary/Guidelines for picking an allocation methods

Mostly pick between GFP_ATOMIC and GFP_KERNEL for allocations
For free high memory use alloc_pages() since it returns struct page
Use kmap() to map high memory pages
Use vmalloc() when doing large allocations where contiguous memory is not a requirement
If doing lots of creations/destructions of same object type use slab cache, prevent fragmentation,get faster allocations

Aakarsh Nair

Notes on memory management in the Linux Kernel

Why is memory allocation in the kernel hard ?

Pages

Zones

Getting Pages

`kmalloc()`

`vmalloc()`

Slab Layer

Statically Allocating on the Stack

High Memory Pages

Per CPU Allocation

The `percpu` Interface

Reasons for using Per-CPU Data

Summary/Guidelines for picking an allocation methods

Discussion, links, and tweets

Aakarsh Nair

Notes on memory management in the Linux Kernel

Why is memory allocation in the kernel hard ?

Pages

Zones

Getting Pages

kmalloc()

vmalloc()

Slab Layer

Statically Allocating on the Stack

High Memory Pages

Per CPU Allocation

The percpu Interface

Reasons for using Per-CPU Data

Summary/Guidelines for picking an allocation methods

Related Posts

Emacs/Evil-mode - A basic reference to using evil mode in Emacs. 09 Mar 2017

Notes on synchronization in the Linux Kernel 23 Jul 2015

Programming Language Reference - EmacsLisp (draft) 23 Jun 2015

Discussion, links, and tweets

`kmalloc()`

`vmalloc()`

The `percpu` Interface