Notes on memory management in the Linux Kernel
23 Jul 2015Just a small collection of notes on memory management in the Kernel. The underlying reference for which is the Linux Kernel Development book by Robert Love.
Why is memory allocation in the kernel hard ?
- Not easy to deal with memory allocation errors.
- Kernel often cannot sleep.
- Require special primitives different from userspace.
Pages
- Physical Pages act as basic unit of memory management
- Different from processor’s smallest addressable unit (byte or word)
- Hardware provides assistance via the
MMU
(memory management unit) - virtual memory pages are smallest unit.
- Page sizes architecture specific
32-bit
4Kb
page size64-bit
Kb
page size - Kernel keeps track of pages in
struct page
structure - pages kept track are the actual
physical pages
these are not virtual pages. struct page
defined in<linux/mm_types.h>
struct page {
unsigned long flags;
atomic_t _count;
atomic_t _mapcount;
unsigned long private;
struct address_space *mapping;
pgoff_t index;
struct list_head lru;
void *virtual;
};
- flags stores the status of the page 32-different flags available see
<linux/page-flags.h>
- See the page flags enum for list of states
enum pageflags {
PG_locked, /* Page is locked. Don't touch. */
PG_error,
PG_referenced,
PG_uptodate,
PG_dirty,
PG_lru,
PG_active,
PG_slab,
PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
PG_arch_1,
PG_reserved,
PG_private, /* If pagecache, has fs-private data */
PG_private_2, /* If pagecache, has fs aux data */
PG_writeback, /* Page is under writeback */
#ifdef CONFIG_PAGEFLAGS_EXTENDED
PG_head, /* A head page */
PG_tail, /* A tail page */
#else
PG_compound, /* A compound page */
#endif
PG_swapcache, /* Swap page: swp_entry_t in private */
PG_mappedtodisk, /* Has blocks allocated on-disk */
PG_reclaim, /* To be reclaimed asap */
PG_swapbacked, /* Page is backed by RAM/swap */
PG_unevictable, /* Page is "unevictable" */
#ifdef CONFIG_MMU
PG_mlocked, /* Page is vma mlocked */
#endif
#ifdef CONFIG_ARCH_USES_PG_UNCACHED
PG_uncached, /* Page has been mapped as uncached */
#endif
#ifdef CONFIG_MEMORY_FAILURE
PG_hwpoison, /* hardware poisoned page. Don't touch */
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
PG_compound_lock,
#endif
__NR_PAGEFLAGS,
/* Filesystems */
PG_checked = PG_owner_priv_1,
/* Two page bits are conscripted by FS-Cache to maintain local caching
* state. These bits are set on pages belonging to the netfs's inodes
* when those inodes are being locally cached.
*/
PG_fscache = PG_private_2, /* page backed by cache */
/* XEN */
PG_pinned = PG_owner_priv_1,
PG_savepinned = PG_dirty,
/* SLOB */
PG_slob_free = PG_private,
};
_count
field represents the usage count for a page- negative count field indicates page is free for allocation
- access via the
page_count()
macro, provides locking, atomic read
virtual
virtual address of the page.- the virtual address of page
- for pages in
HIGH_MEM
which need to be mapped as neededNULL
-
This structure keeps track of the data about physical pages has less to do with actual pages.
struct page
consumes about40 bytes
- Assuming
4Gb
system with8Kb
page size - Means
524,288
pages - That is
20 Mb
ofstruct pages
in memory
- Assuming
Zones
- Allows for non uniform treatment of pages
- Zones group pages of similar properties
- Example of hardware limitations
- being able to perform
DMA
(direct memory access) only within certain memory addresses - large physically addressability vs small virtual addressability
- allow for pages not permanently mapped into kernel address space
- being able to perform
- Four primary zones in Linux
ZONE_DMA
- Pages that can do DMAZONE_DMA32
- Pages that can do DMA and accessible on only 32 bit devicesZONE_NORMAL
- Regularly mapped pagesZONE_HIGHMEM
- Pages not permanently mapped into kernel address space
- Zones defined in
linux/mmzone.h
- Actual usage of zones are architecture dependent
- ISA on x86-32 limited to
DMA
on first 16 Mb of memory - on x86-32
ZONE_DMA
0-16mb
- on x86-32
ZONE_NORMAL
16 mb-896 Mb
- on
32-bit
ZONE_HIGHMEM
all memory above896 Mb
x86-64
No High Mem and all memory is mappable all memory in ZONE_NORMAL
- ISA on x86-32 limited to
-
Thus
ZONE
provides some logical grouping of pages. - Some Key fields in
struct zone
lock
- spinlock to prevent concurrent modificationwatermark
- minimum and low values for this zone.name
- null terminated string- Initialized during boot in
mm/page_alloc.c
struct zone {
/* Fields commonly accessed by the page allocator */
/* zone watermarks, access with *_wmark_pages(zone) macros */
unsigned long watermark[NR_WMARK];
/*
* When free pages are below this point, additional steps are taken
* when reading the number of free pages to avoid per-cpu counter
* drift allowing watermarks to be breached
*/
unsigned long percpu_drift_mark;
/*
* We don't know if the memory that we're going to allocate will be freeable
* or/and it will be released eventually, so to avoid totally wasting several
* GB of ram we must reserve some of the lower zone memory (otherwise we risk
* to run OOM on the lower zones despite there's tons of freeable ram
* on the higher zones). This array is recalculated at runtime if the
* sysctl_lowmem_reserve_ratio sysctl changes.
*/
unsigned long lowmem_reserve[MAX_NR_ZONES];
/*
* This is a per-zone reserve of pages that should not be
* considered dirtyable memory.
*/
unsigned long dirty_balance_reserve;
#ifdef CONFIG_NUMA
int node;
/*
* zone reclaim becomes active if more unmapped pages exist.
*/
unsigned long min_unmapped_pages;
unsigned long min_slab_pages;
#endif
struct per_cpu_pageset __percpu *pageset;
/*
* free areas of different sizes
*/
spinlock_t lock;
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
/* Set to true when the PG_migrate_skip bits should be cleared */
bool compact_blockskip_flush;
/* pfn where compaction free scanner should start */
unsigned long compact_cached_free_pfn;
/* pfn where async and sync compaction migration scanner should start */
unsigned long compact_cached_migrate_pfn[2];
#endif
#ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t span_seqlock;
#endif
struct free_area free_area[MAX_ORDER];
#ifndef CONFIG_SPARSEMEM
/*
* Flags for a pageblock_nr_pages block. See pageblock-flags.h.
* In SPARSEMEM, this map is stored in struct mem_section
*/
unsigned long *pageblock_flags;
#endif /* CONFIG_SPARSEMEM */
#ifdef CONFIG_COMPACTION
/*
* On compaction failure, 1<<compact_defer_shift compactions
* are skipped before trying again. The number attempted since
* last failure is tracked with compact_considered.
*/
unsigned int compact_considered;
unsigned int compact_defer_shift;
int compact_order_failed;
#endif
ZONE_PADDING(_pad1_)
/* Fields commonly accessed by the page reclaim scanner */
spinlock_t lru_lock;
struct lruvec lruvec;
/* Evictions & activations on the inactive file list */
atomic_long_t inactive_age;
unsigned long pages_scanned; /* since last reclaim */
unsigned long flags; /* zone flags, see below */
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
/*
* The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
* this zone's LRU. Maintained by the pageout code.
*/
unsigned int inactive_ratio;
ZONE_PADDING(_pad2_)
/* Rarely used or read-mostly fields */
/*
* wait_table -- the array holding the hash table
* wait_table_hash_nr_entries -- the size of the hash table array
* wait_table_bits -- wait_table_size == (1 << wait_table_bits)
*
* The purpose of all these is to keep track of the people
* waiting for a page to become available and make them
* runnable again when possible. The trouble is that this
* consumes a lot of space, especially when so few things
* wait on pages at a given time. So instead of using
* per-page waitqueues, we use a waitqueue hash table.
*
* The bucket discipline is to sleep on the same queue when
* colliding and wake all in that wait queue when removing.
* When something wakes, it must check to be sure its page is
* truly available, a la thundering herd. The cost of a
* collision is great, but given the expected load of the
* table, they should be so rare as to be outweighed by the
* benefits from the saved space.
*
* __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
* primary users of these fields, and in mm/page_alloc.c
* free_area_init_core() performs the initialization of them.
*/
wait_queue_head_t * wait_table;
unsigned long wait_table_hash_nr_entries;
unsigned long wait_table_bits;
/*
* Discontig memory support fields.
*/
struct pglist_data *zone_pgdat;
/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
unsigned long zone_start_pfn;
/*
* spanned_pages is the total pages spanned by the zone, including
* holes, which is calculated as:
* spanned_pages = zone_end_pfn - zone_start_pfn;
*
* present_pages is physical pages existing within the zone, which
* is calculated as:
* present_pages = spanned_pages - absent_pages(pages in holes);
*
* managed_pages is present pages managed by the buddy system, which
* is calculated as (reserved_pages includes pages allocated by the
* bootmem allocator):
* managed_pages = present_pages - reserved_pages;
*
* So present_pages may be used by memory hotplug or memory power
* management logic to figure out unmanaged pages by checking
* (present_pages - managed_pages). And managed_pages should be used
* by page allocator and vm scanner to calculate all kinds of watermarks
* and thresholds.
*
* Locking rules:
*
* zone_start_pfn and spanned_pages are protected by span_seqlock.
* It is a seqlock because it has to be read outside of zone->lock,
* and it is done in the main allocator path. But, it is written
* quite infrequently.
*
* The span_seq lock is declared along with zone->lock because it is
* frequently read in proximity to zone->lock. It's good to
* give them a chance of being in the same cacheline.
*
* Write access to present_pages at runtime should be protected by
* mem_hotplug_begin/end(). Any reader who can't tolerant drift of
* present_pages should get_online_mems() to get a stable value.
*
* Read access to managed_pages should be safe because it's unsigned
* long. Write access to zone->managed_pages and totalram_pages are
* protected by managed_page_count_lock at runtime. Idealy only
* adjust_managed_page_count() should be used instead of directly
* touching zone->managed_pages and totalram_pages.
*/
unsigned long spanned_pages;
unsigned long present_pages;
unsigned long managed_pages;
/*
* Number of MIGRATE_RESEVE page block. To maintain for just
* optimization. Protected by zone->lock.
*/
int nr_migrate_reserve_block;
/*
* rarely used fields:
*/
const char *name;
} ____cacheline_internodealigned_in_smp;
Getting Pages
- methods for requesting memory at page size granularity in
linux/gfp.h
struct page * alloc_pages(gfp_t gfp_mask, unsigned int order)
- core function to fetch pages
- allocates
2^order
or(1 << order)
pages - returns poniter to first page
- Convert to
logical address
of where page resides usingvoid * page_address(struct page *page)
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
- Returns
logical address
where the page resides - used when
struct page
not required - returned pages are contiguous and follow the first
- Returns
unsigned long get_zeroed_page(unsigned int gfp_mask)
- useful for pages handed to userspace
- prevent leaking sensitive data
- Freeing pages.
void __free_pages(struct page *page, unsigned int order)
void free_pages(unsigned long addr, unsigned int order)
void free_page(unsigned long addr)
- double free is a serious problem in the kernel
- All allocations can fail and we can get
NULL
logical address
unsigned long page;
// allocalte 2^3 or eight pages
// page is the logical address
page = __get_free_pages(GFP_KERNEL, 3);
if (!page) {
/* insufficient memory: you must handle this error! */
return –ENOMEM;
}
/* ‘page’ is now the address of the first of eight contiguous pages ... */
// free the 8 pages.
free_pages(page, 3);
kmalloc()
- Useful or byte sized allocation
- declared in
<linux/slab.h>
-
preferred choice if allocations are not multiples of page size
void * kmalloc(size_t size, gfp_t flags)
- returned memory is at least
size
bytes in length
struct dog *p;
p = kmalloc(sizeof(struct dog), GFP_KERNEL);
if (!p)
/* handle error ... */
gfp_mask
Flags- flags defined in
linux/types.h
- as
unsigned int
- Flag Types
- action modifiers
- how memory is allocated
- eg. during interrupt handler allocations should fail but never sleep
- zone modifiers
- specify where the memory is allocated
- types
- Specify combination of action and zone modifiers
GFP_KERNEL
- for code in
process context
- for code in
- action modifiers
- flags defined in
- Action Modifiers
+------------------------------------------------------------------------+
| Flag | Description |
+------------------------------------------------------------------------+
|__GFP_WAIT | The allocator can sleep. |
|__GFP_HIGH | The allocator can access emergency pools. |
|__GFP_IO | The allocator can start disk I/O. |
|__GFP_FS | The allocator can start filesystem I/O. |
|__GFP_COLD | The allocator should use cache cold pages. |
|__GFP_NOWARN | The allocator does not print failure warnings. |
|__GFP_REPEAT | The allocator repeats the allocation if it |
| | fails, but the allocation can potentially fail. |
|__GFP_NOFAIL | The allocator indefinitely repeats the allocation. |
| | The allocation cannot fail. |
| | |
|__GFP_NORETRY | The allocator never retries if the allocation |
| | fails. |
|__GFP_NOMEMALLOC | The allocator does not fall back on reserves. |
|__GFP_HARDWALL | The allocator enforces “hardwall” cpuset boundaries.|
|__GFP_RECLAIMABLE | The allocator marks the pages reclaimable. |
|__GFP_COMP | The allocator adds compound page metadata |
+------------------------------------------------------------------------+
- Sample usage
// Sleepable, Disk IO able, file system operationable
ptr = kmalloc(size, __GFP_WAIT | __GFP_IO | __GFP_FS);
-
kmalloc
calls ultimately use thealloc_pages
and above grants great flexibility in page allocation -
Zone Modifiers
+------------------------------------------------------------+
|Flag | Description |
+------------------------------------------------------------+
|__GFP_DMA | Allocates only from ZONE_DMA |
|__GFP_DMA32 | Allocates only from ZONE_DMA32 |
|__GFP_HIGHMEM | Allocates from ZONE_HIGHMEM or ZONE_NORMAL |
+------------------------------------------------------------+
- Use
GFP_DMA
if you must have dma-able memory __GFP_HIGHMEM
can use if needed- only
alloc_pages
can return High memory - since logical address cannot be returned for memory not mapped into kernels virtual address space
- only
- Type Flags
- specify combinations of action and zone modifiers
- simpler and less error prone
+--------------------------------------------------------------------+
|Flag | Description |
+--------------------------------------------------------------------+
|GFP_ATOMIC | The allocation is high priority and must
| | not sleep. This is the flag to use in interrupt
| | handlers, in bottom halves, while holding a spin-
| | lock, and in other situations where you cannot sleep.
| |
|GFP_NOWAIT | Like GFP_ATOMIC , except that the call will not
| | fallback on emergency memory pools. This increases the
| | liklihood of the memory allocation failing.
| |
|GFP_NOIO | This allocation can block, but must not initiate
| | disk I/O. This is the flag to use in block I/O code
| | when you cannot cause more disk I/O, which might lead
| | to some unpleasant recursion.
|GFP_NOFS | This allocation can block and can initiate disk I/O,
| | if it must, but it will not initiate a filesystem
| | operation. This is the flag to use in filesystem code
| | when you cannot start another filesystem operation.
|GFP_KERNEL | This is a normal allocation and might block. This is
| | the flag to use in process context code when it is safe
| | to sleep. The kernel will do whatever it has to do to
| | obtain the memory requested by the caller. This flag
| | should be your default choice.
|GFP_USER | This is a normal allocation and might block. This flag is used to
| | allocate memory for user-space processes.
|GFP_HIGHUSER | This is an allocation from ZONE_HIGHMEM and might block. This
| | flag is used to allocate memory for user-space processes.
|GFP_DMA | This is an allocation from ZONE_DMA . Device drivers that need
| |DMA-able memory use this flag, usually in combination with one of
| |the preceding flags.
+-----------------------------------------------------------------------------------+
GFP_NOFS (__GFP_WAIT | __GFP_IO)
GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS)
GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
-
GFP_DMA __GFP_DMA
GFP_KERNEL
most frequently used flag- can block so cannot be used interrupt context
- has high probability of succeeding
- can put caller to sleep , swap inactive pages to disk, flush dirty pages,
GFP_ATOMIC
- most restrictive
- memory allocations which cannot sleep
- if no continguous chunk is available then will not call free instead just fail
- less chance of succeeding
kfree()
- frees block of memory previously allocated
void kfree(const void *ptr)
- double free is a serious bug.
kfree(NULL)
is checked for and works
char *buf;
buf = kmalloc(BUF_SIZE, GFP_ATOMIC);
if (!buf)
/* error allocating memory ! */
kfree(buf);
vmalloc()
- Allocates memory which is only
virtually continguous
unlikekmalloc
whose pages arephysically contiguous
- Acheived by fixing up page tables map memory into contiguos chunks
in
logical address
space - Cannot be use when hardware requires contiguos pages, due to not
being behind
MMU
- All memory would appear to kernel as logically contiguous
- most kernel code will still use
kmalloc
for performance reasons - Uses greater number entries in
TLB
vmalloc
used rarely when need to allocate large regions which may fail withkmalloc
- delcared in
linux/vmalloc.h
defined inlinux/vmalloc.c
void * vmalloc(unsigned long size)
- returned pointer is at least
size
bytes - cannot be used in interrupt context
char *buf;
buf = vmalloc(16 * PAGE_SIZE); /* get 16 pages */
if (!buf)
/* error! failed to allocate memory */
/*
* buf now points to at least a 16*PAGE_SIZE bytes
* of virtually contiguous block of memory
*/
// Free with
vfree(buf);
Slab Layer
- Generalization of the idea of
free lists
of certain granularity of data - Tries to avoid cost of allocation/deallocation
- consolidates the idea of
free lists
in the kernel -
Allows kernel global management of free lists
- Why do we use the slab allocator ?
- Frequently used data structures are allocated and freed often
- arranging free lists contiguously means less memory fragmentation from frequent alloc/free
- freed objects immediately available for subsequent use
- allocator aware of obj size, page size ,total cache
- using some processor specific memory in slabs , means fewer locks
- NUMA aware allocators be location sensitive in alloc/free
- Design of Slab layer
objects
devided intocaches
caches
devided intoslabs
slabs
composed of one or more contiguous pages (typically single page)slab
states :full
,partial
,empty
- requests satisfied from partial slabs
- if no partial slabs then request satisfied from empty slab
-
eg. cache to store
struct inode
frominode_cachep
, cache fortask_struct
cache
representedkmem_cache
with three listsslabs_full
,slabs_partial
,slabs_empty
- above lists stored in
kmem_list3
slab
described by structure inmm/slab.c
- Slab Allocator Interface
int kmem_cache_destroy(struct kmem_cache *cachep)
- invoked on module shutdown to free cache
- may sleep dont call from interrupt context
- caller must be sure that cache is empty , no active slabs
- caller must ensure synchronization
struct kmem_cache * kmem_cache_create(const char *name, size_t size, size_t align, unsigned long flags, void (*ctor)(void *))
- Creates a cache
- returns a pointer to the cache created
/proc/slabinfo
to see caches ,name
shows up theresize
- size of each cache elementalign
- offset of first elementctor
- slab constructor- not used but called when new pages added to cache
flags
SLAB_HWCACHE_ALIGN
:- This flag instructs the slab layer to align each object within a slab to a cache line
SLAB_POISON
:- fill slab with known value
a5a5a5a5
used to catch uninitialized memory
- fill slab with known value
SLAB_RED_ZONE
- use red zones around to detect buffer overruns
SLAB_PANIC
- panic if allocation fails
- indicate that allocations must not fail
SLAB_CACHE_DMA
- each slab must be in dma’able memory
void * kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
- return poniter to object from cache
- allocates new pages if no free slabs
- Example Usage
// cache for task_structs
struct kmem_cache *task_struct_cachep;
// create the cache
task_struct_cachep = kmem_cache_create(“task_struct”,sizeof(struct task_struct),
ARCH_MIN_TASKALIGN,SLAB_PANIC | SLAB_NOTRACK,NULL);
// allocate a task struct from the cache as needed
struct task_struct *tsk;
tsk = kmem_cache_alloc(task_struct_cachep, GFP_KERNEL);
if (!tsk)
return NULL;
// free a task struct
kmem_cache_free(task_struct_cachep, tsk);
Statically Allocating on the Stack
- Kernel stacks are small and fixed
- Kernel stack generally 2 pages per process
8Kb
on32-bit
or16 Kb
on64-bit
- Sometimes beneficial to deal with single page stacks
- deal with memory fragmentation
- allocation of new process becomes harder , not able to find contiguous pages.
- interrupts use kernel stacks of process they interrupted
- use interrupt stacks instead - one page per processor
- this depends on enablement of single page interrupt stacks.
- kernel stacks will overflow into process thread info structure
- Thus keep stack allocations minimum and use dynamic allocation.
High Memory Pages
- Pages from
alloc_pages()
and__GFP_HIGHMEM
. - Since no permanent, high mem pages might not have logical address
x86-32
all memory beyond896 Mb
is high memory(not permanently mapped to kernel address space)x86-32
can theoretically map about2^32
(4 Gb) and (63 Gb) with PAE-
x86-32
high memory pages get mapped in and out between3 Gb
and4 Gb
- Permanent Mappings
<linux/highmem.h>
void * kmap(struct page* page)
- works on both high and low memory
- returns the virtual memory if page in low memory
- if high memory page creates a mapping and returns the page.
- function may sleep - only works in process context
- mappings are permanent user responsible for unmapping
- good to unmap when usage finishes
void kunmap(struct page *page)
- unmap the created mapping of high memory
- Temporary/Atomic Mappings
- To create mappings in interrupt context and other non-blocking contexts
void * kmap_atomic(struct page * page , enum km_type type)
- does not block
- can be used in non-schedulable contexts
- Defined in
<asm-generic/kmap_types.h>
enum km_type {
KM_BOUNCE_READ,
KM_SKB_SUNRPC_DATA,
KM_SKB_DATA_SOFTIRQ,
KM_USER0,
KM_USER1,
KM_BIO_SRC_IRQ,
KM_BIO_DST_IRQ,
KM_PTE0,
KM_PTE1,
KM_PTE2,
KM_IRQ0,
KM_IRQ1,
KM_SOFTIRQ0,
KM_SOFTIRQ1,
KM_SYNC_ICACHE,
KM_SYNC_DCACHE,
KM_UML_USERCOPY,
KM_IRQ_PTE,
KM_NMI,
KM_NMI_PTE,
KM_TYPE_NR
};
* Disables kernel preemption - mappings are processor unique
* `void kunmap_atomic(void *kvaddr, enum km_type type)`
* Ability to undo mapping at `kvaddr`
* On most architectures does nothing but enable kernel preemption
* a Temporary mapping is only valid until next kernel mapping
Per CPU Allocation
- On SMP use data unique to a CPU
- Per-CPU data stored in an array
- Items in array correspond to processor specific data
get_cpu()
- get the current cpu and disable kernel preemptionput_cpu()
- re-enable kernel preemption
unsigned long my_percpu[NR_CPUS];
int cpu;
/* get current processor and disable kernel preemption */
cpu = get_cpu();
// use the cpu specific data
my_percpu[cpu]++;
printk(“my_percpu on cpu=%d is %lu\n”, cpu, my_percpu[cpu]);
/* enable kernel preemption */
put_cpu();
- locking is not required for since data unique to cpu
- Problems with Kernel Preemption
cpu
variable will become invalid if kernel is preempted and rescheduled on another processor- another thread may access now be able to access dirty data structure on same processor
- Using
get_cpu
ensures that kernel preemption on the procssor is disabled
The percpu
Interface
linux/percpu.h
- Definitions in
mm/slab.c
andasm/percpu.h
DEFINE_PER_CPU(type, name);
- An instance of a percpu variable with type and name
get_cpu_var(name)
andput_cpu_var(name)
- disable kernel preemption and get cpu specific value
per_cpu(name, cpu)++;
- fetch another processor per cpu variable
- dangerous method since doesnt disable preemption and doesnt provide locking
- Per-CPU data at Runtime
- In
<linux/percpu.h>
void *alloc_percpu(type); /* a macro */
- Allocate one instace per processor
- macro around
__alloc_percpu
- alligns at byte boundary
__alignof__
- gcc feature to get recomended alignment- returned pointer indirectly references dynamically created data
get_cpu_var(ptr)
fetches cpu specific pointer to dynamically created data.
void *__alloc_percpu(size_t size, size_t align);
- number of bytes to allocate and alignment
void free_percpu(const void *);
- frees data on all the processors
- In
void *percpu_ptr;
unsigned long *foo;
percpu_ptr = alloc_percpu(unsigned long);
if (!ptr)
/* error allocating memory .. */
foo = get_cpu_var(percpu_ptr);
/* manipulate foo .. */
put_cpu_var(percpu_ptr);
Reasons for using Per-CPU Data
- reduction in locking
- reduces cache invalidation due to data being modified on other processors
- Cannot sleep in the middle of accessing per-CPU data
Summary/Guidelines for picking an allocation methods
- Mostly pick between
GFP_ATOMIC
andGFP_KERNEL
for allocations - For free high memory use
alloc_pages()
since it returnsstruct page
- Use
kmap()
to map high memory pages - Use
vmalloc()
when doing large allocations where contiguous memory is not a requirement - If doing lots of creations/destructions of same object type use slab cache, prevent fragmentation,get faster allocations