Notes on memory management in the Linux Kernel
23 Jul 2015Just a small collection of notes on memory management in the Kernel. The underlying reference for which is the Linux Kernel Development book by Robert Love.
Why is memory allocation in the kernel hard ?
- Not easy to deal with memory allocation errors.
- Kernel often cannot sleep.
- Require special primitives different from userspace.
Pages
- Physical Pages act as basic unit of memory management
- Different from processor’s smallest addressable unit (byte or word)
- Hardware provides assistance via the
MMU
(memory management unit) - virtual memory pages are smallest unit.
- Page sizes architecture specific
32-bit
4Kb
page size64-bit
Kb
page size - Kernel keeps track of pages in
struct page
structure - pages kept track are the actual
physical pages
these are not virtual pages. struct page
defined in<linux/mm_types.h>
- flags stores the status of the page 32-different flags available see
<linux/page-flags.h>
- See the page flags enum for list of states
_count
field represents the usage count for a page- negative count field indicates page is free for allocation
- access via the
page_count()
macro, provides locking, atomic read
virtual
virtual address of the page.- the virtual address of page
- for pages in
HIGH_MEM
which need to be mapped as neededNULL
-
This structure keeps track of the data about physical pages has less to do with actual pages.
struct page
consumes about40 bytes
- Assuming
4Gb
system with8Kb
page size - Means
524,288
pages - That is
20 Mb
ofstruct pages
in memory
- Assuming
Zones
- Allows for non uniform treatment of pages
- Zones group pages of similar properties
- Example of hardware limitations
- being able to perform
DMA
(direct memory access) only within certain memory addresses - large physically addressability vs small virtual addressability
- allow for pages not permanently mapped into kernel address space
- being able to perform
- Four primary zones in Linux
ZONE_DMA
- Pages that can do DMAZONE_DMA32
- Pages that can do DMA and accessible on only 32 bit devicesZONE_NORMAL
- Regularly mapped pagesZONE_HIGHMEM
- Pages not permanently mapped into kernel address space
- Zones defined in
linux/mmzone.h
- Actual usage of zones are architecture dependent
- ISA on x86-32 limited to
DMA
on first 16 Mb of memory - on x86-32
ZONE_DMA
0-16mb
- on x86-32
ZONE_NORMAL
16 mb-896 Mb
- on
32-bit
ZONE_HIGHMEM
all memory above896 Mb
x86-64
No High Mem and all memory is mappable all memory in ZONE_NORMAL
- ISA on x86-32 limited to
-
Thus
ZONE
provides some logical grouping of pages. - Some Key fields in
struct zone
lock
- spinlock to prevent concurrent modificationwatermark
- minimum and low values for this zone.name
- null terminated string- Initialized during boot in
mm/page_alloc.c
Getting Pages
- methods for requesting memory at page size granularity in
linux/gfp.h
struct page * alloc_pages(gfp_t gfp_mask, unsigned int order)
- core function to fetch pages
- allocates
2^order
or(1 << order)
pages - returns poniter to first page
- Convert to
logical address
of where page resides usingvoid * page_address(struct page *page)
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
- Returns
logical address
where the page resides - used when
struct page
not required - returned pages are contiguous and follow the first
- Returns
unsigned long get_zeroed_page(unsigned int gfp_mask)
- useful for pages handed to userspace
- prevent leaking sensitive data
- Freeing pages.
void __free_pages(struct page *page, unsigned int order)
void free_pages(unsigned long addr, unsigned int order)
void free_page(unsigned long addr)
- double free is a serious problem in the kernel
- All allocations can fail and we can get
NULL
logical address
kmalloc()
- Useful or byte sized allocation
- declared in
<linux/slab.h>
-
preferred choice if allocations are not multiples of page size
void * kmalloc(size_t size, gfp_t flags)
- returned memory is at least
size
bytes in length
gfp_mask
Flags- flags defined in
linux/types.h
- as
unsigned int
- Flag Types
- action modifiers
- how memory is allocated
- eg. during interrupt handler allocations should fail but never sleep
- zone modifiers
- specify where the memory is allocated
- types
- Specify combination of action and zone modifiers
GFP_KERNEL
- for code in
process context
- for code in
- action modifiers
- flags defined in
- Action Modifiers
- Sample usage
-
kmalloc
calls ultimately use thealloc_pages
and above grants great flexibility in page allocation -
Zone Modifiers
- Use
GFP_DMA
if you must have dma-able memory __GFP_HIGHMEM
can use if needed- only
alloc_pages
can return High memory - since logical address cannot be returned for memory not mapped into kernels virtual address space
- only
- Type Flags
- specify combinations of action and zone modifiers
- simpler and less error prone
GFP_NOFS (__GFP_WAIT | __GFP_IO)
GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS)
GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HIGHMEM)
-
GFP_DMA __GFP_DMA
GFP_KERNEL
most frequently used flag- can block so cannot be used interrupt context
- has high probability of succeeding
- can put caller to sleep , swap inactive pages to disk, flush dirty pages,
GFP_ATOMIC
- most restrictive
- memory allocations which cannot sleep
- if no continguous chunk is available then will not call free instead just fail
- less chance of succeeding
kfree()
- frees block of memory previously allocated
void kfree(const void *ptr)
- double free is a serious bug.
kfree(NULL)
is checked for and works
vmalloc()
- Allocates memory which is only
virtually continguous
unlikekmalloc
whose pages arephysically contiguous
- Acheived by fixing up page tables map memory into contiguos chunks
in
logical address
space - Cannot be use when hardware requires contiguos pages, due to not
being behind
MMU
- All memory would appear to kernel as logically contiguous
- most kernel code will still use
kmalloc
for performance reasons - Uses greater number entries in
TLB
vmalloc
used rarely when need to allocate large regions which may fail withkmalloc
- delcared in
linux/vmalloc.h
defined inlinux/vmalloc.c
void * vmalloc(unsigned long size)
- returned pointer is at least
size
bytes - cannot be used in interrupt context
Slab Layer
- Generalization of the idea of
free lists
of certain granularity of data - Tries to avoid cost of allocation/deallocation
- consolidates the idea of
free lists
in the kernel -
Allows kernel global management of free lists
- Why do we use the slab allocator ?
- Frequently used data structures are allocated and freed often
- arranging free lists contiguously means less memory fragmentation from frequent alloc/free
- freed objects immediately available for subsequent use
- allocator aware of obj size, page size ,total cache
- using some processor specific memory in slabs , means fewer locks
- NUMA aware allocators be location sensitive in alloc/free
- Design of Slab layer
objects
devided intocaches
caches
devided intoslabs
slabs
composed of one or more contiguous pages (typically single page)slab
states :full
,partial
,empty
- requests satisfied from partial slabs
- if no partial slabs then request satisfied from empty slab
-
eg. cache to store
struct inode
frominode_cachep
, cache fortask_struct
cache
representedkmem_cache
with three listsslabs_full
,slabs_partial
,slabs_empty
- above lists stored in
kmem_list3
slab
described by structure inmm/slab.c
- Slab Allocator Interface
int kmem_cache_destroy(struct kmem_cache *cachep)
- invoked on module shutdown to free cache
- may sleep dont call from interrupt context
- caller must be sure that cache is empty , no active slabs
- caller must ensure synchronization
struct kmem_cache * kmem_cache_create(const char *name, size_t size, size_t align, unsigned long flags, void (*ctor)(void *))
- Creates a cache
- returns a pointer to the cache created
/proc/slabinfo
to see caches ,name
shows up theresize
- size of each cache elementalign
- offset of first elementctor
- slab constructor- not used but called when new pages added to cache
flags
SLAB_HWCACHE_ALIGN
:- This flag instructs the slab layer to align each object within a slab to a cache line
SLAB_POISON
:- fill slab with known value
a5a5a5a5
used to catch uninitialized memory
- fill slab with known value
SLAB_RED_ZONE
- use red zones around to detect buffer overruns
SLAB_PANIC
- panic if allocation fails
- indicate that allocations must not fail
SLAB_CACHE_DMA
- each slab must be in dma’able memory
void * kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
- return poniter to object from cache
- allocates new pages if no free slabs
- Example Usage
Statically Allocating on the Stack
- Kernel stacks are small and fixed
- Kernel stack generally 2 pages per process
8Kb
on32-bit
or16 Kb
on64-bit
- Sometimes beneficial to deal with single page stacks
- deal with memory fragmentation
- allocation of new process becomes harder , not able to find contiguous pages.
- interrupts use kernel stacks of process they interrupted
- use interrupt stacks instead - one page per processor
- this depends on enablement of single page interrupt stacks.
- kernel stacks will overflow into process thread info structure
- Thus keep stack allocations minimum and use dynamic allocation.
High Memory Pages
- Pages from
alloc_pages()
and__GFP_HIGHMEM
. - Since no permanent, high mem pages might not have logical address
x86-32
all memory beyond896 Mb
is high memory(not permanently mapped to kernel address space)x86-32
can theoretically map about2^32
(4 Gb) and (63 Gb) with PAE-
x86-32
high memory pages get mapped in and out between3 Gb
and4 Gb
- Permanent Mappings
<linux/highmem.h>
void * kmap(struct page* page)
- works on both high and low memory
- returns the virtual memory if page in low memory
- if high memory page creates a mapping and returns the page.
- function may sleep - only works in process context
- mappings are permanent user responsible for unmapping
- good to unmap when usage finishes
void kunmap(struct page *page)
- unmap the created mapping of high memory
- Temporary/Atomic Mappings
- To create mappings in interrupt context and other non-blocking contexts
void * kmap_atomic(struct page * page , enum km_type type)
- does not block
- can be used in non-schedulable contexts
- Defined in
<asm-generic/kmap_types.h>
* Disables kernel preemption - mappings are processor unique
* `void kunmap_atomic(void *kvaddr, enum km_type type)`
* Ability to undo mapping at `kvaddr`
* On most architectures does nothing but enable kernel preemption
* a Temporary mapping is only valid until next kernel mapping
Per CPU Allocation
- On SMP use data unique to a CPU
- Per-CPU data stored in an array
- Items in array correspond to processor specific data
get_cpu()
- get the current cpu and disable kernel preemptionput_cpu()
- re-enable kernel preemption
- locking is not required for since data unique to cpu
- Problems with Kernel Preemption
cpu
variable will become invalid if kernel is preempted and rescheduled on another processor- another thread may access now be able to access dirty data structure on same processor
- Using
get_cpu
ensures that kernel preemption on the procssor is disabled
The percpu
Interface
linux/percpu.h
- Definitions in
mm/slab.c
andasm/percpu.h
DEFINE_PER_CPU(type, name);
- An instance of a percpu variable with type and name
get_cpu_var(name)
andput_cpu_var(name)
- disable kernel preemption and get cpu specific value
per_cpu(name, cpu)++;
- fetch another processor per cpu variable
- dangerous method since doesnt disable preemption and doesnt provide locking
- Per-CPU data at Runtime
- In
<linux/percpu.h>
void *alloc_percpu(type); /* a macro */
- Allocate one instace per processor
- macro around
__alloc_percpu
- alligns at byte boundary
__alignof__
- gcc feature to get recomended alignment- returned pointer indirectly references dynamically created data
get_cpu_var(ptr)
fetches cpu specific pointer to dynamically created data.
void *__alloc_percpu(size_t size, size_t align);
- number of bytes to allocate and alignment
void free_percpu(const void *);
- frees data on all the processors
- In
Reasons for using Per-CPU Data
- reduction in locking
- reduces cache invalidation due to data being modified on other processors
- Cannot sleep in the middle of accessing per-CPU data
Summary/Guidelines for picking an allocation methods
- Mostly pick between
GFP_ATOMIC
andGFP_KERNEL
for allocations - For free high memory use
alloc_pages()
since it returnsstruct page
- Use
kmap()
to map high memory pages - Use
vmalloc()
when doing large allocations where contiguous memory is not a requirement - If doing lots of creations/destructions of same object type use slab cache, prevent fragmentation,get faster allocations