After this documentation was released in July 2003, I was approached by Prentice Hall and asked to write a book on the Linux VM under the Bruce Peren's Open Book Series.

The book is available and called simply "Understanding The Linux Virtual Memory Manager". There is a lot of additional material in the book that is not available here, including details on later 2.4 kernels, introductions to 2.6, a whole new chapter on the shared memory filesystem, coverage of TLB management, a lot more code commentary, countless other additions and clarifications and a CD with lots of cool stuff on it. This material (although now dated and lacking in comparison to the book) will remain available although I obviously encourge you to buy the book from your favourite book store :-) . As the book is under the Bruce Perens Open Book Series, it will be available 90 days after appearing on the book shelves which means it is not available right now. When it is available, it will be downloadable from http://www.phptr.com/perens so check there for more information.

To be fully clear, this webpage is not the actual book.

Next: 9.2 Slabs Up: 9. Slab Allocator Previous: 9. Slab Allocator Contents Index

Subsections

9.1.1 Cache Descriptor
9.1.2 Cache Static Flags
9.1.3 Cache Dynamic Flags
9.1.4 Cache Allocation Flags
9.1.5 Cache Colouring
9.1.6 Cache Creation
9.1.7 Cache Reaping
9.1.8 Cache Shrinking
9.1.9 Cache Destroying

9.1 Caches

One cache exists for each type of object that is to be cached. For a full list of caches available on a running system, run cat /proc/slabinfo . This file gives some basic information on the caches. An excerpt from the output of this file looks like:

slabinfo - version: 1.1 (SMP)
kmem_cache            80     80    248    5    5    1 :  252  126
urb_priv               0      0     64    0    0    1 :  252  126
tcp_bind_bucket       15    226     32    2    2    1 :  252  126
inode_cache         5714   5992    512  856  856    1 :  124   62
dentry_cache        5160   5160    128  172  172    1 :  252  126
mm_struct            240    240    160   10   10    1 :  252  126
vm_area_struct      3911   4480     96  112  112    1 :  252  126
size-64(DMA)           0      0     64    0    0    1 :  252  126
size-64              432   1357     64   23   23    1 :  252  126
size-32(DMA)          17    113     32    1    1    1 :  252  126
size-32              850   2712     32   24   24    1 :  252  126

Each of the column fields correspond to a field in the struct kmem_cache_s structure. The columns listed in the excerpt above are:

: cache-name A human readable name such as ``tcp_bind_bucket'';
: num-active-objs Number of objects that are in use;
: total-objs How many objects are available in total including unused;
: obj-size The size of each object, typically quite small;
: num-active-slabs Number of slabs containing objects that are active;
: total-slabs How many slabs in total exist;
: num-pages-per-slab The pages required to create one slab, typically 1.

If SMP is enabled like in the example excerpt, two more columns will be displayed after a colon. They refer to the per CPU cache described in Section 9.5. The columns are:

: limit This is the number of free objects the pool can have before half of it is given to the global free pool;
: batchcount The number of objects allocated for the processor in a block when no objects are free.

To speed allocation and freeing of objects and slabs they are arranged into three lists; slabs_full, slabs_partial and slabs_free. slabs_full has all its objects in use. slabs_partial has free objects in it and so is a prime candidate for allocation of objects. slabs_free has no allocated objects and so is a prime candidate for slab destruction.

9.1.1 Cache Descriptor

All information describing a cache is stored in a struct kmem_cache_s declared in mm/slab.c. This is an extremely large struct and so will be described in parts.

190 struct kmem_cache_s {
193         struct list_head        slabs_full;
194         struct list_head        slabs_partial;
195         struct list_head        slabs_free;
196         unsigned int            objsize;
197         unsigned int            flags;
198         unsigned int            num;
199         spinlock_t              spinlock;
200 #ifdef CONFIG_SMP
201         unsigned int            batchcount;
202 #endif
203

Most of these fields are of interest when allocating or freeing objects.

: slabs_* These are the three lists where the slabs are stored as described in the previous section;
: objsize This is the size of each object packed into the slab;
: flags These flags determine how parts of the allocator will behave when dealing with the cache. See Section 9.1.2;
: num This is the number of objects contained in each slab;
: spinlock A spinlock protecting the structure from concurrent accessses;
: batchcount This is the number of objects that will be allocated in batch for the per-cpu caches as described in the previous section.

206         unsigned int            gfporder;
209         unsigned int            gfpflags;
210 
211         size_t                  colour;
212         unsigned int            colour_off;
213         unsigned int            colour_next;
214         kmem_cache_t            *slabp_cache;
215         unsigned int            growing;
216         unsigned int            dflags;
217 
219         void (*ctor)(void *, kmem_cache_t *, unsigned long);
222         void (*dtor)(void *, kmem_cache_t *, unsigned long);
223 
224         unsigned long           failures;
225

This block deals with fields of interest when allocating or freeing slabs from the cache.

: gfporder This indicates the size of the slab in pages. Each slab consumes $2^{\mathrm{gfporder}}$ pages as these are the allocation sizes the buddy allocator provides;
: gfpflags The GFP flags used when calling the buddy allocator to allocate pages are stored here. See Section 7.4 for a full list;
: colour Each slab stores objects in different cache lines if possible. Cache colouring will be further discussed in Section 9.1.5;
: colour_off This is the byte alignment to keep slabs at. For example, slabs for the size-X caches are aligned on the L1 cache;
: colour_next This is the next colour line to use. This value wraps back to 0 when it reaches colour;
: growing This flag is set to indicate if the cache is growing or not. If it is, it is much less likely this cache will be selected to reap free slabs under memory pressure;
: dflags These are the dynamic flags which change during the cache lifetime. See Section 9.1.3;
: ctor A complex object has the option of providing a constructor function to be called to initialise each new object. This is a pointer to that function and may be NULL;
: dtor This is the complementing object destructor and may be NULL;
: failures This field is not referred to anywhere in the code.

227         char                    name[CACHE_NAMELEN];
228         struct list_head        next;

These are set during cache creation

: name This is the human readable name of the cache;
: next This is the next cache on the cache chain.

229 #ifdef CONFIG_SMP
231         cpucache_t              *cpudata[NR_CPUS];
232 #endif

: cpudata This is the per-cpu data and is discussed further in Section 9.5.

233 #if STATS
234         unsigned long           num_active;
235         unsigned long           num_allocations;
236         unsigned long           high_mark;
237         unsigned long           grown;
238         unsigned long           reaped;
239         unsigned long           errors;
240 #ifdef CONFIG_SMP
241         atomic_t                allochit;
242         atomic_t                allocmiss;
243         atomic_t                freehit;
244         atomic_t                freemiss;
245 #endif
246 #endif
247 };

These figures are only available if the CONFIG_SLAB_DEBUG option is set during compile time. They are all beancounters and not of general interest. The statistics for /proc/slabinfo are calculated when the proc entry is read by another process by examining every slab used by each cache rather than relying on these fields to be available.

: num_active The current number of active objects in the cache is stored here;
: num_allocations A running total of the number of objects that have been allocated on this cache is stored in this field;
: high_mark This is the highest value num_active has had to date;
: grown This is the number of times kmem_cache_grow() has been called;
: reaped The number of times this cache has been reaped is kept here;
: errors This field is never used;
: allochit This is the total number of times an allocation has used the per-cpu cache;
: allocmiss To complement allochit, this is the number of times an allocation has missed the per-cpu cache;
: freehit This is the number of times a free was placed on a per-cpu cache;
: freemiss This is the number of times an object was freed and placed on the global pool.

9.1.2 Cache Static Flags

A number of flags are set at cache creation time that remain the same for the lifetime of the cache. They affect how the slab is structured and how objects are stored within it. All the flags are stored in a bitmask in the flags field of the cache descriptor. The full list of possible flags that may be used are declared in linux/slab.h.

There are three principle sets. The first set is internal flags which are set only by the slab allocator and is listed in Table 9.1. The only relevant flag is the CFGS_OFF_SLAB flag which determines where the slab descriptor is stored.

**Table 9.1:** Internal cache static flags
$\begin{table} % latex2html id marker 8652 \begin{center} \begin{tabularx}{13.5... ...r set and never used \\ \par\hline \end{tabularx} \end{center} \end{table}$

The second set are set by the cache creator and they determine how the allocator treats the slab and how objects are stored. They are listed in Table 9.2.

**Table 9.2:** Cache static flags set by caller
$\begin{table}\begin{center} \begin{tabularx}{13.5cm}{\vert l\vert X\vert} \hl... ...se memory from ZONE\_DMA \\ \hline \end{tabularx} \end{center} \end{table}$

The last flags are only available if the compile option CONFIG_SLAB_DEBUG is set. They determine what additional checks will be made to slabs and objects and are primarily of interest only when new caches are being developed.

**Table 9.3:** Cache static debug flags
$\begin{table}\begin{center} \begin{tabularx}{13.5cm}{\vert l\vert X\vert} \hl... ...ted or initialised \\ \par\hline \end{tabularx} \end{center} \end{table}$

To prevent callers using the wrong flags a CREATE_MASK is defined in mm/slab.c consisting of all the allowable flags. When a cache is being created, the requested flags are compared against the CREATE_MASK and reported as a bug if invalid flags are used.

9.1.3 Cache Dynamic Flags

The dflags field has only one flag, DFLGS_GROWN, but it is important. The flag is set during kmem_cache_grow() so that kmem_cache_reap() will be unlikely to choose the cache for reaping. When the function does find a cache with this flag set, it skips the cache and removes the flag.

9.1.4 Cache Allocation Flags

These flags correspond to the GFP page flag options for allocating pages for slabs. Callers sometimes call with either SLAB_* or GFP_* flags, but they really should use only SLAB_* flags. They correspond directly to the flags described in Section 7.4 so will not be discussed in detail here. It is presumed the existence of these flags are for clarity and in case the slab allocator needed to behave differently in response to a particular flag but in reality, there is no difference.

**Table 9.4:** Cache Allocation Flags
$\begin{table}\begin{center} \begin{tabularx}{8.88cm}{\vert l\vert l\vert} \hl... ...o \texttt{GFP\_USER} \\ \par\hline \end{tabularx} \end{center} \end{table}$

9.1.5 Cache Colouring

To utilise hardware cache better, the slab allocator will offset objects in different slabs by different amounts depending on the amount of space left over in the slab. The offset is in units of BYTES_PER_WORD unless SLAB_HWCACHE_ALIGN is set in which case it is aligned to blocks of L1_CACHE_BYTES for alignment to the L1 hardware cache.

During cache creation, it is calculated how many objects can fit on a slab (see Section 9.2.7) and how many bytes would be wasted. Based on wastage, two figures are calculated for the cache descriptor

: colour This is the number of different offsets that can be used;
: colour_off This is the multiple to offset each objects by in the slab.

With the objects offset, they will use different lines on the associative hardware cache. Therefore, objects from slabs are less likely to overwrite each other in memory.

The result of this is best explained by an example. Let us say that s_mem (the address of the first object) on the slab is 0 for convenience, that 100 bytes are wasted on the slab and alignment is to be at 32 bytes to the L1 Hardware Cache on a Pentium II.

In this scenario, the first slab created will have its objects start at 0. The second will start at 32, the third at 64, the fourth at 96 and the fifth will start back at 0. With this, objects from each of the slabs will not hit the same hardware cache line on the CPU. The value of colour is 3 and colour_off is 32.

9.1.6 Cache Creation

The function kmem_cache_create() is responsible for creating new caches and adding them to the cache chain. The tasks that are taken to create a cache are

Perform basic sanity checks for bad usage;
Perform debugging checks if CONFIG_SLAB_DEBUG is set;
Allocate a kmem_cache_t from the cache_cache slab cache ;
Align the object size to the word size;
Calculate how many objects will fit on a slab;
Align the slab size to the hardware cache;
Calculate colour offsets ;
Initialise remaining fields in cache descriptor;
Add the new cache to the cache chain.

Figure 9.2 shows the call graph relevant to the creation of a cache; each function is fully described in the Code Commentary.

**Figure 9.2:** Call Graph: `kmem_cache_create()`
$\includegraphics[width=17cm]{graphs/kmem_cache_create.ps}$

9.1.7 Cache Reaping

When a slab is freed, it is placed on the slabs_free list for future use. Caches do not automatically shrink themselves so when kswapd notices that memory is tight, it calls kmem_cache_reap() to free some memory. This function is responsible for selecting a cache that will be required to shrink its memory usage. It is worth noting that cache reaping does not take into account what memory node or zone is under pressure. This means that with a NUMA or high memory machine, it is possible the kernel will spend a lot of time freeing memory from regions that are under no memory pressure but this is not a problem for architectures like the x86 which has only one bank of memory.

**Figure 9.3:** Call Graph: `kmem_cache_reap()`
$\includegraphics[width=13cm]{graphs/kmem_cache_reap.ps}$

The call graph in Figure 9.3 is deceptively simple as the task of selecting the proper cache to reap is quite long. In the event that there are numerous caches in the system, only REAP_SCANLEN^9.1 caches are examined in each call. The last cache to be scanned is stored in the variable clock_searchp so as not to examine the same caches repeatedly. For each scanned cache, the reaper does the following

Check flags for SLAB_NO_REAP and skip if set;
If the cache is growing, skip it;
if the cache has grown recently (DFLGS_GROWN is set in dflags), skip it but clear the flag so it will be reaped the next time;
Count the number of free slabs in slabs_free and calculate how many pages that would free in the variable pages;
If the cache has constructors or large slabs, adjust pages to make it less likely for the cache to be selected;
If the number of pages that would be freed exceeds REAP_PERFECT, free half of the slabs in slabs_free;
Otherwise scan the rest of the caches and select the one that would free the most pages for freeing half of its slabs in slabs_free.

9.1.8 Cache Shrinking

When a cache is selected to shrink itself, the steps it takes are simple and brutal

Delete all objects in the per CPU caches;
Delete all slabs from slabs_free unless the growing flag gets set.

Linux is nothing, if not subtle.

**Figure 9.4:** Call Graph: `kmem_cache_shrink()`
$\includegraphics[width=6cm]{graphs/kmem_cache_shrink.ps}$

Two varieties of shrink functions are provided with confusingly similar names. kmem_cache_shrink() removes all slabs from slabs_free and returns the number of pages freed as a result. This is the principal function exported for use by the slab allocator users.

**Figure 9.5:** Call Graph: `__kmem_cache_shrink()`
$\includegraphics[width=13cm]{graphs/__kmem_cache_shrink.ps}$

The second function __kmem_cache_shrink() frees all slabs from slabs_free and then verifies that slabs_partial and slabs_full are empty. This is for internal use only and is important during cache destruction when it doesn't matter how many pages are freed, just that the cache is empty.

9.1.9 Cache Destroying

When a module is unloaded, it is responsible for destroying any cache with the function kmem_cache_destroy(). It is important that the cache is properly destroyed as two caches of the same human-readable name are not allowed to exist. Core kernel code often does not bother to destroy its caches as their existence persists for the life of the system. The steps taken to destroy a cache are

Delete the cache from the cache chain;
Shrink the cache to delete all slabs;
Free any per CPU caches (kfree());
Delete the cache descriptor from the cache_cache.

**Figure 9.6:** Call Graph: `kmem_cache_destroy()`
$\includegraphics[width=10cm]{graphs/kmem_cache_destroy.ps}$

Footnotes

...REAP_SCANLEN^9.1: REAP_SCANLEN is statically defined as 10.

Next: 9.2 Slabs Up: 9. Slab Allocator Previous: 9. Slab Allocator Contents Index

Mel 2004-02-15