One of the tasks the slab allocator is dedicated to is improved hardware cache utilization. An aim of high performance computing [#!severance98!#] in general is to use data on the same CPU for as long as possible. Linux achieves this by trying to keep objects in the same CPU cache with a Per-CPU object cache, simply called a cpucache for each CPU in the system.
When allocating or freeing objects, they are placed in the cpucache. When there are no objects free, a batch of objects is placed into the pool. When the pool gets too large, half of them are removed and placed in the global cache. This way the hardware cache will be used for as long as possible on the same CPU.
The second major benefit of this method is that spinlocks do not have to be held when accessing the CPU pool as we are guaranteed another CPU won't access the local data. This is important because without the caches, the spinlock would have to be acquired for every allocation and free which is unnecessarily expensive.
Each cache descriptor has a pointer to an array of cpucaches, described in the cache descriptor as
231 cpucache_t *cpudata[NR_CPUS];
This structure is very simple
173 typedef struct cpucache_s { 174 unsigned int avail; 175 unsigned int limit; 176 } cpucache_t;
The fields are as follows:
A helper macro cc_data() is provided to give the cpucache for a given cache and processor. It is defined as
180 #define cc_data(cachep) \ 181 ((cachep)->cpudata[smp_processor_id()])
This will take a given cache descriptor (cachep) and return a pointer from the cpucache array (cpudata). The index needed is the ID of the current processor, smp_processor_id().
Pointers to objects on the cpucache are placed immediately after the cpucache_t struct. This is very similar to how objects are stored after a slab descriptor.
To prevent fragmentation, objects are always added or removed from the end of the array. To add an object (obj) to the CPU cache (cc), the following block of code is used
cc_entry(cc)[cc->avail++] = obj;
To remove an object
obj = cc_entry(cc)[--cc->avail];
cc_entry() is a helper macro which gives a pointer to the first object in the cpucache. It is defined as
178 #define cc_entry(cpucache) \ 179 ((void **)(((cpucache_t*)(cpucache))+1))
This takes a pointer to a cpucache, increments the value by the size of the cpucache_t descriptor giving the first object in the cache.
When a cache is created, its CPU cache has to be enabled and memory allocated for it using kmalloc(). The function enable_cpucache() is responsible for deciding what size to make the cache and calling kmem_tune_cpucache() to allocate memory for it.
Obviously a CPU cache cannot exist until after the various sizes caches have been enabled so a global variable g_cpucache_up is used to prevent CPU caches being enabled prematurely. The function enable_all_cpucaches() cycles through all caches in the cache chain and enables their cpucache.
Once the CPU cache has been setup, it can be accessed without locking as a CPU will never access the wrong cpucache so it is guaranteed safe access to it.
When the per-cpu caches have been created or changed, each CPU is signalled via an IPI. It is not sufficient to change all the values in the cache descriptor as that would lead to cache coherency issues and spinlocks would have to used to protect the CPU caches. Instead a ccupdate_t struct is populated with all the information each CPU needs and each CPU swaps the new data with the old information in the cache descriptor. The struct for storing the new cpucache information is defined as follows
868 typedef struct ccupdate_struct_s 869 { 870 kmem_cache_t *cachep; 871 cpucache_t *new[NR_CPUS]; 872 } ccupdate_struct_t;
cachep is the cache being updated and new is the array of the cpucache descriptors for each CPU on the system. The function smp_function_all_cpus() is used to get each CPU to call the do_ccupdate_local() function which swaps the information from ccupdate_struct_t with the information in the cache descriptor.
Once the information has been swapped, the old data can be deleted.
When a cache is being shrunk, its first step is to drain the cpucaches of any objects they might have. This is so that the slab allocator will have a clearer view of what slabs can be freed or not. This is important because if just one object in a slab is placed in a per-cpu cache, that whole slab cannot be freed. If the system is tight on memory, saving a few milliseconds on allocations has a low priority.