After this documentation was released in July 2003, I was approached by Prentice Hall and asked to write a book on the Linux VM under the Bruce Peren's Open Book Series.

The book is available and called simply "Understanding The Linux Virtual Memory Manager". There is a lot of additional material in the book that is not available here, including details on later 2.4 kernels, introductions to 2.6, a whole new chapter on the shared memory filesystem, coverage of TLB management, a lot more code commentary, countless other additions and clarifications and a CD with lots of cool stuff on it. This material (although now dated and lacking in comparison to the book) will remain available although I obviously encourge you to buy the book from your favourite book store :-) . As the book is under the Bruce Perens Open Book Series, it will be available 90 days after appearing on the book shelves which means it is not available right now. When it is available, it will be downloadable from http://www.phptr.com/perens so check there for more information.

To be fully clear, this webpage is not the actual book.
next up previous contents index
Next: 7.3 Free Pages Up: 7. Physical Page Allocation Previous: 7.1 Managing Free Blocks   Contents   Index


7.2 Allocating Pages

Linux provides a quite sizable API for the allocation of page frames. All of them take a gfp_mask as a parameter which is a set of flags that determine how the allocator will behave. The flags are discussed in Section 7.4.

The allocation API functions all use the core function __alloc_pages() but the APIs exist so that the correct node and zone will be chosen. Different users will require different zones such as ZONE_ DMA for certain device drivers or ZONE_ NORMAL for disk buffers and callers should not have to be aware of what node is being used. A full list of page allocation APIs are listed in Table 7.1.


Table 7.1: Physical Pages Allocation API
\begin{table}\begin{center}
\begin{tabularx}{13.5cm}{\vert X\vert}
\hline
...
...n
a struct page \\ \\
\par
\hline
\end{tabularx}
\end{center} \end{table}


Allocations are always for a specified order, 0 in the case where a single page is required. If a free block cannot be found of the requested order, a higher order block is split into two buddies. One is allocated and the other is placed on the free list for the lower order. Figure 7.2 shows where a $2^4$ block is split and how the buddies are added to the free lists until a block for the process is available.

Figure 7.2: Allocating physical pages
\includegraphics[width=10cm]{graphs/buddy_allocation.ps}

When the block is later freed, the buddy will be checked. If both are free, they are merged to form a higher order block and placed on the higher free list where its buddy is checked and so on. If the buddy is not free, the freed block is added to the free list at the current order. During these list manipulations, interrupts have to be disabled to prevent an interrupt handler manipulating the lists while a process has them in an inconsistent state. This is achieved by using an interrupt safe spinlock.

The second decision to make is which memory node or pg_data_t to use. Linux uses a node-local allocation policy which aims to use the memory bank associated with the CPU running the page allocating process. Here, the function _alloc_pages() is what is important as this function is different depending on whether the kernel is built for a UMA (function in mm/page_alloc.c) or NUMA (function in mm/numa.c) machine.

Regardless of which API is used, __alloc_pages() in mm/page_alloc.c is the heart of the allocator. This function, which is never called directly, examines the selected zone and checks if it is suitable to allocate from based on the number of available pages. If the zone is not suitable, the allocator may fall back to other zones. The order of zones to fall back on are decided at boot time by the function build_zonelists() but generally ZONE_ HIGHMEM will fall back to ZONE_ NORMAL and that in turn will fall back to ZONE_ DMA. If number of free pages reaches the pages_low watermark, it will wake kswapd to begin freeing up pages from zones and if memory is extremely tight, the caller will do the work of kswapd itself.

Figure 7.3: Call Graph: alloc_pages()
\includegraphics[width=10cm]{graphs/alloc_pages.ps}

Once the zone has finally been decided on, the function rmqueue() is called to allocate the block of pages or split higher level blocks if one of the appropriate size is not available.


next up previous contents index
Next: 7.3 Free Pages Up: 7. Physical Page Allocation Previous: 7.1 Managing Free Blocks   Contents   Index
Mel 2004-02-15