After this documentation was released in July 2003, I was approached by Prentice Hall and asked to write a book on the Linux VM under the Bruce Peren's Open Book Series.

The book is available and called simply "Understanding The Linux Virtual Memory Manager". There is a lot of additional material in the book that is not available here, including details on later 2.4 kernels, introductions to 2.6, a whole new chapter on the shared memory filesystem, coverage of TLB management, a lot more code commentary, countless other additions and clarifications and a CD with lots of cool stuff on it. This material (although now dated and lacking in comparison to the book) will remain available although I obviously encourge you to buy the book from your favourite book store :-) . As the book is under the Bruce Perens Open Book Series, it will be available 90 days after appearing on the book shelves which means it is not available right now. When it is available, it will be downloadable from http://www.phptr.com/perens so check there for more information.

To be fully clear, this webpage is not the actual book.
next up previous contents index
Next: 3.3 Pages Up: 3. Describing Physical Memory Previous: 3.1 Nodes   Contents   Index

Subsections


3.2 Zones

Each zone is described by a struct zone_t. It keeps track of information like page usage statistics, free area information and locks. It is declared as follows in $<$linux/mmzone.h$>$:

37 typedef struct zone_struct {
41         spinlock_t         lock;
42         unsigned long      free_pages;
43         unsigned long      pages_min, pages_low, pages_high;
44         int                need_balance;
45 
49         free_area_t        free_area[MAX_ORDER];
50 
76         wait_queue_head_t  * wait_table;
77         unsigned long      wait_table_size;
78         unsigned long      wait_table_shift;
79 
83         struct pglist_data *zone_pgdat;
84         struct page        *zone_mem_map;
85         unsigned long      zone_start_paddr;
86         unsigned long      zone_start_mapnr;
87 
91         char               *name;
92         unsigned long      size;
93 } zone_t;

This is a brief explanation of each field in the struct.

lock Spinlock to protect the zone;
free_pages Total number of free pages in the zone;
pages_min, pages_low, pages_high These are zone watermarks which are described in the next section;
need_balance This flag tells the pageout kswapd to balance the zone;
free_area Free area bitmaps used by the buddy allocator;

wait_table A hash table of wait queues of processes waiting on a page to be freed. This is of importance to wait_on_page() and unlock_page(). While processes could all wait on one queue, this would cause a ``thundering herd'' of processes to race for pages still locked when woken up;

wait_table_size Size of the hash table which is a power of 2;

wait_table_shift Defined as the number of bits in a long minus the binary logarithm of the table size above;

zone_pgdat Points to the parent pg_data_t;

zone_mem_map The first page in the global mem_map this zone refers to;

zone_start_paddr Same principle as node_start_paddr;
zone_start_mapnr Same principle as node_start_mapnr;
name The string name of the zone, ``DMA'', ``Normal'' or ``HighMem''
size The size of the zone in pages.


3.2.1 Zone Watermarks

When available memory in the system is low, the pageout daemon kswapd is woken up to start freeing pages (see Chapter 11). If the pressure is high, the process will free up memory synchronously which is sometimes referred to as the direct reclaim path. The parameters affecting pageout behavior are similar to those used by FreeBSD [#!mckusick96!#] and Solaris [#!mauro01!#].

Each zone has three watermarks called pages_low, pages_min and pages_high which help track how much pressure a zone is under. The number of pages for pages_min is calculated in the function free_area_init_core() during memory init and is based on a ratio to the size of the zone in pages. It is calculated initially as $\mathrm{ZoneSizeInPages} / 128$. The lowest value it will be is 20 pages (80K on a x86) and the highest possible value is 255 pages (1MiB on a x86).

pages_min When pages_min is reached, the allocator will do the kswapd work in a synchronous fashion. There is no real equivalent in Solaris but the closest is the desfree or minfree which determine how often the pageout scanner is woken up;

pages_low When pages_low number of free pages is reached, kswapd is woken up by the buddy allocator to start freeing pages. This is equivalent to when lotsfree is reached in Solaris and freemin in FreeBSD. The value is twice the value of pages_min by default;

pages_high Once reached, kswapd is woken, it will not consider the zone to be ``balanced'' until pages_high pages are free. In Solaris, this is called lotsfree and in BSD, it is called free_target. The default for pages_high is three times the value of pages_min.

Whatever the pageout parameters are called in each operating system, the meaning is the same, it helps determine how hard the pageout daemon or processes work to free up pages.


next up previous contents index
Next: 3.3 Pages Up: 3. Describing Physical Memory Previous: 3.1 Nodes   Contents   Index
Mel 2004-02-15