After this documentation was released in July 2003, I was approached
by Prentice Hall and asked to write a book on the Linux VM under the Bruce Peren's Open Book Series.
The book is available and called simply "Understanding The Linux Virtual
Memory Manager". There is a lot of additional material in the book that is
not available here, including details on later 2.4 kernels, introductions
to 2.6, a whole new chapter on the shared memory filesystem, coverage of TLB
management, a lot more code commentary, countless other additions and
clarifications and a CD with lots of cool stuff on it. This material (although
now dated and lacking in comparison to the book) will remain available
although I obviously encourge you to buy the book from your favourite book
store :-) . As the book is under the Bruce Perens Open Book Series, it will
be available 90 days after appearing on the book shelves which means it
is not available right now. When it is available, it will be downloadable
from http://www.phptr.com/perens
so check there for more information.
To be fully clear, this webpage is not the actual book.
Next: 3.2 Zones
Up: 3. Describing Physical Memory
Previous: 3. Describing Physical Memory
  Contents
  Index
3.1 Nodes
As we have mentioned, each node in memory is described by a
pg_data_t struct. When allocating a page, Linux uses a
node-local allocation policy to allocate memory from the node
closest to the running CPU. As processes tend to run on the same CPU, it is
likely the memory from the current node will be used. The struct is declared
as follows in linux/mmzone.h:
129 typedef struct pglist_data {
130 zone_t node_zones[MAX_NR_ZONES];
131 zonelist_t node_zonelists[GFP_ZONEMASK+1];
132 int nr_zones;
133 struct page *node_mem_map;
134 unsigned long *valid_addr_bitmap;
135 struct bootmem_data *bdata;
136 unsigned long node_start_paddr;
137 unsigned long node_start_mapnr;
138 unsigned long node_size;
139 int node_id;
140 struct pglist_data *node_next;
141 } pg_data_t;
We now briefly describe each of these fields:
- node_zones The zones for this node, ZONE_ HIGHMEM,
ZONE_ NORMAL, ZONE_ DMA;
- node_zonelists This is the order of zones that allocations are
preferred from. build_zonelists() in
page_alloc.c sets up the order when called
by free_area_init_core(). A failed allocation
in ZONE_ HIGHMEM may fall back to ZONE_ NORMAL
or back to ZONE_ DMA;
- nr_zones Number of zones in this node, between 1 and 3. Not all
nodes will have three. A CPU bank may not have ZONE_ DMA
for example;
- node_mem_map This is the first page of the struct page
array representing each physical frame in the node. It will
be placed somewhere within the global mem_map
array;
- valid_addr_bitmap A bitmap which describes ``holes'' in the memory
node that no memory exists for;
- bdata This is only of interest to the boot memory allocator
discussed in Chapter 6;
- node_start_paddr The starting physical address of the node. An
unsigned long does not work optimally as it breaks for
ia323.1 with Physical Address Extension (PAE)3.2 for example. A more suitable solution would be to record
this as a Page Frame Number (PFN) which could
be trivially defined as (page_phys_addr
>> PAGE_SHIFT);
- node_start_mapnr This gives the page offset within the global
mem_map. It is calculated in
free_area_init_core() by calculating the
number of pages between mem_map and the local
mem_map for this node called lmem_map;
- node_size The total number of pages in this zone;
- node_id The ID of the node, starts at 0;
- node_next Pointer to next node in a NULL terminated list.
All nodes in the system are maintained on a list called
pgdat_list. The nodes are placed on this list as they are
initialised by the init_bootmem_core() function, described later
in Section 6.2.2. Up until late 2.4 kernels
(> 2.4.18), blocks of code that traversed the list looked something like:
pg_data_t * pgdat;
pgdat = pgdat_list;
do {
/* do something with pgdata_t */
...
} while ((pgdat = pgdat->node_next));
In more recent kernels, a macro for_each_pgdat(), which is
trivially defined as a for loop, is provided to improve code readability.
Footnotes
- ...
ia323.1
- FYI from Jeff Haran: Some PowerPC variants appear to have this
same problem (e.g. PPC440GP).
- ... (PAE)3.2
- PAE
is discussed further in Section 3.4.
Next: 3.2 Zones
Up: 3. Describing Physical Memory
Previous: 3. Describing Physical Memory
  Contents
  Index
Mel
2004-02-15