After this documentation was released in July 2003, I was approached by Prentice Hall and asked to write a book on the Linux VM under the Bruce Peren's Open Book Series.

The book is available and called simply "Understanding The Linux Virtual Memory Manager". There is a lot of additional material in the book that is not available here, including details on later 2.4 kernels, introductions to 2.6, a whole new chapter on the shared memory filesystem, coverage of TLB management, a lot more code commentary, countless other additions and clarifications and a CD with lots of cool stuff on it. This material (although now dated and lacking in comparison to the book) will remain available although I obviously encourge you to buy the book from your favourite book store :-) . As the book is under the Bruce Perens Open Book Series, it will be available 90 days after appearing on the book shelves which means it is not available right now. When it is available, it will be downloadable from http://www.phptr.com/perens so check there for more information.

To be fully clear, this webpage is not the actual book.

Next: 12.3 Allocating a swap Up: 12. Swap Management Previous: 12.1 Describing the Swap Contents Index

12.2 Mapping Page Table Entries to Swap Entries

When a page is swapped out, Linux uses the corresponding PTE to store enough information to locate the page on disk again. Obviously a PTE is not large enough in itself to store precisely where on disk the page is located, but it is more than enough to store an index into the swap_info array and an offset within the swap_map and this is precisely what Linux does.

Each PTE, regardless of architecture, is large enough to store a swp_entry_t which is declared as follows in linux/shmem_fs.h

 16 typedef struct {
 17         unsigned long val;
 18 } swp_entry_t;

Two macros are provided for the translation of PTEs to swap entries and vice versa. They are pte_to_swp_entry() and swp_entry_to_pte() respectively.

In the swp_entry_t, two bits are always kept free which are used by Linux to determine if a PTE is present or swapped out. Bit 0 is reserved for the _PAGE_PRESENT flag and Bit 7 is reserved for _PAGE_PROTNONE. The requirement for both bits is explained in Section 4.2.

Bits 1-6 are for the type which is the index within the swap_info array and are returned by the SWP_TYPE() macro.

Bits 8-31 are used are to store the offset within the swap_map from the swp_entry_t. On the x86, this means 24 bits are available, ``limiting'' the size of the swap area to 64GiB. The macro SWP_OFFSET() is used to extract the offset.

To encode a type and offset into a swp_entry_t, the macro SWP_ENTRY() is available which simply performs the relevant bit shifting operations. The relationship between all these macros is illustrated in Figure 12.1.

**Figure 12.1:** Storing Swap Entry Information in `swp_entry_t`
$\includegraphics[width=15cm]{graphs/pte_swp_entry_macros.ps}$

It should be noted that the six bits for ``type'' should allow up to 64 swap areas to exist in a 32 bit architecture instead of the MAX_SWAPFILES restriction of 32. The restriction is probably due to the consumption of the vmalloc address space. If a swap area is the maximum possible size then 32MiB is required for the swap_map ( $2^{24} * \mathrm{sizeof}(\mathrm{short})$ ); remember that each page uses one short for the reference count. For just MAX_SWAPFILES maximum number of swap areas to exist, 1GiB of virtual malloc space is required which is simply impossible because of the user/kernel linear address space split.

This would imply supporting 64 swap areas is not worth the additional complexity but there is cases where a large number of swap areas would be desirable even if the overall swap available does not increase. Some modern machines^12.2 have many separate disks which between them can create a large number of separate block devices. In this case, it is desirable to create a large number of small swap areas which are evenly distributed across all disks. This would allow a high degree of parallelism in the page swapping behavior which is important for swap intensive applications.

Footnotes

... machines ^12.2: A Sun E450 could have in the region of 20 disks in it for example.

Next: 12.3 Allocating a swap Up: 12. Swap Management Previous: 12.1 Describing the Swap Contents Index

Mel 2004-02-15