After this documentation was released in July 2003, I was approached by Prentice Hall and asked to write a book on the Linux VM under the Bruce Peren's Open Book Series.

The book is available and called simply "Understanding The Linux Virtual Memory Manager". There is a lot of additional material in the book that is not available here, including details on later 2.4 kernels, introductions to 2.6, a whole new chapter on the shared memory filesystem, coverage of TLB management, a lot more code commentary, countless other additions and clarifications and a CD with lots of cool stuff on it. This material (although now dated and lacking in comparison to the book) will remain available although I obviously encourge you to buy the book from your favourite book store :-) . As the book is under the Bruce Perens Open Book Series, it will be available 90 days after appearing on the book shelves which means it is not available right now. When it is available, it will be downloadable from http://www.phptr.com/perens so check there for more information.

To be fully clear, this webpage is not the actual book.

Next: 12.2 Mapping Page Table Up: 12. Swap Management Previous: 12. Swap Management Contents Index

12.1 Describing the Swap Area

Each active swap area, be it a file or partition, has a struct swap_info_struct describing the area. All the structures in the running system are stored in a statically declared array called swap_info which holds MAX_SWAPFILES, which is statically defined as 32, entries. This means that at most 32 swap areas can exist on a running system. The swap_info_struct is declared as follows in linux/swap.h

 64 struct swap_info_struct {
 65         unsigned int flags;
 66         kdev_t swap_device;
 67         spinlock_t sdev_lock;
 68         struct dentry * swap_file;
 69         struct vfsmount *swap_vfsmnt;
 70         unsigned short * swap_map;
 71         unsigned int lowest_bit;
 72         unsigned int highest_bit;
 73         unsigned int cluster_next;
 74         unsigned int cluster_nr;
 75         int prio;
 76         int pages;
 77         unsigned long max;
 78         int next;
 79 };

Here is a small description of each of the fields in this quite sizable struct.

: flags This is a bit field with two possible values. SWP_USED is set if the swap area is currently active. SWP_WRITEOK is defined as 3, the two lowest significant bits, including the SWP_USED bit. The flags is set to SWP_WRITEOK when Linux is ready to write to the area as it must be active to be written to;
: swap_device The device corresponding to the partition used for this swap area is stored here. If the swap area is a file, this is NULL;
: sdev_lock As with many structures in Linux, this one has to be protected too. sdev_lock is a spinlock protecting the struct, principally the swap_map. It is locked and unlocked with swap_device_lock() and swap_device_unlock();
: swap_file This is the dentry for the actual special file that is mounted as a swap area. This could be the dentry for a file in the /dev/ directory for example in the case a partition is mounted. This field is needed to identify the correct swap_info_struct when deactiating a swap area;
: vfs_mount This is the vfs_mount object corresponding to where the device or file for this swap area is stored;
: swap_map This is a large array with one entry for every swap entry, or page sized slot in the area. An entry is a reference count of the number of users of this page slot. If it is equal to SWAP_MAP_MAX, the slot is allocated permanently. If equal to SWAP_MAP_BAD, the slot will never be used;
: lowest_bit This is the lowest possible free slot available in the swap area and is used to start from when linearly scanning to reduce the search space. It is known that there are definitely no free slots below this mark;
: highest_bit This is the highest possible free slot available in this swap area. Similar to lowest_bit, there are definitely no free slots above this mark;
: cluster_next This is the offset of the next cluster of blocks to use. The swap area tries to have pages allocated in cluster blocks to increase the chance related pages will be stored together;
: cluster_nr This the number of pages left to allocate in this cluster;
: prio Each swap area has a priority which is stored in this field. Areas are arranged in order of priority and determine how likely the area is to be used. By default the priorities are arranged in order of activation but the system administrator may also specify it using the -p flag when using swapon;
: pages As some slots on the swap file may be unusable, this field stores the number of usable pages in the swap area. This differs from max in that slots marked SWAP_MAP_BAD are not counted;
: max This is the total number of slots in this swap area;
: next This is the index in the swap_info array of the next swap area in the system.

The areas though stored in an array, are also kept in a pseudo list called swap_list which is a very simple type declared as follows in linux/swap.h:

154 struct swap_list_t {
155         int head;    /* head of priority-ordered swapfile list */
156         int next;    /* swapfile to be used next */
157 };

The head is the swap area of the highest priority swap area in use and the next is the next swap area that should be used. This is so areas may be arranged in order of priority when searching for a suitable area but still looked up quickly in the array when necessary.

Each swap area is divided up into a number of page sized slots on disk which means that each slot is 4096 bytes on the x86 for example. The first slot is always reserved as it contains information about the swap area that should not be overwritten. The first 1 KiB of the swap area is used to store a disk label for the partition that can be picked up by userspace tools. The remaining space is used for information about the swap area which is filled when the swap area is created with the system program mkswap. The information is used to fill in a union swap_header which is declared as follows in linux/swap.h:

 25 union swap_header {
 26         struct 
 27         {
 28                 char reserved[PAGE_SIZE - 10];
 29                 char magic[10];
 30         } magic;
 31         struct 
 32         {
 33                 char         bootbits[1024];
 34                 unsigned int version;
 35                 unsigned int last_page;
 36                 unsigned int nr_badpages;
 37                 unsigned int padding[125];
 38                 unsigned int badpages[1];
 39         } info;
 40 };

A description of each of the fields follows

: magic The magic part of the union is used just for identifying the ``magic'' string. The string exists to make sure there is no chance a partition that is not a swap area will be used and to decide what version of swap area is is. If the string is ``SWAP-SPACE'', it is version 1 of the swap file format. If it is ``SWAPSPACE2'', it is version 2. The large reserved array is just so that the magic string will be read from the end of the page;
: bootbits This is the reserved area containing information about the partition such as the disk label;
: version This is the version of the swap area layout;
: last_page This is the last usable page in the area;
: nr_badpages The known number of bad pages that exist in the swap area are stored in this field;
: padding A disk section is usually about 512 bytes in size. The three fields version, last_page and nr_badpages make up 12 bytes and the padding fills up the remaining 500 bytes to cover one sector;
: badpages The remainder of the page is used to store the indices of up to MAX_SWAP_BADPAGES number of bad page slots. These slots are filled in by the mkswap system program if the -c switch is specified to check the area.

MAX_SWAP_BADPAGES is a compile time constant which varies if the struct changes but it is 637 entries in its current form as given by the simple equation;

$\begin{displaymath}\mathrm{MAX\_SWAP\_BADPAGES} = \frac{\mathrm{PAGE\_SIZE} - 1024 - 512 - 10}{\mathrm{sizeof}(\mathrm{long})} \end{displaymath}$

Where 1024 is the size of the bootblock, 512 is the size of the padding and 10 is the size of the magic string identifing the format of the swap file.

Next: 12.2 Mapping Page Table Up: 12. Swap Management Previous: 12. Swap Management Contents Index

Mel 2004-02-15