After this documentation was released in July 2003, I was approached
by Prentice Hall and asked to write a book on the Linux VM under the Bruce Peren's Open Book Series.
The book is available and called simply "Understanding The Linux Virtual
Memory Manager". There is a lot of additional material in the book that is
not available here, including details on later 2.4 kernels, introductions
to 2.6, a whole new chapter on the shared memory filesystem, coverage of TLB
management, a lot more code commentary, countless other additions and
clarifications and a CD with lots of cool stuff on it. This material (although
now dated and lacking in comparison to the book) will remain available
although I obviously encourge you to buy the book from your favourite book
store :-) . As the book is under the Bruce Perens Open Book Series, it will
be available 90 days after appearing on the book shelves which means it
is not available right now. When it is available, it will be downloadable
from http://www.phptr.com/perens
so check there for more information.
To be fully clear, this webpage is not the actual book.
Next: 12.2 Mapping Page Table
Up: 12. Swap Management
Previous: 12. Swap Management
  Contents
  Index
Each active swap area, be it a file or partition, has a struct
swap_info_struct describing the area. All the structures
in the running system are stored in a statically declared array called
swap_info which holds MAX_SWAPFILES, which is
statically defined as 32, entries. This means that at most 32 swap areas can
exist on a running system. The swap_info_struct is declared
as follows in linux/swap.h
64 struct swap_info_struct {
65 unsigned int flags;
66 kdev_t swap_device;
67 spinlock_t sdev_lock;
68 struct dentry * swap_file;
69 struct vfsmount *swap_vfsmnt;
70 unsigned short * swap_map;
71 unsigned int lowest_bit;
72 unsigned int highest_bit;
73 unsigned int cluster_next;
74 unsigned int cluster_nr;
75 int prio;
76 int pages;
77 unsigned long max;
78 int next;
79 };
Here is a small description of each of the fields in this quite sizable struct.
- flags This is a bit field with two possible
values. SWP_USED is set if the swap area is currently
active. SWP_WRITEOK is defined as 3, the two lowest
significant bits, including the SWP_USED bit. The
flags is set to SWP_WRITEOK when Linux is ready to write to
the area as it must be active to be written to;
- swap_device The device corresponding to the partition used for
this swap area is stored here. If the swap area is a file, this is NULL;
- sdev_lock As with many structures in Linux, this one has to be
protected too. sdev_lock is a spinlock protecting the
struct, principally the swap_map. It is locked and unlocked
with swap_device_lock() and swap_device_unlock();
- swap_file This is the dentry for the actual special file
that is mounted as a swap area. This could be the dentry
for a file in the /dev/ directory for example in the case
a partition is mounted. This field is needed to identify the correct
swap_info_struct when deactiating a swap area;
- vfs_mount This is the vfs_mount object corresponding to
where the device or file for this swap area is stored;
- swap_map This is a large array with one entry for every swap entry,
or page sized slot in the area. An entry is a reference count of the number
of users of this page slot. If it is equal to SWAP_MAP_MAX,
the slot is allocated permanently. If equal to SWAP_MAP_BAD,
the slot will never be used;
- lowest_bit This is the lowest possible free slot available in
the swap area and is used to start from when linearly scanning to reduce
the search space. It is known that there are definitely no free slots
below this mark;
- highest_bit This is the highest possible free slot available in this
swap area. Similar to lowest_bit, there are definitely no
free slots above this mark;
- cluster_next This is the offset of the next cluster of blocks to
use. The swap area tries to have pages allocated in cluster blocks to
increase the chance related pages will be stored together;
- cluster_nr This the number of pages left to allocate in this
cluster;
- prio Each swap area has a priority which is stored in this
field. Areas are arranged in order of priority and determine how likely
the area is to be used. By default the priorities are arranged in order
of activation but the system administrator may also specify it using
the -p flag when using swapon;
- pages As some slots on the swap file may be unusable, this field
stores the number of usable pages in the swap area. This differs from
max in that slots marked SWAP_MAP_BAD are
not counted;
- max This is the total number of slots in this swap area;
- next This is the index in the swap_info array of the
next swap area in the system.
The areas though stored in an array, are also kept in a pseudo list called
swap_list which is a very simple type declared as follows in
linux/swap.h:
154 struct swap_list_t {
155 int head; /* head of priority-ordered swapfile list */
156 int next; /* swapfile to be used next */
157 };
The head is the swap area of the highest priority swap area in use
and the next is the next swap area that should be used. This is
so areas may be arranged in order of priority when searching for a suitable
area but still looked up quickly in the array when necessary.
Each swap area is divided up into a number of page sized slots on disk which
means that each slot is 4096 bytes on the x86 for example. The first slot is
always reserved as it contains information about the swap area that should
not be overwritten. The first 1 KiB of the swap area is used to store a
disk label for the partition that can be picked up by userspace tools. The
remaining space is used for information about the swap area which is filled
when the swap area is created with the system program mkswap. The
information is used to fill in a union swap_header which
is declared as follows in linux/swap.h:
25 union swap_header {
26 struct
27 {
28 char reserved[PAGE_SIZE - 10];
29 char magic[10];
30 } magic;
31 struct
32 {
33 char bootbits[1024];
34 unsigned int version;
35 unsigned int last_page;
36 unsigned int nr_badpages;
37 unsigned int padding[125];
38 unsigned int badpages[1];
39 } info;
40 };
A description of each of the fields follows
- magic The magic part of the union is used just for
identifying the ``magic'' string. The string exists to make sure there is no
chance a partition that is not a swap area will be used and to decide what
version of swap area is is. If the string is ``SWAP-SPACE'', it is version 1 of
the swap file format. If it is ``SWAPSPACE2'', it is version 2. The large
reserved array is just so that the magic string will be read from the end of
the page;
- bootbits This is the reserved area containing information about the
partition such as the disk label;
- version This is the version of the swap area layout;
- last_page This is the last usable page in the area;
- nr_badpages The known number of bad pages that exist in the swap area
are stored in this field;
- padding A disk section is usually about 512 bytes in size. The three
fields version, last_page and nr_badpages
make up 12 bytes and the padding fills up the remaining 500
bytes to cover one sector;
- badpages The remainder of the page is used to store the indices of up
to MAX_SWAP_BADPAGES number of bad page slots. These slots
are filled in by the mkswap system program if the -c
switch is specified to check the area.
MAX_SWAP_BADPAGES is a compile time constant which varies if the
struct changes but it is 637 entries in its current form as given by the simple
equation;
Where 1024 is the size of the bootblock, 512 is the size of the padding and
10 is the size of the magic string identifing the format of the swap file.
Next: 12.2 Mapping Page Table
Up: 12. Swap Management
Previous: 12. Swap Management
  Contents
  Index
Mel
2004-02-15