Derby On Disk Page Format
Derby stores table and index data in Containers, which currently map to files in the seg0 directory of the database. In the current Derby implementation there is a 1 to 1 mapping of containers to files. Two containers never map to a single file and 1 container never maps to multiple files.
Data is stored in pages within the container.
A page contains a set of records, which can be accessed by "slot", which defines the order of the records on the page, or by "id" which defines the identity of the records on the page. Clients access records by both slot and id, depending on their needs.
A Table or a BTree index provides a row-based access mechanism (row-based access interface is known as conglomerate). Rows are mapped to records in data pages; in case of a table, a single row can span multiple records in multiple pages.
A container can have three types of pages:
- Header Page - which is just a specialized version of the Alloc Page.
- Data Pages which hold data, and
- Alloc Pages which hold page allocation information. An Alloc page is a specialized verion of the Data page.
The container can be visualised as:
Header Page is currently always page 0 of the container. It contains information that raw store needs to maintain about the container once per container, and is currently implemented as an Alloc Page which "borrows" space from the alloc page for it's information. The original decision was that the designers did not want to waste a whole page for header information, so a part of the page was used and the first allocation map was put on the second half of it. See AllocPage.java for info about layout and borrowing.
Allocation Page - After page 0, all subsequent Allocation pages only have allocation bit maps.
Data Page Format
A data page is broken into five sections.
The formatId is a 4 bytes array, it contains the format Id of this page. The possible values are RAW_STORE_STORED_PAGE or RAW_STORE_ALLOC_PAGE.
The page header is a fixed size, 56 bytes.
|1 byte||boolean||is page an overflow page|
page status is either VALID_PAGE or INVALID_PAGE(a field maintained in base page)
page goes thru the following transition:
deallocated and free page are both INVALID_PAGE as far as BasePage
|8 bytes||long||pageVersion (a field maintained in base page)|
|2 bytes||unsigned short||number of slots in slot offset table|
|4 bytes||integer||next record identifier|
|4 bytes||integer||generation number of this page (Future Use)|
|4 bytes||integer||previous generation of this page (Future Use)|
|8 bytes||bipLocation||the location of the beforeimage page (Future Use)|
|2 bytes||unsigned short||number of deleted rows on page. (new release 2.0)|
|2 bytes||short||spare for future use|
|4 bytes||integer||spare for future use (encryption uses to write random bytes here).|
|8 bytes||long||spare for future use|
|8 bytes||long||spare for future use|
The records section contains zero or more records. Each record starts with a Record Header
Status bits for the record header
|compressed int||record identifier|
|compressed long||overflow page only if RECORD_OVERFLOW is set|
|compressed int||overflow id only if RECORD_OVERFLOW is set|
|compressed int||first field only if RECORD_HAS_FIRST_FIELD is set - otherwise 0|
|compressed int||number of fields in this portion - only if RECORD_OVERFLOW is false OR RECORD_HAS_FIRST_FIELD is true - otherwise 0|
The Record Header is followed by one or more fields. Each field contains a Field Header and optional Field Data.
The status is 1 byte, it indicates the state of the field. A FieldHeader can be in the following states:
|fieldDataLength||The fieldDataLength is only set if the field is not NULL. It is the length of the field that is stored on the current page. The fieldDataLength is a variable length CompressedInt.|
Overflow page and overflow id are stored as field data. If
the overflow bit in status is set, the field data is the overflow
information. When the overflow bit is not set in status, then,
fieldData is the actually user data for the field. That means,
field header consists only field status, and field data length.
Slot Offset Table
The slot offset table is a table of 6 or 12 bytes per record, depending on the pageSize being less or greater than 64K:
|2 bytes (unsigned short) or 4 bytes (int)||page offset for the record that is assigned to the slot|
|2 bytes (unsigned short) or 4 bytes (int)||the length of the record on this page.|
|2 bytes (unsigned short) or 4 bytes (int)||the length of the reserved number of bytes for this record on this page.|
First slot is slot 0. The slot table grows backwards. Slots are never left empty.
8 bytes of a java.util.zip.CRC32 checksum of the entire's page contents without the 8 bytes representing the checksum.
An allocation page of the file container extends a normal Stored page, with the exception that a hunk of space may be 'borrowed' by the file container to store the file header.
The borrowed space is not visible to the alloc page even though it is present in the page data array. It is accessed directly by the FileContainer. Any change made to the borrowed space is not managed or seen by the allocation page.
The reason for having this borrowed space is so that the container header does not need to have a page of its own.
An allocation page extends a stored page, the on disk format is different from a stored page in that N bytes are 'borrowed' by the container and the page header of an allocation page will be slightly bigger than a normal stored page. This N bytes are stored between the page header and the record space.
The reason why this N bytes can't simply be a row is because it needs
to be statically accessible by the container object to avoid a chicken
and egg problem of the container object needing to instantiate an alloc
page object before it can be objectified, and an alloc page object needing
to instantiate a container object before it can be objectified. So this
N bytes must be stored outside of the normal record interface yet it must
be settable because only the first alloc page has this borrowed space.
Other (non-first) alloc page have N == 0.
N is a byte that indicates the size of the borrowed space. Once an alloc page is initialized, the value of N cannot change.
The maximum space that can be borrowed by the container is 256 bytes.
The allocation pages are of the same page size as any other pages in the container. The first allocation page of the FileContainer starts at the first physical byte of the container. Subsequent allocation pages are chained via the nextAllocPageOffset. Each allocation page is expected to manage at least 1000 user pages (for 1K page size) so this chaining may not be a severe performance hit. The logical -> physical mapping of an allocation page is stored in the previous allocation page. The container object will need to maintain this mapping.
The following fields are stored in the page header:
|int||FormatId (Although 4 bytes are allocated, this uses only the first 2 bytes. Next 2 bytes are unused.)|
|StoredPageHeader||see Stored Page Header|
|long||nextAllocPageNumber - the next allocation page's number|
|long||nextAllocPageOffset - the file offset of the next allocation page|
|long||reserved1 - reserved for future usage|
|long||reserved2 - reserved for future usage|
|long||reserved3 - reserved for future usage|
|long||reserved4 - reserved for future usage|
|byte||N - the size of the borrowed container info|
|byte[N]||containerInfo - the content of the borrowed container info|
|AllocExtent||The one and only extent on this alloc page.|
The allocation page contains allocation extent rows. In this first cut implementation, there is only 1 allocation extent row per allocation page.
The allocation extent row is an externalizable object and is directly written on to the page by the alloc page. In other words, it will not be converted in to a storeableRow. This is to cut down overhead, enhance performance and gives more control of the size and layout of the allocation extent row to the alloc page.
Alloc Page detailed implementation notes
Create Container - an embryonic allocation page is formatted on disk by a special static function to avoid instantiating a full AllocPage object. This embryonic allocation has enough information that it can find the file header and not much else. Then the allocation page is properly initialized by creating the first extent.
Open Container - A static AllocPage method will be used to read off the container information directly from disk. Even if the first alloc page (page 0) is already in the page cache, it will not be used because cleaning the alloc page will introduce a deadlock if the container is not in the container cache. Long term, the first alloc page should probably live in the container cache rather than in the page cache.
Get Page - The first alloc page (page 0) will be read into the page cache. Continue to follow the alloc page chain until the alloc page that manages the specified page is found. From the alloc page, the physical offset of the specified page is located.
Cleaning alloc page - the alloc page is written out the same way any page is written out. The container object will provide a call back to the alloc page to write the current version of the container object back into the borrowed space before the alloc page itself is written out.
Cleaning the container object - get the the first alloc page, dirty it and clean it (which will cause it to call the container object to write itself out into the borrowed space). The versioning of the container is independent of the versioning of the alloc page. The container version is stored inside the borrowed space and is opaque to the alloc page.
For the fields in an allocation extent row.
An allocation extent row manages the page status of page in the extent. AllocExtent is externalizable and is written to the AllocPage directly, without being converted to a row first.
|long||extentOffset - the begin physical byte offset of the first page of this extent|
|long||extentStart - the first logical page mananged by this extent.|
|long||extentEnd - the last page this extent can ever hope to manage.|
|int||extentLength - the number of pages allocated in this extent|
extentStatus - status bits for the whole extent.
|int||preAllocLength - the number of pages that have been preallocated|
|long||reserved2 - reserved for future use|
|long||reserved3 - reserved for future use|
|FreePages(bit)||Bitmap of free pages. Bit[i] is ON if page is free for immediate (re)use.|
|unFilledPages(bit)||Bitmap of pages that have free space. Bit[i] is ON if page is likely to be < 1/2 full.|
org.apache.derby.iapi.services.io.FormatableBitSet is used to store the bit map. FormatableBitSet is an externalizable class.
A page can have the following logical state:
Free - a page that is free to be used
Valid - a page that is currently in use
There is another type of transitional pages which pages that have been allocated on disk but has not yet been used. These pages are Free.
Bit[K] freePages Bit[i] is ON iff page i maybe free for reuse. User must get the dealloc page lock on the free page to make sure the transaction.
K is the size of the bit array, it must be >= length.