Derby Write Ahead Log Format
Derby uses a Write Ahead Log to record all changes to the database. The Write Ahead Log (WAL) protocol requires the following rules to be followed:
- A page must be latched exclusively before it can be updated.
- While the latch is held, the update must be logged, and page must be tagged with the identity of the log record (often known as Log Sequence Number or LSN)
- When the page is about to be written to persistent storage, all logs records up to and including the page's LSN, must be forced to disk.
- Once the log records have been forced to disk, the cached page may be written to persistent storage, overwriting the previous version of the page.
The WAL protocol ensures that in the event of a system crash, databases pages can be restored to a consistent state using the information contained in the log records. How this is done will be the subject of another paper.
A good description of Write Ahead Logging, and how a log is typically implemented, can be found in Transaction Processing: Concepts and Techniques , by Jim Gray and Andreas Reuter, 1993, Morgan Kaufmann Publishers .
Derby implementation of the Write Ahead Log
Derby implements the Write Ahead Log using a non-circular file system file. Here are some comments about current implementation of recovery:
Derby supports simple media recovery. It has support for full backup/restore and very basic form of rollforward recovery (replay of logs using backup and archived log files).
1. Derby fully supports crash recovery, it uses java to correctly sync the log file to support this.
2. I would say derby supports media recovery. One can make a backup of the data and store it off line. Logs can be stored on a separate disk from the data, and if you lose your data disk then you can use rollforward recovery on the existing logs and the copy of the backup to bring your database up to the current point in time.
3. Derby does not support "point in time recovery". Someone may want to look at this in the future. Technically I don't think it would be very hard as the logging system has the stuff to solve the hard problems. It does not have an idea about "time" - it just knows log sequence numbers, so need to figure out what kind of interface a user really wants. A very user unfriendly interface would not be very hard to implement which would be recover to a specific log sequence number. Anyone interested in this feature should add it to jira - I'll be happy to add technical comments on what needs to be done.
4. A reasonable next step in derby recovery progress would be to add a way to automatically move/copy log files offline as they are not needed by crash recovery and only needed for media recovery. Some sort of java stored procedure callout would seem most appropriate.
The 'log' is a stream of log records. The 'log' is implemented as a series of numbered log files. These numbered log files are logically continuous so a transaction can have log records that span multiple log files. A single log record cannot span more than one log file. The log file number is monotonically increasing.
The log belongs to a log factory of a RawStore. In the current implementation, each RawStore only has one log factory, so each RawStore only has one log (which is composed of multiple log files). At any given time, a log factory only writes new log records to one log file, this log file is called the 'current log file'.
A log file is named log logNumber .dat
With the default values, a new log file is created (this is known as log switch) when a log file grows beyond 1MB and a checkpoint happens when the amount of log written is 10MB or more from the last checkpoint.
RawStore exposes a checkpoint method which clients can call, or a checkpoint is taken automatically by the RawStore when:
- The log file grows beyond a certain size (configurable, default 1MB)
- RawStore is shutdown and a checkpoint hasn't been done "for a while"
- RawStore is recovered and a checkpoint hasn't been done "for a while"
Log records are identified using LogCounter, which is an implementation of LogInstant, a Derby term for LSN. The LogCounter is made up of the log file number, and the byte offset of the log record within the log file. Within the stored log record a log counter is represented as a long. Outside the LogFactory the instant is passed around as a LogCounter (through its LogInstant interface).
The way the long is encoded is such that < == > correctly tells if one log instant is lessThan, equals or greater than another.
Format of Write Ahead Log
An implementation of file based log is in org.apache.derby.impl.store.raw.log.LogToFile. This LogFactory is responsible for the formats of 2 kinds of file: the log file and the log control file. And it is responsible for the format of the log record wrapper.
Format of Log Control File
The log control file contains information about which log files are present and where the last checkpoint log record is located.
|int||format id set to FILE_STREAM_LOG_FILE|
|int||obsolete log file version|
|long||the log instant (LogCounter) of the last completed checkpoint|
|int||Derby major version|
|int||Derby minor version|
|int||subversion revision/build number|
|byte||Flags (beta flag (0 or 1), test durability flag (0 or 1))|
|byte||spare (value set to 0)|
|byte||spare (value set to 0)|
|byte||spare (value set to 0)|
|long||spare (value set to 0)|
|long||checksum for control data written|
Format of the log file
The log file contains log records which record all the changes to the database. The complete transaction log is composed of a series of log files.
|int||Format id of this log file, set to FILE_STREAM_LOG_FILE.|
|int||Obsolete log file version - not used|
|long||Log file number - this number orders the log files in a series to form the complete transaction log|
|long||PrevLogRecord - log instant of the previous log record, in the previous log file.|
|[log record wrapper]*||one or more log records with wrapper|
|int||EndMarker - value of zero. The beginning of a log record wrapper is the length of the log record, therefore it is never zero|
|[int fuzzy end]*||zero or more int's of value 0, in case this log file has been recovered and any incomplete log record set to zero.|
Format of the log record wrapper
The log record wrapper provides information for the log scan.
|int||length - length of the log record (for forward scan)|
|long||instant - LogInstant of the log record|
|byte[length]||logRecord - byte array that is written by the FileLogger|
|int||length - length of the log record (for backward scan)|
The format of a log record
The log record described every change to the persistent store
|int||format_id, set to LOG_RECORD. The formatId is written by FormatIdOutputStream when this object is written out by writeObject|
loggable group - the loggable's group value.
Each loggable belongs to one or more groups of similar functionality.
Grouping is a way to quickly sort out log records that are interesting to different modules or different implementations.
When a module makes loggable and sent it to the log file, it must mark this loggable with one or more of the following group. If none fit, or if the loggable encompasses functionality that is not described in existing groups, then a new group should be introduced.
Grouping has no effect on how the record is logged or how it is treated in rollback or recovery.
The following groups are defined. This list serves as the registry of all loggable groups.
|TransactionId||xactId - The Transaction this log belongs to.|
|Loggable||op - the log operation|
Pointers to relevant classes
|org.apache.derby.iapi.store.raw.log||LogFactory.java||The java interface for logging system module.|
|org.apache.derby.impl.store.raw.log||LogToFile.java||The implmentation of the LogFactory.java, also implementing Module, this is the one with recovery code.|
|CheckpointOperation.java||A Log Operation that represents a checkpoint.|
|FileLogger.java||Deals with putting log records to disk. Writes log records to a log file as a stream (ie. log records added to the end of the file, no concept of pages).|
|FlushedScan.java||Deals with scanning the log file. Scan the the log which is implemented by a series of log files. This log scan knows how to move across log file if it is positioned at the boundary of a log file and needs to getNextRecord.|
|FlushedScanHandle.java||More stuff dealing with scanning the log file.|
|Scan.java||More scan log file stuff. Scan the the log which is implemented by a series of log files. This log scan knows how to move across log file if it is positioned at the boundary of a log file and needs to getNextRecord.|
|StreamLogScan.java||More scan log file stuff. LogScan provides methods to read a log record and get its LogInstant in an already defined scan.|
|LogAccessFile.java||Lowest level putting log records to disk. Wraps a RandomAccessFile file to provide buffering on log writes.|
|LogAccessFileBuffer.java||Utility for LogAccessFile. A single buffer of data.|
|LogCounter.java||Log sequence number (LSN) implementation|
|LogRecord.java||The log record written out to disk.|
|ReadOnly.java||an alternate read only implementation of LogFactory|