ref: 5ef5a68f7eddfd864d5a3fabcd240abf210cee7e
dir: /docs/mafs.ms/
.\" The current font is \n(.f .\" The current point size is \n(.s .\" The current vertical line spacing is \n(.v .\" The line length is \n(.l .\" The page length is \n(.p .\" The page offset is \n(.o .\" need this RT call else, the subsequent pages are losing the indent .RT .ft B .ce M[a]fs - Plan 9 userspace file systems .ft R .sp Mfs and Mafs wants you to be able to understand it, so you can be self-sufficient and fix a crash at two in the morning or satisfy your need for speed or a feature. This empowerment is priceless for those with skin in the game. .sp Mfs and Mafs are user space file systems to provide system stability and security. They are based on kfs. .sp As this document aims to also provide working knowledge, it gratuitously uses the actual commands and the relevant C data structure definitions to convey information. .sp Mfs writes synchronously to the disk. It ensures that what has been said to be written has been passed along to the disk driver. Mafs writes asynchronously to the disk. It stores the writes in memory for a writer process to write to the disk at leisure. .sp This document uses the word M[a]fs to refer to both mfs and mafs. .sp .ft B Mfs Workflow .ft R .sp .PS right; { Client: box invis height 4*boxht wid 2*boxwid "" "" "" "Chan.aux has" "file offset, etc."; "Client" at Client.n line from Client.ne to Client.se } move 2*boxwid { Multiple: box invis { " multiple" at Multiple.nw - 0,0.1 ljust " workers" at Multiple.sw + 0,0.1 ljust line <-> from Client to Multiple.w "9p" above } move 0.5*boxwid Abstractions: box invis "Abstractions" { "Directory" "File" at Abstractions.s } move 0.75*boxwid Datastructures: box invis "Data Structures" { "Dentry" at Datastructures.s } Buffercache: box invis "Buffer cache" "used blocks" with .sw at Datastructures.ne + 0.5i,0 Extents: box invis "Extents" "free blocks" with .nw at Datastructures.se + 0.5i,0 move 0.5*boxwid Inmemory: box invis "In-memory" "block contents" with .sw at Buffercache.n + 0.5i,0 down move 0.5*boxwid Disk: box "Disk" "blocks" height 1.5*boxht with .sw at Extents.se + 0.4i,0 } line <-> from Multiple.e to Abstractions.w - 0.1i,0 line <-> from Abstractions.e + 0.1i,0 to Datastructures.w - 0.2i,0 line <-> from Datastructures.e + 0,0.1i to Buffercache.w - 0.1i,0 line <-> from Datastructures.e - 0,0.1i to Extents.w line <-> from Buffercache.se + 0.1i,0 to Disk.w line <-> from Extents.e to Disk.w line <-> from Buffercache.s to Extents.n line <-> from Buffercache.ne - 0,0.1i to Inmemory.sw + 0,0.1i .PE .sp .ne 14 .ft B Mafs Workflow .ft R .sp .PS right; { Client: box invis height 4*boxht wid 2*boxwid "" "" "" "Chan.aux has" "file offset, etc."; "Client" at Client.n line from Client.ne to Client.se } move 2*boxwid { Multiple: box invis { " multiple" at Multiple.nw - 0,0.1 ljust " workers" at Multiple.sw + 0,0.1 ljust line <-> from Client to Multiple.w "9p" above } move 0.5*boxwid Abstractions: box invis "Abstractions" { "Directory" "File" at Abstractions.s } move 0.75*boxwid Datastructures: box invis "Data Structures" { "Dentry" at Datastructures.s } Buffercache: box invis "Buffer cache" "used blocks" with .sw at Datastructures.ne + 0.5i,0 Extents: box invis "Extents" "free blocks" with .nw at Datastructures.se + 0.5i,0 move 0.5*boxwid Writer: box invis "writer" with .nw at Buffercache.ne + 0.4i,0 Inmemory: box invis "In-memory" "block contents" with .sw at Buffercache.n + 0.5i,0 down move 0.5*boxwid Disk: box "Disk" "blocks" height 1.5*boxht with .sw at Extents.se + 0.4i,0 } line <-> from Multiple.e to Abstractions.w - 0.1i,0 line <-> from Abstractions.e + 0.1i,0 to Datastructures.w - 0.2i,0 line <-> from Datastructures.e + 0,0.1i to Buffercache.w - 0.1i,0 line <-> from Datastructures.e - 0,0.1i to Extents.w line <- from Buffercache.se + 0.1i,0 to Disk.w line -> from Buffercache.e + 0.1i,0 to Writer.w line -> from Writer.s + 0,0.1i to Disk.n line <-> from Extents.e to Disk.w line <-> from Buffercache.s to Extents.n line <-> from Buffercache.ne - 0,0.1i to Inmemory.sw + 0,0.1i line <-> from Writer.n - 0,0.1i to Inmemory.s + 0.2i,0.1i .PE .sp .sp .ft B Disk Contents .ft R .sp M[a]fs organizes and saves content on a disk as directories and files, just like any other filesystem. .sp The unit of storage is a logical block (not physical sector) of data. Disk space is split into 512 byte logical blocks. .sp .ne 14 A sample disk of 2048 bytes with 4 blocks. .PS right { down; ." {box dashed; box dashed; box dashed; box dashed;} box height 4*boxht; move 0.2i; "disk of" " 2048 bytes" } move; move { move 0.5i; down; { Block0: box dashed; Block1: box dashed; Block2: box dashed; Block3: box dashed; } box height 4*boxht; move 0.2i "disk of" " 2048 bytes" "Block " at Block0.nw rjust "0 " at Block0.w rjust "1 " at Block1.w rjust "2 " at Block2.w rjust "3 " at Block3.w rjust } .PE .sp A block is stored to the disk with a Tag. .br .nf struct Tag { u8 type; /* Tfree, Tmagic, Tdentry, Tdata, Tind\fIn\fR */ u64 path; /* Qid.path, unique identifier of directory or file */ }; .fi .sp Every file or directory is represented on the disk by a directory entry (Dentry). A directory entry uses a block (Tag.type = Tdentry) and is uniquely identifiable by a Qid. .sp A file stores its contents in blocks with a Tag.type of Tdata. A directory holds the directory entries of it's children in blocks with a Tag.type of Tdentry. .sp The blocks used by a file or directory entry are listed in their directory entry. As it is not possible to represent big files using the list of blocks available in the directory entry, the blocks are structured to use multiple levels of indirection as the file size increases. .sp A file's data blocks are identified by a tag of Tdata and that file's Qid.path. A directory's data blocks are identified by a tag of Tdentry and Qid.path of the child directory entry. (Is this quirky? Should the child's directory entry have a tag with the parent's Qid.path?) .sp A block number of zero represents the end of the file's contents. If a file is truncated, the data and indirect blocks are given up and the dentry.dblocks[0] = 0. .sp M[a]fs does not store the last access time of a file or directory. .ne 20 .sp .nf The different types of blocks on a disk are: .br .nf enum { Tfree = 0, /* free block */ Tmagic, /* the first (zero'th) block holds a magic word */ Tdentry, /* directory entry */ /* Tind\fIn\fR are last, to allow for future increases */ Tdata, /* actual file contents */ Tind0, /* contains a list of Tdata block numbers for files and Tdentry block numbers for directories.*/ Tind1, /* contains a list of Tind0 block numbers */ Tind2, /* contains a list of Tind1 block numbers */ Tind3, /* contains a list of Tind2 block numbers */ Tind4, /* contains a list of Tind3 block numbers */ Tind5, /* contains a list of Tind4 block numbers, maximum file size 26 TiB */ }; .fi .sp A directory entry is defined as: .nf enum { Rawblocksize = 512ULL, /* real block size */ Ndblock = 32, /* number of direct blocks in a Dentry */ Niblock = 6, /* maximum depth of indirect blocks */ }; struct Qid9p1 { u32 version; u64 path; /* unique identifier */ }; struct Dentry1 { Qid9p1 qid; u64 size; /* 0 for directories. For files, size in bytes of the content */ u64 pdblkno; /* parent dentry absolute block number. 0 for root. */ u64 pqpath; /* parent qid.path */ u64 mtime; /* modified time in nano seconds from epoch */ u32 mode; /* same bits as defined in lib.h Dir.mode */ s16 uid; s16 gid; s16 muid; u64 dblocks[Ndblock]; /* direct blocks. */ /* List of Tdata block numbers for files and Tdentry block numbers for directories */ /* Tag.type = Tdentry for directories and Tdata for files */ u64 iblocks[Niblock]; /* indirect blocks */ }; /* * Derived constants * Ndentryperblock: number of directory entries per block * Nindperblock: number of block pointers per block */ enum { Blocksize = Rawblocksize - sizeof(Tag), Namelen = (Blocksize-sizeof(Dentry1)), /* maximum size of the name of a file or directory */ Ndentryperblock = 1, /* Blocksize / sizeof(Dentry), */ Nindperblock = Blocksize / sizeof(u64), }; struct Dentry { struct Dentry1; char name[Namelen]; }; .fi .sp A directory entry once assigned is not given up until the parent directory is removed. It is zero'ed if the directory entry is removed. It is reused by the next directory entry created under that parent directory. This removes the need for garbage collection of directory entries on removals and also avoids zero block numbers in the middle of a directory entry's list of blocks. A zero block number while traversing a directory entry's dblocks or iblocks represents the end of directory or file contents. When a directory is removed, the parent will have a directory entry with a tag of Tdentry and Qpnone and the rest of the contents set to zero. .sp A directory's size is always zero. .sp .nf tests/6.sizes # shows the values of the above derived variables. Namelen 145 Ndblock 32 Niblock 6 Blocksize 503 Nindperblock 62 A Tind0 unit points to 1 data blocks (503 bytes) block points to 62 data blocks reli start 32 max 93 max size 94*Blocksize = 47282 bytes A Tind1 unit points to 62 data blocks (31186 bytes) block points to 3844 data blocks reli start 94 max 3937 max size 3938*Blocksize = 1980814 bytes = 1 MiB A Tind2 unit points to 3844 data blocks (1933532 bytes) block points to 238328 data blocks reli start 3938 max 242265 max size 242266*Blocksize = 121859798 bytes = 116 MiB A Tind3 unit points to 238328 data blocks (119878984 bytes) block points to 14776336 data blocks reli start 242266 max 15018601 max size 15018602*Blocksize = 7554356806 bytes = 7 GiB A Tind4 unit points to 14776336 data blocks (7432497008 bytes) block points to 916132832 data blocks reli start 15018602 max 931151433 max size 931151434*Blocksize = 468369171302 bytes = 436 GiB A Tind5 unit points to 916132832 data blocks (460814814496 bytes) block points to 56800235584 data blocks reli start 931151434 max 57731387017 max size 57731387018*Blocksize = 29038887670054 bytes = 26 TiB .fi .ne 30 .sp On an empty m[a]fs filesystem mounted at /n/mafs, the disk contents added by the below commands are: .nf mkdir /n/mafs/dir1 echo test > /n/mafs/dir1/file1 .fi .PS right bigboxht = boxht fieldht = 0.35*boxht { down { Bound: box height 10*bigboxht width 3.3*boxwid } move 0.1i box height fieldht invis "Tdentry 64" box height fieldht invis "qid.version 0" box height fieldht invis "qid.path 64" box height fieldht invis "size 0" box height fieldht invis "pdblkno 3" box height fieldht invis "pqpath 63" box height fieldht invis "mtime 1653302180819962729" box height fieldht invis "mode 20000000777" box height fieldht invis "uid 10006" box height fieldht invis "gid -1" box height fieldht invis "muid 10006" box height fieldht invis "direct blocks" box height fieldht invis " 0 19" box height fieldht invis " 1 0" box height fieldht invis " 2 0" box height fieldht invis "." box height fieldht invis "." box height fieldht invis "." box height fieldht invis " 30 0" box height fieldht invis " 31 0" box height fieldht invis "indirect blocks" box height fieldht invis " 0 0" box height fieldht invis " 1 0" box height fieldht invis " 2 0" box height fieldht invis " 3 0" box height fieldht invis " 4 0" box height fieldht invis " 5 0" box height fieldht invis "name dir1" "Block 18 contents: /dir1 Dentry" at Bound.nw + 0,0.1i ljust "Representation of a file in a directory: /dir1/file1" ljust at Bound.n + 0,0.3i } move 4*boxwid { down { Bound: box height 10*bigboxht width 3.3*boxwid } move 0.1i box height fieldht invis "Tdentry 65" box height fieldht invis "qid.version 0" box height fieldht invis "qid.path 65" box height fieldht invis "size 5" box height fieldht invis "pdblkno 18" box height fieldht invis "pqpath 64" box height fieldht invis "mtime 1653302180823455071" box height fieldht invis "mode 666" box height fieldht invis "uid 10006" box height fieldht invis "gid -1" box height fieldht invis "muid 10006" box height fieldht invis "direct blocks" box height fieldht invis " 0 20"; {"content is in Block 20" at last box.e + 1i,0 ljust} box height fieldht invis " 1 0" box height fieldht invis " 2 0" box height fieldht invis "." box height fieldht invis "." box height fieldht invis "." box height fieldht invis " 30 0" box height fieldht invis " 31 0" box height fieldht invis "indirect blocks" box height fieldht invis " 0 0" box height fieldht invis " 1 0" box height fieldht invis " 2 0" box height fieldht invis " 3 0" box height fieldht invis " 4 0" box height fieldht invis " 5 0" box height fieldht invis "name file1" "Block 19 contents: file1 Dentry" at Bound.nw + 0,0.1i ljust } .PE .sp Contents of block 20 are: .nf disk/block tests/test.1/disk 20 Tdata 65 test .fi .PS right Start: { down { Bound: box height 8.5*bigboxht width 3.3*boxwid } move 0.1i box height fieldht invis "Tdentry 66" box height fieldht invis "qid.version 0" box height fieldht invis "qid.path 66" box height fieldht invis "size 0" box height fieldht invis "pdblkno 3" box height fieldht invis "pqpath 63" box height fieldht invis "mtime 1653302180819962729" box height fieldht invis "mode 20000000777" box height fieldht invis "uid 10006" box height fieldht invis "gid -1" box height fieldht invis "muid 10006" box height fieldht invis "direct blocks" box height fieldht invis " 0 22" box height fieldht invis " 1 24" box height fieldht invis "." box height fieldht invis "." box height fieldht invis "." box height fieldht invis " 31 0" box height fieldht invis "indirect blocks" box height fieldht invis " 0 0" box height fieldht invis "." box height fieldht invis " 5 0" box height fieldht invis "name dir2" "Block 21 contents: /dir2 directory entry" at Bound.nw + 0,0.1i ljust "Representation of two files in a directory (/dir2/file1 and /dir2/file2)" ljust at Bound.nw + 0.2,0.3i } move 4*boxwid { down { Bound: box height 8.5*bigboxht width 3.3*boxwid } move 0.1i box height fieldht invis "Tdentry 67" box height fieldht invis "qid.version 0" box height fieldht invis "qid.path 67" box height fieldht invis "size 5" box height fieldht invis "pdblkno 21" box height fieldht invis "pqpath 66" box height fieldht invis "mtime 1653302180823455071" box height fieldht invis "mode 666" box height fieldht invis "uid 10006" box height fieldht invis "gid -1" box height fieldht invis "muid 10006" box height fieldht invis "direct blocks" box height fieldht invis " 0 23" box height fieldht invis " 1 0" box height fieldht invis "." box height fieldht invis "." box height fieldht invis "." box height fieldht invis " 31 0" box height fieldht invis "indirect blocks" box height fieldht invis " 0 0" box height fieldht invis "." box height fieldht invis " 5 0" box height fieldht invis "name file1" "Block 22 contents: file1 directory entry" at Bound.nw + 0,0.1i ljust } down move 9*bigboxht { down { Bound: box height 8.5*bigboxht width 3.3*boxwid } move 0.1i box height fieldht invis "Tdentry 68" box height fieldht invis "qid.version 0" box height fieldht invis "qid.path 68" box height fieldht invis "size 5" box height fieldht invis "pdblkno 21" box height fieldht invis "pqpath 66" box height fieldht invis "mtime 1653302180823455071" box height fieldht invis "mode 666" box height fieldht invis "uid 10006" box height fieldht invis "gid -1" box height fieldht invis "muid 10006" box height fieldht invis "direct blocks" box height fieldht invis " 0 25" box height fieldht invis " 1 0" box height fieldht invis "." box height fieldht invis "." box height fieldht invis "." box height fieldht invis " 31 0" box height fieldht invis "indirect blocks" box height fieldht invis " 0 0" box height fieldht invis "." box height fieldht invis " 5 0" box height fieldht invis "name file2" "Block 24 contents: file2 directory entry" at Bound.nw + 0,0.1i ljust } .PE .sp iblocks[0] holds the block number of a Tind0 block. A Tind0 block contains a list of Tdata block numbers for files or a list of Tdentry block numbers for directories. .sp iblocks[1] has the block number of a Tind1 block. A Tind1 block contains a list of Tind0 block numbers. .sp Similarly, for other iblocks[n] entries, iblocks[n] has the block number of a Tind\fIn\fR block. A Tind\fIn\fR block contains a list of Tind\fI(n-1)\fR block numbers. .sp .sp Relative index .sp The zero'th relative index in a directory entry is the first data block. The next relative index is the second data block of the directory entry, and so on. .sp tests/6.reli shows how a relative index (reli) is translated into an actual disk block number. .sp To find the actual block number where the first block (zero'th as zero indexed) of a file is stored: .nf tests/6.reli 0 # command, below is the output of this command reli 0 dblock[0] .fi .sp To find the actual block number where the second block of a file is stored: .nf tests/6.reli 1 reli 1 dblock[1] .fi .sp And so on, for the 32nd and 33rd blocks of a file: .nf tests/6.reli 31 reli 31 dblock[31] tests/6.reli 32 reli 32 iblock[0] Tind0 reli 0 is at [0] .fi .sp This is how the last block of a 26 TiB file would be stored: .nf tests/6.reli 57731387017 reli 57731387017 iblock[5] Tind5 reli 56800235583 is at [61] Tind4 reli 916132831 is at [61] Tind3 reli 14776335 is at [61] Tind2 reli 238327 is at [61] Tind1 reli 3843 is at [61] Tind0 reli 61 is at [61] .fi .sp .PS right Start: { down { Bound: box height 8.5*bigboxht width 3.3*boxwid } move 0.1i box height fieldht invis "Tdentry 1 70" box height fieldht invis "qid.version 0" box height fieldht invis "qid.path 70" box height fieldht invis "size 2056192" box height fieldht invis "pdblkno 26" box height fieldht invis "pqpath 69" box height fieldht invis "mtime 1653302180819962729" box height fieldht invis "mode 20000000777" box height fieldht invis "uid 10006" box height fieldht invis "gid -1" box height fieldht invis "muid 10006" box height fieldht invis "direct blocks" box height fieldht invis " 0 28" box height fieldht invis " 1 29" box height fieldht invis " 2 30" box height fieldht invis "." box height fieldht invis "." box height fieldht invis "indirect blocks" box height fieldht invis " 0 61" box height fieldht invis " 1 124" box height fieldht invis " 2 4031" box height fieldht invis " 3 0" box height fieldht invis "name 2MB.file" "Block 27 contents" at Bound.nw + 0,0.1i ljust "Representation of a 2 MB file (/dir3/2MB.file)" ljust at Bound.n + 0,0.3i } move 4*boxwid { down { Bound: box height 6*bigboxht width 3.3*boxwid } move 0.1i box height fieldht invis "Tdata 70" box height fieldht invis "0 0123456789"; {"contents of 2MB.file" at last box.e + 1i,0 ljust} "Block 28 contents" at Bound.nw + 0,0.1i ljust } .PE .PS right Start: { down { Bound: box height 9*bigboxht width 3.3*boxwid } move 0.1i box height fieldht invis "Tdentry 72" box height fieldht invis "qid.version 0" box height fieldht invis "qid.path 72" box height fieldht invis "size 26214400" box height fieldht invis "pdblkno 4045" box height fieldht invis "pqpath 71" box height fieldht invis "mtime 1653302180819962729" box height fieldht invis "mode 20000000777" box height fieldht invis "uid 10006" box height fieldht invis "gid -1" box height fieldht invis "muid 10006" box height fieldht invis "direct blocks" box height fieldht invis " 0 4195" box height fieldht invis " 1 4196" box height fieldht invis " 2 4197" box height fieldht invis "." box height fieldht invis "." box height fieldht invis " 31 4226" box height fieldht invis "indirect blocks" box height fieldht invis " 0 4228" box height fieldht invis " 1 4291" box height fieldht invis " 2 8198" box height fieldht invis " 3 0" box height fieldht invis "name 25MB.file" "Block 4046 contents" at Bound.nw + 0,0.1i ljust "Representation of a 25MB file (/dir4/25MB.file)" ljust at Bound.n + 0,0.3i } move 4*boxwid { down { Bound: box height 4*bigboxht width 3.3*boxwid } move 0.1i box height fieldht invis "Tdata 72" box height fieldht invis "0 0123456789"; {"starting contents" at last box.e + 1i,0 ljust;} box height fieldht invis "."; {"of 25MB.file" at last box.e + 1i,0 ljust} box height fieldht invis "." box height fieldht invis "." "Block 4195 contents" at Bound.nw + 0,0.1i ljust } move to Start - 0,9.5*bigboxht { down { Bound: box height 6*bigboxht width 3.3*boxwid } move 0.1i box height fieldht invis "Tind0 72" box height fieldht invis " 0 4227" box height fieldht invis " 1 4229" box height fieldht invis " 2 4230" box height fieldht invis "." box height fieldht invis "." box height fieldht invis " 61 4289" "Block 4228 contents" at Bound.nw + 0,0.1i ljust } right move 4*boxwid { down { Bound: box height 4*bigboxht width 3.3*boxwid } move 0.1i box height fieldht invis "Tdata 72" box height fieldht invis "789"; {"more content" at last box.e + 1i,0 ljust} box height fieldht invis "."; {"of 25MB.file" at last box.e + 1i,0 ljust} box height fieldht invis "." "Block 4227 contents" at Bound.nw + 0,0.1i ljust } .PE .sp .TS box; c s s s c l c c l a a a . System Files = Block Description Directory entry Content _ 0 magic 1 config Y 2 super Y 3 / Y _ 4 /adm/ Y 5 /adm/config Y 6 /adm/super Y _ 7 /adm/users Y 8 /adm/users Y _ 9 /adm/bkp/ Y 10 /adm/bkp/config.0 Y 11 /adm/bkp/super.0 Y 12 /adm/bkp/root.0 Y 13 /adm/bkp/config.1 Y 14 /adm/bkp/super.1 Y 15 /adm/bkp/root.1 Y _ 16 /adm/ctl (virtual file, empty contents) Y 17 /adm/frees Y .TE .ta 5n 10n 15n 20n 25n 30n 35n 40n 45n 50n 55n 60n 65n 70n 75n 80n .sp The /adm/ctl file is used to halt or sync the file system. /adm/users is a r/w file that will reload users when written to it. The owner of the /adm/ctl file or any user belonging to the sys group can ream the disk. .sp There is no /adm/magic directory entry as the block number of the magic block is zero and zero block in a directory entry signifies the end of the directory contents. .sp .sp .ne 4 Backup blocks .sp Three copies of Config, Super and Root blocks are maintained. This ensures two backups of config, Super and root blocks. .sp The backup block numbers on the disk are calculated during ream based on the disk size. .sp .TS box; c l c s c l c c l a a a . Backup Blocks Block Description 1 2 _ 1 config last block number -1 middle block number -1 2 super block (obsolete?) last block number -2 middle block number -2 3 / last block number -3 middle block number -3 .TE .ta 5n 10n 15n 20n 25n 30n 35n 40n 45n 50n 55n 60n 65n 70n 75n 80n .sp M[a]fs needs atleast Nminblocks=17 blocks (8.5 KB). The middle block number is Nminblocks + ((nblocks - Nminblocks)/2), where nblocks = total number of blocks. .fi .sp kfs and cwfs use 8192 byte blocks. Hence, they store multiple directory entries (Dentry) per block. They use slot numbers to identify a particular directory entry in a block of directory entries. M[a]fs avoids that be using 512 byte blocks thus having only one directory entry per block. This avoids locking up other sibling directory entries on access. .sp .sp .ft B Buffer cache - Hash buckets with a circular linked list of Iobuf's for collisions. .ft R .sp An Iobuf is used to represent a block in memory. An Iobuf is unique to a block. All disk interaction, except for free block management, happens through an Iobuf. We read a block from the disk into an Iobuf. To update a block on the disk, we write to an Iobuf, which, in-turn gets written to the disk. .sp An Iobuf is protected by a read-write lock (RWlock). This ensures synchronization across multiple processes updating the same file. .sp getbuf(), putbuf(), putbufs() and putbuffree() are used to manage Iobuf's. The contents of an Iobuf is not touched unless it is locked by getbuf(). It is unlocked by putbuf(), putbufs() or putbuffree() calls. The Iobuf.dirties Ref is decremented by the mafs writer's dowrite() without a lock(). This is to avoid deadlocks between putbuf() and the writer especially when the writer queue is full. .sp allocblock() allocates a free block into an Iobuf. allocblocks() allocates a bunch of free blocks with their own Iobuf's. .sp freeblock() erases the Iobuf and returns the block to the free block management routines. .sp Iobuf's are organized into a list of hash buckets to speed up access. .sp .nf Hiob *hiob = nil; /* array of nbuckets */ struct Hiob /* Hash bucket */ { Iobuf* link; /* least recently used Iobuf in the circular linked list */ QLock; /* controls access to this hash bucket */ }; struct Content /* used to unmarshall the disk contents */ { union{ u8 buf[Blocksize]; u64 bufa[Nindperblock]; Dentry d; }; Tag; }; struct Iobuf { Ref; RWLock; /* controls access to this Iobuf */ u64 blkno; /* block number on the disk, primary key */ Iobuf *fore; /* for lru */ Iobuf *back; /* for lru */ union{ u8 *xiobuf; /* "real" buffer pointer */ Content *io; /* cast'able to contents */ }; /* This field is used by mafs to ensure that Iobufs are not reused while there are pending writes. dowrite() uses a Ref instead of a wlock() to mark Iobuf's with pending writes. Using a wlock() in dowrite() causes a deadlock with putwrite() especially when the writer queue is full. getbuf() guarantees that even a free'ed block cannot be stolen until the dirties == 0. This avoids dirty blocks being stolen for other block numbers. incref(dirties) only happens while holding a wlock() in putwrite(). */ Ref dirties; /* number of versions of this block yet to be written by the writer */ }; .fi .sp The Iobuf's are arranged into a list of hash buckets. Each bucket points a circular linked list of Iobuf's to handle collisions. If all the Iobuf's in the circular linked list are locked, new Iobuf's are added to this linked list. This circular list is ordered on a least recently used basis. Iobuf's once added to this list are not removed. When an Iobuf is not in the list, the oldest unlocked Iobuf is reused. .sp Hiob hiob[nbuckets] is a valid representation of the list of hash buckets. The block number is hashed to arrive at the relevant hash bucket index. .sp hiob[hash(block number)].link = Address of Iobuf0, where Iobuf0 is the least recently used Iobuf. .PS { right Iobuf0: box "Iobuf 0"; move Iobuf1: box "Iobuf 1"; move Iobuf2: box "Iobuf 2" } down move; move { right Iobufn: box "Iobuf n"; move Iobufn1: box "Iobuf n-1"; move Iobufn2: box "Iobuf n-2" } arrow from Iobuf0.ne - 0,0.05i to Iobuf1.nw - 0,0.05i arrow from Iobuf1.sw + 0,0.05i to Iobuf0.se + 0,0.05i arrow from Iobuf1.ne - 0,0.05i to Iobuf2.nw - 0,0.05i arrow from Iobuf2.sw + 0,0.05i to Iobuf1.se + 0,0.05i arrow from Iobufn.ne - 0,0.05i to Iobufn1.nw - 0,0.05i arrow from Iobufn1.sw + 0,0.05i to Iobufn.se + 0,0.05i arrow from Iobufn1.ne - 0,0.05i to Iobufn2.nw - 0,0.05i arrow from Iobufn2.sw + 0,0.05i to Iobufn1.se + 0,0.05i arrow dashed from Iobuf0.sw + 0.05i,0 to Iobufn.nw + 0.05i,0 arrow dashed from Iobufn.ne - 0.05i,0 to Iobuf0.se - 0.05i,0 arrow dashed from Iobuf2.sw + 0.05i,0 to Iobufn2.nw + 0.05i,0 arrow dashed from Iobufn2.ne - 0.05i,0 to Iobuf2.se - 0.05i,0 .PE .sp The size of the buffer cache is: number of hash buckets * collisions per hash bucket * block size. The approximate size of the buffer cache = Nbuckets * Ncollisions * Rawblocksize = 256 * 10 * 512 bytes = 1.28GiB. The -h parameter can be used to change the number of hash buckets. .sp If you have RAM to spare, increase Nbuckets instead of Ncollisions as the hash index lookup is faster than searching through a linked list. .sp Iobuf.Ref is used to avoid locking up the hash bucket when a process is waiting for a lock on an Iobuf in that hash bucket. .sp Iobuf.Ref ensures that an Iobuf is not stolen before another process can get to wlock()'ing it after letting go of the lock on the hash bucket. We cannot hold the lock on the hash bucket until we wlock() the iobuf as that blocks other processes from using the hash bucket. This could also result in a deadlock. For example, the directory entry is block 18, which hashes to a hash index of 7. A writer() locked the directory entry iobuf and wants to add a data block 84 to the directory entry. Block 84 hashes to the same hash index of 7. Another process wanting to access the directory entry is waiting for a lock on that io buffer. While doing so, it has locked the hash bucket. Now, this has caused a deadlock between both these processes. The first process cannot proceed until it can lock the hash bucket holding block 84 and is still holding the lock on the directory entry in block 18. The second process cannot lock block 18 and is holding the lock on the hash bucket. .nf for locking a buffer: qlock(hash bucket); incref(buffer); qunlock(hash bucket); wlock(buffer); decref(buffer); for stealing an unused buffer: qlock(hash bucket); find a buffer with ref == 0 and wlock()'able. qunlock(hash bucket); for unlocking a buffer: wunlock(buffer); .fi .sp .sp .ne 10 .ft B Asynchronous writes of Mafs .ft R .sp The blocks to be written to a disk are stored in a linked list represented by: .br .nf struct Dirties { QLock lck; /* controls access to this writer queue */ Wbuf *head, *tail; /* linked list of dirty blocks yet to be written to the disk */ s32 n; /* number of dirty blocks in this linked list */ Rendez isfull; /* write throttling */ Rendez isempty; /* writer does not have to keep polling to find work */ } drts = {0}; struct Wbuf { u64 blkno; /* block number on the disk, primary key */ Wbuf *prev, *next; /* writer queue */ Iobuf *iobuf; /* pointer to the used Iobuf in the buffer cache */ union{ u8 payload; /* "real" contents */ Content io; /* cast'able to contents */ }; }; .fi .sp A single writer process takes the blocks from the Dirties linked list on a FIFO (first-in-first-out) basis and writes them to the disk. putbuf() and putbufs() add blocks to the end of this linked list, the writer queue. .sp The dirty blocks not yet written to the disk remain in the buffer cache and cannot be stolen when a need for a new Iobuf arises. .sp Free'd blocks are not written to the disk to avoid writing blanks to a disk. .sp The writer throttles input when there are more than Npendingwrites waiting to be written. This can be adjusted with the -w parameter. .sp The alternative to having a single writer process is to have each worker process write to the disk, as mfs does. Synchronous writes throttle writes to disk write speed. With asynchronous writes, memory is used to hold the data until written to the disk. This shows increased write throughput until we fill up memory. After filling up memory, writes happen at disk speed. Asynchronous writes have the side effect of a single disk write queue. .sp The ideal npendingwrites = ((ups time in seconds)/2) * (diskspeed in bytes/second) / Rawblocksize. .sp .sp .ne 4 .ft B Free blocks .ft R .sp Free blocks are managed using Extents. The list of free blocks is stored to the disk when shutting down. If this state is not written, then the file system needs to be checked and the list of free blocks should be updated. .sp When shutting down, the Extents are written to free blocks. This information can be accessed from /adm/frees. Also, fsok in the super block is set to 1. M[a]fs does not start until fsok is 1. When fsok = 0, run the sanity check that the unused blocks and the free blocks in /adm/frees match up. disk/reconcile identifies any missing blocks or blocks that are marked as both used and free. .sp This process of fixing issues and setting fsok to 1 is manual. There is no automatic file system checker as in other file systems. This document aims to empower you with the knowledge to fix your file system issues instead of entrusting your precious data to an arbitrary decision maker such as the file system checker. .sp A tag of Tfree and Qpnone represent a free block. If a directory entry is removed, the parent will have a zero'ed out child directory entry (Qid.path = 0) and a tag of Tdentry and Qpnone. .sp .sp .ne 4 .ft B Extents .ft R .sp Free blocks and memory are managed using Extents, an abstraction used to manage a continuous list of items. .sp An Extent represents a continuous list of items. An Extents is a list of such Extent's. .sp .nf struct Extent { struct Extent *low, *high; /* sorted by start */ u64 start; /* where this extent starts from */ u64 len; /* how many units in this extent */ /* circular least recently used linked list limited to Nlru items */ struct Extent *prev, *next; }; struct Extents { Extent *head; /* find the first block in a jiffy */ QLock lck; u32 n; /* number of extents */ Rendez isempty; /* fully used, nothing available */ u8 nlru; /* number of items in the lru linked list */ Extent *lru; /* least recently used extent in the circular lru linked list */ }; .fi .sp To allocate n items from Extents, we find the lowest (by block number or memory address) extent that can satisfy our request. If a bigger Extent is available, slice it and take the portion we need. .sp If there is no available Extent to satisfy our request, panic(). .sp allocblock() and freeblock() use balloc() and bfree() respectively. balloc() assigns blocks from an extent and bfree() adds them to an extent for next allocation. .sp .PS # define field { [right; box invis $1 ljust; box invis $2 rjust; down] } # define field { [right; box $1 ljust; box $2 rjust; down] } define field { [right; box invis $1; box invis $2; down] } boxht = 0.5*boxht down { box invis "Extents at memory location 1" Extents: {box ht 3*boxht wid 2*boxwid} Lru: field("lru", "100") { " assuming that the Extent at 100 was used last" ljust at Lru.e } El: field("el","0") { " unlocked" ljust at El.e } field("n","3") } .PE .PS down move 4*boxht define extent { [ down Extent: {box ht 4*boxht wid 2*boxwid} field("blkno", $1) Len: field("len",$2) { right line dashed from Len.sw to Len.se } field("low",$3) High: field("high",$4) if $5 > 0 then { "Extent at" ljust above at Extent.nw "$5" ljust above at Extent.n } ] } { right extent("10", "1", "0", "200", 100); move extent("20", "3", "100", "300", 200); move extent("30", "2", "200", "0", 300); } down { move boxht*5 right box invis "+" box invis width 2 "freed block numbers" "11,12,13,14" box invis "=" } down move boxht*10 { right extent("10", "5", "0", "200", 100); move extent("20", "3", "100", "300", 200); move extent("30", "2", "200", "0", 300); } .PE .PS # ../tests/extents/addabove define delimiter { down line right 5 dashed move down 0.25 } define headingfield { [ right; Blkno: box invis $1; Len: box invis $2; ] } define order { down arrowwid=0.15 arrowht=0.15 arrow 0.25i at $1 } right Before: [ down Head: headingfield("blkno", "len", Blkno.w) { order(Head.w) } field("20", "3") ] { "Extents before" above ljust at Before.nw } [ right box invis "+" box invis "Block number 40" "followed" "by 3 free blocks" box invis "=" ] move After: [ down Headb: headingfield("blkno", "len", Blkno.w) { order(Headb.w) } field("20", "3") field("40", "4") ] { "Extents after" above ljust at After.nw } .PE .PS delimiter .PE .PS # ../tests/extents/mergeabove right Before: [ down Head: headingfield("blkno", "len", Blkno.w) { order(Head.w) } field("100", "5") field("110", "3") ] { "Extents before" above ljust at Before.nw } [ right box invis "+" box invis "Block number 105" "followed" "by 4 free blocks" box invis "=" ] move After: [ down Headb: headingfield("blkno", "len", Blkno.w) { order(Headb.w) } field("100", "13") ] { "Extents after" above ljust at After.nw } .PE .PS delimiter .PE .PS # ../tests/extents/mergeprevious right Before: [ down Head: headingfield("blkno", "len", Blkno.w) { order(Head.w) } field("105", "4") ] { "Extents before" above ljust at Before.nw } [ right box invis "+" box invis "Block number 101" "followed" "by 3 free blocks" box invis "=" ] move After: [ down Headb: headingfield("blkno", "len", Blkno.w) { order(Headb.w) } field("101", "8") ] { "Extents after" above ljust at After.nw } .PE .PS delimiter .PE .PS # ../tests/extents/mergenext right Before: [ down Head: headingfield("blkno", "len", Blkno.w) { order(Head.w) } field("101", "4") ] { "Extents before" above ljust at Before.nw } [ right box invis "+" box invis "Block number 105" "followed" "by 3 free blocks" box invis "=" ] move After: [ down Headb: headingfield("blkno", "len", Blkno.w) { order(Headb.w) } field("100", "8") ] { "Extents after" above ljust at After.nw } .PE .PS delimiter .PE .PS # ../tests/extents/addabove1 right Before: [ down Head: headingfield("blkno", "len", Blkno.w) { order(Head.w) } field("180", "4") ] { "Extents before" above ljust at Before.nw } [ right box invis "+" box invis "Block number 250" "followed" "by 3 free blocks" box invis "=" ] move After: [ down Headb: headingfield("blkno", "len", Blkno.w) { order(Headb.w) } field("180", "4") field("250", "4") ] { "Extents after" above ljust at After.nw } .PE .PS delimiter .PE .PS # ../tests/extents/addbelow right Before: [ down Head: headingfield("blkno", "len", Blkno.w) { order(Head.w) } field("250", "4") ] { "Extents before" above ljust at Before.nw } [ right box invis "+" box invis "Block number 180" "followed" "by 3 free blocks" box invis "=" ] move After: [ down Headb: headingfield("blkno", "len", Blkno.w) { order(Headb.w) } field("180", "4") field("250", "4") ] { "Extents after" above ljust at After.nw } .PE .sp Kfs stores the list of free blocks in a Tfrees block and the Superblock. Instead we use block management routines, similar to pool.h, to allocate and monitor free blocks. On shutdown(), the block management routines (extents.[ch]) store state into the free blocks. This can be read from /adm/frees. On startup, this is read back by the block management routines. On a crash, the fsck can walk the directory structure to identify the free blocks and recreate /adm/frees. .sp .sp .ne 12 .ft B Code details .ft R .sp .TS allbox; c c l a . Program Description _ disk/mfs Start mfs on a disk. disk/mafs Start mafs on a disk. disk/free List the free blocks. It reads the contents of /adm/frees. disk/used List the used blocks by traversing all directory entries. disk/block Show the contents of a block. disk/unused Given a list of used blocks, lists the unused blocks. disk/updatefrees Update the contents of /adm/frees. .TE .sp .TS allbox; c c r l a r . File Description chatty9p _ 9p.c 9p transactions 2 sub.c initialization and super block related routines. 2 dentry.c encode/decode the file system abstraction into block operations. 3 iobuf.c routines on Iobuf's. The bkp() routines operate on Iobuf's. 5 extents.[ch] routines to manage the free blocks. 6 ctl.c /adm/ctl operations. tag.c routines to manage a relative index (reli) in a directory entry. blk.c routines to show blocks. writer.c disk writer routines. console.c obsolete. /adm/ctl is the console. .TE .ta 5n 10n 15n 20n 25n 30n 35n 40n 45n 50n 55n 60n 65n 70n 75n 80n .in 0 .sp A Chan's state could get out of sync with the contents if another process changes the on-disk state. Ephase error occurs when that happens. .sp For throughput, multiple processes are used to service 9p i/o requests when the -s flag is not used. .sp .sp .ne 4 .ft B Useful commands: .ft R .sp disk/mfs and disk/mafs have the same arguments. The following commands use disk/mafs to avoid duplicating them for disk/mfs. .sp Ream and start single process M[a]fs on a disk and also mount it for use. .sp .nf mount -c <{disk/mafs -s -r mafs_myservice mydisk <[0=1]} /n/mafs_myservice .in 3n .br -s: use stdin and stdout for communication -r mafs_myservice: ream the disk using mafs_myservice as the service name mydisk: running mafs on the disk, mydisk .in 0 .fi .sp Ream and start multiple-process mafs on a disk. .sp .nf disk/mafs -r mafs_myservice mydisk mount -c /srv/mafs_myservice /n/mafs_myservice .fi .sp .ne 7 Ream and start mafs on a file. Also, mount thet filesystem at /n/mafs_myservice. .sp .nf dd -if /dev/zero -of myfile -bs 512 -count 128 # 64KB file mount -c <{disk/mafs -s -r mafs_service myfile <[0=1]} /n/mafs_myservice # to reuse the contents of myfile later, remove -r (ream) from the above command. mount -c <{disk/mafs -s myfile <[0=1]} /n/mafs_myservice .fi .sp Prepare and use a disk (/dev/sdF1) for mafs. .sp .nf disk/fdisk -bawp /dev/sdF1/data # partition the disk echo ' a fs 9 $-7 w p q' | disk/prep -b /dev/sdF1/plan9 # add an fs plan 9 partition to the disk disk/mafs -r mafs_sdF1 /dev/sdF1/fs # -r to ream the disk mount -c /srv/mafs_sdF1 /n/mafs_sdF1 # for using the mafs file system on the disk later on disk/mafs /dev/sdF1/fs # no -r mount -c /srv/mafs_sdF1 /n/mafs_sdF1 .fi .sp Starting mafs on a 2MB byte file. The below commands create a disk.file to use as a disk. Mount /n/mafs_disk.file for the file system. .sp .nf dd -if /dev/zero -of disk.file -bs 512 -count 4096; mount -c <{disk/mafs -s -r mafs_disk.file \\ <[0=1]} /n/mafs_disk.file .fi .sp Starting mafs on a RAM file. The below commands create a ramfs filesystem to use as a disk. .sp .nf ramfs -m /n/mafs_ramfs touch /n/mafs_ramfs/file dd -if /dev/zero -of /n/mafs_ramfs/file -count 700 -bs 1m disk/mafs -r mafs_ramfs_file /n/mafs_ramfs/file mount -c /srv/mafs_ramfs_file /n/mafs_ramfs_file .fi .sp Sync M[a]fs. This command does not return until all the writes are written to the disk. So, could take a long time if you have a long writer queue. .sp echo sync >> /n/mafs_myservice/adm/ctl .sp Stop M[a]fs. This command does not return until all the writes are written to the disk. So, could take a long time if you have a long writer queue. .sp echo halt >> /n/mafs_myservice/adm/ctl .sp Interpret the contents of a block based on the tag and write out a single formatted block based on the tag .sp disk/block tests/test.0/disk 22 .sp Traverse the directory heirarchy and write out all the used block numbers. disk/reconcile uses the output of this to reconcile the list of used blocks with the list of free blocks. Also, writes the invalid blocks to stderr. Starting from root, walk down each directory entry printing out the linked blocks with invalid tags. (Why not just write out the list of dirty blocks too? instead of using a different command for it?) .sp disk/used tests/test.0/disk .sp From the contents of /adm/frees show the list of free blocks. disk/reconcile uses the output of this to reconcile the list of used blocks with the list of free blocks. .sp disk/free tests/test.0/disk .sp Read two lists of block numbers and flag the common and missing blocks. .sp .nf disk/reconcile -u <{disk/used tests/test.0/disk} \\ -F <{disk/free tests/test.0/disk} 32 .fi .sp .ne 3 Find traverses the directory heirarchy and identifies the file that a block number belongs to. .sp disk/find tests/test.0/disk 17 .sp .ne 3 Find the total number of blocks on a disk. .sp .nf dd -if /dev/sdF1/fs -bs 512 -iseek 1 -count 1 -quiet 1 | awk '$1 == "nblocks" { print $2 }' disk/block /dev/sdF1/fs 1 | awk '$1 == "nblocks" { print $2 }' .fi .sp .ne 5 Build the list of free blocks. This should match the contents of /adm/frees. .sp .nf disk/unused <{disk/used /dev/sdF1/fs} 11721040049 # 11721040049 = total number of disk blocks disk/unused <{disk/used test.0/disk} 32 # 32 = total number of disk blocks .fi .sp .ne 5 Change the contents of /adm/frees. .sp .nf disk/updatefrees tests/test.0/disk <{disk/unused <{disk/used tests/test.0/disk} 32} disk/updatefrees /dev/sdF1/fs <{disk/unused <{disk/used /dev/sdF1/fs} 11721040049} .fi .sp .ne 5 A sanity check that the file system is not corrupt by comparing that the unused blocks and free blocks match up. $nblocks is the total number of disk blocks. $disk is the disk. .sp .nf diff <{disk/unused -l <{disk/used tests/test.0/disk} 32}} <{disk/free tests/test.0/disk} .fi .sp Changing the service name without a ream. .sp .nf disk/block /dev/sdF1/fs 1 | wc Tdata 2 size 6001172505088 nblocks 11721040049 backup config 1 to 11721040048 5860520032 backup super 2 to 11721040047 5860520031 backup root 3 to 11721040046 5860520030 service mafs_ddf_1 dd -if /dev/sdF1/fs -count 10 -skip 682 -bs 1 mafs_ddf_110+0 records in 10+0 records out dd -if <{echo m_ddf_1; cat /dev/zero} -of /dev/sdF1/fs -count 11 -oseek 682 -bs 1 7+0 records in 7+0 records out disk/block /dev/sdF1/fs 1 Tdata 2 size 6001172505088 nblocks 11721040049 backup config 1 to 11721040048 5860520032 backup super 2 to 11721040047 5860520031 backup root 3 to 11721040046 5860520030 service m_ddf_1 .fi .sp Changing the magic phrase in the magic block. .sp .nf disk/block /dev/sdF1/fs 0 Tmagic 1 mafs device 512 dd -if /dev/sdF1/fs -count 16 -iseek 256 -bs 1 mafs device 512 20+0 records in 20+0 records out dd -if <{echo m[a]fs device; echo 512; cat /dev/zero} -of /dev/sdF1/fs -count 18 -oseek 256 -bs 1 18+0 records in 18+0 records out dd -if /dev/sdF1/fs -count 18 -iseek 256 -bs 1 m[a]fs device 512 18+0 records in 18+0 records out disk/block /dev/sdF1/fs 0 Tmagic 1 m[a]fs device 512 .fi .sp .sp .ne 20 .ft B Tests .ft R .sp .TS box; c l l a . Program Description _ tests/regress.rc All regression tests tests/chkextents.rc Unit tests on extents tests/chkreli.rc Unit tests on relative index lookups _ tests/6.offsets Write file using different offsets to test mafswrite() tests/6.sizes Show the effects of the different parameters tests/6.testextents Test extents.[ch] state changes tests/6.reli Translate relative index to block number on a disk .TE .sp The below disk state tests: .in 3n .br .ti 0 1. Initialize a disk for mafs. .br .ti 0 2. Run mfs or mafs on that dsk. .br .ti 0 3. Stop mfs or mafs. .br .ti 0 4. Compare the contents with the expected contents (tests/test.0/blocks/*). .in 0 .sp .TS box; c s c l l a . Disk State = Test Description _ tests/test.0 empty disk tests/test.1 create a file /dir1/file1 and echo test into it tests/test.2 writes at different offsets to a file and then removes the file _ tests/test.3 write, read and delete files with sizes upto 16384 blocks tests/test.4 directory copy tests/test.5 fcp gzipped files _ tests/test.6 df tests/test.7 multiple processes working on the filesystem simultaneously tests/test.8 check backup blocks locations _ tests/test.9 examples used by this document tests/test.a write, read and delete a 100MB file tests/test.b duplicate of test.2 but seeded with random data _ tests/test.d seed with random data and do mkdir -p a/b/c/d/e/f/g/h tests/test.e seed with random data and test directory and file deletions .TE .sp .TS box; c s c l l a . Extents behaviour = Test Description _ tests/extents/addabove Figure 1 of the Extents section tests/extents/addabove1 Figure 2 of the Extents section tests/extents/addbelow Figure 3 of the Extents section _ tests/extents/mergeabove Figure 4 of the Extents section tests/extents/mergenext Figure 5 of the Extents section tests/extents/mergeprevious Figure 6 of the Extents section .TE .ta 5n 10n 15n 20n 25n 30n 35n 40n 45n 50n 55n 60n 65n 70n 75n 80n .in 0 .sp .ne 3 To run all the regression tests: .br .nf cd tests/; ./regress.rc .fi .sp .ne 3 To loop through all the blocks of a test: .br .nf for(t in tests/test.2/blocks/^`{seq 0 39}*){ echo $t; echo '---------'; cat $t; echo } .fi .sp .sp .ft B Performance metrics .ft R .sp .nf ramfs -m /n/ramfs touch /n/ramfs/file cat /dev/zero | tput -p > /n/ramfs/file 172.49 MB/s 174.56 MB/s 163.50 MB/s 125.00 MB/s 102.99 MB/s 87.81 MB/s 77.78 MB/s 69.50 MB/s 63.71 MB/s 58.65 MB/s 54.72 MB/s dd -if /dev/zero -of /n/ramfs/file -count 700 -bs 1m disk/mfs -r mfs_ramfs_file /n/ramfs/file mount -c /srv/mfs_ramfs_file /n/mfs_ramfs_file cat /dev/zero | tput -p > /n/mfs_ramfs_file/zeros.file 6.26 MB/s 5.99 MB/s 5.90 MB/s echo halt >> /n/mfs_ramfs_file/adm/ctl; lc /srv unmount /n/mfs_ramfs disk/mafs -r mafs_ramafs_file /n/ramfs/file mount -c /srv/mafs_ramafs_file /n/mafs_ramafs_file cat /dev/zero | tput -p > /n/mafs_ramafs_file/zeros.file # throttles down to mfs speed 45.49 MB/s 31.52 MB/s 23.16 MB/s 24.54 MB/s echo halt >> /n/mafs_ramafs_file/adm/ctl; lc /srv unmount /n/ramfs .fi .sp .sp .ne 3 .ft B Limitations .ft R .sp As we use packed structs to store data to the disk, a disk with m[a]fs is not portable to a machine using a different endian system. .sp .sp .ft B Design considerations .ft R .sp For exclusive use (mode has DMEXCL bit set) files, there is no timeout. .sp Use an fs(3) device for RAID or other configuration involving multiple disks. .sp Why are you not using a checksum to verify the contents? .br Checksums are probabilistic and can be implemented as a bespoke application instead of complicating the file system implementation. .sp .sp .ft B Source .ft R .sp http://git.9front.org/plan9front/mafs/HEAD/info.html .sp .sp .ft B References .ft R .sp [1] Sean Quinlan, "A Cached WORM File System," Software--Practice and Experience, Vol 21., No 12., December 1991, pp. 1289-1299 .br [2] Ken Thompson, Geoff Collyer, "The 64-bit Standalone Plan 9 File Server"