Description of internals of SPAD filesystem
===========================================

Crash counts
------------

The filesystem doesn't use journaling, instead it has crash counts. They work in
the following way:

Filesystem has preallocated crash count table [disk_cct], when for each crash
count (16 bit), it stores transaction count (32-bit). Filesystem has current
crash count [disk_cc], that is incremented with each mount and decremented with
each unmount (thus, it counts crashes).

On mount, disk_cc is loaded to memory [memory_cc] and disk_cc is increased by
one (but in memory old value is held). Also disk_cct is loaded to memory
[memory_cct] and memory_cct at index memory_cc is increased by one.

Directory entries and most other structures contain pair [cc, txc] of crash
count (16-bit) and transaction count (32-bit). Directory entry is valid if
cct[entry->cc] >= entry->txc --- if not valid, it skipped just as if it didn't
exist when scanning directory. New entries are written with values (entry->cc =
memory_cc, entry->txc = memory_cct). So they are seen as valid in current
directory scans, but in case of crash, disk will contain one-less value:
disk_cc[entry_cct] == entry->txc - 1, so they will be invalid. More entries can
be atomically made valid by writing memory_cct to disk_cct.

With some bit tricks, this test is used for atomic deletion of files as well or
for atomic create-and-delete operation during rename --- the validity test is
really (__s32)(cct[entry->cc] - entry->txc) >= 0 and the top bit of entry->txc
is used to mark whether entry is scheduled for delete(1) or create(0).

On sync, all metadata buffers are written, disk cache is flushed, memory_cct is
written to disk and disk cache is flushed again. memory_cct[memory_cc] is then
increased. If it would be greater or equal than 2^31, it is not increased and
memory_cc is written to disk and memory_cc is increased.

Allocation pages have two versions and (cc, txc) pair describing which one is
valid --- they use the same validity test as above to make sure that allocation
pages are consistent with directory structure.

Clearing-up some misunderstanding of filesystem limits
------------------------------------------------------

There is no limit on 2^31 transactions. If transaction count overflows, new
crash count is seamlessly used. The real limit on number of transactions is
2^47. There is another limit on 2^16 crashes.

Furthermore, the notion of transaction doesn't have the same meaning as in
journaled filesystems. In journaled filesystems, transaction is one atomic
change (for example allocate a file), in crash counts, many changes are batched
into one transaction. New transaction is opened on these tree conditions:
* sync or fsync
* in periodic intervals (can be set), to make sure that unsynced data are
  written sometimes
* when the filesystem runs out of space and some blocks need to be freed

So there is not really limit on 2^47 modifications, but on 2^47 syncs --- if the
disk can perform sync in 10ms, the limit of 2^47 syncs would overflow in 44000
years.

Layout
======

Files & directories
-------------------

Filesystem has one allocation area used for both metadata and file content.
File descriptors are embedded directly in directories, saving one seek on file
opening or directory traversal. There is really only one important structure ---
fnode_block --- which forms part of a directory.

fnode_block has 512 bytes (but several these may be grouped together in a
continuous area for a larger directory). fnode_block contains fnodes, each
fnode contains file name, attributes, access times, size and first two extents
(eventually pointer to another fnode_block in case it is directory). So the
directory tree may only tree of fnode blocks, no other structures are required.

Other structures:
In case of a hardlink, special structure fixed_fnode_block is allocated and
fnode is moved to it. Directory entries for this file contain only a dummy fnode
that has filename, pointer to fixed_fnode_block and a flag meaning that it is
hardlink.

In case file has more than 2 extents, an anode structure is allocated --- it
contains 20 direct extents and 10 indirect pointers of increasing depths (up to
depth 10) just like classical Unix filesystem --- except that they contain
extents, not array of blocks. Sparse files are not supported.

If a directory is large, structure dnode is allocated that contains pointers to
specific fnode_blocks. Lookup of a file is simple: file is hashed to 32-bit
hash, first few hash bits are used to index array in dnode and file is found in
corresponding fnode_block. If fnode_block overflows anyway, it is first split,
and if it can't be split (there is only one pointer to this fnode_block in
parent dnode), we allocate a dnode of a lower level, indexed by different bits
in hash. If there are so many hash collisions that more dnodes can't be
allocated anymore (i.e. all hash bits are used by some dnode --- i.e.
log2(size_of_array_in_dnode)*dnode_tree_depth >= 32), the filesystem finally
forms a chain of fnodes --- this is ineffective for lookup, but better than to
give up.

Extended attributes
-------------------

Each fnode can have up to 176 bytes of extended attributes. The attributes are
identified by a name (3 bytes), length (1 byte) and content. Unlike in other
filesystems, extended attributes are not meant to be changeable by the user, they
are meant to be set by the filesystem driver to deal with specifics of a given
operating systems. In this implementation, extended attribute 'UNX' contains
features specific to unix --- mode, uid, gid and file prealloc. Additional
extended attribute 'UNR' is created on block and char devices and contains rdev
number. And attribute 'SYM' is created for symlinks.

Extended attributes are formed and parsed in such a way, that filesystem driver
and fsck skip unknown attributes, so that it is possible to add new features to
the filesystem without breaking backward compatibility.

Allocation structures
---------------------

Filesystem contains several structures called apage, each describing some
variable range of its area. Apage contains a double-linked list of free extents
--- struct apage_entry --- each contains starting sector and length of
corresponding free area. Initially there is only one apage containing
apage_entry for the whole disk. Double-linked list of apage_entries is sorted
according to block number, however the entries are placed at any places in
apage. Only the list is sorted, not entries' positions.

This structure allows lookup/add/delete of apage_entries with average O(sqrt(n))
complexity, where n is the size of apage: to find an apage_entry for a space
near a specific sector, we select sqrt(n) entries at pseudorandom positions and
choose the entry with highest sector number, but lower than requested. From this
entry we walk the sorted list pointers until we find entry with greater than
requested sector number. At average, we walk sqrt(n) entries.

When the apage fills up, it is split. If it is too fragmented (such that bitmap
would be more efficient than this list), it is converted to a bitmap. The
filesystem keeps enough spare apages for the worst case scenario of free space
fragmentation.

There is a structure --- apage_index containing pointers to individual apages.
It contains first pointers to used apages (with ranges corresponding to them)
and finally spare apages.

struct apage and apage_index have two versions and a (cc,txc) describing which
one is valid. If the apage hasn't been modified since last sync and it is about
to be modified, the active version is copied into the inactive version and
(cc,txc) is set in such a way, that we can make new version active with one
write into crash count table --- so that it is in sync with directory tree.

Superblock
----------

The superblock is located at sector 0x4000 (assuming 512-byte sectors), the
space below it is unused. Superblock contains general filesystem information and
pointers to other structures.

There is also txblock, allocated at variable position, usually after the
superblock, it contains general filesystem information that changes during
filesystem operation --- example disk_cc value, and (cc,txc) pair denoting which
apage_index is valid. The reason for moving volatile information to another
block is that if txblock would be damaged during write, filesystem could still
be recovered using the superblock.

All pointers to other blocks are in sectors (512-byte size) even if the
filesystem has larger block size. If the block size is larger than 512, pointers
on disk have to be divisible by (block_size/512).

Allocation strategy
===================

Filesystem has three zones (metadata, small files and large files) and it
tries to satisfy the allocation from the appropriate zone. If the zone is full,
other zones are used. Each zone has set of groups. The groups and zones have no
disk structures associated with them --- they are kept only in memory and their
purpose is to prevent fragmentation.

Each directory has pointer to its current small file and large file zone, its
files are allocated from these zones. When it fills up (or when it can't satisfy
a large request), an another zone with less-than-average free space is found and
assigned to a directory. When extending already existing files, the filesystem
tries to allocate space in a zone near the file end, when there is no free
space, again, less-than-average zone is found (but directory zone is not
updated, except in case it is files' first allocation in large file zone).

In large file zone, allocations are padded up to cluster-size value (adjustable
in mkspadfs) to evade free space fragmentation.

vim: textwidth=80