Using SPAD filesystem driver for Linux ====================================== Requirements ------------ Linux kernel 2.6 with modules enabled. Block device with size from 16MB to 2^57 bytes. Disk that can atomically write one sector (512 bytes) so that the sector contains either old or new content in case of crash. Features -------- - Uses crash counts to maintain consistency across crashes instead of journaling. - 48-bit sector numbers. Block size from 512B to machine page size. - Large directories are organized in hashed structure similar to Fagin's extendible hashing. No btrees. - Files without hardlinks and with at most 2 fragments are embedded directly in the directory, saving one seek on open operation. - Fragmented files are organized in trees of indirect blocks of increasing depths, like on classical Unix filesystem, except that they contain extents instead of blocks. - Free space is described in free block runs, forming a sorted linked list. Average lookup/add/delete complexity in such allocation page is sqrt(n). When this structure overflows, it is split. When free space is too fragmented, this structure is converted to bitmap. Compiling & installation ------------------------ Set "Support for Large Block Devices" (CONFIG_LBD) in your kernel configuration if you want to use devices larger than 1TB (no matter what block size). With this option set, filesystem handles device size up to 2^57 bytes, without it it handles devices up to 2^40 bytes. Note that if you set this option on 32-bit kernel, the whole filesystem will be somehow slower because of 64-bit calculations. Eventually, if you change the option, recompile kernel. Optionally, set KERNELDIR in Makefile to path of your current kernel source or build subdirectory in modules directory. By default path is taken from uname -r command: KERNELDIR := /lib/modules/$(shell uname -r)/build Type make Type make install (or copy files mkspadfs and spadfsck somewhere in your path and copy spadfs.ko somewhere in your module path). Insert the filesystem driver with command insmod spadfs.ko Create new filesystem on a block device with mkspadfs /dev/device Mount it with mount -t spadfs /dev/device /mnt/mountpoint Compiling into the kernel ------------------------- Apply the patch spadfs.patch to the kernel souce (you may need to change it a little bit if you use older kernel). Create the directory fs/spadfs in the kernel tree and copy all these files there. Run "make menuconfig" and SpadFS should appear in the list of filesystems. Parameters ========== mkspadfs [parameters] device_name [size] ---------------------------------------- If size is not specified, mkspadfs detects it. Size can have suffix 'K', 'M', 'G', 'T' which means that the number is in KiBs, MiBs, GiBs or TiBs. --no-trim Do not discard the content of the block device --trim Discard the content of the block device --no-checksums Turns off checksums on metadata - can be overridden with mount option --checksums Turns on checksums on metadata (default) - can be overridden with mount option --block-size Block size, minimum is 512B, maximum is 64KiB. The real maximum that can be accessed by the kernel is page size of a machine. default: page size of a machine (or fnode-size or page-size if they are smaller) --fnode-size Size of directory until the filesystem starts to split it to hash pages. It must be >= block-size and <= page-size. default: 8KiB (or block-size if block-size > 8KiB or page-size if page-size < 8KiB) --page-size Size of page with allocation information and size of directory hash page. It must be >= block-size and >= fnode-size. It can be at most 128KiB. default: 32KiB (or fnode-size or block-size if they are larger) --cluster-size Files larger than threshold are allocated in different zone in multiples of this value to prevent fragmentation. It must be >= block-size. default: 32KiB --cluster-threshold Threshold for using cluster size. default: 128KiB (or cluster-size * 4) --group-size Size of an allocation group. This has nothing to do with layout of allocation information, groups are purely "virtual" --- they are kept only in kernel memory. Their purpose is to keep down fragmentation. default: 1/512 of a device size --metadata-group-size Size of a zone for metadata. Rounded to multiply of group-size. default: 1/64 of a device size --smallfile-group-size Size of a zone for files smaller than cluster-threshold. Rounded to multiply of group-size. The rest of a device is used for larger files. default: 1/8 of a device size --reserve The amount of space reserved for root (in bytes). default: 2% - 0.5% -- No more options past this point, use if your device name begins with -. spadfsck [parameters] device_name --------------------------------- Spadfsck need not to be invoked after crash because filesystem manages consistency of data using crash counts. It should be however invoked if block device is damaged. -a -y Assume 'y' on all questions (except the dangerous ones, such as truncating the filesystem when part of a device is inaccessible). -p Like 'y', but don't do some potentially destructive operations. This is used when running spadfsck automatically on each boot. -n Open device in read-only mode. Do not fix it, just print messages about errors. -f Force checking even if there are no errors on filesystem. -r Nothing. For compatibility with e2fsck. --mark Mark the filesystem for checking on next reboot. This flag is automatically assumed when the user attempts to run spadfsck on mounted filesystem. --extend Extend the filesystem. You first need to extend the block device (for example with lvextend command) and then you run spadfsck with --extend flag to extend the file system. --set-reserve Set the number of bytes reserved for root to the specified value. --memory Memory limit (in bytes or megabytes) for spadfsck. It caches previously read data until this fills up. default: 1/2 of available memory --swapfile Store free blocks bitmap to a specified file. Normally spadfsck stores the bitmap in memory (with compressed long runs of 0s and 1s). When it grows above memory limit, it throws it on disk into unused parts of apages. When the apages are damaged, it has nowhere to store its information and its memory consumption can grow. This specifies a file on another mounted device (or raw partition) where block allocation bitmap will be written during check. With this option, spadfsck can check very large filesystems without too much memory consumption. --log Write log to a specified file (on different, mounted filesystem). --sync-writes Do fsync after each write. --rebuild-apages Rebuild allocation pages unconditionally. --reset-crash-counts Reset crash counts over the whole filesystem. --dont-store-cross-links Normally, when spadfsck find cross-linked files, it stores information about each clash in memory, so that it can print which files are cross-linked. On a large filesystem this can consume memory proportional to a filesystem size, so there's an option to disable it. --undelete Attempt to undelete deleted directories. --undelete-scan-all Scan the whole filesystem (including data area) when doing the undelete. --cache Cache filesystem metadata in spadfsck memory. (default) --nocache Do not cache filesystem metadata in memory (though they can still be cached in buffers if you use --nodirect) --direct Use direct I/O. --nodirect Use buffered I/O. (default) --prefetch Prefetch blocks that will likely be needed in the future. (default) --noprefetch Do not prefetch blocks. Testing options: These options are only for testing, they don't have a practical use. They are enabled only if compiled with TESTCODE symbol defined. --recover-all-files Delete and recover all files on a filesystem. (normally only erroneous files are recovered) --recover-all-directories Delete and recover all directories on a filesystem. (normally only erroneous directories are recovered) --fragment-recovered-files Intentionally create fragments in recovered files. (normally fragments are created only when necessary) --move-recovered-files Intentionally move content of recovered files. (normally files are moved around only when resolving cross-link) --dont-sort-recovered-files Don't sort files when recovering them. --make-apage-bitmaps When recovering apages, create them as bitmaps. --always-swap Always swap to apages. --debug-malloc Check malloc/free calls, add redzone to a block end, check for memory leaks when terminating. -- No more options past this point, use if your device name begins with -. Mount options ------------- Specified with -o option=value or -o option syntax in mount command or fstab file. help Display help, do not mount. uid=xxx Set default uid of files that do not have UNX attribute. (default 0) gid=xxx Set default uid of files that do not have UNX attribute. (default 0) umask=xxx Set default mode of files that do not have UNX attribute. (default is inversion of current process' umask) prealloc_part=xxx Prealloc this fraction of an existing file size --- i.e. prealloc_part 8 means to preallocate 1/8 of file size on a write. (default 8) prealloc_min=xxx (default 4096) prealloc_max=xxx Minimum and maximum values for preallocation in bytes. Real prealloc will be portion of a file size (as specified with prealloc_part) pruned into this interval. Note: If you set prealloc_min >= cluster-threshold, you force all files going to large file group (it may or may not be intended). (default 1048576) xfer_size=xxx Report optimal transfer size in st_blksize. cp and other applications copy files in blocks of this size. (default page size) sync_time=xxx Sync after this interval in seconds. (default is 2 minutes) no_checksums checksums Don't or do make and check metadata checksums. Overrides mkspadfs parameter. ino64=no/yes/force Return 64-bit inode numbers. Unfortunatelly, it will break some 32-bit userspace programs, thus it is not recommended when 32-bit userspace is installed. The force makes all inode numbers greater than 2^32. usrquota grpquota Use quotas. Limitations (likely not fixable) -------------------------------- * Inode numbers may not be unique on 32-bit systems. This is Linux design problem and it could be only fixed in kernel. On 64-bit systems, inode numbers are unique. * Symlink length is limited to 172 characters. * No sparse files. * There isn't (and likely won't be) any support to open files by inode numbers instead of path, for NFS servers. vim: textwidth=80