UnixFS is a protocol-buffers-based format for describing files, directories and symlinks as Directed Acyclic Graphs (DAGs) in IPFS.
A Node is the smallest unit present in a graph, and it comes from graph theory. In UnixFS, there is a 1-to-1 mapping between nodes and blocks. Therefore, they are used interchangeably in this document.
A node is addressed by a CID. In order to be able to read a node, its CID is required. A CID includes two important pieces of information:
Thus, the block must be retrieved; that is, the bytes which ,when hashed using the hash function specified in the multihash, gives us the same multihash value back.
In UnixFS, a node can be encoded using two different multicodecs, listed below. More details are provided in the following sections:
raw
(0x55
), which are single block Files.dag-pb
(0x70
), which can be of any other type.Raw
NodesThe simplest nodes use raw
encoding and are implicitly a File. They can
be recognized because their CIDs are encoded using the raw
(0x55
) codec:
Tsize
in parent is equal to blocksize
).dag-pb
NodesMore complex nodes use the dag-pb
(0x70
) encoding. These nodes require two steps of
decoding. The first step is to decode the outer container of the block. This is encoded using the IPLD dag-pb
specification, which can be
summarized as follows:
message PBLink {
// binary CID (with no multibase prefix) of the target object
optional bytes Hash = 1;
// UTF-8 string name
optional string Name = 2;
// cumulative size of target object
optional uint64 Tsize = 3;
}
message PBNode {
// refs to other objects
repeated PBLink Links = 2;
// opaque user data
optional bytes Data = 1;
}
After decoding the node, we obtain a PBNode
. This PBNode
contains a field
Data
that contains the bytes that require the second decoding. These are also
a protobuf message specified in the UnixFSV1 format:
message Data {
enum DataType {
Raw = 0;
Directory = 1;
File = 2;
Metadata = 3;
Symlink = 4;
HAMTShard = 5;
}
required DataType Type = 1;
optional bytes Data = 2;
optional uint64 filesize = 3;
repeated uint64 blocksizes = 4;
optional uint64 hashType = 5;
optional uint64 fanout = 6;
optional uint32 mode = 7; // opt-in, AKA UnixFS 1.5
optional UnixTime mtime = 8; // opt-in, AKA UnixFS 1.5
}
message Metadata {
optional string MimeType = 1;
}
message UnixTime {
required int64 Seconds = 1;
optional fixed32 FractionalNanoseconds = 2;
}
Summarizing, a dag-pb
UnixFS node is an IPLD dag-pb
protobuf,
whose Data
field is a UnixFSV1 Protobuf message. For clarity, the specification
document may represent these nested Protobufs as one object. In this representation,
it is implied that the PBNode.Data
field is encoded in a protobuf.
A dag-pb
UnixFS node supports different types, which are defined in
decode(PBNode.Data).Type
. Every type is handled differently.
File
typeA File is a container over an arbitrary sized amount of bytes. Files are either single block or multi-block. A multi-block file is a concatenation of multiple child files.
PBNode.Links
and decode(PBNode.Data).blocksizes
The sister-lists are the key point of why IPLD dag-pb
is important for files. They
allow us to concatenate smaller files together.
Linked files would be loaded recursively with the same process following a DFS (Depth-First-Search) order.
Child nodes must be of type File; either a dag-pb
File, or a
raw
block.
For example, consider this pseudo-json block:
{
"Links": [{"Hash":"Qmfoo"}, {"Hash":"Qmbar"}],
"Data": {
"Type": "File",
"blocksizes": [20, 30]
}
}
This indicates that this file is the concatenation of the Qmfoo
and Qmbar
files.
When reading a file represented with dag-pb
, the blocksizes
array gives us the
size in bytes of the partial file content present in children DAGs. Each index in
PBNode.Links
MUST have a corresponding chunk size stored at the same index
in decode(PBNode.Data).blocksizes
.
Implementers need to be extra careful to ensure the values in Data.blocksizes
are calculated by following the definition from Blocksize
.
This allows for fast indexing into the file. For example, if someone is trying
to read bytes 25 to 35, we can compute an offset list by summing all previous
indexes in blocksizes
, then do a search to find which indexes contain the
range we are interested in.
In the example above, the offset list would be [0, 20]
. Thus, we know we only need to download Qmbar
to get the range we are interested in.
UnixFS parser MUST error if blocksizes
or Links
are not of the same length.
decode(PBNode.Data).Data
An array of bytes that is the file content and is appended before
the links. This must be taken into account when doing offset calculations; that is,
the length of decode(PBNode.Data).Data
defines the value of the zeroth element
of the offset list when computing offsets.
PBNode.Links[].Name
This field makes sense only in Directories contexts and MUST be absent when creating a new file. For historical reasons, implementations parsing third-party data SHOULD accept empty values here.
If this field is present and non-empty, the file is invalid and the parser MUST error.
decode(PBNode.Data).Blocksize
This field is not directly present in the block, but rather a computable property
of a dag-pb
, which would be used in the parent node in decode(PBNode.Data).blocksizes
.
It is the sum of the length of decode(PBNode.Data).Data
field plus the sum
of all link's blocksizes
.
decode(PBNode.Data).filesize
If present, this field MUST be equal to the Blocksize
computation above.
Otherwise, this file is invalid.
A file terminates a UnixFS content path. Any attempt to resolve a path past a file MUST error.
Directory
TypeA Directory, also known as folder, is a named collection of child Nodes:
PBNode.Links
is an entry (child) of the directory, and
PBNode.Links[].Name
gives you the name of that child.PBNode.Link
CANNOT
have the same Name
. If two identical names are present in a directory, the
decoder MUST fail.The minimum valid PBNode.Data
field for a directory is as follows:
{
"Type": "Directory"
}
The remaining relevant values are covered in Metadata.
The canonical sorting order is lexicographical over the names.
In theory, there is no reason an encoder couldn't use an other ordering. However, this loses some of its meaning when mapped into most file systems today, as most file systems consider directories to be unordered key-value objects.
A decoder SHOULD, if it can, preserve the order of the original files in the same way it consumed those names. However, when some implementations decode, modify and then re-encode, the original link order loses it's original meaning, given that there is no way to indicate which sorting was used originally.
Pop the left-most component of the path, and try to match it to the Name
of
a child under PBNode.Links
. If you find a match, you can then remember the CID.
You MUST continue the search. If you find another match, you MUST error since
duplicate names are not allowed.
Assuming no errors were raised, you can continue to the path resolution on the remaining components and on the CID you popped.
Symlink
typeA Symlink represents a POSIX symbolic link. A symlink MUST NOT have children.
The PBNode.Data.Data
field is a POSIX path that MAY be inserted in front of the
currently remaining path component stack.
There is no current consensus on how pathing over symlinks should behave. Some implementations return symlink objects and fail if a consumer tries to follow them through.
Symlink path resolution SHOULD follow the POSIX specification, over the current UnixFS path context, as much as is applicable.
HAMTDirectory
A HAMT Directory is a Hashed-Array-Mapped-Trie data structure representing a Directory. It is generally used to represent directories that cannot fit inside a single block. These are also known as "sharded directories:, since they allow you to split large directories into multiple blocks, known as "shards".
decode(PBNode.Data).hashType
indicates the multihash function to use to digest
the path components used for sharding. It MUST be murmur3-x64-64
(0x22
).decode(PBNode.Data).Data.Data
is a bit field, which indicates whether or not
links are part of this HAMT, or its leaves. The usage of this field is unknown, given
that you can deduce the same information from the link names.decode(PBNode.Data).Data.fanout
MUST be a power of two. This encodes the number
of hash permutations that will be used on each resolution step. The log base 2
of the fanout
indicate how wide the bitmask will be on the hash at for that step.
fanout
MUST be between 8 and probably 65536. .The field Name
of an element of PBNode.Links
for a HAMT starts with an
uppercase hex-encoded prefix, which is log2(fanout)
bits wide.
To resolve the path inside a HAMT:
decode(PBNode.Data).hashType
.log2(fanout)
lowest bits from the path component hash digest, then
hex encode (using 0-F) those bits using little endian. Find the link that starts
with this hex encoded path.Name
is exactly as long as the hex encoded representation, follow
the link and repeat step 2 with the child node and the remaining bit stack.
The child node MUST be a HAMT directory, or else the directory is invalid. Otherwise, continue.TSize
(child DAG size hint)Tsize
is an optional field in PBNode.Links[]
which represents the precomputed size of the specific child DAG. It provides a performance optimization: a hint about the total size of child DAG can be read without having to fetch any child nodes.
To compute the Tsize
of a child DAG, sum the length of the dag-pb
outside message binary length and the blocksizes
of all nodes in the child DAG.
Examples of where Tsize
is useful:
An implementation SHOULD NOT assume the TSize
values are correct. The value is only a hint that provides performance optimization for better UX.
Following the Robustness Principle, implementation SHOULD be
able to decode nodes where the Tsize
field is wrong (not matching the sizes of sub-DAGs), or
partially or completely missing.
When total data size is needed for important purposes such as accounting, billing, and cost estimation, the Tsize
SHOULD NOT be used, and instead a full DAG walk SHOULD to be performed.
UnixFS currently supports two optional metadata fields.
mode
The mode
is for persisting the file permissions in numeric notation
[spec].
0755
for directories/HAMT shards0644
for all other types where applicableugo-rwx
setuid
, setgid
and the sticky bit
clonedValue = ( modifiedBits & 07777 ) | ( originalValue & 0xFFFFF000 )
interpretedValue = originalValue & 07777
mtime
A two-element structure ( Seconds
, FractionalNanoseconds
) representing the
modification time in seconds relative to the unix epoch 1970-01-01T00:00:00Z
.
The two fields are:
Seconds
( always present, signed 64bit integer ): represents the amount of seconds after or before the epoch.FractionalNanoseconds
( optional, 32bit unsigned integer ): when specified, represents the fractional part of the mtime
as the amount of nanoseconds. The valid range for this value are the integers [1, 999999999]
.Implementations encoding or decoding wire-representations MUST observe the following:
mtime
structure with FractionalNanoseconds
outside of the on-wire range
[1, 999999999]
is not valid. This includes a fractional value of 0
.
Implementations encountering such values should consider the entire enclosing
metadata block malformed and abort the processing of the corresponding DAG.mtime
structure is optional. Its absence implies unspecified
rather
than 0
.0
as
input, while at the same time MUST ensure it is stripped from the final structure
before encoding, satisfying the above constraints.Implementations interpreting the mtime
metadata in order to apply it within a
non-IPFS target MUST observe the following:
unspecified
and 0
/1970-01-01T00:00:00Z
,
the distinction must be preserved within the target. For example, if no mtime
structure
is available, a web gateway must not render a Last-Modified:
header.mtime
( e.g. a FUSE interface ) and no mtime
is
supplied OR the supplied mtime
falls outside of the targets accepted range:
mtime
is specified or the resulting UnixTime
is negative:
implementations must assume 0
/1970-01-01T00:00:00Z
(note that such values
are not merely academic: e.g. the OpenVMS epoch is 1858-11-17T00:00:00Z
)UnixTime
is larger than the targets range ( e.g. 32bit
vs 64bit mismatch), implementations must assume the highest possible value
in the targets range. In most cases, this would be 2038-01-19T03:14:07Z
.Paths begin with a <CID>/
or /ipfs/<CID>/
, where <CID>
is a [multibase]
encoded CID. The CID encoding MUST NOT use a multibase alphabet that contains
/
(0x2f
) unicode codepoints. However, CIDs may use a multibase encoding with
a /
in the alphabet if the encoded CID does not contain /
once encoded.
Everything following the CID is a collection of path components (some bytes)
separated by /
(0x2F
). UnixFS paths read from left to right, and are
inspired by POSIX paths.
/
unicode codepoints because it would break
the path into two components.The \
may be used to trigger an escape sequence. However, it is currently
broken and inconsistent across implementations. Until we agree on a specification
for this, you SHOULD NOT use any escape sequences and/or non-ASCII characters.
Relative path components MUST be resolved before trying to work on the path:
.
points to the current node and MUST be removed...
points to the parent node and MUST be removed left to right. When removing
a ..
, the path component on the left MUST also be removed. If there is no path
component on the left, you MUST error to avoid out-of-bounds
path resolution.The following names SHOULD NOT be used:
.
string, as it represents the self node in POSIX pathing...
string, as it represents the parent node in POSIX pathing.NULL
(0x00
) byte, as this is often used to signify string
terminations in some systems, such as C-compatible systems. Many unix
file systems do not accept this character in path components.mtime
and mode
Metadata Support in UnixFSv1.5Metadata support in UnixFSv1.5 has been expanded to increase the number of possible
use cases. These include rsync
and filesystem-based package managers.
Several metadata systems were evaluated, as discussed in the following sections.
In this scheme, the existing Metadata
message is expanded to include additional
metadata types (mtime
, mode
, etc). It contains links to the actual file data,
but never the file data itself.
This was ultimately rejected for a number of reasons:
File
node. This would not be possible with an intermediate Metadata
node.File
node already contains some metadata (e.g. the file size), so metadata
would be stored in multiple places. This complicates forwards compatibility with
UnixFSv2, as mapping between metadata formats potentially requires multiple fetch
operations.Repeated Metadata
messages are added to UnixFS Directory
and HAMTShard
nodes,
the index of which indicates which entry they are to be applied to. Where entries are
HAMTShard
s, an empty message is added.
One advantage of this method is that, if we expand stored metadata to include entry
types and sizes, we can perform directory listings without needing to fetch further
entry nodes (excepting HAMTShard
nodes). However, without removing the storage of
these datums elsewhere in the spec, we run the risk of having non-canonical data
locations and perhaps conflicting data as we traverse through trees containing
both UnixFS v1 and v1.5 nodes.
This was rejected for the following reasons:
This adds new fields to the UnixFS Data
message to represent the various metadata fields.
It has the advantage of being simple to implement. Metadata is maintained whether the file is accessed directly via its CID or via an IPFS path that includes a containing directory. In addition, metadata is kept small enough that we can inline root UnixFS nodes into their CIDs so that we can end up fetching the same number of nodes if we decide to keep file data in a leaf node for deduplication reasons.
Downsides to this approach are:
mtime
s being different. If the content is
stored in another node, its CID will be constant between the two users,
but you can't navigate to it unless you have the parent node, which will be
less available due to the proliferation of CIDs.With this approach, we would maintain a separate data structure outside of the UnixFS tree to hold metadata.
This was rejected due to concerns about added complexity, recovery after system crashes while writing, and having to make extra requests to fetch metadata nodes when resolving CIDs from peers.
This scheme would see metadata stored in an external database.
The downsides to this are that metadata would not be transferred from one node to another when syncing, as Bitswap is not aware of the database and in-tree metadata.
The integer portion of UnixTime is represented on the wire using a varint
encoding.
While this is inefficient for negative values, it avoids introducing zig-zag encoding.
Values before the year 1970
are exceedingly rare, and it would be handy having
such cases stand out, while ensuring that the "usual" positive values are easily readable. The varint
representing the time of writing this text is 5 bytes
long. It will remain so until October 26, 3058 (34,359,738,367).
Fractional values are effectively a random number in the range 1 to 999,999,999.
In most cases, such values will exceed 2^28 (268,435,456) nanoseconds. Therefore,
the fractional part is represented as a 4-byte fixed32
,
as per Google's recommendation.
This section and included subsections are not authoritative.
@helia/unixfs
implementation of a filesystem compatible with Helia SDKipfs/boxo/../unixfs.proto
boxo/files
ipfs/boxo/ipld/unixfs
go-ipld-prime
implementation: ipfs/go-unixfsnode
Raw
ExampleIn this example, we will build a Raw
file with the string test
as its content.
First, hash the data:
$ echo -n "test" | sha256sum
9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08 -
Add the CID prefix:
f01551220
9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08
f this is the multibase prefix, we need it because we are working with a hex CID, this is omitted for binary CIDs
01 the CID version, here one
55 the codec, here we MUST use Raw because this is a Raw file
12 the hashing function used, here sha256
20 the digest length 32 bytes
9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08 is the the digest we computed earlier
Done. Assuming we stored this block in some implementation of our choice, which makes it accessible to our client, we can try to decode it.
$ ipfs cat f015512209f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08
test
The offset list isn't the only way to use blocksizes and reach a correct implementation, it is a simple canonical one, python pseudo code to compute it looks like this:
def offsetlist(node):
unixfs = decodeDataField(node.Data)
if len(node.Links) != len(unixfs.Blocksizes):
raise "unmatched sister-lists" # error messages are implementation details
cursor = len(unixfs.Data) if unixfs.Data else 0
return [cursor] + [cursor := cursor + size for size in unixfs.Blocksizes[:-1]]
This will tell you which offset inside this node the children at the corresponding index starts to cover. (using [x,y)
ranging)
We gratefully acknowledge the following individuals for their valuable contributions, ranging from minor suggestions to major insights, which have shaped and improved this specification.