Data Structures

Data Structures Overview

fig: Block data structures.

Type Aliases

nametype
Amountuint64
Graffitibyte[MAX_GRAFFITI_BYTES]
HashDigestbyte[32]
Heightint64
Nonceuint64
Roundint32
StateSubtreeIDbyte
Timestampgoogle.protobuf.Timestamp
VotingPoweruint64

Blockchain Data Structures

Block

Blocks are the top-level data structure of the Celestia blockchain.

nametypedescription
headerHeaderBlock header. Contains primarily identification info and commitments.
availableDataHeaderAvailableDataHeaderHeader of available data. Contains commitments to erasure-coded data.
availableDataAvailableDataData that is erasure-coded for availability.
lastCommitCommitPrevious block's Tendermint commit.

Block header, which is fully downloaded by both full clients and light clients.

nametypedescription
versionConsensusVersionThe consensus version struct.
chainIDstringThe CHAIN_ID.
heightHeightBlock height. The genesis block is at height 1.
timestampTimestampTimestamp of this block.
lastHeaderHashHashDigestPrevious block's header hash.
lastCommitHashHashDigestPrevious block's Tendermint commit hash.
consensusHashHashDigestHash of consensus parameters for this block.
AppHashHashDigestThe state root after the previous block's transactions are applied.
availableDataOriginalSharesUseduint64The number of shares used in the original data square that are not tail padding.
availableDataRootHashDigestRoot of commitments to erasure-coded data.
proposerAddressAddressAddress of this block's proposer.

The size of the original data square, availableDataOriginalSquareSize, isn't explicitly declared in the block header. Instead, it is implicitly computed as the smallest power of 2 whose square is at least availableDataOriginalSharesUsed (in other words, the smallest power of 4 that is at least availableDataOriginalSharesUsed).

The header hash is the hash of the serialized header.

AvailableDataHeader

nametypedescription
rowRootsHashDigest[]Commitments to all erasure-coded data.
colRootsHashDigest[]Commitments to all erasure-coded data.

The number of row/column roots of the original data shares in square layout for this block. The availableDataRoot of the header is computed using the compact row and column roots as described here.

The number of row and column roots is each availableDataOriginalSquareSize * 2, and must be a power of 2. Note that the minimum availableDataOriginalSquareSize is 1 (not 0), therefore the number of row and column roots are each at least 2.

Implementations can prune rows containing only tail padding as they are implicitly available.

AvailableData

Data that is erasure-coded for data availability checks.

nametypedescription
transactionsTransactionTransactions are ordinary Cosmos SDK transactions. For example: they may modify the validator set and token balances.
payForBlobDataPayForBlobDataPayForBlob data. Transactions that pay for blobs to be included.
blobDataBlobDataBlob data is arbitrary user submitted data that will be published to the Celestia blockchain.

Commit

nametypedescription
heightHeightBlock height.
roundRoundRound. Incremented on view change.
headerHashHashDigestHeader hash of the previous block.
signaturesCommitSig[]List of signatures.

Timestamp

Timestamp is a type alias.

Celestia uses google.protobuf.Timestamp to represent time.

HashDigest

HashDigest is a type alias.

Output of the hashing function. Exactly 256 bits (32 bytes) long.

TransactionFee

nametypedescription
tipRateuint64The tip rate for this transaction.

Abstraction over transaction fees.

Address

Celestia supports secp256k1 keys where addresses are 20 bytes in length.

nametypedescription
AccAddress[20]byteAccAddress a wrapper around bytes meant to represent an account address

CommitSig

enum CommitFlag : uint8_t {
    CommitFlagAbsent = 1,
    CommitFlagCommit = 2,
    CommitFlagNil = 3,
};
nametypedescription
commitFlagCommitFlag
validatorAddressAddress
timestampTimestamp
signatureSignature

Signature

nametypedescription
rbyte[32]r value of the signature.
sbyte[32]s value of signature.

ConsensusVersion

nametypedescription
blockuint64The VERSION_BLOCK.
appuint64The app version.

Serialization

Objects that are committed to or signed over require a canonical serialization. This is done using a deterministic (and thus, bijective) variant of protobuf defined here.

Note: there are two requirements for a serialization scheme, should this need to be changed:

  1. Must be bijective.
  2. Serialization must include the length of dynamic structures (e.g. arrays with variable length).

Hashing

All protocol-level hashing is done using SHA-2-256 as defined in FIPS 180-4. SHA-2-256 outputs a digest that is 256 bits (i.e. 32 bytes) long.

Libraries implementing SHA-2-256 are available in Go (https://pkg.go.dev/crypto/sha256) and Rust (https://docs.rs/sha2).

Unless otherwise indicated explicitly, objects are first serialized before being hashed.

Merkle Trees

Merkle trees are used to authenticate various pieces of data across the Celestia stack, including transactions, blobs, the validator set, etc. This section provides an overview of the different tree types used, and specifies how to construct them.

Binary Merkle Tree

Binary Merkle trees are constructed in the same fashion as described in Certificate Transparency (RFC-6962), except for using a different hashing function. Leaves are hashed once to get leaf node values and internal node values are the hash of the concatenation of their children (either leaf nodes or other internal nodes).

Nodes contain a single field:

nametypedescription
vHashDigestNode value.

The base case (an empty tree) is defined as the hash of the empty string:

node.v = 0xe3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

For leaf node node of leaf data d:

node.v = h(0x00, serialize(d))

For internal node node with children l and r:

node.v = h(0x01, l.v, r.v)

Note that rather than duplicating the last node if there are an odd number of nodes (the Bitcoin design), trees are allowed to be imbalanced. In other words, the height of each leaf may be different. For an example, see Section 2.1.3 of Certificate Transparency (RFC-6962).

Leaves and internal nodes are hashed differently: the one-byte 0x00 is prepended for leaf nodes while 0x01 is prepended for internal nodes. This avoids a second-preimage attack where internal nodes are presented as leaves trees with leaves at different heights.

BinaryMerkleTreeInclusionProof

nametypedescription
siblingsHashDigest[]Sibling hash values, ordered starting from the leaf's neighbor.

A proof for a leaf in a binary Merkle tree, as per Section 2.1.1 of Certificate Transparency (RFC-6962).

Namespace Merkle Tree

Shares in Celestia are associated with a provided namespace. The Namespace Merkle Tree (NMT) is a variation of the Merkle Interval Tree, which is itself an extension of the Merkle Sum Tree. It allows for compact proofs around the inclusion or exclusion of shares with particular namespace IDs.

Nodes contain three fields:

nametypedescription
n_minNamespaceMin namespace in subtree rooted at this node.
n_maxNamespaceMax namespace in subtree rooted at this node.
vHashDigestNode value.

The base case (an empty tree) is defined as:

node.n_min = 0x0000000000000000
node.n_max = 0x0000000000000000
node.v = 0xe3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

For leaf node node of share data d:

node.n_min = d.namespace
node.n_max = d.namespace
node.v = h(0x00, d.namespace, d.rawData)

The namespace blob field here is the namespace of the leaf, which is a NAMESPACE_SIZE-long byte array.

Leaves in an NMT must be lexicographically sorted by namespace in ascending order.

For internal node node with children l and r:

node.n_min = min(l.n_min, r.n_min)
if l.n_min == PARITY_SHARE_NAMESPACE
  node.n_max = PARITY_SHARE_NAMESPACE
else if r.n_min == PARITY_SHARE_NAMESPACE
  node.n_max = l.n_max
else
  node.n_max = max(l.n_max, r.n_max)
node.v = h(0x01, l.n_min, l.n_max, l.v, r.n_min, r.n_max, r.v)

Note that the above snippet leverages the property that leaves are sorted by namespace: if l.n_min is PARITY_SHARE_NAMESPACE, so must {l,r}.n_max. By construction, either both the min and max namespace of a node will be PARITY_SHARE_NAMESPACE, or neither will: if r.n_min is PARITY_SHARE_NAMESPACE, so must r.n_max.

For some intuition: the min and max namespace for subtree roots with at least one non-parity leaf (which includes the root of an NMT, as the right half of an NMT as used in Celestia will be parity shares) ignore the namespace ID for the parity leaves. Subtree roots with only parity leaves have their min and max namespace ID set to PARITY_SHARE_NAMESPACE. This allows for shorter proofs into the tree than if the namespace ID of parity shares was not ignored (which would cause the max namespace ID of the root to always be PARITY_SHARE_NAMESPACE).

A compact commitment can be computed by taking the hash of the serialized root node.

NamespaceMerkleTreeInclusionProof

nametypedescription
siblingValuesHashDigest[]Sibling hash values, ordered starting from the leaf's neighbor.
siblingMinsNamespace[]Sibling min namespace IDs.
siblingMaxesNamespace[]Sibling max namespace IDs.

When verifying an NMT proof, the root hash is checked by reconstructing the root node root_node with the computed root_node.v (computed as with a plain Merkle proof) and the provided rootNamespaceMin and rootNamespaceMax as the root_node.n_min and root_node.n_max, respectively.

Erasure Coding

In order to enable trust-minimized light clients (i.e. light clients that do not rely on an honest majority of validating state assumption), it is critical that light clients can determine whether the data in each block is available or not, without downloading the whole block itself. The technique used here was formally described in the paper Fraud and Data Availability Proofs: Maximising Light Client Security and Scaling Blockchains with Dishonest Majorities.

The remainder of the subsections below specify the 2D Reed-Solomon erasure coding scheme used, along with the format of shares and how available data is arranged into shares.

Reed-Solomon Erasure Coding

Note that while data is laid out in a two-dimensional square, rows and columns are erasure coded using a standard one-dimensional encoding.

Reed-Solomon erasure coding is used as the underlying coding scheme. The parameters are:

Note that availableDataOriginalSquareSize may vary each block, and is decided by the block proposer of that block. Leopard-RS is a C library that implements the above scheme with quasilinear runtime.

2D Reed-Solomon Encoding Scheme

The 2-dimensional data layout is described in this section. The roots of NMTs for each row and column across four quadrants of data in a 2k * 2k matrix of shares, Q0 to Q3 (shown below), must be computed. In other words, 2k row roots and 2k column roots must be computed. The row and column roots are stored in the availableDataCommitments of the AvailableDataHeader.

fig: RS2D encoding: data quadrants.

The data of Q0 is the original data, and the remaining quadrants are parity data. Setting k = availableDataOriginalSquareSize, the original data first must be split into shares and arranged into a k * k matrix. Then the parity data can be computed.

Where A -> B indicates that B is computed using erasure coding from A:

  • Q0 -> Q1 for each row in Q0 and Q1
  • Q0 -> Q2 for each column in Q0 and Q2
  • Q2 -> Q3 for each row in Q2 and Q3

Note that the parity data in Q3 will be identical if it is vertically extended from Q1 or horizontally extended from Q2.

fig: RS2D encoding: extending data.

As an example, the parity data in the second column of Q2 (in striped purple) is computed by extending the original data in the second column of Q0 (in solid blue).

fig: RS2D encoding: extending a column.

Now that all four quadrants of the 2k * 2k matrix are filled, the row and column roots can be computed. To do so, each row/column is used as the leaves of a NMT, for which the compact root is computed (i.e. an extra hash operation over the NMT root is used to produce a single HashDigest). In this example, the fourth row root value is computed as the NMT root of the fourth row of Q0 and the fourth row of Q1 as leaves.

fig: RS2D encoding: a row root.

Finally, the availableDataRoot of the block Header is computed as the Merkle root of the binary Merkle tree with the row and column roots as leaves, in that order.

fig: Available data root.

Arranging Available Data Into Shares

The previous sections described how some original data, arranged into a k * k matrix, can be extended into a 2k * 2k matrix and committed to with NMT roots. This section specifies how available data (which includes transactions, PayForBlob transactions, and blobs) is arranged into the matrix in the first place.

Note that each share only has a single namespace, and that the list of concatenated shares is lexicographically ordered by namespace.

Then,

  1. For each of transactionData, intermediateStateRootData, PayForBlob transactions, serialize:
    1. For each request in the list:
      1. Serialize the request (individually).
      2. Compute the length of each serialized request, serialize the length, and prepend the serialized request with its serialized length.
    2. Split up the length/request pairs into SHARE_SIZE-NAMESPACE_ID_BYTES-SHARE_RESERVED_BYTES-byte chunks.
    3. Create a share out of each chunk. This data has a reserved namespace ID, so the first NAMESPACE_SIZE+SHARE_RESERVED_BYTES bytes for these shares must be set specially.
  2. Concatenate the lists of shares in the order: transactions, intermediate state roots, PayForBlob transactions.

These shares are arranged in the first quadrant (Q0) of the availableDataOriginalSquareSize*2 * availableDataOriginalSquareSize*2 available data matrix in row-major order. In the example below, each reserved data element takes up exactly one share.

fig: Original data: reserved.

Each blob in the list blobData:

  1. Serialize the blob (individually).
  2. Compute the length of each serialized blob, serialize the length, and prepend the serialized blob with its serialized length.
  3. Split up the length/blob pairs into SHARE_SIZE-NAMESPACE_SIZE-byte chunks.
  4. Create a share out of each chunk. The first NAMESPACE_SIZE bytes for these shares is set to the namespace.

For each blob, it is placed in the available data matrix, with row-major order, as follows:

  1. Place the first share of the blob at the next unused location in the matrix, then place the remaining shares in the following locations.

Transactions must commit to a Merkle root of a list of hashes that are each guaranteed (assuming the block is valid) to be subtree roots in one or more of the row NMTs. For additional info, see the rationale document for this section.

However, with only the rule above, interaction between the block producer and transaction sender may be required to compute a commitment to the blob the transaction sender can sign over. To remove interaction, blobs can optionally be laid out using a non-interactive default:

  1. Place the first share of the blob at the next unused location in the matrix whose column is aligned with the largest power of 2 that is not larger than the blob length or availableDataOriginalSquareSize, then place the remaining shares in the following locations unless there are insufficient unused locations in the row.
  2. If there are insufficient unused locations in the row, place the first share of the blob at the first column of the next row. Then place the remaining shares in the following locations. By construction, any blob whose length is greater than availableDataOriginalSquareSize will be placed in this way.

In the example below, two blobs (of lengths 2 and 1, respectively) are placed using the aforementioned default non-interactive rules.

fig: original data blob

The blob share commitment rules may introduce empty shares that do not belong to any blob (in the example above, the top-right share is empty). These are zeroes with namespace ID equal to the either TAIL_TRANSACTION_PADDING_NAMESPACE_ID if between a request with a reserved namespace ID and a blob, or the namespace ID of the previous blob if succeeded by a blob. See the rationale doc for more info.

Available Data

Transaction

Celestia transactions are Cosmos SDK transactions.

PayForBlobData

IndexWrapper

IndexWrapper are wrappers around PayForBlob transactions. They include additional metadata by the block proposer that is committed to in the available data matrix.

nametypedescription
txbytesActual transaction.
share_indexes[]uint32Share indexes (in row-major order) of the first share for each blob this transaction pays for. Needed for light verification of proper blob inclusion.
type_idstringType ID of the IndexWrapper transaction type. This is used for encoding and decoding IndexWrapper transactions. It is always set to "INDX".

BlobData

nametypedescription
blobsBlob[]List of blobs.

Blob

nametypedescription
namespaceIDNamespaceIDNamespace ID of this blob.
rawDatabyte[]Raw blob bytes.

State

The state of the Celestia chain is intentionally restricted to containing only account balances and the validator set metadata. Similar to other Cosmos SDK based chains, the state of the Celestia chain is maintained in a multistore. The root of the application state is committed to in the block header via the AppHash.

Consensus Parameters

Various consensus parameters are committed to in the block header, such as limits and constants.

nametypedescription
versionConsensusVersionThe consensus version struct.
chainIDstringThe CHAIN_ID.
shareSizeuint64The SHARE_SIZE.
shareReservedBytesuint64The SHARE_RESERVED_BYTES.
availableDataOriginalSquareMaxuint64The AVAILABLE_DATA_ORIGINAL_SQUARE_MAX.

In order to compute the consensusHash field in the block header, the above list of parameters is hashed.