The lihata format

0. introduction

Lihata is a simple list-hash-table language. There are a lot of languages out there already, with different strengths and drawbacks. It is hard to find the balance between the following properties:

compactness
transparency (easy to read and understand)
representation of complex data structures (ordered vs. unordered lists, key=value pairs, tables)

Lihata is yet another mix of features trying to find a different balance. For a comparison of lihata and other languages, check out the design decision document , which also defines the design goals .

The current paper attempts to describe the language, not the design decisions behind it; however, a link to the appropriate design decision section will always be provided in the following form: []

1. Tree, syntax

Lihata represents a tree of nodes. Each node has the following properties:

type (sometimes implicit)
name (optional, anonymous nodes are allowed)
content

The following types are defined:

text (default); prefix: te or no prefix
list (ordered list of children nodes); prefix: li
hash (unordered list of children nodes hashed by unique node names); prefix: ha
table (a matrix or children nodes); prefix: ta
symlink (a single text field describing which other node shall be used when this node is referenced)

Each type is identified by its two-letter prefix [].

The canonical format of a node specification is:

{type:name}={content};

There are some simplifications [], tho:

The "type:" part is optional; when not specified, text ("te:") is implied in most contexts [], except for a table, where list (as a row of table) is implied.
Even the name is optional; when not specified, the name is the empty string, which can be used as sort of anonymous node names [], even in a hash.
The equal sign is optional [].
Simple text nodes may be specified without the braces []. The rule of thumb here is: punctuation marks it needs protection, else it does not (even if it contains spaces, tabs or underscores).
The semicolon at the end is a separator (discussed in the next section).
type:name is just a text, the same rules apply as for the content (e.g. spaces are allowed).
But it is uncommon to use braces around type:name and users seldom chose names that contain characters that need escaping.

The shortest valid node description is the anonymous empty text node:

{};

Any combination between the shortest and the canonical format is accepted; a few examples:

my_text1={blah blah};
my_text2=blah blah;
li: = {
	item1;
	item2
	item3;
	item4;item5;
};

The above example declares two text nodes with the same content, an anonymous list with 5 anonymous text items.

White spaces between tokens are ignored by the parser, except for newline characters which are interpreted as separators. In practice this means unprotected text fields (including node names) may contain spaces, but leading and trailing whitespace will be stripped. Backslash or brace protection can be used when leading/trailing spaces should be kept.

1.1. Separator

Any two node specification is separated by a series of one or more separators []. Separators are:

semicolon
newline characters (\r and \n)

Note: because of taking a series of separators as one, using multiple separators will not result in empty nodes. This allows the user to use both semicolon and newline, place empty lines in a list, etc. When an empty node is required, the explicit form of "{}" must be used.

1.2. comments

Lines starting with hash mark (#), or with white spaces followed by hash mark are ignored by the parser (up to the end of the line).

1.3. escaping []

Lihata uses backslash as escape character.

On the following list any character (called the lihata control characters) preceded by a backslash is taken as the given literal character and is not interpreter by the parser:

backslash (\\ will result in a single literal backslash)
":" (colon)
";" (semicolon)
newline characters (otherwise they would be interpreted as separators!)
"{" (opening brace)
"}" (closing brace)
"=" (equal)

Another special character is the slash ("/"). It may appear anywhere in strings, but it is not a wise decision to put slashes in node names: symlinks use slash as path separator and slash in names must be protected with backslash in symlink values.

2. Node types

NOTE: minimal implementation requirements are given to explain the help understanding the purpose and real use cases of different constructions. Actual implementation shall provide the minimal requirements since lihata files will be designed accordingly, but may provide more or less features. Implementations providing less features may be suboptimal (slow, user code being longer) as users of the lihata format will chose constructions assuming all minimal requirements are provided.

2.1. te: text

The content of a text node is plain text of arbitrary length. The two ways of describing the text is with and without braces.

Without braces, short text can be specified with minimal overhead. The end of the string is the separator. When the string contains the separator, it must be protected by backslash. In case the separator is a series of separator characters longer than a single character, each separator character must be protected individually. Furthermore all other lihata control characters must be protected in similar manner.

When enclosed in braces, much less protection is required: the only special characters are the closing brace and backslash. Nevertheless escaping is applied even in the string [] , which means a backslash followed by any character is taken as that literal character. A specific common example is windows/dos paths, where double backslash is required (first backslash escaping the second).

A special case is empty string, which is described as {} and can not be described without the braces - making empty string explicit and different from missing key.

A text node never has children and must be the leaf of the tree.

2.2. li: list

The list node has zero or more children of nodes in an ordered list. Multiple instances of the same node name is allowed. List items are separated by the lihata separator and the list is always enclosed in braces.

Example:

li:names {
	li:first = { Ann; John; Jack; Lily }
	li:last  = { Smith; McAdam; }
	seed = 15
	desc = {First and last names to randomize full names; as described in: {5\}; seed is also specified as names/seed. }
}

List /names contains two other lists /names/first and /names/last. There are small differences in how optional syntax is used in each list:

names depends on newlines being separators to make its fields easier to read;
names/first and names/last are each specified in a single line since values are short
items of first and last are anonymous text items; since none of them contained any lihata control character, no braces are used around them
names/seed is a simple text node
names/desc contains some control characters (;, :, /, { and }); instead of protecting each, the whole string is enclosed in braces and only the closing brace is escaped.

Minimal implementation requirements:

fast: access items on the list one by one (as a single linked list), in order;
fast: rewinding to the first item;
unlink or delete an item;
insert/append an item;

2.3. ha: hash

Hash is very similar to list. The differences are: order of children is not preserved and children name must be unique among the given hash. Multiple instances of the same name will cause parse error. An anonymous node is a node with empty name and is not a special case - in other words, only one anonymous node can be placed on a hash.

Minimal implementation requirements:

fast: access of items by name;
fast: insert a new item;
iterate on all items in random order while number of items are not changing

Example:

ha:directory {
	ha:Jack Smith {
		phone = 555-12345678
		bday  = 01-02
	}
	ha:Ann Mary {
		phone = 555-4242424242
		bday  = 06-01
	}
}

2.4. ta: table

Table is a special list that implies []:

each child node is a list that specifies a row of the table;
each list has exactly the same number of items []

If any of the above implications is not met, parse error is generated. Rows are either explicitly typed li:, or if no type is specified, li: is implied instead of te:.

Since the table itself is a list of rows, ordering of the rows is preserved and rows may be anonymous. In some cases the user may decide to name rows, but minimal implementation requirements will not provide any acceleration for that use case. Cells of the table may be of any type, even full tables themselves.

For the cells the same naming considerations are applied (as a row is a list of cells): can be named, but the implementations are not required to provide acceleration for handling named cells.

Example:

ta:transformation mx {
	{0.5; 0.4; 0}
	{1.5; 0.5; 1}
	{1.1; 3.4; 0.5}
}

Table is usually implemented as an array of rows, rows being arrays of cells; minimal implementation requirements:

for rows of the table: all the list manipulation requirements (in a table-specific way: inserted rows should have the right amount of data)
for cells of a row: fast rewind and "next"; inserting or deleting a cell means inserting or deleting a column, and is support for this operation is not required; when implemented, newly created cells in other rows must have the anonymous empty text as content
for columns of the table: all the list manipulation requirements, except inserting or deleting; when it is implemented inserting a column without content for all existing rows specified, the anonymous empty string shall be used for the missing cells
fast: retrieve cells by integer column;row coordinates (with boundary check)
fast: number of columns and rows

Note: the implementation is not required to store data in row lists, but is free to use a native data structure. This includes the freedom to remove the abscarction of rows (as long as row names are kept); for example the trivial implementation in many languages is a 2 dimensional array of cells and a separate array of row names. List API is not required to work on tables or table rows.

2.5. sy: symlink

Symlink is introduced to make tables readable when cell data is not a single string but a large data structure. The content of the symlink is text that is interpreted as a relative or absolute path in the document during normal access: for almost all the normal calls, the fact that the node is a symlink is hidden. Special accessors are provided to check whether a node is a symlink and to manipulate symlink data. A symlink is broken when target does not exist or leads to a symlink loop. An empty symlink target string is equivalent to "." (pointing to the parent of the symlink) and is broken only if the symlink is the root of the document. Implementations must guarantee that at least 8 levels of symlinks is supported in a symlink chain.

Operations where symlinks are transparent (the caller doesn't find out the referred node is a symlink):

tree traversal: accessing children nodes
tree traversal: when a symlink name is referred in a path, .. is the parent of the symlink and not the target, children is children of the target (since sy: content is text, the node never has children nodes)
list access (when symlink is a list item): getting the first/next node operates on the list where the symlink is, not on a list of where the symlink points to; removing the list item removes the symlink, not the target
hash access: symlink has its own name, which is used for hashing; removing the list item removes the symlink, not the target
table cell: table operations and name are table local as expected
any access that requires target to exist will fail with a special broken symlink error; operations that do not require to deal with the target (e.g. removing or creating a symlink) will not fail if target does not exist

Special functions for accessing symlinks (n is a path or node; function names are examples; functionality may be split or merged among the functions but the implementation must implement all operations that are possible with this list of calls):

is_symlink(n) - returns true if n is a symlink
is_broken(n) - returns true if n is a broken symlink
has_symlink(n) - returns true if subtree n has symlinks
has_broken(n) - returns true if subtree n has broken symlinks
get_target(n) - returns the target node or path of symlink n or an error if n is not a symlink
set_target(n, target) - creates or sets the target node or path of symlink n or an error if n is not a symlink

A symlink is followed only in advanced tree walk functions. Lower levels (parser) should handle symlink as text. Thus, during parsing, it is normal to have broken symlinks:

a symlink can point to a node not yet reached by the parser ("forward reference", word "forward" interpreted in text stream context)
a final lihata document assembled from multiple lihata documents with tree merge - there is no requirement that a symlink in one source file can not point to a node defined in another
in a stream it is the responsibility of the application to define communication packages or transaction and to define how symlinks are interpreted (only within the transaction, or can reference old transactions, etc.)

Even in upper levels it is somewhat normal to have broken symlinks - the error generated is probably similar to any "path not found" error the application may get when following paths that lead to non-existent nodes even without symlink. It is similar as with a normal UNIX file system.

3. single-root tree conventions

A lihata tree has a single root and is a tree on low (node) level (the logical links introduced by symlink are considered to be on a higher level). The root node is always explicit and specified by the user.

There is no restriction of the type of the root node, which means a single anonymous string is already a valid lihata tree, and the smallest valid lihata tree is an anonymous empty string.

The root node may be anonymous.

The parser should stop parsing after the root node is closed. The caller should be responsible for deciding what to do with any remaining data. This behavior may be necessary for stream parsing.

3.1. splitting a tree

The implementation may provide tree split operations:

a subtree is copied or moved and taken as a separate lihata tree (root is the root of the subtree)
the whole tree is traversed and matching paths/nodes are copied or moved to a separate lihata tree with or without children nodes (recursively or not recursively); if a node on the 4th level matched, the 3 parents are also copied, but not recursively (since that would result in a full tree copy)
when splitting a table, the resulting table must be valid

3.2. merging trees

The implementation may provide tree merge operations. Lihata tree A may be merged into lihata tree B at any target node. The type of the target node B must match the type of the root node of A. In case A should be inserted/appended as a new subtree of B, first a new node with the appropriate type shall be created in B and A shall be merged with that node. This way insertion is not a special case.

From now on the root node of A is called src and the target node in B is called dst. When merging two nodes of the same type:

te: content of src is concatenated to content of dst at the end without introducing any character in between the two
li: items of src are appended one by one at the end of dst, keeping original ordering
ta: items of src are appended one by one at the end of dst, keeping original ordering; the number of columns in src and dst must match, else the merge fails
sy: if the target text of the two symlinks match byte-to-byte, one instance is kept, else the merge fails; symlinks with empty value will never match (this can be used to explicitly disable merging of subtrees [] )
ha: items of src are inserted one by one in random order in dst; when a collision is detected, depending on the item type:
- mismatching types: the merge fails with an error
- li: the src item is merged with the dst item as described in the normal li case
- ha: the src item is merged with the dst item as described in the normal ha case
- ta: the src item is merged with the dst item as described in the normal ta case
- sy: the src item is merged with the dst item as described in the normal sy case

An attempt to merge nodes with different type always fails. The merge should fail if src is among the ancestors of dst.

It is strongly recommended that any implementation provide the above merging as this enables config.d style loading of files, letting the user decide about scattering data across files, without the application hardwiring any file name.

3.3 path

A path is a slash separated list of node identifiers [] . When interpreted, the tree is traversed:

from the root (if the path starts with a '/') or;
from the current node (if the path does not start with '/' and is the value of a symlink).

While descending in the tree, on each level one of the nodes will be selected using the corresponding node identifier; if the identifier does not match on its level, an error is generated. If the traversal consumed all identifiers on the path, the last matching node is returned. The method of selecting the node depends on the node type of the current level:

in a list:
- num - the numth element of the list
- name: - the first element whose node name matches name
- name:num - the numth element whose node name matches name
in a hash only the name is specified in the node identifier.
in a table we use the row/column format (as a table is almost a list of lists), both row and column allowing all list identifier formats; a few examples:
- num1/num2 - the node in column num2 in row num1
- name:/num - the node in column num in the first row called name
- name:num1/num2 - the node in column num2 in the num1th row called name
- name1:num1/name2:num2 - the num2th node column called name2 in the num1th row called name1
NOTE: it is forbidden to address a row of a table, only table cells can be addressed - in other words: for a table, the path always contains 2 identifiers.

Numeric identifiers start from zero [] , i.e., the first element in a list is ./0.

Examples:

	li:root {
		li:foo {
			bar = aaaaaa
			      bbbbbb
			bar = cccccc
			  2 = dddddd
		}

		# points to aaaaaa
		sy:ppp = foo/0

		# points to bbbbbb
		sy:qqq = foo/1

		# points to aaaaaa
		sy:rrr = foo/bar:

		# points to cccccc
		sy:sss = foo/bar:1

		# invalid
		sy:ttt = foo/bar

		# points to dddddd
		sy:uuu = foo/1:
	}

The following identifiers will be interpreted as directives and do not depend on list node type:

. - the current level (for compatibility with file system notations)
.. - one level up

Backslash escaping works in paths: if a node name has colon in its name or is called "." or "..", the node identifier should use a backslash before the colon or each dot.

Implementations may provide API calls that take path arguments. These calls may accept extra features in paths, but symlink must not, to avoid implementation-specific lihata documents.

The implementation is expected to provide functionality similar to the following call:

path_resolve(path) - returns the unique lihata node that matches the path (using the symlink path rules described above with no extra features) - symlinks are followed

It is recommended that the implementation also provide similar API that accepts regexps and can return multiple nodes matching the regexp path. The implementation may also provide a version of path_resolve() that returns all matches (in case of lists/tables with named identifiers).

4. other conventions

4.1. tables with named columns

It may be a good convention to name table columns so that the user code does not depend on order. Lihata tries to keep things simple and relatively orthogonal, so there is no dedicated feature doing this. Examples how existing features can be used for the purpose:

use the first row as a header and name each cell node (this won't interfere with cell data); after loading the table build a hash telling which named column is which numbered column
use the first row as a header and make each cell a text node containing the column name - similar approach, more compact but less transparent
if the source code has a preconception of a native order of columns, a low level library function may reorder the columns of the table according to any naming solution
a design decision is whether an extra function to query a cell by name instead of column number is introduced, or the caller shall keep a translation table

Note: if tables are to be merged as described in 3.2., the best solution is seemingly reordering the columns during load or merge, and requires the library to understand the naming convention

4.2. streaming

Streaming formats that require a single root is always a hack of some sort. There are two recommended setups/workarounds for this with lihata

4.2.1. single-root stream

the root shall be a list; this expresses that an ordered sequence of transactions will follow
the name of the root may be the name and version of the protocol - useful if name/version data is short
alternatively the first transaction could be a hash with the name and version of the protocol - useful if there are a lot to send in version data (e.g. list of capabilities)
each child node of the root should be regarded as a transaction; a transaction should be processed only when the full child node is parsed. Technically, using the split method described in 3.1. each transaction can be handled as a separate lihata tree assuming that the parser stops parsing after the end of the root node (which is the end of the transaction node in this case); alternatively the parser may call hooks or return after parsing each node on a specific/preconfigured level.
if the root node is closed, the connection should be terminated

Transactions may be of any node type, and the type:name pair of the transaction should identify the (operation) type of the transaction. For example changing states is easy with a hash, sending bulk, unprocessed textual payload is easy with a text node, sending lists of data is easy with a list node, etc.

Alternatively, if a transaction has more meta-data, transactions may be declared to always be hashes, introducing a new level for meta-data. In this case the name of the transaction hash should be ignored and name, type, version, session-id/cookie and payload of the transaction shall be the fields of the hash.

This method is better if each transaction is a new connection over a connection-oriented protocol (e.g. TCP) or each transaction is an atomic message (e.g. UDP). TODO: example

4.2.2. multi-root stream

A multi-root setup simply puts each transaction in a lihata root node. The event parser (and in fact the dom parser as well) stops parsing after the root node is closed. The caller should then evaluate the transaction and reinitialize the parser so that the next transaction can be parsed.

Inside the transaction, some of the considerations of 4.2.1. apply.

This method is useful if there are multiple requests and answers shall be sent in a connection-oriented protocol (e.g. TCP).

4.3. data types and marshaling

Often native binary data (C structs in our examples) shall be written or read in textual format. When using lihata for the purpose, the following conventions are suggested.

Fields of the structure should map 1:1 to fields of a lihata hash. Since lihata doesn't offer extra syntax for describing the type of the payload, and validation of the file format shall be performed, such type data needs to be stored separately. The simplest way is defining a special named field that is put in each data hash, a name that a native field can not have, which holds type info. Best practice is to make these fields symlink to central li: nodes describing the format. To make things more compact, it's cheaper to introduce a new level that describes the format and contains a lot of record strictly following that format. An example for this convention:

li:data set {
	ha:formats {
		li:phone {
			name=string
			phone=int64
		}
	}
	li:phone directory {
		sy:format=../formats/phone
		ha: {
			name=Jack Smith
			phone=1177182
		}
		ha: {
			name=Susanne Groove
			phone=424242
		}
	}
}

At the end the C struct may be:

struct phone_directory {
	char *name;
	uint64_t phone;
}

The marshaler could be directed to load all data under "/phone directory" into a newly allocated array and return the number of entries.

If the format description is part of each data file, the data files are self-describing - no versioning is needed and the code dealing with the marshaling can decide whether the format is acceptable or not.

A different approach, more useful if there are a huge amount of data of the same types scattered in multiple files, is to have all format description in a single file and rely of the config.d-style merging of the loader. The same self-describing concept is applied, not to individual files but the whole directory.

4.4. data encoding

Lihata stores byte streams that can contain any (binary) that makes up a valid lihata document (when parsed byte-by-byte) with the only exception being the ASCII \0 character, which is forbidden anywhere in the file [].

In case user application would like to use different encodings it should store encoding info as a text node; for example the root could be a li with a first child te:encoding=whatever.

A common, but explicitly unsupported case is the UTF-8 BOM (the UTF identifier cookie bytes in the beginning of a document), since the document must start with a comment or the root node. This means a lihata document that contains more than a single text node may start with one of:

a sequence of whitespaces
a hashmark (comment)
a type identifier (li:, ha:, ta: or sy:)

Of course user application may wrap the lihata document in whatever other format, including an UTF-8 frame starting with a BOM, as long as framing is stripped by the application before passing the document to the standard lihata parser. However, in this case the full document will not be compatible with other applications and libs parsing or emitting lihata documents. Thus it is strongly recommended to avoid using UTF-8 BOM or encoding other headers and use a lihata node for storing them instead.