The lihata format

0. introduction

Lihata is a simple list-hash-table language. There are a lot of languages out there already, with different strengths and drawbacks. It is hard to find the balance between the following properties: Lihata is yet another mix of features trying to find a different balance. For a comparison of lihata and other languages, check out the design decision document , which also defines the design goals .

The current paper attempts to describe the language, not the design decisions behind it; however, a link to the appropriate design decision section will always be provided in the following form: []

1. Tree, syntax

Lihata represents a tree of nodes. Each node has the following properties:

The following types are defined:

Each type is identified by its two-letter prefix [].

The canonical format of a node specification is:

{type:name}={content};
There are some simplifications [], tho: The shortest valid node description is the anonymous empty text node:
{};
Any combination between the shortest and the canonical format is accepted; a few examples:
my_text1={blah blah};
my_text2=blah blah;
li: = {
	item1;
	item2
	item3;
	item4;item5;
};
The above example declares two text nodes with the same content, an anonymous list with 5 anonymous text items.

White spaces between tokens are ignored by the parser, except for newline characters which are interpreted as separators. In practice this means unprotected text fields (including node names) may contain spaces, but leading and trailing whitespace will be stripped. Backslash or brace protection can be used when leading/trailing spaces should be kept.

1.1. Separator

Any two node specification is separated by a series of one or more separators []. Separators are: Note: because of taking a series of separators as one, using multiple separators will not result in empty nodes. This allows the user to use both semicolon and newline, place empty lines in a list, etc. When an empty node is required, the explicit form of "{}" must be used.

1.2. comments

Lines starting with hash mark (#), or with white spaces followed by hash mark are ignored by the parser (up to the end of the line).

1.3. escaping []

Lihata uses backslash as escape character.

On the following list any character (called the lihata control characters) preceded by a backslash is taken as the given literal character and is not interpreter by the parser:

Another special character is the slash ("/"). It may appear anywhere in strings, but it is not a wise decision to put slashes in node names: symlinks use slash as path separator and slash in names must be protected with backslash in symlink values.

2. Node types

NOTE: minimal implementation requirements are given to explain the help understanding the purpose and real use cases of different constructions. Actual implementation shall provide the minimal requirements since lihata files will be designed accordingly, but may provide more or less features. Implementations providing less features may be suboptimal (slow, user code being longer) as users of the lihata format will chose constructions assuming all minimal requirements are provided.

2.1. te: text

The content of a text node is plain text of arbitrary length. The two ways of describing the text is with and without braces.

Without braces, short text can be specified with minimal overhead. The end of the string is the separator. When the string contains the separator, it must be protected by backslash. In case the separator is a series of separator characters longer than a single character, each separator character must be protected individually. Furthermore all other lihata control characters must be protected in similar manner.

When enclosed in braces, much less protection is required: the only special characters are the closing brace and backslash. Nevertheless escaping is applied even in the string [] , which means a backslash followed by any character is taken as that literal character. A specific common example is windows/dos paths, where double backslash is required (first backslash escaping the second).

A special case is empty string, which is described as {} and can not be described without the braces - making empty string explicit and different from missing key.

A text node never has children and must be the leaf of the tree.

2.2. li: list

The list node has zero or more children of nodes in an ordered list. Multiple instances of the same node name is allowed. List items are separated by the lihata separator and the list is always enclosed in braces.

Example:

li:names {
	li:first = { Ann; John; Jack; Lily }
	li:last  = { Smith; McAdam; }
	seed = 15
	desc = {First and last names to randomize full names; as described in: {5\}; seed is also specified as names/seed. }
}
List /names contains two other lists /names/first and /names/last. There are small differences in how optional syntax is used in each list:

Minimal implementation requirements:

2.3. ha: hash

Hash is very similar to list. The differences are: order of children is not preserved and children name must be unique among the given hash. Multiple instances of the same name will cause parse error. An anonymous node is a node with empty name and is not a special case - in other words, only one anonymous node can be placed on a hash.

Minimal implementation requirements:

Example:
ha:directory {
	ha:Jack Smith {
		phone = 555-12345678
		bday  = 01-02
	}
	ha:Ann Mary {
		phone = 555-4242424242
		bday  = 06-01
	}
}

2.4. ta: table

Table is a special list that implies []: If any of the above implications is not met, parse error is generated. Rows are either explicitly typed li:, or if no type is specified, li: is implied instead of te:.

Since the table itself is a list of rows, ordering of the rows is preserved and rows may be anonymous. In some cases the user may decide to name rows, but minimal implementation requirements will not provide any acceleration for that use case. Cells of the table may be of any type, even full tables themselves.

For the cells the same naming considerations are applied (as a row is a list of cells): can be named, but the implementations are not required to provide acceleration for handling named cells.

Example:

ta:transformation mx {
	{0.5; 0.4; 0}
	{1.5; 0.5; 1}
	{1.1; 3.4; 0.5}
}
Table is usually implemented as an array of rows, rows being arrays of cells; minimal implementation requirements:

Note: the implementation is not required to store data in row lists, but is free to use a native data structure. This includes the freedom to remove the abscarction of rows (as long as row names are kept); for example the trivial implementation in many languages is a 2 dimensional array of cells and a separate array of row names. List API is not required to work on tables or table rows.

2.5. sy: symlink

Symlink is introduced to make tables readable when cell data is not a single string but a large data structure. The content of the symlink is text that is interpreted as a relative or absolute path in the document during normal access: for almost all the normal calls, the fact that the node is a symlink is hidden. Special accessors are provided to check whether a node is a symlink and to manipulate symlink data. A symlink is broken when target does not exist or leads to a symlink loop. An empty symlink target string is equivalent to "." (pointing to the parent of the symlink) and is broken only if the symlink is the root of the document. Implementations must guarantee that at least 8 levels of symlinks is supported in a symlink chain.

Operations where symlinks are transparent (the caller doesn't find out the referred node is a symlink):

Special functions for accessing symlinks (n is a path or node; function names are examples; functionality may be split or merged among the functions but the implementation must implement all operations that are possible with this list of calls):

A symlink is followed only in advanced tree walk functions. Lower levels (parser) should handle symlink as text. Thus, during parsing, it is normal to have broken symlinks:

Even in upper levels it is somewhat normal to have broken symlinks - the error generated is probably similar to any "path not found" error the application may get when following paths that lead to non-existent nodes even without symlink. It is similar as with a normal UNIX file system.

3. single-root tree conventions

A lihata tree has a single root and is a tree on low (node) level (the logical links introduced by symlink are considered to be on a higher level). The root node is always explicit and specified by the user.

There is no restriction of the type of the root node, which means a single anonymous string is already a valid lihata tree, and the smallest valid lihata tree is an anonymous empty string.

The root node may be anonymous.

The parser should stop parsing after the root node is closed. The caller should be responsible for deciding what to do with any remaining data. This behavior may be necessary for stream parsing.

3.1. splitting a tree

The implementation may provide tree split operations:

3.2. merging trees

The implementation may provide tree merge operations. Lihata tree A may be merged into lihata tree B at any target node. The type of the target node B must match the type of the root node of A. In case A should be inserted/appended as a new subtree of B, first a new node with the appropriate type shall be created in B and A shall be merged with that node. This way insertion is not a special case.

From now on the root node of A is called src and the target node in B is called dst. When merging two nodes of the same type:

An attempt to merge nodes with different type always fails. The merge should fail if src is among the ancestors of dst.

It is strongly recommended that any implementation provide the above merging as this enables config.d style loading of files, letting the user decide about scattering data across files, without the application hardwiring any file name.

3.3 path

A path is a slash separated list of node identifiers [] . When interpreted, the tree is traversed: While descending in the tree, on each level one of the nodes will be selected using the corresponding node identifier; if the identifier does not match on its level, an error is generated. If the traversal consumed all identifiers on the path, the last matching node is returned. The method of selecting the node depends on the node type of the current level: Numeric identifiers start from zero [] , i.e., the first element in a list is ./0.

Examples:

	li:root {
		li:foo {
			bar = aaaaaa
			      bbbbbb
			bar = cccccc
			  2 = dddddd
		}

		# points to aaaaaa
		sy:ppp = foo/0

		# points to bbbbbb
		sy:qqq = foo/1

		# points to aaaaaa
		sy:rrr = foo/bar:

		# points to cccccc
		sy:sss = foo/bar:1

		# invalid
		sy:ttt = foo/bar

		# points to dddddd
		sy:uuu = foo/1:
	}
The following identifiers will be interpreted as directives and do not depend on list node type: Backslash escaping works in paths: if a node name has colon in its name or is called "." or "..", the node identifier should use a backslash before the colon or each dot.

Implementations may provide API calls that take path arguments. These calls may accept extra features in paths, but symlink must not, to avoid implementation-specific lihata documents.

The implementation is expected to provide functionality similar to the following call:

It is recommended that the implementation also provide similar API that accepts regexps and can return multiple nodes matching the regexp path. The implementation may also provide a version of path_resolve() that returns all matches (in case of lists/tables with named identifiers).

4. other conventions

4.1. tables with named columns

It may be a good convention to name table columns so that the user code does not depend on order. Lihata tries to keep things simple and relatively orthogonal, so there is no dedicated feature doing this. Examples how existing features can be used for the purpose:

Note: if tables are to be merged as described in 3.2., the best solution is seemingly reordering the columns during load or merge, and requires the library to understand the naming convention

4.2. streaming

Streaming formats that require a single root is always a hack of some sort. There are two recommended setups/workarounds for this with lihata

4.2.1. single-root stream

Transactions may be of any node type, and the type:name pair of the transaction should identify the (operation) type of the transaction. For example changing states is easy with a hash, sending bulk, unprocessed textual payload is easy with a text node, sending lists of data is easy with a list node, etc.
  • Alternatively, if a transaction has more meta-data, transactions may be declared to always be hashes, introducing a new level for meta-data. In this case the name of the transaction hash should be ignored and name, type, version, session-id/cookie and payload of the transaction shall be the fields of the hash.

    This method is better if each transaction is a new connection over a connection-oriented protocol (e.g. TCP) or each transaction is an atomic message (e.g. UDP). TODO: example

    4.2.2. multi-root stream

    A multi-root setup simply puts each transaction in a lihata root node. The event parser (and in fact the dom parser as well) stops parsing after the root node is closed. The caller should then evaluate the transaction and reinitialize the parser so that the next transaction can be parsed.

    Inside the transaction, some of the considerations of 4.2.1. apply.

    This method is useful if there are multiple requests and answers shall be sent in a connection-oriented protocol (e.g. TCP).

    4.3. data types and marshaling

    Often native binary data (C structs in our examples) shall be written or read in textual format. When using lihata for the purpose, the following conventions are suggested.

    Fields of the structure should map 1:1 to fields of a lihata hash. Since lihata doesn't offer extra syntax for describing the type of the payload, and validation of the file format shall be performed, such type data needs to be stored separately. The simplest way is defining a special named field that is put in each data hash, a name that a native field can not have, which holds type info. Best practice is to make these fields symlink to central li: nodes describing the format. To make things more compact, it's cheaper to introduce a new level that describes the format and contains a lot of record strictly following that format. An example for this convention:

    li:data set {
    	ha:formats {
    		li:phone {
    			name=string
    			phone=int64
    		}
    	}
    	li:phone directory {
    		sy:format=../formats/phone
    		ha: {
    			name=Jack Smith
    			phone=1177182
    		}
    		ha: {
    			name=Susanne Groove
    			phone=424242
    		}
    	}
    }
    
    At the end the C struct may be:
    struct phone_directory {
    	char *name;
    	uint64_t phone;
    }
    
    The marshaler could be directed to load all data under "/phone directory" into a newly allocated array and return the number of entries.

    If the format description is part of each data file, the data files are self-describing - no versioning is needed and the code dealing with the marshaling can decide whether the format is acceptable or not.

    A different approach, more useful if there are a huge amount of data of the same types scattered in multiple files, is to have all format description in a single file and rely of the config.d-style merging of the loader. The same self-describing concept is applied, not to individual files but the whole directory.

    4.4. data encoding

    Lihata stores byte streams that can contain any (binary) that makes up a valid lihata document (when parsed byte-by-byte) with the only exception being the ASCII \0 character, which is forbidden anywhere in the file [].

    In case user application would like to use different encodings it should store encoding info as a text node; for example the root could be a li with a first child te:encoding=whatever.

    A common, but explicitly unsupported case is the UTF-8 BOM (the UTF identifier cookie bytes in the beginning of a document), since the document must start with a comment or the root node. This means a lihata document that contains more than a single text node may start with one of:

    Of course user application may wrap the lihata document in whatever other format, including an UTF-8 frame starting with a BOM, as long as framing is stripped by the application before passing the document to the standard lihata parser. However, in this case the full document will not be compatible with other applications and libs parsing or emitting lihata documents. Thus it is strongly recommended to avoid using UTF-8 BOM or encoding other headers and use a lihata node for storing them instead.