The lihata format
0. introduction
Lihata is a simple list-hash-table language. There are a lot of languages
out there already, with different strengths and drawbacks. It is hard
to find the balance between the following properties:
- compactness
- transparency (easy to read and understand)
- representation of complex data structures (ordered vs. unordered lists, key=value pairs, tables)
Lihata is yet another mix of features trying to find a different balance.
For a comparison of lihata and other languages, check out the
design decision document , which
also defines the design goals .
The current paper attempts to describe the language, not the design
decisions behind it; however, a link to the appropriate design decision
section will always be provided in the following form:
[]
1. Tree, syntax
Lihata represents a tree of nodes. Each node has the following properties:
- type (sometimes implicit)
- name (optional, anonymous nodes are allowed)
- content
The following types are defined:
- text (default); prefix: te or no prefix
- list (ordered list of children nodes); prefix: li
- hash (unordered list of children nodes hashed by unique node names); prefix: ha
- table (a matrix or children nodes); prefix: ta
- symlink (a single text field describing which other node shall be used when this node is referenced)
Each type is identified by its two-letter prefix [].
The canonical format of a node specification is:
{type:name}={content};
There are some simplifications [], tho:
- The "type:" part is optional; when not specified, text ("te:") is implied in most contexts [], except for a table, where list (as a row of table) is implied.
- Even the name is optional; when not specified, the name is the empty string, which can be used as sort of anonymous node names [], even in a hash.
- The equal sign is optional [].
- Simple text nodes may be specified without the braces []. The rule of thumb here is: punctuation marks it needs protection, else it does not (even if it contains spaces, tabs or underscores).
- The semicolon at the end is a separator (discussed in the next section).
- type:name is just a text, the same rules apply as for the content (e.g. spaces are allowed).
- But it is uncommon to use braces around type:name and users seldom chose names that contain characters that need escaping.
The shortest valid node description is the anonymous empty text node:
{};
Any combination between the shortest and the canonical format is accepted;
a few examples:
my_text1={blah blah};
my_text2=blah blah;
li: = {
item1;
item2
item3;
item4;item5;
};
The above example declares two text nodes with the same content, an anonymous
list with 5 anonymous text items.
White spaces between tokens are ignored by the
parser, except for newline characters which are interpreted as separators.
In practice this means unprotected text fields (including node names)
may contain spaces, but leading and trailing whitespace will be stripped.
Backslash or brace protection can be used when leading/trailing spaces
should be kept.
1.1. Separator
Any two node specification is separated by a series of one or more separators [].
Separators are:
- semicolon
- newline characters (\r and \n)
Note: because of taking a series of separators as one, using multiple
separators will not result in empty nodes. This allows the user
to use both semicolon and newline, place empty lines in a list, etc.
When an empty node is required, the explicit form of "{}" must be used.
1.2. comments
Lines starting with hash mark (#), or with white spaces followed by hash
mark are ignored by the parser (up to the end of the line).
1.3. escaping []
Lihata uses backslash as escape character.
On the following list any character (called the lihata control characters)
preceded by a backslash is taken as the given literal character and is
not interpreter by the parser:
- backslash (\\ will result in a single literal backslash)
- ":" (colon)
- ";" (semicolon)
- newline characters (otherwise they would be interpreted as separators!)
- "{" (opening brace)
- "}" (closing brace)
- "=" (equal)
Another special character is the slash ("/"). It may appear anywhere in
strings, but it is not a wise decision to put slashes in node names: symlinks
use slash as path separator and slash in names must be protected with backslash
in symlink values.
2. Node types
NOTE: minimal implementation requirements are given to explain the help
understanding the purpose and real use cases of different constructions.
Actual implementation shall provide the minimal requirements since
lihata files will be designed accordingly, but may provide more or
less features. Implementations providing less features may be suboptimal
(slow, user code being longer) as users of the lihata format will chose
constructions assuming all minimal requirements are provided.
2.1. te: text
The content of a text node is plain text of arbitrary length. The two ways
of describing the text is with and without braces.
Without braces, short text can be specified with minimal overhead. The end
of the string is the separator. When the string contains the separator,
it must be protected by backslash. In case the separator is a series of
separator characters longer than a single character, each separator character
must be protected individually. Furthermore all other lihata control characters
must be protected in similar manner.
When enclosed in braces, much less protection is required: the only special
characters are the closing brace and backslash. Nevertheless escaping is applied even
in the string [] , which means
a backslash followed by any character is taken as that literal character.
A specific common example is windows/dos paths, where double backslash
is required (first backslash escaping the second).
A special case is empty string, which is described as {} and can not
be described without the braces - making empty string explicit and
different from missing key.
A text node never has children and must be the leaf of the tree.
2.2. li: list
The list node has zero or more children of nodes in an ordered list. Multiple
instances of the same node name is allowed. List items are separated by
the lihata separator and the list is always enclosed in braces.
Example:
li:names {
li:first = { Ann; John; Jack; Lily }
li:last = { Smith; McAdam; }
seed = 15
desc = {First and last names to randomize full names; as described in: {5\}; seed is also specified as names/seed. }
}
List /names contains two other lists /names/first and /names/last. There
are small differences in how optional syntax is used in each list:
- names depends on newlines being separators to make its fields easier to read;
- names/first and names/last are each specified in a single line since values are short
- items of first and last are anonymous text items; since none of them contained any lihata control character, no braces are used around them
- names/seed is a simple text node
- names/desc contains some control characters (;, :, /, { and }); instead of protecting each, the whole string is enclosed in braces and only the closing brace is escaped.
Minimal implementation requirements:
- fast: access items on the list one by one (as a single linked list), in order;
- fast: rewinding to the first item;
- unlink or delete an item;
- insert/append an item;
2.3. ha: hash
Hash is very similar to list. The differences are: order of children is
not preserved and children name must be unique among the given hash.
Multiple instances of the same name will cause parse error. An anonymous
node is a node with empty name and is not a special case - in
other words, only one anonymous node can be placed on a hash.
Minimal implementation requirements:
- fast: access of items by name;
- fast: insert a new item;
- iterate on all items in random order while number of items are not changing
Example:
ha:directory {
ha:Jack Smith {
phone = 555-12345678
bday = 01-02
}
ha:Ann Mary {
phone = 555-4242424242
bday = 06-01
}
}
2.4. ta: table
Table is a special list that implies []:
- each child node is a list that specifies a row of the table;
- each list has exactly the same number of items []
If any of the above implications is not met, parse error is generated. Rows
are either explicitly typed li:, or if no type is specified, li: is implied instead
of te:.
Since the table itself is a list of rows, ordering of the rows is
preserved and rows may be anonymous. In some cases the user may
decide to name rows, but minimal implementation requirements will not
provide any acceleration for that use case. Cells of the table
may be of any type, even full tables themselves.
For the cells the same naming considerations are applied (as a row is a
list of cells): can be named, but the implementations are not required
to provide acceleration for handling named cells.
Example:
ta:transformation mx {
{0.5; 0.4; 0}
{1.5; 0.5; 1}
{1.1; 3.4; 0.5}
}
Table is usually implemented as an array of rows, rows being
arrays of cells; minimal implementation requirements:
- for rows of the table: all the list manipulation requirements (in a table-specific way: inserted rows should have the right amount of data)
- for cells of a row: fast rewind and "next"; inserting or deleting a cell means inserting or deleting a column, and is support for this operation is not required; when implemented, newly created cells in other rows must have the anonymous empty text as content
- for columns of the table: all the list manipulation requirements, except inserting or deleting; when it is implemented inserting a column without content for all existing rows specified, the anonymous empty string shall be used for the missing cells
- fast: retrieve cells by integer column;row coordinates (with boundary check)
- fast: number of columns and rows
Note: the implementation is not required to store data in row lists, but
is free to use a native data structure. This includes the freedom to remove
the abscarction of rows (as long as row names are kept); for example the
trivial implementation in many languages is a 2 dimensional array of cells
and a separate array of row names. List API is not required to work on
tables or table rows.
2.5. sy: symlink
Symlink is introduced to make tables readable when cell data is not
a single string but a large data structure. The content of the symlink
is text that is interpreted as a relative or absolute path in the
document during normal access: for almost all the normal calls, the fact
that the node is a symlink is hidden. Special accessors are
provided to check whether a node is a symlink and to manipulate symlink
data. A symlink is broken when target does not exist or leads to a
symlink loop. An empty symlink target string is equivalent to "."
(pointing to the parent of the symlink) and is broken only if
the symlink is the root of the document.
Implementations must guarantee that at least 8 levels of symlinks
is supported in a symlink chain.
Operations where symlinks are transparent (the caller doesn't
find out the referred node is a symlink):
- tree traversal: accessing children nodes
- tree traversal: when a symlink name is referred in a path, .. is the parent of the symlink and not the target, children is children of the target (since sy: content is text, the node never has children nodes)
- list access (when symlink is a list item): getting the first/next node operates on the list where the symlink is, not on a list of where the symlink points to; removing the list item removes the symlink, not the target
- hash access: symlink has its own name, which is used for hashing; removing the list item removes the symlink, not the target
- table cell: table operations and name are table local as expected
- any access that requires target to exist will fail with a special broken symlink error; operations that do not require to deal with the target (e.g. removing or creating a symlink) will not fail if target does not exist
Special functions for accessing symlinks (n is a path or node; function
names are examples; functionality may be split or merged among the functions
but the implementation must implement all operations that are possible
with this list of calls):
- is_symlink(n) - returns true if n is a symlink
- is_broken(n) - returns true if n is a broken symlink
- has_symlink(n) - returns true if subtree n has symlinks
- has_broken(n) - returns true if subtree n has broken symlinks
- get_target(n) - returns the target node or path of symlink n or an error if n is not a symlink
- set_target(n, target) - creates or sets the target node or path of symlink n or an error if n is not a symlink
A symlink is followed only in advanced tree walk functions. Lower levels
(parser) should handle symlink as text. Thus, during parsing,
it is normal to have broken symlinks:
- a symlink can point to a node not yet reached by the parser ("forward reference", word "forward" interpreted in text stream context)
- a final lihata document assembled from multiple lihata documents with tree merge - there is no requirement that a symlink in one source file can not point to a node defined in another
- in a stream it is the responsibility of the application to define communication packages or transaction and to define how symlinks are interpreted (only within the transaction, or can reference old transactions, etc.)
Even in upper levels it is somewhat normal to have broken symlinks - the error
generated is probably similar to any "path not found" error the application
may get when following paths that lead to non-existent nodes even without symlink.
It is similar as with a normal UNIX file system.
3. single-root tree conventions
A lihata tree has a single root and is a tree on low (node) level
(the logical links introduced by symlink are considered to
be on a higher level). The root node is always explicit and specified
by the user.
There is no restriction of the type of the root node, which
means a single anonymous string is already a valid lihata tree, and
the smallest valid lihata tree is an anonymous empty string.
The root node may be anonymous.
The parser should stop parsing after the root node is closed. The caller
should be responsible for deciding what to do with any remaining data.
This behavior may be necessary for stream parsing.
3.1. splitting a tree
The implementation may provide tree split operations:
- a subtree is copied or moved and taken as a separate lihata tree (root is the root of the subtree)
- the whole tree is traversed and matching paths/nodes are copied or moved to a separate lihata tree with or without children nodes (recursively or not recursively); if a node on the 4th level matched, the 3 parents are also copied, but not recursively (since that would result in a full tree copy)
- when splitting a table, the resulting table must be valid
3.2. merging trees
The implementation may provide tree merge operations.
Lihata tree A may be merged into lihata tree B at any target
node. The type of the target node B must match the type of
the root node of A. In case A should be inserted/appended
as a new subtree of B, first a new node with the appropriate type
shall be created in B and A shall be merged with that node. This
way insertion is not a special case.
From now on the root node of A is called src and the target
node in B is called dst.
When merging two nodes of the same type:
- te: content of src is concatenated to content of dst at the end without introducing any character in between the two
- li: items of src are appended one by one at the end of dst, keeping original ordering
- ta: items of src are appended one by one at the end of dst, keeping original ordering; the number of columns in src and dst must match, else the merge fails
- sy: if the target text of the two symlinks match byte-to-byte, one instance is kept, else the merge fails; symlinks with empty value will never match (this can be used to explicitly disable merging of subtrees [] )
- ha: items of src are inserted one by one in random order in dst; when a collision is detected, depending on the item type:
- mismatching types: the merge fails with an error
- li: the src item is merged with the dst item as described in the normal li case
- ha: the src item is merged with the dst item as described in the normal ha case
- ta: the src item is merged with the dst item as described in the normal ta case
- sy: the src item is merged with the dst item as described in the normal sy case
An attempt to merge nodes with different type always fails. The merge should
fail if src is among the ancestors of dst.
It is strongly recommended that any implementation provide the above merging
as this enables config.d style loading of files, letting the user decide
about scattering data across files, without the application hardwiring
any file name.
3.3 path
A path is a slash separated list of node identifiers [] .
When interpreted, the tree is traversed:
- from the root (if the path starts with a '/') or;
- from the current node (if the path does not start with '/' and is the value of a symlink).
While descending in the tree, on each level one of the nodes will be selected
using the corresponding node identifier; if the identifier does not match
on its level, an error is generated. If the traversal consumed all identifiers
on the path, the last matching node is returned. The method of selecting the
node depends on the node type of the current level:
- in a list:
- num - the numth element of the list
- name: - the first element whose node name matches name
- name:num - the numth element whose node name matches name
- in a hash only the name is specified in the node identifier.
- in a table we use the row/column format (as a table is almost a list of lists), both row and column allowing all list identifier formats; a few examples:
- num1/num2 - the node in column num2 in row num1
- name:/num - the node in column num in the first row called name
- name:num1/num2 - the node in column num2 in the num1th row called name
- name1:num1/name2:num2 - the num2th node column called name2 in the num1th row called name1
NOTE: it is forbidden to address a row of a table, only table cells can be addressed - in other words: for a table, the path always contains 2 identifiers.
Numeric identifiers start from zero [] , i.e., the first element in a list is ./0.
Examples:
li:root {
li:foo {
bar = aaaaaa
bbbbbb
bar = cccccc
2 = dddddd
}
# points to aaaaaa
sy:ppp = foo/0
# points to bbbbbb
sy:qqq = foo/1
# points to aaaaaa
sy:rrr = foo/bar:
# points to cccccc
sy:sss = foo/bar:1
# invalid
sy:ttt = foo/bar
# points to dddddd
sy:uuu = foo/1:
}
The following identifiers will be interpreted as directives and do not depend
on list node type:
- . - the current level (for compatibility with file system notations)
- .. - one level up
Backslash escaping works in paths: if a node name has colon in its name or
is called "." or "..", the node identifier should use a backslash before
the colon or each dot.
Implementations may provide API calls that take path arguments. These
calls may accept extra features in paths, but symlink must not, to
avoid implementation-specific lihata documents.
The implementation is expected to provide functionality similar to the following
call:
- path_resolve(path) - returns the unique lihata node that matches the path (using the symlink path rules described above with no extra features) - symlinks are followed
It is recommended that the implementation also provide similar API
that accepts regexps and can return multiple nodes matching the regexp path.
The implementation may also provide a version of path_resolve() that returns all
matches (in case of lists/tables with named identifiers).
4. other conventions
4.1. tables with named columns
It may be a good convention to name table columns so that the user code
does not depend on order. Lihata tries to keep things simple and relatively
orthogonal, so there is no dedicated feature doing this. Examples how
existing features can be used for the purpose:
- use the first row as a header and name each cell node (this won't interfere with cell data); after loading the table build a hash telling which named column is which numbered column
- use the first row as a header and make each cell a text node containing the column name - similar approach, more compact but less transparent
- if the source code has a preconception of a native order of columns, a low level library function may reorder the columns of the table according to any naming solution
- a design decision is whether an extra function to query a cell by name instead of column number is introduced, or the caller shall keep a translation table
Note: if tables are to be merged as described in 3.2., the best solution is
seemingly reordering the columns during load or merge, and requires
the library to understand the naming convention
4.2. streaming
Streaming formats that require a single root is always a hack of some sort.
There are two recommended setups/workarounds for this with lihata
4.2.1. single-root stream
- the root shall be a list; this expresses that an ordered sequence of transactions will follow
- the name of the root may be the name and version of the protocol - useful if name/version data is short
- alternatively the first transaction could be a hash with the name and version of the protocol - useful if there are a lot to send in version data (e.g. list of capabilities)
- each child node of the root should be regarded as a transaction; a transaction should be processed only when the full child node is parsed. Technically, using the split method described in 3.1. each transaction can be handled as a separate lihata tree assuming that the parser stops parsing after the end of the root node (which is the end of the transaction node in this case); alternatively the parser may call hooks or return after parsing each node on a specific/preconfigured level.
- if the root node is closed, the connection should be terminated
Transactions may be of any node type, and the type:name pair of the transaction
should identify the (operation) type of the transaction. For example
changing states is easy with a hash, sending bulk, unprocessed textual
payload is easy with a text node, sending lists of data is easy with a list
node, etc.
Alternatively, if a transaction has more meta-data, transactions may be
declared to always be hashes, introducing a new level for meta-data. In this
case the name of the transaction hash should be ignored and name, type, version,
session-id/cookie and payload of the transaction shall be the fields of the
hash.
This method is better if each transaction is a new connection over
a connection-oriented protocol (e.g. TCP) or each transaction
is an atomic message (e.g. UDP).
TODO: example
4.2.2. multi-root stream
A multi-root setup simply puts each transaction in a lihata root node.
The event parser (and in fact the dom parser as well) stops parsing after
the root node is closed. The caller should then evaluate the transaction
and reinitialize the parser so that the next transaction can be parsed.
Inside the transaction, some of the considerations of 4.2.1. apply.
This method is useful if there are multiple requests and answers shall
be sent in a connection-oriented protocol (e.g. TCP).
4.3. data types and marshaling
Often native binary data (C structs in our examples) shall be written or
read in textual format. When using lihata for the purpose, the following
conventions are suggested.
Fields of the structure should map 1:1 to fields of a lihata hash. Since
lihata doesn't offer extra syntax for describing the type of the payload,
and validation of the file format shall be performed, such type data needs
to be stored separately. The simplest way is defining a special named
field that is put in each data hash, a name that a native field can not have, which
holds type info. Best practice is to make these fields symlink to central
li: nodes describing the format. To make things more compact, it's
cheaper to introduce a new level that describes the format and contains
a lot of record strictly following that format. An example for this convention:
li:data set {
ha:formats {
li:phone {
name=string
phone=int64
}
}
li:phone directory {
sy:format=../formats/phone
ha: {
name=Jack Smith
phone=1177182
}
ha: {
name=Susanne Groove
phone=424242
}
}
}
At the end the C struct may be:
struct phone_directory {
char *name;
uint64_t phone;
}
The marshaler could be directed to load all data under "/phone directory" into
a newly allocated array and return the number of entries.
If the format description is part of each data file, the data files
are self-describing - no versioning is needed and the code dealing
with the marshaling can decide whether the format is acceptable or
not.
A different approach, more useful if there are a huge amount of data
of the same types scattered in multiple files, is to have all format
description in a single file and rely of the config.d-style merging
of the loader. The same self-describing concept is applied, not
to individual files but the whole directory.
4.4. data encoding
Lihata stores byte streams that can contain any (binary) that makes up
a valid lihata document (when parsed byte-by-byte) with the only
exception being the ASCII \0 character, which is forbidden anywhere in
the file [].
In case user application would like to use different encodings it should
store encoding info as a text node; for example the root could be a li
with a first child te:encoding=whatever.
A common, but explicitly unsupported case is the UTF-8 BOM (the UTF identifier
cookie bytes in the beginning of a document), since the document must start
with a comment or the root node. This means a lihata document that contains
more than a single text node may start with one of:
- a sequence of whitespaces
- a hashmark (comment)
- a type identifier (li:, ha:, ta: or sy:)
Of course user application may wrap the lihata document in whatever
other format, including an UTF-8 frame starting with a BOM, as long
as framing is stripped by the application before passing the document
to the standard lihata parser. However, in this case the full
document will not be compatible with other applications and libs
parsing or emitting lihata documents. Thus it is strongly recommended
to avoid using UTF-8 BOM or encoding other headers and use a lihata node
for storing them instead.