Lihata design decisions

0. Goals

Lihata should (in order of importance):

1. Comparing lihata to other languages

1.1. xml

Xml can represent arbitrary trees. It requires to have a single root just as lihata does. Xml has less escaping escaping overhead, especially in text content. Xml node attributes can be regarded as a hash with unique keys, embedding nodes are similar to lihata lists, and text content are the match of lihata te nodes. If we take this model, list vs. hash vs. text all have entirely different syntax, not based on keywords but punctuation marks. However, in xml an attribute can not have a subtree, rendering this mental model unorthogonal.

There is no compact way of representing tables in xml.

Lihata is much more compact than xml with a lot of closing tags. Xml also requires a static header (document type).

1.2. JSON

From all the formats described in this section, JSON is the closest to lihata. JSON does support lists and hashes very similar to lihata's, however doing so, it depends on different brackets instead of keywords, which makes the file less readable for those not familiar with the syntax. There is no simple (small-overhead) way of representing tables. Describing strings require quoting, which is a large overhead for the most common cases (at least in lihata target domain). This price is paid to enable JSON to understand data formats such as number, string and boolean - a feature not supported by lihata.

1.3. wiki

A typical wiki format defines a large amount of controls as series of punctuation marks. This makes the format much more compact than xml, but many users find it hard to remember the sequences. Some of these sequences also depend on indentation (for example lists) - forgetting a space in front of the line renders the list broken.

Such wiki formats usually define list and table, both in compact and human readable format, but they don't define hashes. Nodes are embeddable, allowing the document to form a tree.

1.4. html

Html offers the tree structure, text nodes, lists and tables. The syntax of a modern html is not very compact and suffers from the redundancy of attributes already mentioned at xml. Html is mentioned here only because it offers a few good examples:

2. generic design decisions

2.1. minimalistic syntax vs. convenient syntax

Unfortunately I couldn't find a single format for representing lists, hashes, tables and text without too much compromises on overhead on some of them. However, it was an important design decision that the actual format(s) of the different types should be rooted in some simple, central rules. This leads to a design that allows multiple ways of expressing things; combining them in specific ways may help decreasing the syntax overhead for each data type. However, these should not be restrictive special cases: any combination should be allowed anywhere.

2.2. type prefixes

As mentioned in [R7], punctuation marks should not control data types, but short keywords should be used. Keywords li, ha, ta, te and sy are all the first two characters of the corresponding type name. It's probably easier to remember them than remember exotic combination of punctuation (like in the typical wiki format), and not longer to type.

2.3. Anonymous nodes

Allowing anonymous objects on any level is an important feature of lihata. This is one of the major moves that made it possible to lay down a simple, orthogonal syntax while keeping overhead low and is fully aligned with [R5], [R6] and [R9].

However, anonymity should not increase code complexity on parsing or application side or increase the complexity of the mental model. Having the anonymous node the same as empty string seems to be the right decision here.

The vast majority ([R6]) of anon nodes will live on lists (including table rows). Many root nodes will lack name as well. In theory a hash can have at most one anonymous node as well, but it's far less useful there.

2.4. no content type

An important design decision is that unlike JSON, lihata lacks content type. This obviously helps maintaining [R5], but more importantly [R6] as well: In this regard lihata is more similar to xml. There are a few shortcomings of this decision: Besides simplicity, another benefit is flexibility: it is easily possible to store type information in the existing lihata syntax, using a list or hash somewhere. This allows strange uses, when a field may have multiple types or a type of a field depends on some other properties of the record.

2.5. documentation

The technical specification of the format does not describe the reason for each design decision - this helps keeping that document short. However, common use cases and conventions are described there, even if those are not hardwired in parsers and code: in the final result such conventions are as important as the actual syntax.

If the format works as expected, most users will not read the design document at all.

2.6. binary files, encoding

The character format of lihata is the byte. The main reason for this is simplicity: the API works with C-style char *. For the same reason the \0 character is not allowed anywhere in the file (so that the API doesn't need to have string length parameters and the standard string libs can be used on the data).

Still, lihata is mainly a text format, so it is probably better to use base64 or other similar encoding instead of dumping binary in the middle of the document. Especially that even with brace protection, such strings would require to escape at least the closing brace. Another workaround is to implement (on application level) a more complex escaping on top of lihata's, that allows encoding the \0 character in some escape sequence.

However, keeping the format fully text even on application level has benefits: exporting the tree in line-oriented text formats opens up the text processing tool domain (grep, sed, awk, etc.), which would be nearly unaccessible with binary content.

Internationalization: most encodings (e.g. latin1, latin2, utf-8) safely fits into this frame, as they do not use the ASCII \0. The only risk is remapping }, causing unprotected brace-string termination.

Thus lihata does not define default encoding or a specific way of describing document encoding:

3. syntax design decisions

3.1. separators and indentation

Separator being newlines and semicolons is a choice similar to some procedural languages like shell and awk. The reason is simple: in many cases (lists, hashes, rows of tables) the most natural way is to have each record/item/row in a single line.

Forcing the user to add an extra separator (as in JSON) would increase the overhead. However, using newlines only would make it impossible to describe a list or hash in a single line - especially impractical in case of a table row. With the current design table row is not a special case, and the same feature often makes short lists more readable as one-liners.

Another alternative would be to rely on indentation, as make or python does. The decision has been made against this design, because it restricts the user into One Good Way of indenting/representing data. If One Good Way exists, that's per user or even per task - the flexibility of switching between one-liners and verbose multi-line representation must be kept. (Enforcing one specific indentation is simply a different school, which benefits from seeing all instance of the data/code in the same style/format).

This results in ignoring white space between tokens, a way to remove indentation and provide flexibility in arranging the file in a table or "ASCII art".

3.2. escaping

Lihata tries to minimize control characters used by the syntax (but not at any price, see 2.1). Quoting with double or single quotes have various drawbacks. Instead, lihata chooses to implement backslash escaping and braces, because they are already there as part of the syntax. The backslash filter is applied the same way regardless of the context - even for brace protected strings, where the only control character to be protected is the closing brace. This only means that protecting other control characters in a brace string is optional and is implemented to keep things orthogonal.

The backslash protection method has its own shortcomings, and would not have been introduced if the brace method could be enough alone (but the closing brace needs to be protected...). For example if a lihata tree is embedded in a shell script or other media, a backslash avalanche is triggered: number of backslashes will increase or decrease depending on how deeply the lihata tree is embedded. However, R6 applies here: it will be far more common to have lihata trees in separate files or pipes.

A combination of brace protection, backslash protection and separators/indentation (described in 3.1.) results in a situation that experienced users can write lihata files without too much escaping or syntax overhead [R5]. For example node names can contain spaces without any escaping. An user, without understanding the mechanisms behind, can just write down anything, and in most of the cases it will just do what is expected. However, such systems often fail. In accordance to R[10], lihata provides a fallback: a safe syntax that always works. This is the brace protected string, where the only one corner case is the closing brace making this format "[R6]-friendly".

3.3. the = sign

The equal sign in the syntax is very handy to allow users to specify lists and hashes and (key=value pairs) in a compact way. Not having the equal sign would result in constructions like:
	li:root {
		li:foo {
			key1 {value1}
			key2 {value1}
		}

		li:foo2 {
			{key1} value1
			{key2} value1
		}
	}
or would require keys and/or values to not contain space (and make space be the separator).

To avoid having the equal sign as a special case, it's an optional separator between name and value. This removes the syntax overhead in the above case when defining /foo and /foo2 - however, some may reason that the format is more uniform when the = is always used, and the user may choose to do so. Note: the above example, although ugly, is a valid lihata tree.

3.4. default text and default list

The choice to make text type default is for [R6]: leaves of the tree are most often strings, and the extra overhead of always describing the type is noticeable:
		li:foo {
			te:key1 = value1
			te:key2 = value1
		}
Unfortunately a clumsy example sneaks in when describing tables. For table rows, the only type that makes sense is the list type. In this case expecting the user to explicitly add the type tag for each row is as bad as the previous case with requiring explicit text tags. Another workaround would have been to lie about table rows, making ta payload special. That probably wouldn't have made the mental model any simpler.

3.5. tables

In this regard, wiki is the right example: a table should be represented in some tabular format. Looking at a table even from a distance should make it obvious (R[9]). While JSON and xml can be get to represent tables without extra support from the format, the syntax overhead is somewhat big.

A trivial way to do the same in lihata without explicit support would be to just embed li in li. While this feature is still available, an explicit table support is also introduced to make sure lib and lihata file implementations follow the same common rule. The construction implemented in the design covers one specific case, which is more like a matrix than a simple table. The only difference to the custom li/li setup is that the parser makes sure about the number of columns and that rows are really lists.

3.6. shortcuts

The specific set of shortcuts (which tokens are optional and what the rest would mean) is designed to fit [R5] and [R6]. This is most obviously demonstrated by the representation of anonymous text node, which is nothing more than the text itself. The following cases are anticipated to be the most common:
li:examples {
	# anonymous text node
	blah blah blah

	#key-value pairs
	key=value

	# named list
	li:listname {
		i1
		i2
		i3
	}

	# anon list
	li:{
		i1
		i2
		i3
	}
}

3.7. path

Path syntax was chosen after careful consideration of the following:

In lists, the order of elements matter (one may argue that this is the defining characteristic of lists). Therefore it makes sense to define syntax that provides access to the numth element of a list, and to assign the least possible overhead to this syntax.

Furthermore, elements may be named, and names are not required to be unique within a given list. The name:num format is provided to allow the user to select a specific named element from the list (the numth element called name).

The colon is chosen as separator here partly because it is already used elsewhere in the syntax, so there is no need to "taint" another character with special behavior, and partly because it allows regexps in paths without too much escaping.

The only volatile form is name: without num, which will always return the first match: there is no obvious advantage of returning any other match, while sticking to trivial user assumptions on how the simple name form should work.

In hash names are unique and order is not guaranteed: only a named node makes sense for addressing.

The purpose of tables is to grant fast access to cells by col/row coords, thus the num1/num2 format must be provided. Here the slash was chosen as separator because even though a table is a not the same as a list of lists, (internally, they may be implemented via completely different data structures), the syntax to describe them is similar, so it was deemed advantageous that the path access syntax be similar as well.

Sometimes rows or columns of the table will be named. It makes most sense if the rules for named list elements apply here as well.

Paths count from 0 on lists and in tables. The user may expect counting from 1, and the original proposal indeed did that. Unfortunately this would split the API: native binary access functions count from 0 (this is standard in C and many other languages) while path variants would count from 1. Instead paths had been switched to count from 0.

3.8. merge

In practice config.d is very convenient for users and maintainers. Aggregating multiple lihata files from a config.d directory means plugging them together. Tree merge is a generalized verzion of this method, allowing the end user or maintainer to split the tree wherever it seems useful, without having to worry about the code: the code doesn't have assumptions. As long as two subtrees have the same structure, they can be merged. However, this feature is a double-edged sword. With a large amount of subtrees to merge, especially split into multiple directories (system settings, user settings, project-specific settings, etc.) accidental merges may happen if two nodes unintentionally have the same name. For example the root is a hash of named subtrees; in some cases the user wants to be able to split a subtree in two files. If the two subtrees have the same name on the root hash level, they will be merged. In other applications the content of the subtree can technically be merged with another subtree of the same kind, but on a higher level the resulting content will be useless; still if the user unintentionally names two subtrees the same, the merge will take place. The user is able to protect against this by defining a node in all such subtrees using the same name, making sure the node would cause collision upon merge. With standard lihata nodes this is hardly possible; the closest thing is using symlinks with random value and hope that those values are unique (which is hard to guarantee).

A solution for this problem would be to introduce a special node type that does not hold data but causes collision. This would cost the overhead of a new node type on all levels of the library and user applications. Another solution would be to add an extra field for hashes that control collision; this would introduce overhead on the syntax level. The solution chosen for lihata is a special feature of existing nodes: symlinks with empty value will not merge. The user shall place an empty symlink in all such hashes using the same name for the symlinks.