Lihata C API

1. Introduction

Lihata C API is a set of libraries layered on top of each other in a way the user can choose how many their application code wants to use. The layers are:

core constants and types and associated utility functions (lihata.h)
event parser (parser.h) - depends on core
DOM parser that builds a tree (dom.h) - depends on the event parser
tree utils (tree.h) - depends on the DOM parser

Size, code complexity and API complexity of the layers are increasing from the core toward the higher levels.

Each level of the stack is designed to be reasonably reentrant:

the application may maintain multiple lihata documents
threaded apps: concurrent read/write operations on different documents will work
threaded apps: concurrent read operations are guaranteed to work on a single document
threaded apps: concurrent read while write will not work on a single document: the application should not attempt to change a document while other threads are reading it
threaded apps: concurrent write operations will not work on a single document and the application is responsible for implementing locking mechanism

This is achieved by using document handles on all levels. These handles store all internal data and parser states of the document and the library does not depend on global variables. For many read operations (most notably tree print functions) any call-local storage is allocated on the stack so multiple concurrent calls on the same non-changing tree shall work.

No layer of the library depends on threading or thread libraries. For non-threaded applications the above concurrency rules restrict operations that can be executed from callback functions. In threaded applications the write operations shall put exclusive read-write lock on the lihata document to make sure no read sessions are confused by the write. For example a recursive tree print of a lihata document in one thread may produce undefined behavior if the tree is changed during the descend.

2. The core

The core library is not useful alone, but is required by all above layers. It is dealing with basic types and constants derived from the specification and possible parse errors.

3. The event parser

The event parser consumes the document character by character in a non-blocking manner. The caller is responsible for feeding the parser until it returns an error or PE_STOP. Normally the caller shall pass on the document without any filtering (even passing the EOF). If the parser detects an error or a valid EOF (root node closed), it stops parsing and returns PE_STOP upon subsequent calls. This means the event parser ignores anything beyond the root node.

Parsing a stream with a single root node behaves exactly as parsing a file document. When parsing a streams without a root node, the caller should reinitialize the event parser after the root node is closed.

Detecting when the root node is closed can be implemented in different ways:

check if two subsequent calls return PE_SUCCESS and PE_STOP
trace depth of nesting in the event handler; when the outmost node is closed, the root is closed

The event parser will generate open/close events in pair for nodes that may have children. The close event is anonymous - if the event callback needs to pair up close events with node events, or remember the node name or type at close, it should maintain its own internal database of open nodes. Node types that can not have children will trigger only a single event that will hold all properties of the node (i.e. text or symlink nodes will not have open/close events but a single textdata event).

The event parser is not required to remember more than the maximum two textual tokens and the type of the current node. This allows the caller to set an upper limit of memory usage of the event parser. However, this also means the event parser can not do all checks and validation of the document the lihata specification requires, and it is the responsibility of the caller to implement those:

keys in a hash must be unique
children of a table must be lists
rows of a table should have the same length

Note: the DOM parser does implement all these checks.

The event parser is recommended when the application is not required to have random access on the data but can process it sequentially. This often means the document consists of lists/tables and the application expects a specific order of data; or the document is built of hashes and the application sets the value of variables by name. The simplest example expects a single list or hash node as the root with setting nodes of known order and/or name.

NOTE: This should not mean the order of nodes matters under a hash.

4. The DOM parser

4.1. generic concepts: documents, trees, nodes

The DOM parser uses the output of the event parser to build an in-memory representation of a lihata document. To distiguish between them the latter will be called lihata stream in this document, and will mean a lihata document read from or written to a file, a network connection, a pipe, etc; document will always mean the in-memory repersentation.

The sequence for loading a stream into a document (lht_doc_t *):

allocate a new empty document using lht_dom_init()
parse a lihata stream into the document:
- feeding the stream in lht_dom_parser_char();
- or loading the stream from a named file using lht_dom_load();
- or loading the stream from an open FILE * using lht_dom_load_stream();
handle errors and discard the document if necessary

There are accessors for many non-trivial functionality, but there are no accessors for the trivial ones. Instead, the relevant fields of the public struct (lht_doc_t) should be read (or in rare cases written) directly. The most important fields are:

lht_node_t *root: lihata documents are required to have a single root node; in a non-empty document this is the root node.
lht_node_t *unlinked: a linked list of nodes which are part of the document but are not part of the tree

The same rule applies on nodes; the node struct (lht_node_t) has the following public fields:

->name: the lihata name of the node (malloc()'d for each node)
->type: node type enum
->data: holds the value of the lihata node in one of the data structs depending on type
->next: the next sibling of the node (on the same level of the tree), or NULL if this node is the last sibling of its level
->parent: pointer to the parent node, or NULL if the current node is the root of the document (or the floating subtree, see later)
->doc: pointer to the document the node is part of (or NULL for floating nodes/subtrees, see later)
location information (see later)

After loading a lihata stream the document is integral which means all nodes are part of the tree and are accessible from the root. There are four ways a node or subtree can be removed from the tree:

unlinking: Tree manipulation, especially removing or relocating parts of the tree may result in situations where some nodes are not part of the tree but are still part of the document - such unlinked nodes are temporarily stored in the unlinked list. It is impossible to save these nodes into a stream when exporting a document, since they are not part of the single rooted tree. The unlink feature is useful for memory management: the application may keep pointers to nodes which are not part of the tree while the lib can keep track of such nodes and free them when the document is uninitialized. In other words, node pointers stay valid until the uninitialization of a document even if a node is unlinked from the tree.
Function lht_tree_unlink() unlinks a subtree within its document. The node is placed on the unlinked list and node->parent will be NULL. This also means the children of the node can not be accessed from the document root anymore.
moving between documents: TODO
detaching: Detaching unlinks the node from the tree, modifies the ->doc field (recursively) so that they won't point to the document and the subtree is not inserted in the document's unlinked list. In other words the subtree is fully detached from the document but pointers pointing into the subtree stay valid. A new document is allocated where the subtree is moved, the original node being the root. The detach_doc field of the new document is set to 1. NOTE: this is only a special case of moving between documents - the only difference is that the target document is automatically allocated instead of using an existing one. When such automatically allocated document becomes empty (root node moved or deleted), it is automatically destroyed. The caller can acquire the dummy document pointer from the node's ->doc field, but the pointer may become invalid when the document becomes empty. Manually setting the detach_doc field to 0 will turn off automatic destroy and transfers the duty of uninitializing the document to the caller. The generic (type-independent) call is lht_tree_detach().
deleting: The node and its subtree are unlinked from the tree, detached from the document and destroyed (all memory allocated for the subtree is free()'d). Any pointer still pointing into the subtree become invalid. Done by calling lht_tree_del() on a node.

Note: the above calls take a node pointer to be unlinked/detached/deleted/moved. Getting such node pointers is usually done by resolving paths using the path API or manually descending the tree using lower level calls and/or struct fields.

It is possible to create floating nodes with the low level node creating calls. A floating node is not part of any document and does not have location information. All administration regarding to floating nodes is the duty of the caller. A floating subtree can be destroyed by deleting it using lht_tree_del().

4.2. location

A document is most often loaded from a stream and not created from scratch by the caller. During loading, the DOM parser attaches location information to each node (stream name and line number). After merging trees the location references of a single document may point to different stream names.

The following three read-only fields are used to store the location in the node_t struct:

file_name - the name of the stream
line - line number of the node
col - column number within that line

To save memory and speed up location handling, each document has a hash table of all stream names ever referenced by any location of any node that ever have been part of the document. file_name may be NULL (e.g. for nodes created after the load operation).

4.3. data

The ->data field of node_t holds the value of the node in the following format:

->type == text: ->data.text.value is a read-write char * malloc()'d;
->type == symlink: same as text;
->type == list: ->data.list->first and ->data.list->last are pointers to the first and last, respectively;
->type == hash: ->data.hash->tbl is the genht string->pointer hash (pointers are lht_node_t *);
->type == table: ->data.table->cols and ->data.table->rows are the number of columns and rows in use, ->r is an array of rows (each row an array of lht_node_t * cells), ->row_names is a char * array of names of the rows.

NOTE: anything else than text and symlink value should be accessed using high level accessors as described in section 5.

4.4. doc uninit

When a document is no longer needed, lht_dom_uninit() should be called on it to free up all resources, including the unlinked list.

4.5. iterator

Two functions and a mostly opaque struct are provided for iterating over the children of a node (regardless of the type of the node). The loop should call lht_dom_first() before the first iteration and lht_dom_next() for subsequent calls. The state pointer (lht_dom_iterator_t *) usually points to a local variable which doesn't need to be initialized before the first iteration or cleaned up after the last iteration.

The iterator considers only the direct (first-level) children of the node.

The caller is free to break the loop. Inserting new children or deleting existing children under the node may cause undefined behaviour.

4.6. debug/diganostic

Function lht_dom_pnode() is provided for printing a node non-recursively and function lht_dom_ptree() for printing a subtree recursively. Both calls get a node argument, an output FILE * and a prefix which is inserted before each line printed. Output is a line based textual format intended for diagnostics, often including internal states/pointers.

4.7. building a tree

Allocating a new floating node is possible using the lht_dom_node_alloc(). A floating node can become part of a tree by:

becoming the root of an empty document (node->doc = doc; doc->root = node) or can be
inserted or appended to a list
inserted in a hash table
replacing an existing node in a tree by the new node (see 4.8)

4.8. per-type calls

The following sections describe special, type-specific accessors. They are useful for low level operations, but very often can be avoided by using higher level API described later. For example instead of using the node->next field in a list or addressing from row;col loops in a table, the iterator API can visit each child of a node regardless of the node type. When seeking a node or descending the tree, the path API is the type-independent generic way.

4.8.1. list

Calls lht_err_t lht_dom_list_append() and lht_dom_list_insert() are provided for growing lists. Counting the number of elements on a list is done by int lht_dom_list_len(). There is no type-independent API for these.

Looking up children nodes can be done using the higher level path API, or the list-specific calls lht_tree_list_nth(), lht_tree_list_nthname().

Deleting/detaching/unlinking/replacing children can be done by the generic API or by lht_tree_list_del_nth, lht_err_t lht_tree_list_del_child, lht_tree_list_detach_nth, lht_err_t lht_tree_list_detach_child, lht_tree_list_replace_child.

4.8.2. hash

Inserting a node in a hash is done using lht_dom_hash_put(). Children nodes can be counted by lht_tree_hash_len(). There is no high level equivalent for these calls.

Looking up a node in a hash, by name is possible using lht_dom_hash_get() or by using the path API.

Deleting/detaching/replacing children can be done by the generic API or by lht_tree_hash_del_child(), lht_tree_hash_detach_child(), lht_tree_hash_replace_child().

4.8.3. table

There is no direct way of inserting nodes in a table. Instead, a row or column of anonymous empty strings can be inserted using lht_tree_table_ins_row() and lht_tree_table_ins_col() then any empty (or non-empty) cell can be replaced using the generic lht_tree_replace() (after looking up the node pointer of the cell).

It is possible to delete a whole row or column using lht_tree_table_del_row() and lht_tree_table_del_col(). When a single cell is deleted using lht_tree_del() it is replaced by the anonymous empty string.

Looking up the node pointer of a cell can be done by the path API or by lht_dom_table_cell() and lht_tree_table_named_cell().

4.9. node replace

Any node of a tree can be replaced with a detached node using lht_tree_replace(). The replace operation detaches the original node from the tree (and its document) and links the new node in place. The detached old node (and subtree) becomes floating and shall be destroyed (or inserted in another document) by the caller.