Lihata C API
1. Introduction
Lihata C API is a set of libraries layered on top of each other in a way
the user can choose how many their application code wants to use. The layers
are:
- core constants and types and associated utility functions (lihata.h)
- event parser (parser.h) - depends on core
- DOM parser that builds a tree (dom.h) - depends on the event parser
- tree utils (tree.h) - depends on the DOM parser
Size, code complexity and API complexity of the layers are increasing from
the core toward the higher levels.
Each level of the stack is designed to be reasonably reentrant:
- the application may maintain multiple lihata documents
- threaded apps: concurrent read/write operations on different documents will work
- threaded apps: concurrent read operations are guaranteed to work on a single document
- threaded apps: concurrent read while write will not work on a single document: the application should not attempt to change a document while other threads are reading it
- threaded apps: concurrent write operations will not work on a single document and the application is responsible for implementing locking mechanism
This is achieved by using document handles on all levels. These handles
store all internal data and parser states of the document and the library
does not depend on global variables. For many read operations (most notably
tree print functions) any call-local storage is allocated on the stack so
multiple concurrent calls on the same non-changing tree shall work.
No layer of the library depends on threading or thread libraries. For
non-threaded applications the above concurrency rules restrict operations
that can be executed from callback functions. In threaded applications
the write operations shall put exclusive read-write lock on the lihata
document to make sure no read sessions are confused by the write. For example
a recursive tree print of a lihata document in one thread may produce undefined
behavior if the tree is changed during the descend.
2. The core
The core library is not useful alone, but is required by all above
layers. It is dealing with basic types and constants derived from
the specification and possible parse errors.
3. The event parser
The event parser consumes the document character by character in
a non-blocking manner. The caller is responsible for feeding the parser
until it returns an error or PE_STOP. Normally the caller shall
pass on the document without any filtering (even passing the EOF). If
the parser detects an error or a valid EOF (root node closed), it stops
parsing and returns PE_STOP upon subsequent calls. This means
the event parser ignores anything beyond the root node.
Parsing a stream with a single root node behaves exactly as parsing a
file document. When parsing a streams without a root node, the caller
should reinitialize the event parser after the root node is closed.
Detecting when the root node is closed can be implemented in different
ways:
- check if two subsequent calls return PE_SUCCESS and PE_STOP
- trace depth of nesting in the event handler; when the outmost node is closed, the root is closed
The event parser will generate open/close events in pair for nodes that may
have children. The close event is anonymous - if the event callback needs to
pair up close events with node events, or remember the node name or type
at close, it should maintain its own internal database of open nodes.
Node types that can not have children will trigger only a single event
that will hold all properties of the node (i.e. text or symlink nodes
will not have open/close events but a single textdata event).
The event parser is not required to remember more than the maximum
two textual tokens and the type of the current node. This allows the
caller to set an upper limit of memory usage of the event parser. However,
this also means the event parser can not do all checks and validation
of the document the lihata specification requires, and it is the
responsibility of the caller to implement those:
- keys in a hash must be unique
- children of a table must be lists
- rows of a table should have the same length
Note: the DOM parser does implement all these checks.
The event parser is recommended when the application is not
required to have random access on the data but can process it
sequentially. This often means the document consists of lists/tables
and the application expects a specific order of data; or
the document is built of hashes and the application sets
the value of variables by name. The simplest example expects
a single list or hash node as the root with setting nodes of
known order and/or name.
NOTE: This should not mean the order of nodes matters under
a hash.
4. The DOM parser
4.1. generic concepts: documents, trees, nodes
The DOM parser uses the output of the event parser to build
an in-memory representation of a lihata document. To distiguish between
them the latter will be called lihata stream in this document, and
will mean a lihata document read from or written to a file, a network connection,
a pipe, etc; document will always mean the in-memory repersentation.
The sequence for loading a stream into a document (lht_doc_t *):
- allocate a new empty document using lht_dom_init()
- parse a lihata stream into the document:
- feeding the stream in lht_dom_parser_char();
- or loading the stream from a named file using lht_dom_load();
- or loading the stream from an open FILE * using lht_dom_load_stream();
- handle errors and discard the document if necessary
There are accessors for many non-trivial functionality, but there are
no accessors for the trivial ones. Instead, the relevant fields of the public
struct (lht_doc_t) should be read (or in rare cases written) directly. The
most important fields are:
- lht_node_t *root: lihata documents are required to have a single root node; in a non-empty document this is the root node.
- lht_node_t *unlinked: a linked list of nodes which are part of the document but are not part of the tree
The same rule applies on nodes; the node struct (lht_node_t) has the
following public fields:
- ->name: the lihata name of the node (malloc()'d for each node)
- ->type: node type enum
- ->data: holds the value of the lihata node in one of the data structs depending on type
- ->next: the next sibling of the node (on the same level of the tree), or NULL if this node is the last sibling of its level
- ->parent: pointer to the parent node, or NULL if the current node is the root of the document (or the floating subtree, see later)
- ->doc: pointer to the document the node is part of (or NULL for floating nodes/subtrees, see later)
- location information (see later)
After loading a lihata stream the document is integral
which means all nodes are part of the tree and are accessible from
the root. There are four ways a node or subtree can be removed
from the tree:
Note: the above calls take a node pointer to be unlinked/detached/deleted/moved.
Getting such node pointers is usually done by resolving paths using the
path API or manually descending the tree using lower level calls and/or
struct fields.
It is possible to create floating nodes with the
low level node creating calls. A floating node is not part of
any document and does not have location information. All administration
regarding to floating nodes is the duty of the caller. A floating
subtree can be destroyed by deleting it using lht_tree_del().
4.2. location
A document is most often loaded from a stream and not
created from scratch by the caller. During loading, the DOM parser
attaches location information to each node (stream name and line number).
After merging trees the location references of a single document may
point to different stream names.
The following three read-only fields are used to store the location in
the node_t struct:
- file_name - the name of the stream
- line - line number of the node
- col - column number within that line
To save memory and speed up location handling, each document has a hash
table of all stream names ever referenced by any location of any
node that ever have been part of the document. file_name may be
NULL (e.g. for nodes created after the load operation).
4.3. data
The ->data field of node_t holds the value of the node in the following format:
- ->type == text: ->data.text.value is a read-write char * malloc()'d;
- ->type == symlink: same as text;
- ->type == list: ->data.list->first and ->data.list->last are pointers to the first and last, respectively;
- ->type == hash: ->data.hash->tbl is the genht string->pointer hash (pointers are lht_node_t *);
- ->type == table: ->data.table->cols and ->data.table->rows are the number of columns and rows in use, ->r is an array of rows (each row an array of lht_node_t * cells), ->row_names is a char * array of names of the rows.
NOTE: anything else than text and symlink value should be accessed
using high level accessors as described in section 5.
4.4. doc uninit
When a document is no longer needed, lht_dom_uninit() should
be called on it to free up all resources, including the unlinked list.
4.5. iterator
Two functions and a mostly opaque struct are provided for iterating
over the children of a node (regardless of the type of the node).
The loop should call lht_dom_first() before the first
iteration and lht_dom_next() for subsequent calls.
The state pointer (lht_dom_iterator_t *) usually points to
a local variable which doesn't need to be initialized before
the first iteration or cleaned up after the last iteration.
The iterator considers only the direct (first-level) children
of the node.
The caller is free to break the loop. Inserting new children
or deleting existing children under the node may cause undefined
behaviour.
4.6. debug/diganostic
Function lht_dom_pnode() is provided for printing a node
non-recursively and function lht_dom_ptree() for
printing a subtree recursively. Both calls get a node argument,
an output FILE * and a prefix which is inserted before each line
printed. Output is a line based textual format intended for diagnostics,
often including internal states/pointers.
4.7. building a tree
Allocating a new floating node is possible using the
lht_dom_node_alloc(). A floating node can become
part of a tree by:
- becoming the root of an empty document (node->doc = doc; doc->root = node) or can be
- inserted or appended to a list
- inserted in a hash table
- replacing an existing node in a tree by the new node (see 4.8)
4.8. per-type calls
The following sections describe special, type-specific accessors. They
are useful for low level operations, but very often can be avoided
by using higher level API described later. For example instead of
using the node->next field in a list or addressing from row;col
loops in a table, the iterator API can visit each child of a node
regardless of the node type. When seeking a node or descending the
tree, the path API is the type-independent generic way.
4.8.1. list
Calls lht_err_t lht_dom_list_append() and
lht_dom_list_insert() are provided for growing lists.
Counting the number of elements on a list is done by
int lht_dom_list_len(). There is no type-independent
API for these.
Looking up children nodes can be done using the higher level
path API, or the list-specific calls lht_tree_list_nth(),
lht_tree_list_nthname().
Deleting/detaching/unlinking/replacing children can be done by the
generic API or by lht_tree_list_del_nth,
lht_err_t lht_tree_list_del_child,
lht_tree_list_detach_nth,
lht_err_t lht_tree_list_detach_child,
lht_tree_list_replace_child.
4.8.2. hash
Inserting a node in a hash is done using lht_dom_hash_put().
Children nodes can be counted by lht_tree_hash_len().
There is no high level equivalent for these calls.
Looking up a node in a hash, by name is possible using
lht_dom_hash_get() or by using the path API.
Deleting/detaching/replacing children can be done by the
generic API or by
lht_tree_hash_del_child(),
lht_tree_hash_detach_child(),
lht_tree_hash_replace_child().
4.8.3. table
There is no direct way of inserting nodes in a table. Instead,
a row or column of anonymous empty strings can be inserted using
lht_tree_table_ins_row() and
lht_tree_table_ins_col() then any empty (or non-empty)
cell can be replaced using the generic lht_tree_replace()
(after looking up the node pointer of the cell).
It is possible to delete a whole row or column using
lht_tree_table_del_row() and lht_tree_table_del_col().
When a single cell is deleted using lht_tree_del() it is replaced
by the anonymous empty string.
Looking up the node pointer of a cell can be done by the
path API or by lht_dom_table_cell() and
lht_tree_table_named_cell().
4.9. node replace
Any node of a tree can be replaced with a detached node using
lht_tree_replace(). The replace
operation detaches the original node from the tree (and its document)
and links the new node in place. The detached old node (and subtree)
becomes floating and shall be destroyed (or inserted in another document)
by the caller.