gentrex - tutorial

1. Introduction, environment

1.1. terminology

1.2. The example application and the example tree

The example application implements the following hard-wired tree:

Examples of this document assume the above tree.

Example queries can be run locally using the example_app shipped in src/ or online .

1.3. Queries and iterations

In this simplistic test application, queries are always evaluated from the root of the example tree. Real applications may chose to start evaluation from whichever node of a tree. The evaluation performs the following operations:
  1. execute the evaluation from the root node provided by the caller
  2. set map variables as specified in the query script
  3. return the value of the evaluation (which is FALSE or a reference to a node)

A query is an expression written in the gentrex language. The language is a mixture of C-style expressions, regex syntax and tree operators.

Some queries may yield multiple results (by doing searches on the tree). The caller may run such a query in a loop, listing all non-false results until there are no more results. This is exactly how the example app works. Of course some applications may be interested in the first match only so they do not run multiple iterations of evaluation.

In the example app map settings are reset to the initial state between iterations. The initial state is specified on the command line (or the web interface).

2. Simple match queries

The simplest form of gentrex queries ask whether the tree from a given root node have a specific predefined shape. These queries are sort of like hardwired masks: the mask either matches to the tree or not. There is always a single result.

2.1. Match the root node

Consider the following query:
[name == A]

Evaluation starts by defining a Node Under Test (NUT) - at the beginning, this is the root node passed on by the caller (the root of the example tree in the example app). The [] block is a tag_expr which is evaluated on the NUT. It says "return NUT if it's name is A", else return FALSE. Since the root of the example tree has a name tag whose value is "A", the root node is the result.

The very same query can be written using double quotes around the tag value:

[name == "A"]

This is useful if the value contains non-alphanumeric characters.

tag_expr is an expression that may contain operators. The unary ! operator inverts the boolean value of the expression:

[!name == A]

will result in FALSE: it asserts NUT is a node whose name is not A, which is false. In the command line example app "false" appears as no output.

Since our root node is called "A", we can also match the root node by saying "it is not called Z" (or any other name would do that differs from "A"):

[!name == Z]

An even simpler method matching the root is specifying a tag_expr that would match any node. This is a corner-case of the syntax, the empty tag expression, which is defined as "matches any existing node":

[]

2.2. Complex tag_exprs

tag_expr also features logic operators || ("or") and && ("and"). For example
[name=="A" || name=="B"]

will still match the root, as it allows both "A" and "B" to be the name. It is possible to combine them using parenthesis to specify precedence:

[(name=="A" || name=="B") && (!name=="C")]

Evaluation is lazy:

In case of tag_expr this merely means some speedup (as long as the caller doesn't have side effects, such as debug prints, on the string match operation).

2.3. Tree operators: changing the NUT

So far we've been matching the root various ways. In practical queries it's more often that whole subtrees should be matched, not a single root node. For that, we need to relocate (hop) the NUT. The following example will result in the first child of the root node (if it exists):
[] . []

The first [] makes sure the root is matched - more accurately: match [] on NUT, the result is the NUT if it matches (and [] always matches). The "." operator means "make the NUT the first child of the current NUT". The second [] matches on the new NUT. The result of the expression is the new NUT, node B. (Strictly speaking what really happens is that the "." operator evaluates it's left side first, which will result in FALSE or in a node; if it's a node, it sets NUT to this node and evaluates the right side.)

Node E is "the first child of the first child" is obviously

[] . [] . []

The only way these queries may fail is when the structure of the tree doesn't match (there's no child node under the root or under the first child of the child). On the example tree the following expression will evaluate to false for that reason:

[] > []

The ">" operator is similar to the "." operator and means "the right sibling of NUT". As there are no siblings of the root node, it will evaluate to false.

In practice such queries are interested in whether the subtree matches a given structure and tag data. The following example (that won't work in the example app since there are no "color" and "shape" tags defined there) demonstrates such a query:

[!shape == ""] . [color=="red" || color=="blue"] > []

It matches root only if root has a non-empty shape tag, the first child of the root is red or blue (color-wise), and there is a second child too, although there is no restriction as of what tags this second child shall have.

There is also a parent operator called "^". The following expression evaluates to the root only if it's first child is called B:

[] . [name == "B"] ^ []

The '[] . [name == "B"]' part matches only if the first child is called B; the value of the expression is node B, tho. The rest moves NUT back to the parent using the "^" operator and the always-match tag_expr. At the end, the value of the expression is either false, or the NUT.

If query root is the actual root of the tree (like in the example app), the following expression must evaluate to false:

[] ^ []

(Since the root node can not have a parent.)

Finally the / operator says "any of the children of the current node". Unlike the above tree operators that will always evaluate to a single definite node, the / operator may evaluate to multiple nodes, and thus will be discussed in section 4.

2.4. Logic operators

So far example queries contained a single tag_expr or a series of tag_exprs with NUT hops between them. Gentrex also supports logic operators || and && on the outer level. For example the following two expressions are equivalent:
	[name==A || name==B]
	[name==A] || [name==B]
	

The second variant first evaluates its left side on NUT if it's false, evaluates the right side. The result of the || operator is the first part that evaluated to true. Similar considerations apply to &&, except its "laziness" is inverted (won't evaluate the right side if the left side is false).

The real use of expression level logics is that each side of the operator starts out from the same NUT and can change its own NUT. For example the following expression "searches" the first 3 children of the root node for node "C":

([] . [name==C]) || ([] . [] > [name==C])  || ([] . [] > [] > [name==C])
Note how each section of the || starts describing the pattern from root, which is the NUT for the || in this example. Using parenthesis, the NUT can be "saved":
 [] . (([name == C]) || ([] > [name == C]) || ([] > [] > [name == C])) 

the first "[] ." combo sets the NUT to the first child of the root, and the three-fold || expression is evaluated on this new NUT.

The "!" operator negates the result.

![]

is false if there's a root;

!![]

is double negation which results in the original value. Taking the "search-for-C" example:

 !([] . (([name == E]) || ([] > [name == E]) || ([] > [] > [name == E]))) 

this will be true only if none of the first three children of the root is called "E". When it is true, it evaluates to B, since that's the leftmost expression of the ||.

3. The map

3.1. Backreferences

It is often useful to save results of a complex expression and do further analysis on it. Consider a slightly modified version of the previous example that "searched" for node C under the root:
([] . r=(([name == C]) || ([] > [name == C]) || ([] > [] > [name == C])))

This version features a "name=expr" part, which saves the node found by that expression in the map under name "r". This does not change the final value of the expression in any way, merely saves a node in a map. However, this node can be used later in the expression to reference the node found here:

([] . r=(([name == C]) || ([] > [name == C]) || ([] > [] > [name == C]))) && (r . [name==H]) 

This will not only find node C, it will also check whether it's first child is H. This can not be done easily without such backreference: the NUT for the right side of the && defaults back to the root node, as that was the generic context of the && operator. Note how the reference to r stands instead of a match to the NUT; such a reference changes NUT to the referenced node immediately, without any check.

One step further we can also say C must not be the third child of A:

([] . r=(x=([name == C]) || ([] > [name == C]) || ([] > [] > [name == C]))) && (r . [name==H]) && (!x) 

This stores the first child in x, and the result of the "search" in r. If the first child is C, x will be C, and the expression after the last && evaluates to false, making the whole expression false. If the first child is not C, x is FALSE, making that last check true. However, this is getting tricky, as the following example fails:

([] . r=(([name == C]) || ([] > [name == C]) || x=([] > [] > [name == C]))) && (r . [name==H]) && (!x) 

It is supposed to be the same, except checking for the 3rd child not to be C. The reason it fails is that the second child is C, so || gets lazy and does not evaluate the part that involves setting the value of x. If the value of a map slot is uninitialized, it's not true or false - negating it will not make it true. A trivial workaround is to move the 'x=...' part at the beginning of the ||, ensuring it will be evaluated no matter how the rest of the || goes.

Another workaround is the set q false as the first step of the expression. There's no dedicated "false" operator; however, it's easy to generate a value that's true: force-match a node; then negating it will result in a value that's false:

x=![]

Now that this evaluates to false, x is false, the next step is to make it part of the evaluation. It needs to run before anything else, and the rest should run even if the value of this part is false. Operator || does exactly that: 'x=![] || something' will:

(x=![]) || ([] . r=(([name == C]) || ([] > [name == C]) || x=([] > [] > [name == C]))) && (r . [name==H]) && (!x) 

But does it have to be this complicated? No, there's a simpler, more straight forwards solution, without backreference for x:

([] . r=(([name == C]) || ([] > [name == C]) || ([] > [] > [!name == C]))) && (r . [name==H]) () 

x was introduced only to demonstrate map variable initialization.

Also, in section 4, a simpler and more generic syntax for searching is introduced.

Note: in the command line tool it's possible to print map variables using the command line arguments -p "qw" (where qw is a set of the lowercase map names that should be printed after each result). The online version has checkboxes for each map variable for the same purpose.

3.2. Detailing results

Another use of the map is to save specific nodes for the caller. The expression evaluates to a single node, but the caller is often interested in the steps the result is produced; this will be demonstrated in section 4 in details. A simple example is:
[] . ((q=w=[name==C]) || q=[] > w=[name==C])

This expression evaluates to C if C is the first or second children of the root. In any case, q points to the first child, and w points to node C. The caller doesn't need to look up children or implement branches, just fetch q and w.

3.3. Passing parameters

Such communication between the caller and the evaluation is bidirectional: the caller may initialize map variables before the call. For example if map variable r is set to C, the following query will result in node I:
r . [] > []

Note: in the command line tool it's possible to initialize map variables using the command line arguments r=C (in general map=NODE). The online version has input boxes for each map variable for the same purpose.

The example app resets the map to the same initial state before each query.

3.4. self reference

Section 3.3. demonstrated how self reference works as a basis of tree operators. However, sometimes a backreferenced node should be evaluated using a tag_expr. Or in other words, the backreferenced node itself should be the expression, not it's relative, thus a tree operator is needed that does not hop but stays in place, setting the NUT to the referenced node. This operator is called ':' and is demonstrated by the following not-very-elegant example:
([] . q=[]) && (q : [!name==D]) 

This takes the first child of the root on the left side of &&, saving its value in q; then checks q to make sure its name is not D, on the right side.

Actually both sides of : can is an expression, so the left side does not have to be a reference and the right side does not have to be a tag_expr:

([] . []) : [!name==D] 
This selects the first child of the root, and makes it the NUT for the right side of the ':' operator. Since the ':' operator is just a tree operator, similar to ., the parenthesis is not even required on the left side:
[] . [] : [!name==D] 

This could be translated into: "there's a root (first []), which has a child (second []) and the name of that child is not D". Of course as a further simplification we get back to the canonical form that does the same:

[] . [!name==D] 

which suggests the self reference is really useful in combination of back referencing map variables.

4. Simple searches: loop operator

4.1. Sibling loop (zero, one or more times)

The example in section 3 "searching" the first 3 children of the root node is not extensible. If we wanted to search all children, as the tree grows and more children appears under the root node, the expression would need to grow too. To overcome this limitation, gentrex offers loop operators. Loop operators can be appended to tree operators to express that the tree hopping should happen multiple times. For example iterating over all children of the root is:
[] . [] >* []

This expression says "there is a root node (1st []), it has a child (2nd []); repeat the > operator on this child zero or more times to get to the final NUT(3rd [])". The first child of the root is found only because * allows the > operator to perform zero times as well. It is very much like the * operator in regex.

This expression has multiple results: the looped operator translates into an actual loop that will keep on yielding results until there's no more siblings. On the example tree, it will list all three children of the root, B, C and D.

Using this iterator, the simplest forms of "find the child of root that is called C":

[] . [] >* [name == C]

The same can be done using backreference and a logic operator:

 ([] . [] >* n=[]) && (n:[name == C]) 

Semantically what >* does is taking the result of "[] . []"; then it also takes all its siblings, one by one, and evaluate the whole expression with different amount of hops substituted into the >* operator. In some sense this is generating all possible combinations of the three explicit nodes specified by the tag expressions. The number of combinations (or iterations) depends only on the shape of the tree. For each iteration the whole expression is evaluated. If the result would be false, the next iteration is performed automatically until it evaluates to true or there are no more combinations to check. Effectively this will list all the results on combinations for whcih the expression evaluated to true followed by a single false result that terminates the list.

This mechanism allows listing all non-C nodes with a simple expression:

[] . [] >* [!name == C]

However, determining that there was no C node under the root at all is harder. The following expression does not solve the problem, it does the same as the previous one:

!([] . [] >* n=[name == C])

TODO: This is a problem that will be addressed in the next release of gentrex.

4.2. Other one direction loops (zero, one or more times)

Looping the > operator formulate a mechanism that hops in "horizontal" direction under a given parent. The vertical equivalent is looping the . operator:
[] .* []

returns a list of first-childs including the root node (A, B, E), and

[] .+ []

returns a list of first-childs not including the root node (B, E).

It is also possible to loop the parent node. The left side of && of the folowoing expression selects C/H/N, the right side loops parents until C is found:

([]. [] > [] . [] . t=[]) && (t ^* r=[name == C])

The expression is true if there's a parent called C, but evaluates to N (the left side of the logic operator). C is saved in r.

Note: there's a much simpler way selecting N, as it will be demonstrated in the following section.

4.3. The / operator

Slash is a tree operator that iterates over all children of the given NUT, even without a loop operator. The following expression selects the child node of root that is called C:
[] / [name==C]

The / operator is a loop itself, that operates on a single level, iterating through nodes whose direct parent is the NUT. Without a loop operator, the / operator is the same as the combination of a . and a >* operator; the above example written out longer is:

[] . [] >* [name==C]

For this use the / operator would not be required in gentrex. However, combining the . and the > operators it is not possible to build a full search of a subtree. A looped / operator will do a BFS (breadth-first search), keeping the loop count to the relative distence of the current level from the original NUT. A full BFS for node N from the root is:

[] /* [name == N]
The iteration counter is 0 for the root (since we used * for the loop, root is considered too); 1 for B, C and D, 2 for E, F, G, H, I, J, K, L and M. For N it is 3, and for Y it is 4. This will be kmore significant from section 4.5.

4.4. Nested loops

Let's decide if the tree matches the following criteria: The most straight-forward way to script this is just write all criterias in:
 ([!name == C] /* [name == C] /* l=[name == Q]) && !(l . []) 

The left side of && makes sure root is not C but there is a node C under root and there's a Q somewhere under C. The right side with the backreference using 'l' is required to make sure Q is a leaf (it shall not have children).

The left side is an example of nested looping. The [!name == C] /* [name == C] part of the expression will be evaluated to all nodes under the root, including the root: 26 nodes total. From all these nodes, there will be exactly one that matches, node C. When this happens, NUT is set to C and the /* l=[name == Q] loop is run on the new NUT. This will iterate over 10 nodes (including C) and will be true exactly once. This is the only time the left side of && is true, and the right side us run. Thus the total number of iterations taken is about 26+10, 36. Later in this chapter another example will demonstrate how easy it is to modify the above expression to get 26*10=260 iterations.

It is possible to cut back on iterations taken while evaluating the expression: if root is not C, and this is ensured by the first tag_expr, there's no need to consider the root node in the first loop:

 ([!name == C] /+ [name == C] /* l=[name == Q]) && !(l . []) 

Since each node has a single name, the leaf node can't be called both C and Q; this modification cuts another iteration:

 ([!name == C] /+ [name == C] /+ l=[name == Q]) && !(l . []) 

There is another, more substantial change in the meaning of the new expression: it explicitly requires that there are at least three levels of the path that mathces: the root, a level with C and a level with Q. The /* allowed match-on-level-0, and the requirement of multiple levels sneaked in through names being unique and that we required three differnet names.

If we are looking for all nodes under the root that has Q in its subtree, we should use the following expression:

 ([] /+ r=[] /* l=[name == Q])
This will be true three times, for r values C, I and Q. It is not true for A, because the frist loop operator is + (that requires r to be different from the root), but is true for Q, since the second loop operator allows level-0, so that r == l.

The total number of iterations is 25*25=625, from which only 3 will be result in a true value. This is an important consideration: nested the looped operator, especially BFS operations, can easily increase the number of iterations considreably. When this is a concern, especially on large trees, care should be taken to formulate the expression in alternate ways:

([] /+ l=[name == Q] ^* r=[]) && !(r:[name == A])
This takes all nodes under the root (25 iterations), picks the only one whose name is Q (1), and iterates over all its parents (4, since ^* allows the node itself to be an iteration). This means 25+4=29 iterations total. Since the original problem did not want the root to be found, an && operator had to be added with the right side backreferencing the parent making sure it's not the root. The other side-effect of this rewirte is that results are listed in reverse order.

The "&& with backreference from the right side to the left side" often can be replaced with the "self reference" operator. A more compact form of the above example is:

([] /+ l=[name == Q] ^* r=[]) : [!name == A]
All considerations for the left side of ':' are the same they were for the left side of the &&. The value of the left side is the same as the value of the r (the last NUT in that expression). The ':' makes the right side expression evaluate with the NUT set to the result of the left side. Since the final result of the expression is the result of the right side, the effect is the same as before. This is an idiom for filtering the result of searches.

4.5. Numeric ranges

Beside the open ended + and * loop operators, gentrex also supports ranged loops in regex-style, using the {} syntax:
operator syntax meaning (in iteration count)
{3} only the 3rd match (or level for /)
{4-4} only the 4th match (or level for /)
{2-4} second, third and fourth match (or level for /)
{0} 0th match; this is a special case, this means the current NUT (and si equivalent to the : operator) for the following operators: > . ^ /

If numeric ranges would allow infinity, the * operator could be written as {0-inf} and the + operator could be written as {1-inf}. While the syntax of gentrex doesn't allow inf, the grammar does exactly those substitutions for + and * and the virtual machine implements only the {integer-integer} notation.

For example listing the first level of nodes below the root is

[] /{1} []

which is equivalent to

[] / []

Similar to that, the first sibling of node stored in map slot 'r' is

 r >{1} [] 

or

 r > [] 

The following expression will list E only:

 [] .{2-4} []
It doesn't list A or B, because those are level 0 and 1 for the child operator; if E would have a child, that would be listed too, and it's grandchild under that node too.

Since iteration count is the distance from the NUT for the / operator, the following expressions will list Y and Z (the only nodes being on level 4 from A):

[] /{4} []
[] /{4-8} []

5. Case studies

This section features random examples that try to demonstrate how gentrex expressions are constructed through a series of steps. This approach is about building simpler expressions which are easier to test and then combine them.

5.1. list all leaf nodes

First let's list all nodes:
[] /* [] 

We have to use * here for the corner case where the root is the only node in the tree (it should be accepted as a leaf then).

Next, we need to decide whether each node is a leaf. A node is a leaf if it does not have children - or in other words, if it does not have a first child. Getting the first child of a node is:

[] . []

Running this on root will result in B. If we negate the expression we get true only if there was no first child (it's false from the root on the example tree):

!([] . [])

The final step is combining te two: run the second expression on each node in the root. The obvious combination would be:

([] /* [])  &&  !([] . [])

but this does not work: the right side of && has the same NUT as the left side, so any time the left side takes the next node, the right side checks wheter root is a leaf.

We need to pass on information between the two sides. One way is using backreference: save the current NUT at the end of the left side and backreference that node from the right:

([] /* n=[])  &&  !(n . [])

A slightly trickier solution is to use the self-reference operator, which can be translated to "take the final NUT of the left side and apply the right-side expression on it":

([] /* []) : !([] . [])

An alternative translation would be "test that there is no children of the node (right side) found by the left side".

5.2. list all non-leaf nodes

The trivial answer is to invert the result of 5.1.:

!(([] /* n=[])  &&  !(n . []))

However, it seemingly fails on the self-referenced variant:

!(([] /* []) : !([] . []))

The result of the expression in the first case is derived as:

Inverting the self-referenced version fails to work because the ":" operator evaluates to the right side. If we are interested in the parent (the node that did not have a child), we definetly need the left side's NUT as the result.

The && based expression can be simplified by removing the double inversion:

([] /* n=[])  &&  (n . [])

which translates to "find all node n that does have a child and return n".

5.3. regex "anchors" vs. gentrex explicit searches

Let's decide whether a tree has a node combination so that a node's name is C and its first child is called H. In gentrex, this is:
[] /* [name == C] . [name == H]

An analog example in regex would be "search for character C followed by H", which is written as:

CH

Now accept C and H only if C is the root; in gentrex:

[name == C] . [name == H]

and in regex:

^CH

The difference is conceptual: regex performs a search on the whole range of the input string by default and tries to match the pattern against any portion of it. Gentrex always starts matching at the root. In regex we need to explicitly mark the beginning (or end) of the string, sort of anchoring the expression there; in gentrex we need to explicitly mark searches.

5.4. combination of nodes

List all "paths" in the tree that has 3 nodes a, b and c so that b is in the subtree of a and c is in the subtree of c.

A simpler problem is listing all "paths" which consists of a and b where b is in the subtree of a:

[] /* a=[] /+ b=[]

The first /* operator will BFS all nodes, including the root and saves them in map variable a. The /+ part runs a BFS under a to find a b that differs from a (because we used + instead of * here, b must be at least on level 1 from a). Note: since we used nested loops, the search iterates a lot: 26 in the first /* multiplied by the number of nodes in each subtree (ends up at multiple hundreds of iterations for this small tree).

Also note that the list of results is ordered because both / operators perform BFS in deterministic order.

Adding a third level is trivial:

[] /* a=[] /+ b=[] /+ c=[]