File format vs. VCS vs. script and manual editing
The problem
The file a CAD program loads and saves is likely to end up in a Version
Control System (VCS) sooner or later. To make VCS more meaningful, the
changes to the file need to be minimal and relevant. For example in a
load-save round trip (i.e. load the file in the CAD and save it in another
file without any modification) the diff should be empty. If the user adds a new
object, the diff should contain only lines added for the new object. (And maybe
a few lines of changes in an index, if there's an index, but I believe it's
better if there's no index.) Likewise, if an object is deleted, it is expected
to result a few lines removed, preferably at a single location.
Terminology used in this document:
- file is the data file, represented in a structured text format;
for the whole document except for the last chapter, any format may work:
xml, json, ini, lihata, etc.
- node a data node; may be:
- a leaf node with payload: an actual data item (like a number)
- a hash that contains an unordered set of other named nodes
- a list is an ordered list of named or unnamed nodes
The file is a tree; even an ini file is a tree with 1 or 2 levels (bottom
level is a hash of text nodes, top level, groups, represent another hash). An
xml is a list of lists with text nodes and non-text-node attributes
representing a hash.
Problem #0: order
To achieve the above goals, both gschem and pcb and pcb-rnd are prudent:
objects are ordered in memory. This is necessary because in an unordered
list or hash if the number of objects changes (objects added or removed)
the saving order may shuffle, causing a lot of unnecessary changes.
This problem is already solved.
Problem #1: canonical form
Our file formats are plain text. While there are certain constructs that have
strict format, e.g. a string in quotes can be described only one way without
changing the content of the string, other parts have alternatives:
- white spaces and other field separators
- numerical formats: is 8 the same as 08 or 8.0?
- pcb flag formats, old/new style: the same object can be described
using different syntax constructions, sometimes resulting the same
object
This problem is seemingly solved by using a canonical form: even if the
loader may accept alternatives, the save code always generates the same formatted
output, down to the last whitespace indentation.
Problem #2: hand editing, external scripts
However, the canonical format is not always the most convenient for the user.
Sometimes different indentation, maybe even locally, may improve readability.
Unfortunately a CAD round-trip brings back the file to canonical format, removing
any user formatting. Or in other words, the user needs to learn the syntax
of the file format and how the canonical format looks like, else user
changes to a file may cause VCS noise later.
Even worse, 3rd party scripts need to learn the canonical format too, for the
same reason.
The solution
One solution is to merge the canonical format and the syntax, so the resulting
specification is so strict that there are no alternatives but everything can
be written using only a single, well described format. This solves problem
#1 and #2, but creates problem #3: a text file format as rigid as a binary
format would be. While it could be still parsed by text tools easily, it
would be much harder to emit properly using the same tools or even harder to
edit manually, with a text editor. In this sense it may start losing its
most important text-file-property: free form, to some extent.
An alternative is that the CAD program doesn't save canonical format over
an existing file but tries to accommodate to the format the file already features.
This solves problem #0, #1, #2 and #3, allowing the user (or 3rd party scripts)
to use whatever formatting they prefer, with clean round trips.
How to implement it (in theory)
A few unusual features are required:
- F1: the parser needs to be aware of otherwise unimportant
characters (e.g. whitespace in indentation)
- F2: if (payload) data is to be converted to a native format by
the program, the original text representation needs to be saved;
e.g. the input file contains 4.99999999, the program loads it to
a float and later saved back, it may end up as 4.999997836;
instead if the data did not change, the original text shall be
saved back; this also solves the 005 vs 5 vs 5.0 problem
- F3: new data should be inserted, obsolete data should be removed
without reordering stationary data
The obvious approach
An obvious approach would be to smarten the parser and store delimiters
(e.g. whitespace) and the original text version for each data node, along with
the order of nodes in the original file.
The less obvious approach
A second approach is to ignore these details on load and keep on storing data
in native binary format in memory. On save, start parsing the original file
(or a snapshot of it taken load-time), and:
- any non-data-payload character should be copied the the output;
this includes whitespace, delimiters, comments, syntax sugar;
this solves F1
- if a node did not change, the parser copies it byte-to-byte to
the output (solves F2 partially - no format change if data is
unchanged in memory);
- value change check: while parsing, convert the original value to
native format again and compare, which solves F2: the same
string->native conversion should yield the same native result;
if our in-memory representation differs, it has changed and the
new value should be copied to the output instead of the old value;
the format of the new value doesn't matter, can be canonical, since
the diff is already caused by the fact that the value has changed.
- if the language supports ordered lists: before emitting list items,
the input list must be read through (pre-read) and the in-memory and
just-parse lists need to be compared to detect changes in order;
existing list items are copied as described above; list items
deleted in-memory are ignored; list items added in-memory are
written out (solves F3)
- if the language supports unordered hashes: check each key parsed;
if it exists in-memory, copy/write it normally; if it doesn't,
delete it; add new hash items (the ones present in-memory but
not seen when the parser reaches the end of the hash; solves F3)
This may sound more expensive than the first approach, because the parser needs
to run again on save. However, the costs of the previous approach are
high too:
- essentially the whole file needs to be stored in-memory
(for whitespace), but not in one bug chunk but in many
little chunks among with the data nodes (which leads to
allocation overhead); the original values need to be stored in
text too, to solve the numeric problems
- either the same string->native conversion needs to be done using
stored text values; or the code needs to keep track of which node
changed using some data-changed bit
- the same list/hash merging needs to be done
If the syntax-sugar overhead of the file format is low, majority of all
bytes will be actual node data (need to be stored) and delimiters/indentation
(need to be stored). Then it might be cheaper to store it all in one bug chunk
and re-parse it (without building a DOM) than storing the same data in small
chunks.
An even less obvious approach
Take the previous method, but assume the application does not change the
order of objects on ordered lists and each object has some sort of unique
identifier (so two identical looking objects can't be on a list).
This makes list merging much simpler and doable in a single pass, keeping
a list item pointer with the in-memory list:
- initialization: set the pointer to the first element of the list
- iteration: parse the first (next) item of the list
- check if it exists in-memory and is the first
- if yes, copy from the original file, increase the pointer by
one element
- if not, check if it got deleted (not present on the list
in-memory) and skip it if so; pointer stays in place
- if not, check if it is on the in-memory list as a later entry;
if so, insert all preceding entries from the in-memory list,
increasing the pointer by one for each element written
- repeat the iteration until there are no more items on the input list
- check if there are items remaining from the pointer, if so output
them (they will end up at the end of the list in the output file)
- finished: the output file's list now matches the in-memory list.
Empty lists may need special (but trivial) treatment.
While copying an existing node, it is possible to "pick up" style
e.g. indentation. When a new item needs to be added, such picked up local
style info can be used to generate output that looks similar to other items
in its local environment. For example the first time a list or hash item
needs to be added is right after at least one item of the given hash or
list is already parsed, which means the style of an item is already know.
How it is implemented in practice
Using the last approach from the above list, with an utility library for
lihata. The full list of requirements (on the application and the tree
structure):
- the root element of the input file and the in-memory representation
must fully match
- list items must have names that are unique to the list, even across
item removal from the list within a session (so a newly added item
never has the same name as a past item parsed back from the file)
- the application must not change the order of elements on a list -
new elements may be inserted anywhere and old elements may be
removed from anywhere but the order of any two remaining old
elements must be preserved