pcb-rnd knowledge pool
File format vs. VCS vs. script and manual editing
persistent by Tibor 'Igor2' Palinkas on 2016-08-23 | Tags: insight, file format, format, native, diff, vcs, formatting |
Abstract: Persistent formatting on save: the key to get our native file format diff-friendly, thus VCS-friendly.
The problem
The file a CAD program loads and saves is likely to end up in a Version Control System (VCS) sooner or later. To make VCS more meaningful, the changes to the file need to be minimal and relevant. For example in a load-save round trip (i.e. load the file in the CAD and save it in another file without any modification) the diff should be empty. If the user adds a new object, the diff should contain only lines added for the new object. (And maybe a few lines of changes in an index, if there's an index, but I believe it's better if there's no index.) Likewise, if an object is deleted, it is expected to result a few lines removed, preferably at a single location.
Terminology used in this document:
- file is the data file, represented in a structured text format; for the whole document except for the last chapter, any format may work: xml, json, ini, lihata, etc.
-
node
a data node; may be:
- a leaf node with payload : an actual data item (like a number)
- a hash that contains an unordered set of other named nodes
- a list is an ordered list of named or unnamed nodes
The file is a tree; even an ini file is a tree with 1 or 2 levels (bottom level is a hash of text nodes, top level, groups, represent another hash). An xml is a list of lists with text nodes and non-text-node attributes representing a hash.
Problem #0: order
To achieve the above goals, both gschem and pcb and pcb-rnd are prudent: objects are ordered in memory. This is necessary because in an unordered list or hash if the number of objects changes (objects added or removed) the saving order may shuffle, causing a lot of unnecessary changes.
This problem is already solved.
Problem #1: canonical form
Our file formats are plain text. While there are certain constructs that have strict format, e.g. a string in quotes can be described only one way without changing the content of the string, other parts have alternatives:
- white spaces and other field separators
- numerical formats: is 8 the same as 08 or 8.0?
- pcb flag formats, old/new style: the same object can be described using different syntax constructions, sometimes resulting the same object
This problem is seemingly solved by using a canonical form: even if the loader may accept alternatives, the save code always generates the same formatted output, down to the last whitespace indentation.
Problem #2: hand editing, external scripts
However, the canonical format is not always the most convenient for the user. Sometimes different indentation, maybe even locally, may improve readability. Unfortunately a CAD round-trip brings back the file to canonical format, removing any user formatting. Or in other words, the user needs to learn the syntax of the file format and how the canonical format looks like, else user changes to a file may cause VCS noise later.
Even worse, 3rd party scripts need to learn the canonical format too, for the same reason.
The solution
One solution is to merge the canonical format and the syntax, so the resulting specification is so strict that there are no alternatives but everything can be written using only a single, well described format. This solves problem #1 and #2, but creates problem #3: a text file format as rigid as a binary format would be. While it could be still parsed by text tools easily, it would be much harder to emit properly using the same tools or even harder to edit manually, with a text editor. In this sense it may start losing its most important text-file-property: free form, to some extent.
An alternative is that the CAD program doesn't save canonical format over an existing file but tries to accommodate to the format the file already features. This solves problem #0, #1, #2 and #3, allowing the user (or 3rd party scripts) to use whatever formatting they prefer, with clean round trips.
How to implement it (in theory)
A few unusual features are required:
- F1: the parser needs to be aware of otherwise unimportant characters (e.g. whitespace in indentation)
- F2: if (payload) data is to be converted to a native format by the program, the original text representation needs to be saved; e.g. the input file contains 4.99999999, the program loads it to a float and later saved back, it may end up as 4.999997836; instead if the data did not change, the original text shall be saved back; this also solves the 005 vs 5 vs 5.0 problem
- F3: new data should be inserted, obsolete data should be removed without reordering stationary data
The obvious approach
An obvious approach would be to smarten the parser and store delimiters (e.g. whitespace) and the original text version for each data node, along with the order of nodes in the original file.
The less obvious approach
A second approach is to ignore these details on load and keep on storing data in native binary format in memory. On save, start parsing the original file (or a snapshot of it taken load-time), and:
- any non-data-payload character should be copied to the output; this includes whitespace, delimiters, comments, syntax sugar; this solves F1
- if a node did not change, the parser copies it byte-to-byte to the output ( solves F2 partially - no format change if data is unchanged in memory);
- value change check: while parsing, convert the original value to native format again and compare, which solves F2 : the same string-> native conversion should yield the same native result; if our in-memory representation differs, it has changed and the new value should be copied to the output instead of the old value; the format of the new value doesn't matter, can be canonical, since the diff is already caused by the fact that the value has changed.
- if the language supports ordered lists: before emitting list items, the input list must be read through (pre-read) and the in-memory and just-parse lists need to be compared to detect changes in order; existing list items are copied as described above; list items deleted in-memory are ignored; list items added in-memory are written out ( solves F3 )
- if the language supports unordered hashes: check each key parsed; if it exists in-memory, copy/write it normally; if it doesn't, delete it; add new hash items (the ones present in-memory but not seen when the parser reaches the end of the hash; solves F3 )
This may sound more expensive than the first approach, because the parser needs to run again on save. However, the costs of the previous approach are high too:
- essentially the whole file needs to be stored in-memory (for whitespace), but not in one big chunk but in many little chunks among with the data nodes (which leads to allocation overhead); the original values need to be stored in text too, to solve the numeric problems
- either the same string-> native conversion needs to be done using stored text values; or the code needs to keep track of which node changed using some data-changed bit
- the same list/hash merging needs to be done
If the syntax-sugar overhead of the file format is low, majority of all bytes will be actual node data (need to be stored) and delimiters/indentation (need to be stored). Then it might be cheaper to store it all in one big chunk and re-parse it (without building a DOM) than storing the same data in small chunks.
An even less obvious approach
Take the previous method, but assume the application does not change the order of objects on ordered lists and each object has some sort of unique identifier (so two identical looking objects can't be on a list). This makes list merging much simpler and doable in a single pass, keeping a list item pointer with the in-memory list:
- initialization: set the pointer to the first element of the list
- iteration: parse the first (next) item of the list
-
check if it exists in-memory and is the first
- if yes, copy from the original file, increase the pointer by one element
- if not, check if it got deleted (not present on the list in-memory) and skip it if so; pointer stays in place
- if not, check if it is on the in-memory list as a later entry; if so, insert all preceding entries from the in-memory list, increasing the pointer by one for each element written
- repeat the iteration until there are no more items on the input list
- check if there are items remaining from the pointer, if so output them (they will end up at the end of the list in the output file)
- finished: the output file's list now matches the in-memory list.
Empty lists may need special (but trivial) treatment.
While copying an existing node, it is possible to "pick up" style e.g. indentation. When a new item needs to be added, such picked up local style info can be used to generate output that looks similar to other items in its local environment. For example the first time a list or hash item needs to be added is right after at least one item of the given hash or list is already parsed, which means the style of an item is already known.
How it is implemented in practice
Using the last approach from the above list, with an utility library for lihata. The full list of requirements (on the application and the tree structure):
- the root element of the input file and the in-memory representation must fully match
- list items must have names that are unique to the list, even across item removal from the list within a session (so a newly added item never has the same name as a past item parsed back from the file)
- the application must not change the order of elements on a list - new elements may be inserted anywhere and old elements may be removed from anywhere but the order of any two remaining old elements must be preserved
Conclusions after 1.5 years of production use
While the above approach works in practice, there are a few suboptimal properties, thus I am considering to implement an alternative to experiment with.