pcb-rnd knowledge pool


File format vs. VCS vs. script and manual editing

persistent by Tibor 'Igor2' Palinkas on 2016-08-23

Tags: insight, file format, format, native, diff, vcs, formatting

node source



Abstract: Persistent formatting on save: the key to get our native file format diff-friendly, thus VCS-friendly.


The problem

The file a CAD program loads and saves is likely to end up in a Version Control System (VCS) sooner or later. To make VCS more meaningful, the changes to the file need to be minimal and relevant. For example in a load-save round trip (i.e. load the file in the CAD and save it in another file without any modification) the diff should be empty. If the user adds a new object, the diff should contain only lines added for the new object. (And maybe a few lines of changes in an index, if there's an index, but I believe it's better if there's no index.) Likewise, if an object is deleted, it is expected to result a few lines removed, preferably at a single location.

Terminology used in this document:

The file is a tree; even an ini file is a tree with 1 or 2 levels (bottom level is a hash of text nodes, top level, groups, represent another hash). An xml is a list of lists with text nodes and non-text-node attributes representing a hash.

Problem #0: order

To achieve the above goals, both gschem and pcb and pcb-rnd are prudent: objects are ordered in memory. This is necessary because in an unordered list or hash if the number of objects changes (objects added or removed) the saving order may shuffle, causing a lot of unnecessary changes.

This problem is already solved.

Problem #1: canonical form

Our file formats are plain text. While there are certain constructs that have strict format, e.g. a string in quotes can be described only one way without changing the content of the string, other parts have alternatives:

This problem is seemingly solved by using a canonical form: even if the loader may accept alternatives, the save code always generates the same formatted output, down to the last whitespace indentation.

Problem #2: hand editing, external scripts

However, the canonical format is not always the most convenient for the user. Sometimes different indentation, maybe even locally, may improve readability. Unfortunately a CAD round-trip brings back the file to canonical format, removing any user formatting. Or in other words, the user needs to learn the syntax of the file format and how the canonical format looks like, else user changes to a file may cause VCS noise later.

Even worse, 3rd party scripts need to learn the canonical format too, for the same reason.

The solution

One solution is to merge the canonical format and the syntax, so the resulting specification is so strict that there are no alternatives but everything can be written using only a single, well described format. This solves problem #1 and #2, but creates problem #3: a text file format as rigid as a binary format would be. While it could be still parsed by text tools easily, it would be much harder to emit properly using the same tools or even harder to edit manually, with a text editor. In this sense it may start losing its most important text-file-property: free form, to some extent.

An alternative is that the CAD program doesn't save canonical format over an existing file but tries to accommodate to the format the file already features. This solves problem #0, #1, #2 and #3, allowing the user (or 3rd party scripts) to use whatever formatting they prefer, with clean round trips.

How to implement it (in theory)

A few unusual features are required:

The obvious approach

An obvious approach would be to smarten the parser and store delimiters (e.g. whitespace) and the original text version for each data node, along with the order of nodes in the original file.

The less obvious approach

A second approach is to ignore these details on load and keep on storing data in native binary format in memory. On save, start parsing the original file (or a snapshot of it taken load-time), and:

This may sound more expensive than the first approach, because the parser needs to run again on save. However, the costs of the previous approach are high too:

If the syntax-sugar overhead of the file format is low, majority of all bytes will be actual node data (need to be stored) and delimiters/indentation (need to be stored). Then it might be cheaper to store it all in one big chunk and re-parse it (without building a DOM) than storing the same data in small chunks.

An even less obvious approach

Take the previous method, but assume the application does not change the order of objects on ordered lists and each object has some sort of unique identifier (so two identical looking objects can't be on a list). This makes list merging much simpler and doable in a single pass, keeping a list item pointer with the in-memory list:

Empty lists may need special (but trivial) treatment.

While copying an existing node, it is possible to "pick up" style e.g. indentation. When a new item needs to be added, such picked up local style info can be used to generate output that looks similar to other items in its local environment. For example the first time a list or hash item needs to be added is right after at least one item of the given hash or list is already parsed, which means the style of an item is already known.

How it is implemented in practice

Using the last approach from the above list, with an utility library for lihata. The full list of requirements (on the application and the tree structure):

Conclusions after 1.5 years of production use

While the above approach works in practice, there are a few suboptimal properties, thus I am considering to implement an alternative to experiment with.