pcb-rnd knowledge pool


Code change statistics

chgstat by Tibor 'Igor2' Palinkas on 2017-11-17

Tags: insight, project, management, statistics, fork, history

node source



Abstract: The code change statistics is an attempt to measure how much of the code base has been changed since the fork. It is calculated by a script, using svn metadata.


NEWS: 75%

On November 5th, 2018, we've reached 75%. This means:

at the moment at least 3/4 of the code lines are
either new or has undergone
some significant change.

As this number is rising, the accuracy is decreasing. It's still an under-estimate because of a lot of noise-factors. Especially because of the code-move-mask effect: if we move some code in a refactoring and it contained old code, we have to mark the whole commit 'old', which accounts some new code as old too.

Because of the increasing inaccuracy and the administrative overhead, as of 75%, I decided to stop calculating this value.

This statistics has fulfilled its purpose: it shows that pcb-rnd made a major change, not just in words, features and bugfixes, but in pure statistics as well.

Below is the old document explaining how this statistics was calculated.

1. About

Code change has many aspects. The other day an user asked me to list what's new/different/more in pcb-rnd compared to pcb mainline and I realized we have so many new things that a complete list would be so long nobody would bother to read.

It is possible to have a few features highlighted, but the selection is subjective. For example what took much time is the data model rewrite, which led to removing all layer and footprint limitations and a unified padstack model. This costed many many hundred hours of developer time. On the other hand, many of our users love fp_wget, which transparently integrates edakrill and gedasymbols in the library window - that feature literally costed 2 weekends, total.

The code change statistics is just one aspect, an objective one. But interpretation of the method and the result is obviously subjective. My interpretaton is that it shows how many lines of pcb-rnd code does not look similar to its corresponding mainline code line . This means code lines that got major change or code lines that are new in pcb-rnd and never existed in mainline.

2. Technical details: how it is calculated

The chgstat script runs svn blame on all source files to get each source line prefixed by the svn revision number when it last changed. The main idea is that if a line changed after the fork, it's new, if it's part of the original import, it's old. At the end the percentage is derived from new/(old+new) .

The fine print: we mask out a lot of changes. For example if a file gets split into two files or a file gets renamed, code gets moved within a file, or code gets reindented, we tell chgstat to take that revision as old .

This method is not perfect. The most trivial shortcomings are:

We follow a rather conservative policy: when we are in doubt, we rather classify the revision as 'old'. This, combined with the masking effect suggests that errors of the method favor 'old'. In other words, the result is a lower estimate of the actual change.

So what does 1% mean? Our current codebase (as of November 2017) is about 230k lines, so 1% means about 2300 lines of code .

3. Why do we have this stats?

The only reason is to have anything that's remotely objective. I started to do this stat at around 25%. Back then the code modularization and namespace cleanup contributed the most. It was fun to see how the number rised by adding new I/O plugins and rewriting the data model.

It also somewhat shows the sheer amount of low level codecraft that takes place under each high level feature addition or infrastructure refactoring. It shows that doing the huge changes in pcb-rnd are not only theoretical; it's not only about thinking things over, designing mechanism, but it's also about a lot of typing.

But it's important to handle this statistics at its right place, and don't overestimate it's importance. What makes pcb-rnd so good is not how many lines of code we put in, but what the code does and how much effort we invest.

4. Fork vs. rewrite

At any time you are reading this article, we have a rather high code change percentage. At the moment of writing the article, it's over 65%. Users somtime ask: "so if we/you changed more than half/two-third/three-quarter of the code, wouldn't it have been easier to just rewrite it from sratch?". The short answer is clearly no , fork was the better way. The longer answer is:

With cschem we take a different approach and write it from scratch. The reason is that gschem/gnetlist/lepton is not very close to what we want to get at the end. The differences are large, both on design level and on actual code level.