pcb-rnd knowledge pool
Code change statistics
chgstat by Tibor 'Igor2' Palinkas on 2017-11-17
Tags: insight, project, management, statistics, fork, history
Abstract: The code change statistics is an attempt to measure how much of the code base has been changed since the fork. It is calculated by a script, using svn metadata.
On November 5th, 2018, we've reached 75%. This means:
either new or has undergone
some significant change.
As this number is rising, the accuracy is decreasing. It's still an under-estimate because of a lot of noise-factors. Especially because of the code-move-mask effect: if we move some code in a refactoring and it contained old code, we have to mark the whole commit 'old', which accounts some new code as old too.
Because of the increasing inaccuracy and the administrative overhead, as of 75%, I decided to stop calculating this value.
This statistics has fulfilled its purpose: it shows that pcb-rnd made a major change, not just in words, features and bugfixes, but in pure statistics as well.
Below is the old document explaining how this statistics was calculated.
Code change has many aspects. The other day an user asked me to list what's new/different/more in pcb-rnd compared to pcb mainline and I realized we have so many new things that a complete list would be so long nobody would bother to read.
It is possible to have a few features highlighted, but the selection is subjective. For example what took much time is the data model rewrite, which led to removing all layer and footprint limitations and a unified padstack model. This costed many many hundred hours of developer time. On the other hand, many of our users love fp_wget, which transparently integrates edakrill and gedasymbols in the library window - that feature literally costed 2 weekends, total.
The code change statistics is just one aspect, an objective one. But interpretation of the method and the result is obviously subjective. My interpretaton is that it shows how many lines of pcb-rnd code does not look similar to its corresponding mainline code line . This means code lines that got major change or code lines that are new in pcb-rnd and never existed in mainline.
2. Technical details: how it is calculated
The chgstat script runs svn blame on all source files to get each source line prefixed by the svn revision number when it last changed. The main idea is that if a line changed after the fork, it's new, if it's part of the original import, it's old. At the end the percentage is derived from new/(old+new) .
The fine print: we mask out a lot of changes. For example if a file gets split into two files or a file gets renamed, code gets moved within a file, or code gets reindented, we tell chgstat to take that revision as old .
This method is not perfect. The most trivial shortcomings are:
- If there is a real change in r100 then an indentation change in r200, we have to tell chgstat to account r200 changes to the old code; this masks the new lines of r100, counting them as old too.
- The script works on physical source lines - including empty lines and even the license banners. These lines rarely change, so even if we changed all actual code of the original source, we still wouldn't see 100% in the statistics because of the unchanged empty lines and license banners
- On the other hand, when a new source file is added, it contains the license banner and empty lines, and these are counted as new code lines too (this probably justifies the previous loss)
- Some changes, like namespace cleanup (massive renames), are hard to categorize; when done right, they don't change how pcb-rnd works but they make the code look different. Because of how the change statistics is defined, and how much actual work a good quality rename requires and how it improves the code base, we count these as change.
We follow a rather conservative policy: when we are in doubt, we rather classify the revision as 'old'. This, combined with the masking effect suggests that errors of the method favor 'old'. In other words, the result is a lower estimate of the actual change.
So what does 1% mean? Our current codebase (as of November 2017) is about 230k lines, so 1% means about 2300 lines of code .
3. Why do we have this stats?
The only reason is to have anything that's remotely objective. I started to do this stat at around 25%. Back then the code modularization and namespace cleanup contributed the most. It was fun to see how the number rised by adding new I/O plugins and rewriting the data model.
It also somewhat shows the sheer amount of low level codecraft that takes place under each high level feature addition or infrastructure refactoring. It shows that doing the huge changes in pcb-rnd are not only theoretical; it's not only about thinking things over, designing mechanism, but it's also about a lot of typing.
But it's important to handle this statistics at its right place, and don't overestimate it's importance. What makes pcb-rnd so good is not how many lines of code we put in, but what the code does and how much effort we invest.
4. Fork vs. rewrite
At any time you are reading this article, we have a rather high code change percentage. At the moment of writing the article, it's over 65%. Users somtime ask: "so if we/you changed more than half/two-third/three-quarter of the code, wouldn't it have been easier to just rewrite it from sratch?". The short answer is clearly no , fork was the better way. The longer answer is:
- Even tho a lot of code has changed, some essential code remained. Some infrastructure or low level code was just right, and I am glad I did not have to rewrite those parts.
- In a sense, pcb already did mostly what we needed, so taking it and making the modifications was cheaper than starting from scratch. Remember: code change stat includes everything from small changes to new code.
- Unlike git people, we have a different development model. Instead of working for 2 years in isoaltion and then presenting the results, we are working in public. Additions of the code, redesign of critical infrastrucutre, changing the data model all goes on the live version - there are no branches . (This forces developers to be a bit more careful about how specific changes are carried out, but this pays back many times ni early user testing.) This method requires a software that is already there, already works. That's not cheap to get if we write it from scratch.
With cschem we take a different approach and write it from scratch. The reason is that gschem/gnetlist/lepton is not very close to what we want to get at the end. The differences are large, both on design level and on actual code level.