minilibs vs. megalibs

1. Introduction

Life is so simple when we write our first few dozen programs. They are usually small, focused on some simple and well defined task. We learn how to use the standard library shipped with the language of our choice, and usually ignore the rest of the world. Sometimes we reinvent and reimplement code that are already available in 3rd party libraries - but it is all fine, these projects are throw-away example programs, byproducts of the learning curve.

When we start writing production code, we face the usual dilemmas of programmers:

This article collects considerations that may help finding a sustainable balance in these questions. The conclusion presents a practical solution to some of the problems discussed.

2. The hidden/unexpected knowledge

Even smallish parts or subproblems often turn out to be knotty and time consuming to implement. It's usually relatively easy to implement a prototype, but things get more complicated if the code has to be efficient, portable, generic and behave reasonably on corner cases.

This suggests depending more on libraries. The programmer who wrote the generic purpose library for the subproblem perhaps spent much time on it, probably evaluating different algorithms and APIs. The library probably got a lot of testing already and would work much better out-of-the-box than new code.

However, these are assumptions, and should not be trusted blindly. Sometimes the library is not implemented along the same requirements/preferences as the project that tries to use it. Sometimes the library is simply not implemented properly. A well designed API does not automatically mean proper implementation or throughly tested code. Popularity doesn't imply quality either. The user should download the source of the library and evaluate code quality.

If the library is old with a lot of users on a lot of system, it doesn't only mean that it's well tested, but it often means it's also large: some users had special needs and when the library does 99% the user needs it's hard to resist adding the missing 1%. This may encumber the library with code (and API) for rarely used special cases, that may later on interfere with each other or with the normal case code.

Large code is hard to review. It's often more expensive to read the code of a library than just implementing an alternative from scratch (for the relevant part). Unfortunately there is a third option: use the lib without review, blindly trusting the "brand" (e.g. "it's large, it's popular, it can not be that bad.").

This in turn may lead to another problem: what if the library doesn't do exactly what is needed in the given use? The match between what the library offers and what the application needs is obviously never 100%. The application code may needs to do some extra (e.g. converting its own idea of the world to the API of the lib), needs to accommodate. This works well, up to a point, but sometimes the difference is so large it'd be easier to change the library and save the extra cruft. If the library is small and easy to understand, this is a real option, while it rarely happens with libs that span multiple 100k source lines.

This sometimes leads to more application code spent on the interfacing than what it would cost to have a local implementation of the functionalty.

The con for reimplementation is exactly the hidden knowledge accumulated in the library: all the experience in handling corner cases, all the testing on different systems, etc.

From another perspective, this hidden knowledge is probably better to be collected and maintained at one place, in a library, instead of many places (local implementations in different applications).

3. Reuse vs. dependency

Libraries are usually used as dependencies. The project assumes the right version of the library is already installed on the system. Dependencies may cause various problems for end users, distribution packagers or even developers.

3.1. Getting the dependency

Dependencies are usually hosted by 3rd party. There are obvious risks involved: Thus some software seek alternative solutions: These alternatives obviously do not work as well with megalibs.

3.2. Dependency versioning

Libraries evolve, APIs change. Programs are usually developed on a few developer boxes with at most a few versions of dependent libs installed. It is sometimes hard to map what version intervals of libs are really suitable for the application. This may lead to requiring a specific version or too new versions, "just to be on the safe side" (instead of testing with all available versions).

Large libraries have more parts that can change, and because of the extra interdependencies, it is more likely that a change triggers another. Because of the sheer size and complexity, mega libraries are less likely to stabilize, e.g. reach a state where developers say "done, finished, the library already supports everything we ever wanted it to". This leads to a constant stream of changes and bump of version numbers, which makes it harder to determine API compatibility precisely.

3.3. API changes and forward porting

The other issue this leads to (in an oversimplified way) is the waste of developer time in forward porting the application. For example take an application that used gtk for GUI. It started by using gtk1.2. Considerable amount of developer time spent on understanding and learning API and developing the initial code. About 3 years after gtk1.2 was released, a new version, gtk2.0 came out.

After some time system stopped supporting gtk1.2 and developers had to spend time forward porting the gtk1.2 GUI module to gtk2.0. Some widgets were removed, new widgets introduced, support infrastructure and API changed. All in all, developers had to spend their time forward porting their application not because they were necessarily fond of the new features, nor because there was a strong user demand for the actual new features of gtk2, but often only because if major OS distributions stop supporting gtk1.2, the application that depends on gtk1.2 is also left behind.

Gtk3.0 followed gtk2.0 in 9 years. The same cycle repeats: as of now (2016) we can happily use gtk2.0 on most systems, but there are signs that applications should switch to gtk3.0.

So an application that depends on gtk needs to spare some developer time spent on forward porting the GUI. Users using the said application need to spare some cycles learning a new GUI over and over. Not because they do want to have a new GUI every 6..10 years, rather because they have to follow the trends or the application bit rots and gets unusable. There is no other way around this: even gtk1.2 is too big to be fully shipped with a random application and forward porting gtk1.2 itself to new libc and X or windows versions would be even more expensive.

If an application depends on multiple large libraries and tools, gtk, glib, guile, boost, autotools, etc., each such dependency will have its own force-forward-port cycle. As such large libraries often have more developers than small projects or applications, this may result in a system where a big chunk of the application developer's time is spent on following dependencies.

Fortunately there is an alternative. When picking dependencies for a new project, estimate how much developer time will be available on the long run and, looking at history and trends of the dependency, estimate how much time it will take just to keep up with the given dependency. It may turn out that going for the given dependency is more expensive on the long run than going for another solution that has higher first-time costs, but pays off later. To avoid burning excess developer time on forward portig, consider:

3.4. Security

Among the costs of dependencies, security should be considered too. Much of our software run on networked hosts. Running any sort of software increases the surface through with the host can be attacked. Ideally the developer writes the software with great care not to leave exploitable vulnerabilities in the release. With large software the user won't review the code and needs to trust the developer on this.

If the release depends on libraries, the programmer will follow similar rules: more-or-less blindly depend on the quality of libraries above a certain size. However, there's an extra level or risk involved: even if both the application alone and the library alone are safe, the combination may be unsafe. Complex, opaque API and the sheer volume of the library source code may make the application programmer to use the API without fully understanding what happens in the library code, which in turn may lead to unforeseen vulnerabilities.

4. Additive vs. subtractive design

In manufacturing mechanical parts, processes are often categorized as additive or subtractive. The classic way is subtractive machining: take a large chunk of raw material and keep on removing mass and deforming the part until the remaining material is shaped as the design requires. Additive methods do the inverse: start from void and add material to build up the part - 3d printers or casting work like that.

Building software using libraries is often a surprisingly similar process. Some libraries are designed to be a set of small, weakly dependent parts that can be freely combined. The user employs the additive process: it selects the "raw material" his program needs and adds them the way the program needs them. There's usually very little excess material added in this process, and most often such burr is tolerable.

Other libraries are designed to provide a large set of interdependent infrastructure; when the program starts using one, it can hardly avoid using others. This is the subtractive method: by default all features are used and the user need to manually remove (turn off, or even work around) excess features that are not needed or are sometimes even harmful.

Whether a library ends up used in subtractive or additive ways is more or less hardwired in the design of the library API. The subtractive method is hard to avoid when using megalibs like glib or gtk . On the other hand the additive method is easier to achieve with minilibs, as they often focus on a single problem.

5. Conclusion

The programmer should always carefully evaluate alternatives before settling with a given library (or a local implementation of the feature). Choosing to depend on a large library may have long term effects on the project; these effect should not be ignored. Factors like popularity of the library or the large amount of features offered should not be reasons to suppress evaluating the risks that may be introduced by the size of the library and the API. The evaluation should not consider short term effects only, but long term effects too (e.g. release cycles of the library and expected forward porting efforts).

The outcome of the evaluation may indeed be to go with a large lib. In other cases it turns out using a small, specialized library is better. For reference, a collection of such mini libraries follows in the appendix.

6. Appendix: a collection of minilibs



data types

name size
(sloc)
standard has
cross
platform
parts
type
generic
description alternative for
[size in sloc]
genht 620 C89/C99 no yes Dynamic hash tables with user specified key and value type, user supplied "compare keys" and "hash key" functions. glib2 [445542]
genlist 813 C89 no yes Doubly linked lists with or without automatic element allocation. List item type is supplied by the caller. Works well with both dynamic and static memory allocation. glib2 [445542]
genvector 579 C89 no yes Dynamic arrays (vectors) of user specified element type: grow and shrink as needed, insert, append, delete elements. Dynamic string is just an instance of a vector configured for char *. Options for: maintaining a terminator record, initialize/construct/destruct records with user callbacks, user supplied alloc()/realloc()/free(). glib2 [445542]


util-lib

name size
(sloc)
standard has
cross
platform
parts
type
generic
description alternative for
[size in sloc]
genhvbox 627 C89 no no Display independent hbox/vbox packer: specify a tree of boxes with constraints (minimal sizes, alignment, padding, spacing), the host window size and it calculates box and widget coordinates and sizes.  
genregex 881 C89 no no Reentrant, compile-execute regular expressions for strings and binary data.  
gentrex 1245 C89 no no Regular-expression-like language designed for trees. The tree data structure is provided by the user (over callback functions). Gentrex can find or list matching nodes.  
libacas 1203 C89 yes no Atomic compare-and-swap based locks, multi-reader-writer thread safe queues. glib2 [445542]
libporty 3218 C89 yes no Portability layer: file system, date/time/sleep, networking (async networking, files, stdio) apr [69708]
glib2 [445542]
lihata 3607 C89 no no File format parser for lihata; parses a text tree of lists, hashes, tables and texts. The event parser is about 550 sloc, on which an optional DoM is built. The event parser is designed to be suitable for embedded environment. libxml2 [169445]
expat [11469]
json-c [2985]
libendstr 243 C89/C99 no no Functions to ease building strings by appending. The API is similar to the standard library's str* and mem* functions but are designed to track the end of the string.  
gensexpr 535 C89 no no Parse s-expressions; optionally load them in a DoM, allowing the user to extend the tree node type with custom fields. Export the DoM.  


util

name size
(sloc)
standard has
cross
platform
parts
type
generic
description alternative for
[size in sloc]
scconfig 13538 C89 yes no Execute a set of interdependent compile-run-compare and run-compare tests to build a database of host (and cross compilation target) system. E.g. find the C compiler, figure whether certain libc calls work or work properly or how exactly they behave on corners, find out the compilation/link flags of 3rd party libraries. autotools
cmake [195477]
sphash 492 C89 no no Find a collision-free hash function and generate hash table in C code for a set of compile-time-known strings. Such a table can be used to convert string to integer in O(1) time.  
 

Legend

size: lines of source code as calculated by sloccount. Excludes regression test code and auxiliary utilities in case of libs.

standard: all projects are implemented in C; this column shows the oldest C standard the code aims to complaint to.

has cross platform parts: most of these libs are 100% portable plain C code with no dependencies. Some of them need to depend on calls that differ from system to system. These libs are marked as having "cross platform parts", which means some of the code is implemented multiple times, each instance for one of the systems supported.

type generic: applicable to data types only; the code can be configured or instantiated for different user supplied types. The code is type safe (different instances have different types that don't mix, e.g. an "string-to-int" hash lookup is different from a "int-to-pointer" lookup).

alternative for: the project is a good alternative for the well known public project(s) listed in this column. Since some of these projects have huge code base, a minilib is often alternative for parts of the projects. The project size is calculated with sloccount, not counting regression tests and auxiliary utilities wherever possible.