minilibs vs. megalibs

1. Introduction

Life is so simple when we write our first few dozen programs. They are usually small, focused on some simple and well defined task. We learn how to use the standard library shipped with the language of our choice, and usually ignore the rest of the world. Sometimes we reinvent and reimplement code that are already available in 3rd party libraries - but it is all fine, these projects are throw-away example programs, byproducts of the learning curve.

When we start writing production code, we face the usual dilemmas of programmers:

Complexity: software tend to grow big and complexity is not increasing by a linear function of code size. After a certain point code complexity poses a major risk to the stability of the project. Some projects simply collapse under their own weight (see also the second-system effect).
Hidden or unexpected knowledge in subtasks; hide vs. expose, opacity vs. transparency.
Reuse vs. dependencies: a tool of fighting complexity is modularization and reusing existing, well tested and well documented 3rd party modules/libraries. However, sometimes this does not pay off.

This article collects considerations that may help finding a sustainable balance in these questions. The conclusion presents a practical solution to some of the problems discussed.

2. The hidden/unexpected knowledge

Even smallish parts or subproblems often turn out to be knotty and time consuming to implement. It's usually relatively easy to implement a prototype, but things get more complicated if the code has to be efficient, portable, generic and behave reasonably on corner cases.

This suggests depending more on libraries. The programmer who wrote the generic purpose library for the subproblem perhaps spent much time on it, probably evaluating different algorithms and APIs. The library probably got a lot of testing already and would work much better out-of-the-box than new code.

However, these are assumptions, and should not be trusted blindly. Sometimes the library is not implemented along the same requirements/preferences as the project that tries to use it. Sometimes the library is simply not implemented properly. A well designed API does not automatically mean proper implementation or throughly tested code. Popularity doesn't imply quality either. The user should download the source of the library and evaluate code quality.

If the library is old with a lot of users on a lot of system, it doesn't only mean that it's well tested, but it often means it's also large: some users had special needs and when the library does 99% the user needs it's hard to resist adding the missing 1%. This may encumber the library with code (and API) for rarely used special cases, that may later on interfere with each other or with the normal case code.

Large code is hard to review. It's often more expensive to read the code of a library than just implementing an alternative from scratch (for the relevant part). Unfortunately there is a third option: use the lib without review, blindly trusting the "brand" (e.g. "it's large, it's popular, it can not be that bad.").

This in turn may lead to another problem: what if the library doesn't do exactly what is needed in the given use? The match between what the library offers and what the application needs is obviously never 100%. The application code may need to do some extra (e.g. converting its own idea of the world to the API of the lib), needs to accommodate. This works well, up to a point, but sometimes the difference is so large it'd be easier to change the library and save the extra cruft. If the library is small and easy to understand, this is a real option, while it rarely happens with libs that span multiple 100k source lines.

This sometimes leads to more application code spent on the interfacing than what it would cost to have a local implementation of the functionalty.

The con for reimplementation is exactly the hidden knowledge accumulated in the library: all the experience in handling corner cases, all the testing on different systems, etc.

From another perspective, this hidden knowledge is probably better to be collected and maintained at one place, in a library, instead of many places (local implementations in different applications).

3. Reuse vs. dependency

Libraries are usually used as dependencies. The project assumes the right version of the library is already installed on the system. Dependencies may cause various problems for end users, distribution packagers or even developers.

3.1. Getting the dependency

Dependencies are usually hosted by 3rd party. There are obvious risks involved:

bit rot: hosting services, libraries disappearing
material removed for copyright or patent or other legal reasons
a specific, old version (needed for the app) not hosted any more
the user has downloaded the tarball and has to install it offline (e.g. on a secure network); if dependencies turn out only during ./configure, in multiple iterations, this may make the process more painful than required.

Thus some software seek alternative solutions:

depend on minilibs and ship the dependencies with the software; in case of few hundred lines mini libs losing dynamic linking and library sharing among executables may not be a big deal; the project may be self-contained this way; see also: svn:extern
same, but let ./configure choose between local version (self-contained) of system installed version, if it is present
provide a tarball with all dependencies collected

These alternatives obviously do not work as well with megalibs.

3.2. Dependency versioning

Libraries evolve, APIs change. Programs are usually developed on a few developer boxes with at most a few versions of dependent libs installed. It is sometimes hard to map what version intervals of libs are really suitable for the application. This may lead to requiring a specific version or too new versions, "just to be on the safe side" (instead of testing with all available versions).

Large libraries have more parts that can change, and because of the extra interdependencies, it is more likely that a change triggers another. Because of the sheer size and complexity, mega libraries are less likely to stabilize, e.g. reach a state where developers say "done, finished, the library already supports everything we ever wanted it to". This leads to a constant stream of changes and bump of version numbers, which makes it harder to determine API compatibility precisely.

3.3. API changes and forward porting

The other issue this leads to (in an oversimplified way) is the waste of developer time in forward porting the application. For example take an application that used gtk for GUI. It started by using gtk1.2. Considerable amount of developer time spent on understanding and learning API and developing the initial code. About 3 years after gtk1.2 was released, a new version, gtk2.0 came out.

After some time systems stopped supporting gtk1.2 and developers had to spend time forward porting the gtk1.2 GUI module to gtk2.0. Some widgets were removed, new widgets introduced, support infrastructure and API changed. All in all, developers had to spend their time forward porting their application not because they were necessarily fond of the new features, nor because there was a strong user demand for the actual new features of gtk2, but often only because if major OS distributions stopped supporting gtk1.2, the application that depends on gtk1.2 is also left behind.

Gtk3.0 followed gtk2.0 in 9 years. The same cycle repeats: as of now (2016) we can happily use gtk2.0 on most systems, but there are signs that applications should switch to gtk3.0.

So an application that depends on gtk needs to spare some developer time spent on forward porting the GUI. Users using the said application need to spare some cycles learning a new GUI over and over. Not because they do want to have a new GUI every 6..10 years, rather because they have to follow the trends or the application bit rots and gets unusable. There is no other way around this: even gtk1.2 is too big to be fully shipped with a random application and forward porting gtk1.2 itself to new libc and X or windows versions would be even more expensive.

If an application depends on multiple large libraries and tools, gtk, glib, guile, boost, autotools, etc., each such dependency will have its own force-forward-port cycle. As such large libraries often have more developers than small projects or applications, this may result in a system where a big chunk of the application developer's time is spent on following dependencies.

Fortunately there is an alternative. When picking dependencies for a new project, estimate how much developer time will be available on the long run and, looking at history and trends of the dependency, estimate how much time it will take just to keep up with the given dependency. It may turn out that going for the given dependency is more expensive on the long run than going for another solution that has higher first-time costs, but pays off later. To avoid burning excess developer time on forward portig, consider:

choosing smaller libs that already stabilized;
and/or libs that have much longer rewrite cycles;
and/or even rewrite some code locally;
and/or abandon some requirements and just omit features and code that would introduce hard-to-sustain forward porting paths.

3.4. Security

Among the costs of dependencies, security should be considered too. Much of our software run on networked hosts. Running any sort of software increases the potential attack surface on the host. Ideally the developer writes the software with great care not to leave exploitable vulnerabilities in the release. With large software the user won't review the code and needs to trust the developer on this.

If the release depends on libraries, the programmer will follow similar rules: more-or-less blindly depend on the quality of libraries above a certain size. However, there's an extra level or risk involved: even if both the application alone and the library alone are safe, the combination may be unsafe. Complex, opaque API and the sheer volume of the library source code may make the application programmer to use the API without fully understanding what happens in the library code, which in turn may lead to unforeseen vulnerabilities.

4. Additive vs. subtractive design

In manufacturing mechanical parts, processes are often categorized as additive or subtractive. The classic way is subtractive machining: take a large chunk of raw material and keep on removing mass and deforming the part until the remaining material is shaped as the design requires. Additive methods do the inverse: start from void and add material to build up the part - 3d printers or casting work like that.

Building software using libraries is often a surprisingly similar process. Some libraries are designed to be a set of small, weakly dependent parts that can be freely combined. The user employs the additive process: he selects the "raw material" his program needs and adds them the way the program needs them. There's usually very little excess material added in this process, and most often such burr is tolerable.

Other libraries are designed to provide a large set of interdependent infrastructure; when the program starts using one, it can hardly avoid using others. This is the subtractive method: by default all features are used and the user need to manually remove (turn off, or even work around) excess features that are not needed or are sometimes even harmful.

Whether a library ends up used in subtractive or additive ways is more or less hardwired in the design of the library API. The subtractive method is hard to avoid when using megalibs like glib or gtk . On the other hand the additive method is easier to achieve with minilibs, as they often focus on a single problem.

5. Conclusion

The programmer should always carefully evaluate alternatives before settling with a given library (or a local implementation of the feature). Choosing to depend on a large library may have long term effects on the project; these effect should not be ignored. Factors like popularity of the library or the large amount of features offered should not be reasons to suppress evaluating the risks that may be introduced by the size of the library and the API. The evaluation should not consider short term effects only, but long term effects too (e.g. release cycles of the library and expected forward porting efforts).

The outcome of the evaluation may indeed be to go with a large lib. In other cases it turns out using a small, specialized library is better. For reference, a collection of such mini libraries follows in the appendix.

6. Appendix: a collection of minilibs

name	size (sloc)	standard	has cross platform parts	type generic	description	alternative for [size in sloc]
data structs
genht	669	C89/C99	no	yes	Dynamic hash tables with user specified key and value type, user supplied "compare keys" and "hash key" functions.	glib2 [445542]
genlist	846	C89	no	yes	Doubly linked lists with or without automatic element allocation. List item type is supplied by the caller. Works well with both dynamic and static memory allocation.	glib2 [445542]
genvector	652	C89	no	yes	Dynamic arrays (vectors) of user specified element type: grow and shrink as needed, insert, append, delete elements. Dynamic string is just an instance of a vector configured for char *. Options for: maintaining a terminator record, initialize/construct/destruct records with user callbacks, user supplied alloc()/realloc()/free().	glib2 [445542]
libuundo	385	C89	no	no	Undo/redo lists with callbacks.
libualloc	682	C89	no	no	Modular memory allocator for predictable allocation patterns.	glib2 [445542]
genprique	235	C89	no	yes	Type-generic priority queue (fibonacci heap).	glib2 [445542]
geometry
name	size (sloc)	standard	has cross platform parts	type generic	description	alternative for [size in sloc]
genhvbox	651	C89	no	no	Display independent hbox/vbox packer: specify a tree of boxes with constraints (minimal sizes, alignment, padding, spacing), the host window size and it calculates box and widget coordinates and sizes.
genrtree	708	C89	no	yes	Type-generic rtrees for storing and efficiently searching rectangles in the 2d plane.
libusteiner	617	C89	no	no	Simple approximation to minimal 2D Euclidean Steiner Tree, based on spring forces.
util-lib
name	size (sloc)	standard	has cross platform parts	type generic	description	alternative for [size in sloc]
genregex	881	C89	no	no	Reentrant, compile-execute regular expressions for strings and binary data.
gentrex	1245	C89	no	no	Regular-expression-like language designed for trees. The tree data structure is provided by the user (over callback functions). Gentrex can find or list matching nodes.
libacas	537	C89	yes	no	Atomic compare-and-swap based locks, multi-reader-writer thread safe queues.	glib2 [445542]
libporty	3234	C89	yes	no	Portability layer: file system, date/time/sleep, networking (async networking, files, stdio)	apr [69708] glib2 [445542]
lihata	3798	C89	no	no	File format parser for lihata; parses a text tree of lists, hashes, tables and texts. The event parser is about 550 sloc, on which an optional DoM is built. The event parser is designed to be suitable for embedded environment.	libxml2 [169445] expat [11469] json-c [2985]
libendstr	243	C89/C99	no	no	Functions to ease building strings by appending. The API is similar to the standard library's str* and mem* functions but are designed to track the end of the string.
gensexpr	596	C89	no	no	Parse s-expressions; optionally load them in a DoM, allowing the user to extend the tree node type with custom fields. Export the DoM.
libminuid	248	C89	no	no	Portable mini UID generation based on large random numbers (no MAC address needed).
puplug	1506	C89	no	no	Portable Micro Plugins. Handles static linked and dynamic loaded plugins, plugin dependencies. Optional support for scconfig.
fungw	4961	C89	no	no	Function call based gateway between C and other languages. Also suitable for building an event dispatch system.
libfawk	5716	C89	no	no	Turing complete, procedural scripting languages as reentrant, multi-script C library: an AWK dialect, a BASIC dialect and a Pascal dialect. No dependencies. Per language embeddable single-c-source versions are 2500..2600 sloc.	lua [16564]
libusearch	340	C89	no	no	Generic, callback based graph search algorithm (A*) - does not depend on data representation, does not need a full graph built in memory.
libulzw	186	C89	no	no	Tiny lib for lzw compressing/decompressing. It is possible to link only compress or decompress code. Reentrant API.
util
name	size (sloc)	standard	has cross platform parts	type generic	description	alternative for [size in sloc]
scconfig	22968	C89	yes	no	Execute a set of interdependent compile-run-compare and run-compare tests to build a database of host (and cross compilation target) system. E.g. find the C compiler, figure whether certain libc calls work or work properly or how exactly they behave on corners, find out the compilation/link flags of 3rd party libraries.	autotools cmake [195477]
sphash	492	C89	no	no	Find a collision-free hash function and generate hash table in C code for a set of compile-time-known strings. Such a table can be used to convert string to integer in O(1) time.
byaccic	8052	C89	no	no	Generate reentrant grammars and lexers. Byaccic is similar to yacc(1), but generates properly symbol-prefixed, reentrant code with context pointers by default. Ureglex is similar to lex(1), but generates reentrant push code.	byacc [10018] bison [13691] flex [13610]

Legend

size: lines of source code as calculated by sloccount. Excludes regression test code and auxiliary utilities in case of libs.

standard: all projects are implemented in C; this column shows the oldest C standard the code aims to complaint to.

has cross platform parts: most of these libs are 100% portable plain C code with no dependencies. Some of them need to depend on calls that differ from system to system. These libs are marked as having "cross platform parts", which means some of the code is implemented multiple times, each instance for one of the systems supported.

type generic: applicable to data types only; the code can be configured or instantiated for different user supplied types. The code is type safe (different instances have different types that don't mix, e.g. an "string-to-int" hash lookup is different from a "int-to-pointer" lookup).

alternative for: the project is a good alternative for the well known public project(s) listed in this column. Since some of these projects have huge code base, a minilib is often alternative for parts of the projects. The project size is calculated with sloccount, not counting regression tests and auxiliary utilities wherever possible.