minilibs vs. megalibs
1. Introduction
Life is so simple when we write our first few dozen programs. They are usually
small, focused on some simple and well defined task. We learn how to use
the standard library shipped with the language of our choice, and usually
ignore the rest of the world. Sometimes we reinvent and reimplement code
that are already available in 3rd party libraries - but it is all fine,
these projects are throw-away example programs, byproducts of the learning
curve.
When we start writing production code, we face the usual dilemmas of
programmers:
- Complexity: software tend to grow big and complexity is not increasing
by a linear function of code size. After a certain point code complexity
poses a major risk to the stability of the project. Some projects
simply collapse under their own weight (see also
the second-system effect).
- Hidden or unexpected knowledge in subtasks; hide vs. expose, opacity vs. transparency.
- Reuse vs. dependencies: a tool of fighting complexity is modularization
and reusing existing, well tested and well documented 3rd party
modules/libraries. However, sometimes this does not pay off.
This article collects considerations that may help finding a sustainable
balance in these questions. The conclusion presents a practical solution
to some of the problems discussed.
2. The hidden/unexpected knowledge
Even smallish parts or subproblems often turn out to be knotty and
time consuming to implement. It's usually relatively easy
to implement a prototype, but things get more complicated if the
code has to be efficient, portable, generic and behave reasonably on
corner cases.
This suggests depending more on libraries. The programmer who wrote the
generic purpose library for the subproblem perhaps spent much time on it,
probably evaluating different algorithms and APIs. The library probably got
a lot of testing already and would work much better out-of-the-box than new
code.
However, these are assumptions, and should not be trusted blindly. Sometimes
the library is not implemented along the same requirements/preferences
as the project that tries to use it. Sometimes the library is simply not
implemented properly. A well designed API does not automatically mean
proper implementation or throughly tested code. Popularity doesn't
imply quality either. The user should download the source of the
library and evaluate code quality.
If the library is old with a lot of users on a lot of system, it doesn't
only mean that it's well tested, but it often means it's also large: some
users had special needs and when the library does 99% the user needs it's
hard to resist adding the missing 1%. This may encumber the library with
code (and API) for rarely used special cases, that may later on interfere
with each other or with the normal case code.
Large code is hard to review. It's often more expensive to read the code
of a library than just implementing an alternative from scratch (for the
relevant part). Unfortunately there is a third option: use the lib
without review, blindly trusting the "brand" (e.g. "it's large, it's
popular, it can not be that bad.").
This in turn may lead to another problem: what if the library doesn't do
exactly what is needed in the given use? The match between what the library
offers and what the application needs is obviously never 100%. The application
code may need to do some extra (e.g. converting its own idea of the
world to the API of the lib), needs to accommodate. This works well, up to
a point, but sometimes the difference is so large it'd be easier to change
the library and save the extra cruft. If the library is small and easy to
understand, this is a real option, while it rarely happens with libs that
span multiple 100k source lines.
This sometimes leads to more application code spent on the interfacing than
what it would cost to have a local implementation of the functionalty.
The con for reimplementation is exactly the hidden knowledge accumulated
in the library: all the experience in handling corner cases, all the testing
on different systems, etc.
From another perspective, this hidden knowledge is probably better to be
collected and maintained at one place, in a library, instead of many places
(local implementations in different applications).
3. Reuse vs. dependency
Libraries are usually used as dependencies. The project assumes the right
version of the library is already installed on the system. Dependencies
may cause various problems for end users, distribution packagers or even
developers.
3.1. Getting the dependency
Dependencies are usually hosted by 3rd party. There are obvious risks
involved:
- bit rot: hosting services, libraries disappearing
- material removed for copyright or patent or other legal reasons
- a specific, old version (needed for the app) not hosted any more
- the user has downloaded the tarball and has to install it offline
(e.g. on a secure network); if dependencies turn out only during
./configure, in multiple iterations, this may make the process more
painful than required.
Thus some software seek alternative solutions:
- depend on minilibs and ship the dependencies with the software; in case
of few hundred lines mini libs losing dynamic linking and library sharing
among executables may not be a big deal; the project may be
self-contained this way; see also: svn:extern
- same, but let ./configure choose between local version (self-contained)
of system installed version, if it is present
- provide a tarball with all dependencies collected
These alternatives obviously do not work as well with megalibs.
3.2. Dependency versioning
Libraries evolve, APIs change. Programs are usually developed on a few
developer boxes with at most a few versions of dependent libs installed.
It is sometimes hard to map what version intervals of libs are really
suitable for the application. This may lead to requiring a specific version
or too new versions, "just to be on the safe side" (instead of testing with
all available versions).
Large libraries have more parts that can change, and because of the extra
interdependencies, it is more likely that a change triggers another. Because
of the sheer size and complexity, mega libraries are less likely to
stabilize, e.g. reach a state where developers say "done, finished, the
library already supports everything we ever wanted it to". This leads
to a constant stream of changes and bump of version numbers, which makes
it harder to determine API compatibility precisely.
3.3. API changes and forward porting
The other issue this leads to (in an oversimplified way) is the waste of
developer time in forward porting the application. For example take an
application that used gtk for GUI. It started by using gtk1.2. Considerable
amount of developer time spent on understanding and learning API and developing
the initial code. About 3 years after gtk1.2 was released, a new version,
gtk2.0 came out.
After some time systems stopped supporting gtk1.2 and developers had to
spend time forward porting the gtk1.2 GUI module to gtk2.0. Some widgets
were removed, new widgets introduced, support infrastructure and API
changed. All in all, developers had to spend their time forward porting
their application not because they were necessarily fond of the new features,
nor because there was a strong user demand for the actual new features of
gtk2, but often only because if major OS distributions stopped supporting
gtk1.2, the application that depends on gtk1.2 is also left behind.
Gtk3.0 followed gtk2.0 in 9 years. The same cycle repeats: as of now
(2016) we can happily use gtk2.0 on most systems, but there are signs that
applications should switch to gtk3.0.
So an application that depends on gtk needs to spare some developer time spent
on forward porting the GUI. Users using the said application need to spare
some cycles learning a new GUI over and over. Not because they do want to have
a new GUI every 6..10 years, rather because they have to follow the
trends or the application bit rots and gets unusable. There is no other
way around this: even gtk1.2 is too big to be fully shipped with a random
application and forward porting gtk1.2 itself to new libc and X or windows
versions would be even more expensive.
If an application depends on multiple large libraries and tools,
gtk, glib, guile, boost, autotools, etc., each such dependency will
have its own force-forward-port cycle. As such large libraries often
have more developers than small projects or applications, this may result
in a system where a big chunk of the application developer's time is spent on
following dependencies.
Fortunately there is an alternative. When picking dependencies for a new
project, estimate how much developer time will be available on the long run
and, looking at history and trends of the dependency, estimate how much time
it will take just to keep up with the given dependency. It may turn out that
going for the given dependency is more expensive on the long run than
going for another solution that has higher first-time costs, but pays off
later. To avoid burning excess developer time on forward portig, consider:
- choosing smaller libs that already stabilized;
- and/or libs that have much longer rewrite cycles;
- and/or even rewrite some code locally;
- and/or abandon some requirements and just omit features and code that would introduce hard-to-sustain forward porting paths.
3.4. Security
Among the costs of dependencies, security should be considered too. Much of
our software run on networked hosts. Running any sort of software increases
the potential attack surface on the host. Ideally the developer
writes the software with great care not to leave exploitable vulnerabilities
in the release. With large software the user won't review the code and needs
to trust the developer on this.
If the release depends on libraries, the programmer will follow similar rules:
more-or-less blindly depend on the quality of libraries above a certain size.
However, there's an extra level or risk involved: even if both the application
alone and the library alone are safe, the combination may be unsafe. Complex,
opaque API and the sheer volume of the library source code may make the
application programmer to use the API without fully understanding what happens
in the library code, which in turn may lead to unforeseen vulnerabilities.
4. Additive vs. subtractive design
In manufacturing mechanical parts, processes are often categorized as
additive or subtractive. The classic way is subtractive machining: take
a large chunk of raw material and keep on removing mass and deforming the part
until the remaining material is shaped as the design requires. Additive
methods do the inverse: start from void and add material to build up the part -
3d printers or casting work like that.
Building software using libraries is often a surprisingly similar process.
Some libraries are designed to be a set of small, weakly dependent parts
that can be freely combined. The user employs the additive process: he selects
the "raw material" his program needs and adds them the way the program needs
them. There's usually very little excess material added in this process,
and most often such burr is tolerable.
Other libraries are designed to provide a large set of interdependent infrastructure;
when the program starts using one, it can hardly avoid using others. This is
the subtractive method: by default all features are used and the user need
to manually remove (turn off, or even work around) excess features that are
not needed or are sometimes even harmful.
Whether a library ends up used in subtractive or additive ways is more
or less hardwired in the design of the library API.
The subtractive method is hard to avoid when using megalibs like glib or
gtk . On the other hand the additive method is
easier to achieve with minilibs, as they often focus on a single problem.
5. Conclusion
The programmer should always carefully evaluate alternatives before
settling with a given library (or a local implementation of the feature).
Choosing to depend on a large library may have long term effects on
the project; these effect should not be ignored. Factors like popularity
of the library or the large amount of features offered should not be
reasons to suppress evaluating the risks that may be introduced by
the size of the library and the API. The evaluation should not consider
short term effects only, but long term effects too (e.g. release
cycles of the library and expected forward porting efforts).
The outcome of the evaluation may indeed be to go with a large lib.
In other cases it turns out using a small, specialized library is
better. For reference, a collection of such mini libraries follows
in the appendix.
6. Appendix: a collection of minilibs
data structs
| name | size (sloc) | standard | has cross platform parts | type generic | description | alternative for [size in sloc]
|
---|
genht | 669
| C89/C99
| no
| yes
| Dynamic hash tables with user specified key and value type, user supplied "compare keys" and "hash key" functions.
| glib2 [445542]
| genlist | 846
| C89
| no
| yes
| Doubly linked lists with or without automatic element allocation. List item type is supplied by the caller. Works well with both dynamic and static memory allocation.
| glib2 [445542]
| genvector | 652
| C89
| no
| yes
| Dynamic arrays (vectors) of user specified element type: grow and shrink as needed, insert, append, delete elements. Dynamic string is just an instance of a vector configured for char *. Options for: maintaining a terminator record, initialize/construct/destruct records with user callbacks, user supplied alloc()/realloc()/free().
| glib2 [445542]
| libuundo | 385
| C89
| no
| no
| Undo/redo lists with callbacks.
|
| libualloc | 682
| C89
| no
| no
| Modular memory allocator for predictable allocation patterns.
| glib2 [445542]
| genprique | 235
| C89
| no
| yes
| Type-generic priority queue (fibonacci heap).
| glib2 [445542]
|
geometry
| name | size (sloc) | standard | has cross platform parts | type generic | description | alternative for [size in sloc]
|
---|
genhvbox | 651
| C89
| no
| no
| Display independent hbox/vbox packer: specify a tree of boxes with constraints (minimal sizes, alignment, padding, spacing), the host window size and it calculates box and widget coordinates and sizes.
|
| genrtree | 708
| C89
| no
| yes
| Type-generic rtrees for storing and efficiently searching rectangles in the 2d plane.
|
| libusteiner | 617
| C89
| no
| no
| Simple approximation to minimal 2D Euclidean Steiner Tree, based on spring forces.
|
|
util-lib
| name | size (sloc) | standard | has cross platform parts | type generic | description | alternative for [size in sloc]
|
---|
genregex | 881
| C89
| no
| no
| Reentrant, compile-execute regular expressions for strings and binary data.
|
| gentrex | 1245
| C89
| no
| no
| Regular-expression-like language designed for trees. The tree data structure is provided by the user (over callback functions). Gentrex can find or list matching nodes.
|
| libacas | 537
| C89
| yes
| no
| Atomic compare-and-swap based locks, multi-reader-writer thread safe queues.
| glib2 [445542]
| libporty | 3234
| C89
| yes
| no
| Portability layer: file system, date/time/sleep, networking (async networking, files, stdio)
| apr [69708] glib2 [445542]
| lihata | 3798
| C89
| no
| no
| File format parser for lihata; parses a text tree of lists, hashes, tables and texts. The event parser is about 550 sloc, on which an optional DoM is built. The event parser is designed to be suitable for embedded environment.
| libxml2 [169445] expat [11469] json-c [2985]
| libendstr | 243
| C89/C99
| no
| no
| Functions to ease building strings by appending. The API is similar to the standard library's str* and mem* functions but are designed to track the end of the string.
|
| gensexpr | 596
| C89
| no
| no
| Parse s-expressions; optionally load them in a DoM, allowing the user to extend the tree node type with custom fields. Export the DoM.
|
| libminuid | 248
| C89
| no
| no
| Portable mini UID generation based on large random numbers (no MAC address needed).
|
| puplug | 1506
| C89
| no
| no
| Portable Micro Plugins. Handles static linked and dynamic loaded plugins, plugin dependencies. Optional support for scconfig.
|
| fungw | 4961
| C89
| no
| no
| Function call based gateway between C and other languages. Also suitable for building an event dispatch system.
|
| libfawk | 5716
| C89
| no
| no
| Turing complete, procedural scripting languages as reentrant, multi-script C library: an AWK dialect, a BASIC dialect and a Pascal dialect. No dependencies. Per language embeddable single-c-source versions are 2500..2600 sloc.
| lua [16564]
| libusearch | 340
| C89
| no
| no
| Generic, callback based graph search algorithm (A*) - does not depend on data representation, does not need a full graph built in memory.
|
| libulzw | 186
| C89
| no
| no
| Tiny lib for lzw compressing/decompressing. It is possible to link only compress or decompress code. Reentrant API.
|
|
util
| name | size (sloc) | standard | has cross platform parts | type generic | description | alternative for [size in sloc]
|
---|
scconfig | 22968
| C89
| yes
| no
| Execute a set of interdependent compile-run-compare and run-compare tests to build a database of host (and cross compilation target) system. E.g. find the C compiler, figure whether certain libc calls work or work properly or how exactly they behave on corners, find out the compilation/link flags of 3rd party libraries.
| autotools cmake [195477]
| sphash | 492
| C89
| no
| no
| Find a collision-free hash function and generate hash table in C code for a set of compile-time-known strings. Such a table can be used to convert string to integer in O(1) time.
|
| byaccic | 8052
| C89
| no
| no
| Generate reentrant grammars and lexers. Byaccic is similar to yacc(1), but generates properly symbol-prefixed, reentrant code with context pointers by default. Ureglex is similar to lex(1), but generates reentrant push code.
| byacc [10018] bison [13691] flex [13610]
|
|
|
Legend
size: lines of source code as calculated by sloccount. Excludes
regression test code and auxiliary utilities in case of libs.
standard: all projects are implemented in C; this column shows
the oldest C standard the code aims to complaint to.
has cross platform parts: most of these libs are 100% portable
plain C code with no dependencies. Some of them need to depend on
calls that differ from system to system. These libs are marked as
having "cross platform parts", which means some of the code is implemented
multiple times, each instance for one of the systems supported.
type generic: applicable to data types only; the code can
be configured or instantiated for different user supplied types. The
code is type safe (different instances have different types that
don't mix, e.g. an "string-to-int" hash lookup is different from
a "int-to-pointer" lookup).
alternative for: the project is a good alternative for
the well known public project(s) listed in this column. Since some of
these projects have huge code base, a minilib is often alternative for
parts of the projects. The project size is calculated with sloccount,
not counting regression tests and auxiliary utilities wherever possible.
|