Chapter 7: Strings

7.1. String concatenation

Two strings can be concatenated using the @ operator, which creates a new string simply gluing the two operands together:
example program ch7_ex1
function main(ARGV)
{
	a = "foo";
	b = "bar";
	c = a @ b;
	fawk_print_cell(a);
	fawk_print_cell(b);
	fawk_print_cell(c);
}

7.2. String conversions

When a string is used as an operand for an arithmetic operator, it is converted to number. If the conversion fails (i.e. the string is not a number), the value 0 is used - there is no error or warning emitted:
example program ch7_ex2
function main(ARGV)
{
# prints number 6
	fawk_print_cell("5" + 1);

# prints 5 twice, first as a string, then as a number; this is
# how a string can be explicitly converted to a number
	a = "5";
	b = a + 0;
	fawk_print_cell(a);
	fawk_print_cell(b);

# prints number 0 because "oops" is converted to 0
	fawk_print_cell("oops" + 0);
}

When the @ operator is used with a numeric operand, it is first converted to string because the result of the @ operator is always a string:
example program ch7_ex3
function main(ARGV)
{
# prints 5 as number
	a = 5;
	fawk_print_cell(a);

# prints 5wow as string
	b = a @ "wow";
	fawk_print_cell(b);

# prints 5 as string - this is how a number is explicitly converted into a string
	c = a @ "";
	fawk_print_cell(c);

# num-to-str conversion happens even if both operands are nums: this prints 50
	fawk_print_cell(5 @ 0);

# this will print 100, because str "50" is converted back to number for the
# multiplication
	fawk_print_cell((5 @ 0) * 2);
}

7.3. Reference counting

NOTE: this section is interesting for programmers trying to understand libfawk implementation internals. Reference counting does not have any visible effect for scripting.

The only object in fawk that has reference counting is string. Reference counting means each string in memory has a counter that counts the number of users. When the script copies the string, no real copy is created, only the reference counter is incremented. When a string is no longer used (e.g. it was a local variable within a function) it is destroyed, all related memory free'd. This makes copying large strings cheap.

However, it makes strings immutable (read-only): instead of making a modification to any existing string, any modification needs to be done on a newly allocated string - sort of a Copy-On-Write.

fawk_print_cell() prints the reference counter.

7.4. Substrings

Substrings can be generated using the AWK standard substr() builtin:
example program ch7_ex4
function main(ARGV)
{
	s = "hello";
	
	# prints 'e'
	fawk_print(substr(s, 2, 1));

	# these print 'h'
	fawk_print(substr(s, 1, 1));
	fawk_print(substr(s, 0, 1));
	fawk_print(substr(s, -3, 1));

	# these print empty string
	fawk_print_cell(substr(s, 2, 0));
	fawk_print_cell(substr(s, 6, 1));
	fawk_print_cell(substr(s, 6));
	fawk_print_cell(substr(s, 100));
}

7.5. Text blocks

In many cases the script will need to print or build strings, from static string literals and dynamic calculated data. Most often the ratio of string literals and calculated data allows a script written with the traditional approach look good. However, in some rare cases there are huge blocks of verbatim string data, often with newlines and indentation, that requires only a few fields to be filled in.

A typical example is generating a Makefile like this, with an option to replace -O3 with -g:

LDFLAGS =
LDLIBS = -lm
CFLAGS = -I.. -Wall -O3

all: main

main: main.o
	$(CC) $(LDFLAGS) -o main main.o $(LDLIBS)

main.o: main.c
	$(CC)  -c $(CFLAGS) -o main.o main.c

With the usual string tools, storing the Makefile template in the script would make it unreadable because of the \n newlines and the tab indentation wouldn't be obvious either.

fawk provides a text block feature to make such scripts more readable. A text block is basically a verbatim copy of a potentially multiline text string, with local escapes for inserting calculated expressions. The syntax is:

[[~string~]]

or

[[~string1~expr~string2~]]

or in general:

[[~ string or ~expr~ ... ~]]

The first character after the [[ is the escape character, then the string starts. A string lasts until the escape character. If the escape character is followed by a ]], the text block is closed, else the parser switches mode and parses an expression until the next escape character, where it switches back to string mode. Any number of strings and expressions may be put in a text block, in arbitrary order.

At the end of the day, the string parts are converted to string literals and the whole text block is converted into an expression that concatenates its parts with the string concatenation operator (@). In the second syntax example above, this means: "string1" @ expr @ "string2".

Note: since every character of the string part is copied verbatim, except for the escape character, the text block may contain spaces, tabs, newlines, non-printable characters (other than \0).

Note: empty expr is not accepted (because it would lead to "str" @ @ "str", which is invalid).

The programmer is free to choose the escape character: it can be any ASCII character. The only restriction is that once an escape character is chosen for a text block, that character can not be used in the string or expression part, because it is always interpreted as an escape character - there's no escaping of the escape character.

Typical choices for the escape character:

The above Makefile example using a text block:
example program ch7_ex5
function main(ARGS)
{

# change this to 0 to get -O3
dbg = 1;

fawk_print([[^
LDFLAGS =
LDLIBS = -lm
CFLAGS = -I.. -Wall ^(dbg ? "-g" : "-O3")^

all: main

main: main.o
	$(CC) $(LDFLAGS) -o main main.o $(LDLIBS)

main.o: main.c
	$(CC)  -c $(CFLAGS) -o main.o main.c
^]]);
}

In this example ^ was chosen for escape because it normally does not appear in make(1) syntax. The expression, a standard ternary expression evaluating to a string, is put in parenthesis. This is not mandatory but best practice, both for readability and to make sure operator precedences do not interfere with the expression internals.