The functions in this section look at or change the text of one or more strings. Optional parameters are enclosed in square brackets ("[" and "]").
index(in, find)
$ awk 'BEGIN { print index("peanut", "an") }'
-| 3
If find is not found, index returns zero.
(Remember that string indices in awk start at one.)
length([string])
length("abcde") is five. By
contrast, length(15 * 35) works out to three. How? Well, 15 * 35 =
525, and 525 is then converted to the string "525", which has
three characters.
If no argument is supplied, length returns the length of $0.
In older versions of awk, you could call the length function
without any parentheses. Doing so is marked as "deprecated" in the
POSIX standard. This means that while you can do this in your
programs, it is a feature that can eventually be removed from a future
version of the standard. Therefore, for maximal portability of your
awk programs, you should always supply the parentheses.
match(string, regexp)
match function searches the string, string, for the
longest, leftmost substring matched by the regular expression,
regexp. It returns the character position, or index, of
where that substring begins (one, if it starts at the beginning of
string). If no match is found, it returns zero.
The match function sets the built-in variable RSTART to
the index. It also sets the built-in variable RLENGTH to the
length in characters of the matched substring. If no match is found,
RSTART is set to zero, and RLENGTH to -1.
For example:
awk '{
if ($1 == "FIND")
regex = $2
else {
where = match($0, regex)
if (where != 0)
print "Match of", regex, "found at", \
where, "in", $0
}
}'
This program looks for lines that match the regular expression stored in
the variable regex. This regular expression can be changed. If the
first word on a line is `FIND', regex is changed to be the
second word on that line. Therefore, given:
FIND ru+n My program runs but not very quickly FIND Melvin JF+KM This line is property of Reality Engineering Co. Melvin was here.
awk prints:
Match of ru+n found at 12 in My program runs Match of Melvin found at 1 in Melvin was here.
split(string, array [, fieldsep])
array[1], the second piece in array[2], and so
forth. The string value of the third argument, fieldsep, is
a regexp describing where to split string (much as FS can
be a regexp describing where to split input records). If
the fieldsep is omitted, the value of FS is used.
split returns the number of elements created.
The split function splits strings into pieces in a
manner similar to the way input lines are split into fields. For example:
split("cul-de-sac", a, "-")
splits the string `cul-de-sac' into three fields using `-' as the
separator. It sets the contents of the array a as follows:
a[1] = "cul" a[2] = "de" a[3] = "sac"The value returned by this call to
split is three.
As with input field-splitting, when the value of fieldsep is
" ", leading and trailing whitespace is ignored, and the elements
are separated by runs of whitespace.
Also as with input field-splitting, if fieldsep is the null string, each
individual character in the string is split into its own array element.
(This is a gawk-specific extension.)
Recent implementations of awk, including gawk, allow
the third argument to be a regexp constant (/abc/), as well as a
string (d.c.). The POSIX standard allows this as well.
Before splitting the string, split deletes any previously existing
elements in the array array (d.c.).
sprintf(format, expression1,...)
printf would
have printed out with the same arguments
(see section Using printf Statements for Fancier Printing).
For example:
sprintf("pi = %.2f (approx.)", 22/7)
returns the string "pi = 3.14 (approx.)".
sub(regexp, replacement [, target])
sub function alters the value of target.
It searches this value, which is treated as a string, for the
leftmost longest substring matched by the regular expression, regexp,
extending this match as far as possible. Then the entire string is
changed by replacing the matched text with replacement.
The modified string becomes the new value of target.
This function is peculiar because target is not simply
used to compute a value, and not just any expression will do: it
must be a variable, field or array element, so that sub can
store a modified value there. If this argument is omitted, then the
default is to use and alter $0.
For example:
str = "water, water, everywhere" sub(/at/, "ith", str)sets
str to "wither, water, everywhere", by replacing the
leftmost, longest occurrence of `at' with `ith'.
The sub function returns the number of substitutions made (either
one or zero).
If the special character `&' appears in replacement, it
stands for the precise substring that was matched by regexp. (If
the regexp can match more than one string, then this precise substring
may vary.) For example:
awk '{ sub(/candidate/, "& and his wife"); print }'
changes the first occurrence of `candidate' to `candidate
and his wife' on each input line.
Here is another example:
awk 'BEGIN {
str = "daabaaa"
sub(/a*/, "c&c", str)
print str
}'
-| dcaacbaaa
This shows how `&' can represent a non-constant string, and also
illustrates the "leftmost, longest" rule in regexp matching
(see section How Much Text Matches?).
The effect of this special character (`&') can be turned off by putting a
backslash before it in the string. As usual, to insert one backslash in
the string, you must write two backslashes. Therefore, write `\\&'
in a string constant to include a literal `&' in the replacement.
For example, here is how to replace the first `|' on each line with
an `&':
awk '{ sub(/\|/, "\\&"); print }'
Note: As mentioned above, the third argument to sub must
be a variable, field or array reference.
Some versions of awk allow the third argument to
be an expression which is not an lvalue. In such a case, sub
would still search for the pattern and return zero or one, but the result of
the substitution (if any) would be thrown away because there is no place
to put it. Such versions of awk accept expressions like
this:
sub(/USA/, "United States", "the USA and Canada")For historical compatibility,
gawk will accept erroneous code,
such as in the above example. However, using any other non-changeable
object as the third parameter will cause a fatal error, and your program
will not run.
gsub(regexp, replacement [, target])
sub function, except gsub replaces
all of the longest, leftmost, non-overlapping matching
substrings it can find. The `g' in gsub stands for
"global," which means replace everywhere. For example:
awk '{ gsub(/Britain/, "United Kingdom"); print }'
replaces all occurrences of the string `Britain' with `United
Kingdom' for all input records.
The gsub function returns the number of substitutions made. If
the variable to be searched and altered, target, is
omitted, then the entire input record, $0, is used.
As in sub, the characters `&' and `\' are special,
and the third argument must be an lvalue.
gensub(regexp, replacement, how [, target])
gensub is a general substitution function. Like sub and
gsub, it searches the target string target for matches of
the regular expression regexp. Unlike sub and
gsub, the modified string is returned as the result of the
function, and the original target string is not changed. If
how is a string beginning with `g' or `G', then it
replaces all matches of regexp with replacement.
Otherwise, how is a number indicating which match of regexp
to replace. If no target is supplied, $0 is used instead.
gensub provides an additional feature that is not available
in sub or gsub: the ability to specify components of
a regexp in the replacement text. This is done by using parentheses
in the regexp to mark the components, and then specifying `\n'
in the replacement text, where n is a digit from one to nine.
For example:
$ gawk '
> BEGIN {
> a = "abc def"
> b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a)
> print b
> }'
-| def abc
As described above for sub, you must type two backslashes in order
to get one into the string.
In the replacement text, the sequence `\0' represents the entire
matched text, as does the character `&'.
This example shows how you can use the third argument to control
which match of the regexp should be changed.
$ echo a b c a b c |
> gawk '{ print gensub(/a/, "AA", 2) }'
-| a b c AA b c
In this case, $0 is used as the default target string.
gensub returns the new string as its result, which is
passed directly to print for printing.
If the how argument is a string that does not begin with `g' or
`G', or if it is a number that is less than zero, only one
substitution is performed.
gensub is a gawk extension; it is not available
in compatibility mode (see section Command Line Options).
substr(string, start [, length])
substr("washington", 5, 3) returns "ing".
If length is not present, this function returns the whole suffix of
string that begins at character number start. For example,
substr("washington", 5) returns "ington". The whole
suffix is also returned
if length is greater than the number of characters remaining
in the string, counting from character number start.
Note: The string returned by substr cannot be
assigned to. Thus, it is a mistake to attempt to change a portion of
a string, like this:
string = "abcdef" # try to get "abCDEf", won't work substr(string, 3, 3) = "CDE"or to use
substr as the third agument of sub or gsub:
gsub(/xyz/, "pdq", substr($0, 5, 20)) # WRONG
tolower(string)
tolower("MiXeD cAsE 123") returns "mixed case 123".
toupper(string)
toupper("MiXeD cAsE 123") returns "MIXED CASE 123".
sub, gsub and gensub
When using sub, gsub or gensub, and trying to get literal
backslashes and ampersands into the replacement text, you need to remember
that there are several levels of escape processing going on.
First, there is the lexical level, which is when awk reads
your program, and builds an internal copy of your program that can
be executed.
Then there is the run-time level, when awk actually scans the
replacement string to determine what to generate.
At both levels, awk looks for a defined set of characters that
can come after a backslash. At the lexical level, it looks for the
escape sequences listed in section Escape Sequences.
Thus, for every `\' that awk will process at the run-time
level, you type two `\'s at the lexical level.
When a character that is not valid for an escape sequence follows the
`\', Unix awk and gawk both simply remove the initial
`\', and put the following character into the string. Thus, for
example, "a\qb" is treated as "aqb".
At the run-time level, the various functions handle sequences of `\' and `&' differently. The situation is (sadly) somewhat complex.
Historically, the sub and gsub functions treated the two
character sequence `\&' specially; this sequence was replaced in
the generated text with a single `&'. Any other `\' within
the replacement string that did not precede an `&' was passed
through unchanged. To illustrate with a table:
This table shows both the lexical level processing, where
an odd number of backslashes becomes an even number at the run time level,
and the run-time processing done by sub.
(For the sake of simplicity, the rest of the tables below only show the
case of even numbers of `\'s entered at the lexical level.)
The problem with the historical approach is that there is no way to get a literal `\' followed by the matched text.
The 1992 POSIX standard attempted to fix this problem. The standard
says that sub and gsub look for either a `\' or an `&'
after the `\'. If either one follows a `\', that character is
output literally. The interpretation of `\' and `&' then becomes
like this:
This would appear to solve the problem. Unfortunately, the phrasing of the standard is unusual. It says, in effect, that `\' turns off the special meaning of any following character, but that for anything other than `\' and `&', such special meaning is undefined. This wording leads to two problems.
awk programs.
awk program is portable, every character
in the replacement string must be preceded with a
backslash.(11)
The POSIX standard is under revision.(12) Because of the above problems, proposed text for the revised standard reverts to rules that correspond more closely to the original existing practice. The proposed rules have special cases that make it possible to produce a `\' preceding the matched text.
In a nutshell, at the run-time level, there are now three special sequences of characters, `\\\&', `\\&' and `\&', whereas historically, there was only one. However, as in the historical case, any `\' that is not part of one of these three sequences is not special, and appears in the output literally.
gawk 3.0 follows these proposed POSIX rules for sub and
gsub.
Whether these proposed rules will actually become codified into the
standard is unknown at this point. Subsequent gawk releases will
track the standard and implement whatever the final version specifies;
this book will be updated as well.
The rules for gensub are considerably simpler. At the run-time
level, whenever gawk sees a `\', if the following character
is a digit, then the text that matched the corresponding parenthesized
subexpression is placed in the generated output. Otherwise,
no matter what the character after the `\' is, that character will
appear in the generated text, and the `\' will not.
Because of the complexity of the lexical and run-time level processing,
and the special cases for sub and gsub,
we recommend the use of gawk and gensub for when you have
to do substitutions.
Go to the first, previous, next, last section, table of contents.