Re: config files substitution with awk

From: Paul Eggert
Subject: Re: config files substitution with awk
Date: Sun, 26 Nov 2006 09:54:36 -0800
User-agent: Gnus/5.1008 (Gnus v5.10.8) Emacs/21.4 (gnu/linux)

Ralf Wildenhues writes:

> after having studied 
> a bit, the following thought comes to mind:  The gawk.texi section about
> differences in implementations is much more detailed than the Autoconf
> one currently is (and I hope ever will be, lest we increase it lot);
> OTOH, our manual contains information not present in the gawk one
> (readily to me, at least).  How about adding a pointer to the gawk
> manual part, feeding our extra knowledge that way, and removing these
> bits from autoconf.texi after the follinwg gawk release?  IMHO it's
> better to have one coherent source of information in this case.
> FWIW, I'd volunteer to go over this thread and collect things to feed to
> bug-gawk eventually.

That sounds reasonable to me.  But let's file the gawk patch first,
and once it's accepted we can then make the change to the autoconf
manual.  That way, if the gawk patch isn't accepted we'll still
have the info.

> The shell interpolation (and the workaround for `index in array') again
> require the use of a special marker.  I hope `|#_!!_#|' is sufficient
> for both uses (and mentioned this in autoconf.texi).

It's probably a bit safer to use an auxiliary array instead.

> Since Paolo's tests define what happens with address@hidden@var2@', I added a
> note in the manual that users should not rely on this.  (This is to
> avoid the need to distinguish the "special substitutions" still done
> with sed, and to allow us to change the implementation again later on.)

If I am understanding things correctly, why not just make the 'awk'
solution compatible with 'sed', so that @address@hidden@ subsitutes only
var1?  Doesn't the current code do that already?

> An independent open question to me is that, if AC_PROG_AWK was used
> anyway by, whether we should then prefer $AWK over awk,
> for efficiency.  WDYT?

I'd say so; there's little point to testing awk twice.

> Is this patch ok for Autoconf now, so I can start asking bug-gawk?

It looks good; just a couple of minor tweaks.  Assuming the 'awk'
solution is compatible with 'sed' with respect to @address@hidden, we
needn't document that issue.  Second, use that auxiliary variable,
which I called "S_is_set".  I installed this:

2006-11-26  Ralf Wildenhues  <address@hidden>

        Rewrite config files generation: avoid quadratic growth in
        the number of substituted variables by using awk instead of sed
        for the bulk of the substitutions.
        * NEWS: Mention this.
        * doc/autoconf.texi (Setting Output Variables): `|#_!!_#|' is also
        forbidden in the output (and thus input) file.
        * lib/autoconf/status.m4 (_AC_AWK_LITERAL_LIMIT): New macro.
        (_AC_OUTPUT_FILES_PREPARE): Instead of several sed scripts,
        generate just one large awk script for substitutions,
        eliminating much of the earlier complexity, while adding some
        new complexity.  Only expand the substitution templates at
        configure time, for smaller configure script size.  If
        _AC_SUBST_FILES are used, test 'awk' for working getline support
        at config.status time.  If absent, interpolate through the
        shell.  The awk script was written with much help
        from Paolo Bonzini and Paul Eggert.
        (_AC_SED_CMD_NUM, _AC_SED_DELIM_NUM, _AC_SED_FRAG): Removed.
        (_AC_SED_FRAG_NUM): Likewise.
        (_AC_SUBST_CMDS): Renamed from...
        (_AC_SED_CMDS): ...this.
        * tests/ (Substitute a 2000-byte string): Also
        substitute a line with 1000 words, and a variable with several
        long lines.
        (Substitute and define special characters): Test awk special
        characters, and put substitution input strings address@hidden@' in the
        output, to test that no recursion happens; test several other
        combinations from Paolo Bonzini.

Index: NEWS
RCS file: /cvsroot/autoconf/autoconf/NEWS,v
retrieving revision 1.413
diff -p -u -r1.413 NEWS
--- NEWS        17 Nov 2006 20:01:04 -0000      1.413
+++ NEWS        26 Nov 2006 17:47:14 -0000
@@ -1,5 +1,7 @@
 * Major changes in Autoconf 2.61a (??)
+** config.status now uses awk instead of sed for most substitutions, for speed.
 * Major changes in Autoconf 2.61 (2006-11-17)
Index: doc/autoconf.texi
RCS file: /cvsroot/autoconf/autoconf/doc/autoconf.texi,v
retrieving revision 1.1109
diff -p -u -r1.1109 autoconf.texi
--- doc/autoconf.texi   21 Nov 2006 20:54:35 -0000      1.1109
+++ doc/autoconf.texi   26 Nov 2006 17:47:15 -0000
@@ -8352,8 +8352,8 @@ is called.  The value can contain newlin
 The substituted value is not rescanned for more output variables;
 occurrences of @samp{@@@var{variable}@@} in the value are inserted
 literally into the output file.  (The algorithm uses the special marker
address@hidden|#_!!_#|} internally, so the substituted value cannot contain
address@hidden|#_!!_#|} internally, so neither the substituted value nor the
+output file may contain @code{|#_!!_#|}.)
 If @var{value} is given, in addition assign it to @var{variable}.
Index: lib/autoconf/status.m4
RCS file: /cvsroot/autoconf/autoconf/lib/autoconf/status.m4,v
retrieving revision 1.119
diff -p -u -r1.119 status.m4
--- lib/autoconf/status.m4      17 Nov 2006 21:04:54 -0000      1.119
+++ lib/autoconf/status.m4      26 Nov 2006 17:47:15 -0000
@@ -311,6 +311,16 @@ dnl One cannot portably go further than 
+# ---------------------
+# Evaluate the maximum number of characters to put in an awk
+# string literal, not counting escape characters.
+# Some awk's have small limits, such as Solaris and AIX awk.
 # ------------------------
 # Create the sed scripts needed for CONFIG_FILES.
@@ -319,90 +329,80 @@ dnl One cannot portably go further than 
 # The intention is to have readable config.status and configure, even
 # though this m4 code might be scaring.
-# This code was written by Dan Manthey.
+# This code was written by Dan Manthey and rewritten by Ralf Wildenhues.
 # This macro is expanded inside a here document.  If the here document is
 # closed, it has to be reopened with "cat >>$CONFIG_STATUS <<\_ACEOF".
-# Set up the sed scripts for CONFIG_FILES section.
-dnl ... and define _AC_SED_CMDS, the pipeline which executes them.
-m4_define([_AC_SED_CMDS], [])dnl
-# No need to generate the scripts if there are no CONFIG_FILES.
-# This happens for instance when ./config.status config.h
+[# Set up the scripts for CONFIG_FILES section.
+# No need to generate them if there are no CONFIG_FILES.
+# This happens for instance with `./config.status config.h'.
 if test -n "$CONFIG_FILES"; then
-m4_pushdef([_AC_SED_FRAG_NUM], 0)dnl Fragment number.
-m4_pushdef([_AC_SED_CMD_NUM], 2)dnl Num of commands in current frag so far.
-m4_pushdef([_AC_SED_DELIM_NUM], 0)dnl Expected number of delimiters in file.
-m4_pushdef([_AC_SED_FRAG], [])dnl The constant part of the current fragment.
+dnl For AC_SUBST_FILE, check for usable getline support in awk,
+dnl at config.status execution time.
+dnl Otherwise, do the interpolation in sh, which is slower.
+dnl Without any AC_SUBST_FILE, omit all related code.
+dnl Note the expansion is double-quoted for readability.
+[[if awk 'BEGIN { getline <"/dev/null" }' </dev/null 2>/dev/null; then
+  ac_cs_awk_getline=:
+  ac_cs_awk_pipe_init=
+  ac_cs_awk_read_file='
+      while ((getline aline < (F[key])) > 0)
+       print(aline)
+      close(F[key])'
+  ac_cs_awk_pipe_fini=
+  ac_cs_awk_getline=false
+  ac_cs_awk_pipe_init="print \"cat <<'|#_!!_#|'\""
+  ac_cs_awk_read_file='
+      print "|#_!!_#|"
+      print "cat " F[key]
+      '$ac_cs_awk_pipe_init
+  ac_cs_awk_pipe_fini='END { print "|#_!!_#|" }'
+dnl Define the pipe that does the substitution.
-[# Create sed commands to just substitute file output variables.
-m4_foreach_w([_AC_Var], m4_defn([_AC_SUBST_FILES]),
-[dnl End fragments at beginning of loop so that last fragment is not ended.
-m4_if(m4_eval(_AC_SED_CMD_NUM + 3 > _AC_SED_CMD_LIMIT), 1,
-[dnl Fragment is full and not the last one, so no need for the final un-escape.
-dnl Increment fragment number.
-m4_define([_AC_SED_FRAG_NUM], m4_incr(_AC_SED_FRAG_NUM))dnl
-dnl Record that this fragment will need to be used.
-  m4_defn([_AC_SED_CMDS])[| sed -f "$tmp/subs-]_AC_SED_FRAG_NUM[.sed" ])dnl
-cat >"\$tmp/subs-]_AC_SED_FRAG_NUM[.sed" <<\CEOF
+[m4_define([_AC_SUBST_CMDS], [|
+if $ac_cs_awk_getline; then
+  awk -f "$tmp/subs.awk"
+  awk -f "$tmp/subs.awk" | $SHELL
+[| awk -f "$tmp/subs.awk"])])dnl
+echo 'BEGIN {' >"$tmp/subs.awk"
-]m4_define([_AC_SED_CMD_NUM], 2)m4_define([_AC_SED_FRAG])dnl
-])dnl Last fragment ended.
-m4_define([_AC_SED_CMD_NUM], m4_eval(_AC_SED_CMD_NUM + 3))dnl
-[/^[    address@hidden@[        ]*$/{
-r $]_AC_Var[
+[# Create commands to substitute file output variables.
+  echo "cat >>$CONFIG_STATUS <<_ACEOF"
+  echo 'cat >>"\$tmp/subs.awk" <<\CEOF'
+  echo "$ac_subst_files" | sed 's/.*/F@<:@"&"@:>@="$&"/'
+  echo "CEOF"
+  echo "_ACEOF"
+} >conf$$
+. ./conf$$
+rm -f conf$$
-# Remaining file output variables are in a fragment that also has non-file
-# output varibles.
-m4_define([_AC_SED_FRAG], [
-m4_ifdef([_AC_SUBST_VARS], [m4_defn([_AC_SUBST_VARS]) ])address@hidden@],
-[m4_if(_AC_SED_DELIM_NUM, 0,
-[m4_if(_AC_Var, address@hidden@],
-[dnl The whole of the last fragment would be the final deletion of `|#_!!_#|'.
-m4_define([_AC_SED_CMDS], m4_defn([_AC_SED_CMDS])[| sed 's/|#_!!_#|//g' ])],
+  echo "cat >conf$$subs.awk <<_ACEOF"
+  echo "$ac_subst_vars" | sed 's/.*/&!$&$ac_delim/'
+  echo "_ACEOF"
+} >conf$$
+ac_delim_num=`echo "$ac_subst_vars" | grep -c '$'`
 ac_delim='%!_!# '
 for ac_last_try in false false false false false :; do
-  cat >conf$$subs.sed <<_ACEOF
-m4_if(_AC_Var, address@hidden@],
-      [m4_if(m4_eval(_AC_SED_CMD_NUM + 2 <= _AC_SED_CMD_LIMIT), 1,
-             [m4_define([_AC_SED_FRAG], [ end]m4_defn([_AC_SED_FRAG]))])],
-[m4_define([_AC_SED_CMD_NUM], m4_incr(_AC_SED_CMD_NUM))dnl
-m4_define([_AC_SED_DELIM_NUM], m4_incr(_AC_SED_DELIM_NUM))dnl
-      m4_if(_AC_Var, address@hidden@], m4_if(_AC_SED_CMD_NUM, 2, 2, 
+  . ./conf$$
-dnl Do not use grep on conf$$subs.sed, since AIX grep has a line length limit.
-  if test `sed -n "s/.*$ac_delim\$/X/p" conf$$subs.sed | grep -c X` = 
+dnl Do not use grep on conf$$subs.awk, since AIX grep has a line length limit.
+  if test `sed -n "s/.*$ac_delim\$/X/p" conf$$subs.awk | grep -c X` = 
$ac_delim_num; then
   elif $ac_last_try; then
     AC_MSG_ERROR([could not make $CONFIG_STATUS])
@@ -410,51 +410,104 @@ dnl Do not use grep on conf$$subs.sed, s
     ac_delim="$ac_delim!$ac_delim _$ac_delim!! "
+rm -f conf$$
 dnl Similarly, avoid grep here too.
-ac_eof=`sed -n '/^CEOF[[0-9]]*$/s/CEOF/0/p' conf$$subs.sed`
+ac_eof=`sed -n '/^CEOF[[0-9]]*$/s/CEOF/0/p' conf$$subs.awk`
 if test -n "$ac_eof"; then
   ac_eof=`echo "$ac_eof" | sort -nru | sed 1q`
   ac_eof=`expr $ac_eof + 1`
-dnl Increment fragment number.
-m4_define([_AC_SED_FRAG_NUM], m4_incr(_AC_SED_FRAG_NUM))dnl
-dnl Record that this fragment will need to be used.
-m4_defn([_AC_SED_CMDS])[| sed -f "$tmp/subs-]_AC_SED_FRAG_NUM[.sed" ])dnl
+dnl Initialize an awk array of substitutions, keyed by variable name.
+dnl First read a whole (potentially multi-line) substitution,
+dnl and construct `S["VAR"]='.  Then, split it into pieces that fit
+dnl in an awk literal.  Each piece then gets active characters escaped
+dnl (if we escape earlier we risk splitting inside an escape sequence).
+dnl Output as separate string literals, joined with backslash-newline.
+dnl Eliminate the newline after `=' in a second script, for readability.
+dnl Notes to the main part of the awk script:
+dnl - the unusual FS value helps prevent running into the limit of 99 fields,
+dnl - we avoid sub/gsub because of the \& quoting issues, see
+dnl - Writing `$ 0' prevents expansion by both the shell and m4 here.
+dnl m4-double-quote most of the scripting for readability.
-cat >"\$tmp/subs-]_AC_SED_FRAG_NUM[.sed" <<\CEOF$ac_eof
+cat >>"\$tmp/subs.awk" <<\CEOF$ac_eof
 sed '
-s/[,\\&]/\\&/g; s/@/@|#_!!_#|/g
-s/^/s,@/; s/!/@,|#_!!_#|/
-t n
-s/'"$ac_delim"'$/,g/; t
-s/$/\\/; p
-N; s/^.*\n//; s/[,\\&]/\\&/g; s/@/@|#_!!_#|/g; b n
-' >>$CONFIG_STATUS <conf$$subs.sed
-rm -f conf$$subs.sed
+t line
+s/'"$ac_delim"'$//; t gotline
+N; b line
+s/^/S["/; s/!.*/"]=/; p
+t more
+t notlast
+s/["\\]/\\&/g; s/\n/\\n/g
+s/^/"/; s/$/"/
+s/["\\]/\\&/g; s/\n/\\n/g
+s/^/"/; s/$/"\\/
+b more
+' <conf$$subs.awk | sed '
+  N
+  s/\n//
+rm -f conf$$subs.awk
-]m4_if(_AC_Var, address@hidden@],
-[m4_if(m4_eval(_AC_SED_CMD_NUM + 2 > _AC_SED_CMD_LIMIT), 1,
-[m4_define([_AC_SED_CMDS], m4_defn([_AC_SED_CMDS])[| sed 's/|#_!!_#|//g' ])],
+cat >>"\$tmp/subs.awk" <<CEOF
+  for (key in S) S_is_set[key] = 1
+  FS = ""
+[  \$ac_cs_awk_pipe_init])[
+  line = $ 0
+  nfields = split(line, field, "@")
+  substed = 0
+  len = length(field[1])
+  for (i = 2; i < nfields; i++) {
+    key = field[i]
+    keylen = length(key)
+    if (S_is_set[key]) {
+      value = S[key]
+      line = substr(line, 1, len) "" value "" substr(line, len + keylen + 3)
+      len += length(value) + length(field[++i])
+      substed = 1
+    } else
+      len += 1 + keylen
+  }
+[[  if (nfields == 3 && !substed) {
+    key = field[2]
+    if (F[key] != "" && line ~ /^[      address@hidden@[        ]*$/) {
+      \$ac_cs_awk_read_file
+      next
+    }
+  }]])[
+  print line
-m4_define([_AC_SED_FRAG], [
-])m4_define([_AC_SED_DELIM_NUM], 0)m4_define([_AC_SED_CMD_NUM], 2)dnl
+]dnl end of double-quoted part
 # VPATH may cause trouble with some makes, so we remove $(srcdir),
 # ${srcdir} and @srcdir@ from VPATH if srcdir is ".", strip leading and
@@ -554,7 +607,7 @@ m4_foreach([_AC_Var], [srcdir, abs_srcdi
 m4_ifndef([AC_DATAROOTDIR_CHECKED], [$ac_datarootdir_hack
-" $ac_file_inputs m4_defn([_AC_SED_CMDS])>$tmp/out
+" $ac_file_inputs m4_defn([_AC_SUBST_CMDS]) >$tmp/out
 [test -z "$ac_datarootdir_hack$ac_datarootdir_seen" &&
Index: tests/
RCS file: /cvsroot/autoconf/autoconf/tests/,v
retrieving revision 1.72
diff -p -u -r1.72
--- tests/    28 Oct 2006 09:41:07 -0000      1.72
+++ tests/    26 Nov 2006 17:47:15 -0000
@@ -539,18 +539,26 @@ AT_CLEANUP
 # Solaris 9 /usr/ucb/sed that rejects commands longer than 4000 bytes.  HP/UX
 # sed dumps core around 8 KiB.  However, POSIX says that sed need not
 # handle lines longer than 2048 bytes (including the trailing newline).
-# So we'll just test a 2000-byte value.
+# So we'll just test a 2000-byte value, and for awk, we test a line with
+# almost 1000 words, and one variable with 4 lines of 500 bytes each.
 AT_SETUP([Substitute a 2000-byte string])
 AT_DATA([], address@hidden@
+AT_DATA([], address@hidden@
+AT_DATA([], address@hidden@
 AC_SUBST([foo], ]m4_for([n], 1, 100,, ....................)[)
+AC_SUBST([bar], "]m4_for([n], 1, 100,, @ @ @ @ @ @ @ @ @ @@)[")
+AC_SUBST([baz], "]m4_for([n], 1, 4,, m4_for([m], 1, 25,, ... ... ... ... ....)
@@ -558,6 +566,11 @@ AT_CHECK_AUTOCONF
 AT_CHECK([cat Foo], 0, m4_for([n], 1, 100,, ....................)
+AT_CHECK([cat Bar], 0, m4_for([n], 1, 100,, @ @ @ @ @ @ @ @ @ @@)
+AT_CHECK([cat Baz], 0, m4_for([n], 1, 4,, m4_for([m], 1, 25,, ... ... ... ... 
@@ -584,25 +597,57 @@ AT_CLEANUP
 ## Substitute and define special characters.  ##
 ## ------------------------------------------ ##
-# Use characters special to the shell, sed, and M4.
+# Use characters special to the shell, sed, awk, and M4.
 AT_SETUP([Substitute and define special characters])
 AT_DATA([], address@hidden@
address@hidden@@notsubsted@@baz@ stray @ and more@@@baz@
address@hidden @address@hidden
address@hidden @address@hidden@
address@hidden @baz@@baz@
+        @file@  
-[[foo="AS@&address@hidden([[X*'[]+ ", `\($foo]])"
+[[foo="AS@&address@hidden([[X*'[]+ ",& &`\($foo \& \\& \\\& \\\\& \ \\ \\\]])"
+bar="@foo@ @baz@"
-AC_DEFINE([foo], [[X*'[]+ ", `\($foo]], [Awful value.])
+AC_DEFINE([foo], [[X*'[]+ ",& &`\($foo]], [Awful value.])
-AT_CHECK([cat Foo], 0, [[X*'[]+ ", `\($foo
+AT_CHECK([cat Foo], 0, [[X*'[]+ ",& &`\($foo \& \\& \\\& \\\\& \ \\ \\\
address@hidden@ @baz@@address@hidden stray @ and more@@bla
address@hidden@ @address@hidden@baz
address@hidden@ @address@hidden
address@hidden@ @address@hidden@
address@hidden blabaz
address@hidden blabaz@
address@hidden blabla
-AT_CHECK_DEFINES([[#define foo X*'[]+ ", `\($foo
+AT_CHECK_DEFINES([[#define foo X*'[]+ ",& &`\($foo

