[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

config files substitution with awk

From: Ralf Wildenhues
Subject: config files substitution with awk
Date: Sun, 19 Nov 2006 21:03:08 +0100
User-agent: Mutt/1.5.13 (2006-08-11)

If you have one F config files, each with L lines in which substitutions
apply, and S substituted variables, then the overall work for creating
all config files currently scales roughly as
  F * (c1 * (L + S) + c2 * (L * S)) + c3 * S

c1 is larger than c2, but the c2 term causes the most work for large
packages.  Automake 1.10 was changed to decrease L in packages, which
shows when you have many AM_CONDITIONALs.  The patch below kills the
L * S term and replaces it by L * log(S) (if the awk implementation is
worth its name).  A rough check indicated that c3 is increased a bit,
but c1 and c2 are likely even lower now as well.  And configure script
size is reduced as well.

For example[1], in a large package with 871 substituted variables, of
which 2*136 are produced by AM_CONDITIONAL, and roughly 210 Makefiles.
'./config.status' execution for those Makefiles (no headers, no depfiles):
- with Automake-1.9.6:
78.54user 9.32system 1:38.60elapsed 89%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2551217minor)pagefaults 0swaps
- with Automake 1.10 (no superfluous $(*_TRUE)/$(*_FALSE) settings):
56.11user 8.31system 1:16.51elapsed 84%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2284709minor)pagefaults 0swaps
- additionally with the Autoconf patch below:
11.24user 3.62system 0:21.89elapsed 67%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+935332minor)pagefaults 0swaps

In comparison, for CVS coreutils the win is less pronounced:
- before the patch (full config.cache, warm CPU cache):
$ \time ./config.status --recheck  # to prove there is no regression
1.34user 1.78system 0:09.15elapsed 34%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+765545minor)pagefaults 0swaps
$ \time ./config.status     # again, without depfiles and config headers
1.20user 1.08system 0:03.72elapsed 61%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+262175minor)pagefaults 0swaps
- with the patch:
$ \time ./config.status --recheck
1.50user 1.58system 0:08.73elapsed 35%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+755682minor)pagefaults 0swaps
$ \time ./config.status     # again, without depfiles and config headers
0.69user 0.86system 0:02.98elapsed 52%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+218260minor)pagefaults 0swaps

The assumption underlying the patch is that AC_PROG_AWK finds an awk
that can handle the resulting script.  I did not see an easy way to
write it portably to ancient awk, but I tested several systems fine.
Obviously this may present a bootstrapping issue for GNU awk; but I
understand [4] such that systems with such limited awk have another
vendor (n|m)awk which is capable enough.

There's hopefully not many packages that expand AC_PROG_AWK inside a
shell conditional (which would break our 'AC_REQUIRE([AC_PROG_AWK])').
But anyway AM_INIT_AUTOMAKE requires it, so all packages using Automake
should be fine anyway.

This patch should be tested by OpenServer users[2], once it's settled
a bit more.  For example, I don't know whether my testsuite additions
run afoul of some limitation of their awk or sed (but I sure hope not).

Regarding known limitations: AIX awk has a limit for string literals at
398 characters, not counting escape characters, and Solaris awk (not
nawk) at 148.  However, both handle much longer strings through
concatenation, so the matter of initialization is merely one of
splitting the substituted values often enough.  Note however that
Solaris awk is too ancient to handle the script anyway, so the current
setting of _AC_AWK_LITERAL_LIMIT is overcautious.  IRIX nawk fails if I
make the multi-line long string 4000 bytes long.  Further note that due
to the code using sed to setup the substitution script we do retain the
length limitations imposed by it (and it does currently pull in a whole
substituted value, across newlines).

One drawback for AC_SUBST_FILE currently present causes a noticeable
regression due to the fact that awk's system function is used for each
such substitution.  The autoconf.texi note leaves me uncertain what we
can portably expect from awk's getline.  But my experience is that few
packages use many AC_SUBST_FILE, rendering this irrelevant in practice.

Also currently '@file@'s are substituted even if there's more than just
white space surrounding it (but the rest of the line _is_ killed).  I'm
not certain yet what would be the most desirable semantics (which should
then be documented), but I think it would be possible to improve upon the
current ones.

It is possible to remove the restriction that the output may not contain
'|#_!!_#|', but I haven't done that yet.  (I'm using it as well to
disable field splitting, to get around the HP-UX awk limitation of 99

Currently the c3*S step can scale quadratically in the maximal line
length, as the large sed script pulls in a whole substituted variable
(including internal newlines), and I'm not sure if sed implementations
are smart enough for this.  If this exceeds sed limits, then that will
need to be rewritten before it can be used.  The scaling and the pulling
in can be eliminated, but at the cost of a more complex sed script.

Also, there is a quadratic scaling in the number of substitutions per
line.  I guess it's not worth fighting this unless someone sees an easy
and portable way out.

Is it necessary to 'chmod +x' a file before sourcing it ('. ./file')?

The removing of the temporary files in this whole macro is done merely
because the testsuite checks for presence of files before the final trap
is run (which would have removed conf$$*).

With this patch, it would even be quite easy to add support for reliable
(and fast) recursive substitutions.  IOW, we could add an support for
  AC_SUBST_RECURSIVE([foo], address@hidden@ @address@hidden)

and bar and baz will be replaced as well.  If there is desire for such a
macro, that is.  (Note the 'skip' in the awk script is currently used to
prevent an infinite loop on @notsubsted@ only; taking the size of the
replacement into account requires some hackery wrt. the escaping of &.)

The diff below was generated with the proposed --more-readable patch[3]
to diffutils, for, umm, readability. :-)

OK to apply?  I think we'll be better off with a couple more tests, too,
to appear later.

Now if we can figure out a way to use make-time includes in Automake in
a mostly backward-compatible way (there is quite a bit of related issues
to deal with first) ...  ;-)


[1] OpenMPI
    but with the kill-trailing-spaces change undone (I don't like it if
    the vi paragraph more command '{' doesn't skip over the whole patch).

2006-11-19  Ralf Wildenhues  <address@hidden>

        Rewrite config files generation: replace quadratic growth in
        the number of substituted variables with loglinear growth by
        using awk instead of sed for the bulk of the substitutions.
        * lib/autoconf/status.m4 (_AC_AWK_LITERAL_LIMIT)
        Require AC_PROG_AWK if config files are generated.
        (_AC_OUTPUT_FILES_PREPARE): Instead of several sed scripts,
        generate just one large awk script for substitutions,
        eliminating much of the earlier complexity, while adding some
        new complexity.  Only expand the substitution templates at
        configure time, for smaller configure script size.
        (_AC_SUBST_CMDS): Renamed from...
        (_AC_SED_CMDS): ...this.
        (_AC_DELIM_NUM): Renamed from...
        (_AC_SED_DELIM_NUM): ...this.
        (_AC_SED_CMD_NUM, _AC_SED_FRAG, _AC_SED_FRAG_NUM): Removed.
        (AC_OUTPUT): Use _AC_OUTPUT_FILES_REQUIRE if needed.
        * tests/ (Substitute a 2000-byte string): Also
        substitute a line with 1000 words, and a variable with several
        long lines.
        (Substitute and define special characters): Also substitute
        ampersands, and put substitution input strings address@hidden@' in the
        output, to test that no recursion happens.
        * NEWS: Update.

--- NEWS        2006-11-18 04:04:11.000000000 +0100
+++ NEWS        2006-11-19 20:05:10.000000000 +0100
@@ -1,5 +1,9 @@
 * Major changes in Autoconf 2.61a (??)
+** config.status now uses awk for substitutions, for improved scaling
+  with the number of substituted variables.  This change requires that
+  AC_PROG_AWK finds a non-ancient awk program.
 * Major changes in Autoconf 2.61 (2006-11-17)
--- lib/autoconf/status.m4      2006-11-18 04:04:15.000000000 +0100
+++ lib/autoconf/status.m4      2006-11-19 20:34:35.000000000 +0100
@@ -311,6 +311,29 @@
+# ---------------------
+# Evaluate the maximum number of characters to put in an awk
+# string literal, not counting escape characters.
+# Some awk's have small limits, such as Solaris and AIX awk.
+# ---------------------
+# ------------------------
 # ------------------------
 # Create the sed scripts needed for CONFIG_FILES.
@@ -319,7 +342,7 @@
 # The intention is to have readable config.status and configure, even
 # though this m4 code might be scaring.
-# This code was written by Dan Manthey.
+# This code was written by Dan Manthey and rewritten by Ralf Wildenhues.
 # This macro is expanded inside a here document.  If the here document is
 # closed, it has to be reopened with "cat >>$CONFIG_STATUS <<\_ACEOF".
@@ -328,81 +351,44 @@
 # Set up the sed scripts for CONFIG_FILES section.
-dnl ... and define _AC_SED_CMDS, the pipeline which executes them.
-m4_define([_AC_SED_CMDS], [])dnl
+dnl ... and define _AC_SUBST_CMDS, the pipeline which executes them.
+m4_define([_AC_SUBST_CMDS], [| $AWK -f "$tmp/subs.awk" ])dnl
 # No need to generate the scripts if there are no CONFIG_FILES.
 # This happens for instance when ./config.status config.h
 if test -n "$CONFIG_FILES"; then
+echo 'BEGIN {' >"$tmp/subs.awk"
-m4_pushdef([_AC_SED_FRAG_NUM], 0)dnl Fragment number.
-m4_pushdef([_AC_SED_CMD_NUM], 2)dnl Num of commands in current frag so far.
-m4_pushdef([_AC_SED_DELIM_NUM], 0)dnl Expected number of delimiters in file.
-m4_pushdef([_AC_SED_FRAG], [])dnl The constant part of the current fragment.
-[# Create sed commands to just substitute file output variables.
-m4_foreach_w([_AC_Var], m4_defn([_AC_SUBST_FILES]),
-[dnl End fragments at beginning of loop so that last fragment is not ended.
-m4_if(m4_eval(_AC_SED_CMD_NUM + 3 > _AC_SED_CMD_LIMIT), 1,
-[dnl Fragment is full and not the last one, so no need for the final un-escape.
-dnl Increment fragment number.
-m4_define([_AC_SED_FRAG_NUM], m4_incr(_AC_SED_FRAG_NUM))dnl
-dnl Record that this fragment will need to be used.
-  m4_defn([_AC_SED_CMDS])[| sed -f "$tmp/subs-]_AC_SED_FRAG_NUM[.sed" ])dnl
-cat >"\$tmp/subs-]_AC_SED_FRAG_NUM[.sed" <<\CEOF
-]m4_define([_AC_SED_CMD_NUM], 2)m4_define([_AC_SED_FRAG])dnl
-])dnl Last fragment ended.
-m4_define([_AC_SED_CMD_NUM], m4_eval(_AC_SED_CMD_NUM + 3))dnl
-[/^[    address@hidden@[        ]*$/{
-r $]_AC_Var[
-# Remaining file output variables are in a fragment that also has non-file
-# output varibles.
-m4_define([_AC_SED_FRAG], [
-m4_ifdef([_AC_SUBST_VARS], [m4_defn([_AC_SUBST_VARS]) ])address@hidden@],
-[m4_if(_AC_SED_DELIM_NUM, 0,
-[m4_if(_AC_Var, address@hidden@],
-[dnl The whole of the last fragment would be the final deletion of `|#_!!_#|'.
-m4_define([_AC_SED_CMDS], m4_defn([_AC_SED_CMDS])[| sed 's/|#_!!_#|//g' ])],
-ac_delim='%!_!# '
-for ac_last_try in false false false false false :; do
-  cat >conf$$subs.sed <<_ACEOF
-m4_if(_AC_Var, address@hidden@],
-      [m4_if(m4_eval(_AC_SED_CMD_NUM + 2 <= _AC_SED_CMD_LIMIT), 1,
-             [m4_define([_AC_SED_FRAG], [ end]m4_defn([_AC_SED_FRAG]))])],
-[m4_define([_AC_SED_CMD_NUM], m4_incr(_AC_SED_CMD_NUM))dnl
-m4_define([_AC_SED_DELIM_NUM], m4_incr(_AC_SED_DELIM_NUM))dnl
-      m4_if(_AC_Var, address@hidden@], m4_if(_AC_SED_CMD_NUM, 2, 2, 
-dnl Do not use grep on conf$$subs.sed, since AIX grep has a line length limit.
-  if test `sed -n "s/.*$ac_delim\$/X/p" conf$$subs.sed | grep -c X` = 
+[# Create commands to substitute file output variables.
+  echo "cat >>$CONFIG_STATUS <<_ACEOF"
+  echo 'cat >>"\$tmp/subs.awk" <<\CEOF'
+  echo "$ac_subst_files" | sed 's/.*/F@<:@"&"@:>@ = "$&"/'
+  echo "CEOF"
+  echo "_ACEOF"
+} >conf$$
+chmod +x conf$$
+. ./conf$$
+rm -f conf$$
+  echo "cat >conf$$subs.awk <<_ACEOF"
+  echo "$ac_subst_vars" | sed 's/.*/&!$&$ac_delim/'
+  echo "_ACEOF"
+} >conf$$
+chmod +x conf$$
+ac_delim_num=`echo "$ac_subst_vars" | grep -c '$'`
+ac_delim='%!_!# '
+for ac_last_try in false false false false false :; do
+  . ./conf$$
+dnl Do not use grep on conf$$subs.awk, since AIX grep has a line length limit.
+  if test `sed -n "s/.*$ac_delim\$/X/p" conf$$subs.awk | grep -c X` = 
$ac_delim_num; then
   elif $ac_last_try; then
     AC_MSG_ERROR([could not make $CONFIG_STATUS])
@@ -410,51 +396,95 @@
     ac_delim="$ac_delim!$ac_delim _$ac_delim!! "
+rm -f conf$$
 dnl Similarly, avoid grep here too.
-ac_eof=`sed -n '/^CEOF[[0-9]]*$/s/CEOF/0/p' conf$$subs.sed`
+ac_eof=`sed -n '/^CEOF[[0-9]]*$/s/CEOF/0/p' conf$$subs.awk`
 if test -n "$ac_eof"; then
   ac_eof=`echo "$ac_eof" | sort -nru | sed 1q`
   ac_eof=`expr $ac_eof + 1`
-dnl Increment fragment number.
-m4_define([_AC_SED_FRAG_NUM], m4_incr(_AC_SED_FRAG_NUM))dnl
-dnl Record that this fragment will need to be used.
-m4_defn([_AC_SED_CMDS])[| sed -f "$tmp/subs-]_AC_SED_FRAG_NUM[.sed" ])dnl
-cat >"\$tmp/subs-]_AC_SED_FRAG_NUM[.sed" <<\CEOF$ac_eof
-sed '
-s/[,\\&]/\\&/g; s/@/@|#_!!_#|/g
-s/^/s,@/; s/!/@,|#_!!_#|/
-t n
-s/'"$ac_delim"'$/,g/; t
-s/$/\\/; p
-N; s/^.*\n//; s/[,\\&]/\\&/g; s/@/@|#_!!_#|/g; b n
-' >>$CONFIG_STATUS <conf$$subs.sed
-rm -f conf$$subs.sed
-]m4_if(_AC_Var, address@hidden@],
-[m4_if(m4_eval(_AC_SED_CMD_NUM + 2 > _AC_SED_CMD_LIMIT), 1,
-[m4_define([_AC_SED_CMDS], m4_defn([_AC_SED_CMDS])[| sed 's/|#_!!_#|//g' ])],
-m4_define([_AC_SED_FRAG], [
-])m4_define([_AC_SED_DELIM_NUM], 0)m4_define([_AC_SED_CMD_NUM], 2)dnl
+dnl Initialize an awk array of substitutions, keyed by variable name.
+dnl First read a whole (potentially multi-line) substitution,
+dnl and construct `S["VAR"] ='.  Then, escape '@' in the value,
+dnl and split it into pieces that fit in an awk literal.
+dnl Each piece then gets active characters escaped:
+dnl    "       -> \"
+dnl    \       -> \\
+dnl    newline -> \n
+dnl    &       -> \\&  (otherwise & will be active in awk's sub)
+dnl (if we escape earlier we risk splitting inside an escape sequence).
+dnl Output as separate string literals, joined with backslash-newline.
+dnl Eliminate the newline after `=' in a second script, for readability.
+dnl m4-double-quote most of the scripting for readability.
+cat >>"\$tmp/subs.awk" <<\CEOF$ac_eof
+sed '
+t line
+s/'"$ac_delim"'$//; t gotline
+N; b line
+s/^/S["/; s/!.*/"] = /; p
+s/^.*!//; s/@/@|#_!!_#|/g
+t more
+t notlast
+s/["\\]/\\&/g; s/\n/\\n/g; s/&/\\\\&/g
+s/^/"/; s/$/"/
+s/["\\]/\\&/g; s/\n/\\n/g; s/&/\\\\&/g
+s/^/"/; s/$/"\\/
+b more
+' <conf$$subs.awk | sed '
+  N
+  s/\n//
+rm -f conf$$subs.awk
+  FS = "[|]#_!!_#[|]"
+/@[a-zA-Z_][a-zA-Z_0-9]*@/ {
+  skip = ""
+  while (match($ 0, skip "@[a-zA-Z_][a-zA-Z_0-9]*@") > 0) {
+    l = length(skip "")
+    key = substr($ 0, RSTART + 1 + l, RLENGTH - 2 - l)
+    if (key in S) {
+      sub("@" key "@", S[key])
+    } else {
+      if (key in F) {  # match only ^[  address@hidden@[        ]$ ?
+        system("cat <" F[key])
+       next
+      } else {
+       add = ""
+       for (i=0; i<RSTART; i++)
+         add = add "."
+       skip = skip add
+      }
+    }
+  }
+  gsub("[|]#_!!_#[|]", "")
+  print
+]dnl end of double-quoted part
 # VPATH may cause trouble with some makes, so we remove $(srcdir),
 # ${srcdir} and @srcdir@ from VPATH if srcdir is ".", strip leading and
@@ -554,7 +583,7 @@
 m4_ifndef([AC_DATAROOTDIR_CHECKED], [$ac_datarootdir_hack
-" $ac_file_inputs m4_defn([_AC_SED_CMDS])>$tmp/out
+" $ac_file_inputs m4_defn([_AC_SUBST_CMDS])>$tmp/out
 [test -z "$ac_datarootdir_hack$ac_datarootdir_seen" &&
@@ -1069,6 +1098,8 @@
 [dnl Dispatch the extra arguments to their native macros.
          [AC_CONFIG_COMMANDS(default, [$2], [$3])])dnl
--- tests/    2006-10-28 01:17:47.000000000 +0200
+++ tests/    2006-11-19 20:46:46.000000000 +0100
@@ -539,18 +539,26 @@
 # Solaris 9 /usr/ucb/sed that rejects commands longer than 4000 bytes.  HP/UX
 # sed dumps core around 8 KiB.  However, POSIX says that sed need not
 # handle lines longer than 2048 bytes (including the trailing newline).
-# So we'll just test a 2000-byte value.
+# So we'll just test a 2000-byte value, and for awk, we test a line with
+# almost 1000 words, and one variable with 4 lines of 500 bytes each.
 AT_SETUP([Substitute a 2000-byte string])
 AT_DATA([], address@hidden@
+AT_DATA([], address@hidden@
+AT_DATA([], address@hidden@
 AC_SUBST([foo], ]m4_for([n], 1, 100,, ....................)[)
+AC_SUBST([bar], "]m4_for([n], 1, 100,, . . . . . . . . . ..)[")
+AC_SUBST([baz], "]m4_for([n], 1, 4,, m4_for([m], 1, 25,, ... ... ... ... ....)
@@ -558,6 +566,11 @@
 AT_CHECK([cat Foo], 0, m4_for([n], 1, 100,, ....................)
+AT_CHECK([cat Bar], 0, m4_for([n], 1, 100,, . . . . . . . . . ..)
+AT_CHECK([cat Baz], 0, m4_for([n], 1, 4,, m4_for([m], 1, 25,, ... ... ... ... 
@@ -589,20 +602,26 @@
 AT_SETUP([Substitute and define special characters])
 AT_DATA([], address@hidden@
address@hidden@@notsubsted@@baz@ stray @ and more@@@baz@
-[[foo="AS@&address@hidden([[X*'[]+ ", `\($foo]])"
+[[foo="AS@&address@hidden([[X*'[]+ ",& &`\($foo]])"
+bar="@foo@ @baz@"
-AC_DEFINE([foo], [[X*'[]+ ", `\($foo]], [Awful value.])
+AC_DEFINE([foo], [[X*'[]+ ",& &`\($foo]], [Awful value.])
-AT_CHECK([cat Foo], 0, [[X*'[]+ ", `\($foo
+AT_CHECK([cat Foo], 0, [[X*'[]+ ",& &`\($foo
address@hidden@ @baz@@address@hidden stray @ and more@@bla
-AT_CHECK_DEFINES([[#define foo X*'[]+ ", `\($foo
+AT_CHECK_DEFINES([[#define foo X*'[]+ ",& &`\($foo

reply via email to

[Prev in Thread] Current Thread [Next in Thread]