[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: config files substitution with awk

From: Ralf Wildenhues
Subject: Re: config files substitution with awk
Date: Tue, 21 Nov 2006 19:48:22 +0100
User-agent: Mutt/1.5.13 (2006-08-11)

[ apologies for the resend ]

* Paul Eggert wrote on Tue, Nov 21, 2006 at 06:30:07PM CET:
> +In traditional Awk, @code{FS} must be a string containing just one
> +ordinary character, and similarly for the field-separator argument to
> address@hidden

Thanks.  FWIW, Solaris awk seems to choose only the first character,
rather than erroring out (which seemed unobvious to me from the above
description).  I see two alternatives out:
- Choose for FS a character unlikely to occur often; I'd guess # or ~
  should work?
- Do something like
    sed 's/~/|#_!!_#|/g | awk -f "$tmp/subs.awk" | sed 's/|#_!!_#|/~/g'
  to work around this (with FS="~" in the awk script).


I have another Solaris awk issue, and don't know how to get around
this easily: it supports `index in array' only in for statements:

$ awk 'END { v="x"; F[v]=1; if (v in F) print v; }' </dev/null
awk: syntax error near line 1
awk: illegal statement near line 1

Interestingly, this difference is not mentioned in autoconf.texi, nor
in the gawk.texi or files of GNU awk.

Note that rewriting this to, say, test for nonempty 'array[index]'
instead (and using a marker to distinguish empty replacement strings)
could be quite memory-intensive, due to all the new array members
created on the way, so I'd prefer not to go that way, but I admit to
not having tested this.  Looping over array members has the wrong work
complexity, so that would be a step in the wrong direction as well.

One further portability note, colors the description of
'next' in blue, the notation for a nonportable feature.  OTOH, the V7
manual describes it.  Hmm.

FWIW, below's what I have currently, with above issues not addressed.



- get rid of sub/gsub completely.  Besides avoiding the \& quoting
  issue, it also shaves another 8 seconds off ./config.status time
  for the large example package:
3.20user 3.66system 0:13.84elapsed 49%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+925981minor)pagefaults 0swaps

  Which only reconfirms that hashing is faster than regular expression
  matching. ;-)

- As a consequence, the |#_!!_#| marker can be avoided entirely.
  I left it in as setting for FS for now, pending the question above.

- parenthesize the file name argument to getline, as suggested in the
  gawk manual.

- Added a bit more comments about curiosities in the script.

- Only match AC_SUBST_FILEs if they are alone on the line (except for
  white space).  I don't particularly care about being precise enough
  to give an error for a line such as
    @substed_var@ @substed_file@
  when $substed_var happens to be empty.  Does anyone else?
  (It would require looping once for F[] and once for S[].)

- Some more tests from Paolo's examples.

2006-11-21  Ralf Wildenhues  <address@hidden>

        Rewrite config files generation: avoid quadratic growth in
        the number of substituted variables by using awk instead of sed
        for the bulk of the substitutions.
        * lib/autoconf/status.m4 (_AC_AWK_LITERAL_LIMIT): New macro.
        (_AC_OUTPUT_FILES_PREPARE): Instead of several sed scripts,
        generate just one large awk script for substitutions,
        eliminating much of the earlier complexity, while adding some
        new complexity.  Only expand the substitution templates at
        configure time, for smaller configure script size.  The awk
        script was written with help from Paolo Bonzini and Paul Eggert. 
        (_AC_SUBST_CMDS): Renamed from...
        (_AC_SED_CMDS): ...this.
        (_AC_DELIM_NUM): Renamed from...
        (_AC_SED_DELIM_NUM): ...this.
        (_AC_SED_CMD_NUM, _AC_SED_FRAG, _AC_SED_FRAG_NUM): Removed.
        * tests/ (Substitute a 2000-byte string): Also
        substitute a line with 1000 words, and a variable with several
        long lines.
        (Substitute and define special characters): Test awk special
        characters, and put substitution input strings address@hidden@' in the
        output, to test that no recursion happens; test several other
        combinations from Paolo Bonzini.
        * doc/autoconf.texi (Setting Output Variables): The marker
        `|#_!!_#|' can appear in the substituted files again.
        * NEWS: Update.

--- NEWS        2006-11-20 18:42:44.000000000 +0100
+++ NEWS        2006-11-21 18:52:53.000000000 +0100
@@ -1,5 +1,8 @@
 * Major changes in Autoconf 2.61a (??)
+** config.status now uses awk for substitutions, for improved scaling
+  with the number of substituted variables.
 * Major changes in Autoconf 2.61 (2006-11-17)
--- doc/autoconf.texi   2006-11-17 18:49:39.000000000 +0100
+++ doc/autoconf.texi   2006-11-21 18:57:28.000000000 +0100
@@ -8351,9 +8351,7 @@
 is called.  The value can contain newlines.
 The substituted value is not rescanned for more output variables;
 occurrences of @samp{@@@var{variable}@@} in the value are inserted
-literally into the output file.  (The algorithm uses the special marker
address@hidden|#_!!_#|} internally, so the substituted value cannot contain
+literally into the output file.
 If @var{value} is given, in addition assign it to @var{variable}.
--- lib/autoconf/status.m4      2006-11-20 18:42:44.000000000 +0100
+++ lib/autoconf/status.m4      2006-11-21 18:41:50.000000000 +0100
@@ -311,6 +311,16 @@
+# ---------------------
+# Evaluate the maximum number of characters to put in an awk
+# string literal, not counting escape characters.
+# Some awk's have small limits, such as Solaris and AIX awk.
 # ------------------------
 # Create the sed scripts needed for CONFIG_FILES.
@@ -319,7 +329,7 @@
 # The intention is to have readable config.status and configure, even
 # though this m4 code might be scaring.
-# This code was written by Dan Manthey.
+# This code was written by Dan Manthey and rewritten by Ralf Wildenhues.
 # This macro is expanded inside a here document.  If the here document is
 # closed, it has to be reopened with "cat >>$CONFIG_STATUS <<\_ACEOF".
@@ -328,81 +338,42 @@
 # Set up the sed scripts for CONFIG_FILES section.
-dnl ... and define _AC_SED_CMDS, the pipeline which executes them.
-m4_define([_AC_SED_CMDS], [])dnl
+dnl ... and define _AC_SUBST_CMDS, the pipeline which executes them.
+m4_define([_AC_SUBST_CMDS], [| awk -f "$tmp/subs.awk" ])dnl
 # No need to generate the scripts if there are no CONFIG_FILES.
 # This happens for instance when ./config.status config.h
 if test -n "$CONFIG_FILES"; then
+echo 'BEGIN {' >"$tmp/subs.awk"
-m4_pushdef([_AC_SED_FRAG_NUM], 0)dnl Fragment number.
-m4_pushdef([_AC_SED_CMD_NUM], 2)dnl Num of commands in current frag so far.
-m4_pushdef([_AC_SED_DELIM_NUM], 0)dnl Expected number of delimiters in file.
-m4_pushdef([_AC_SED_FRAG], [])dnl The constant part of the current fragment.
-[# Create sed commands to just substitute file output variables.
-m4_foreach_w([_AC_Var], m4_defn([_AC_SUBST_FILES]),
-[dnl End fragments at beginning of loop so that last fragment is not ended.
-m4_if(m4_eval(_AC_SED_CMD_NUM + 3 > _AC_SED_CMD_LIMIT), 1,
-[dnl Fragment is full and not the last one, so no need for the final un-escape.
-dnl Increment fragment number.
-m4_define([_AC_SED_FRAG_NUM], m4_incr(_AC_SED_FRAG_NUM))dnl
-dnl Record that this fragment will need to be used.
-  m4_defn([_AC_SED_CMDS])[| sed -f "$tmp/subs-]_AC_SED_FRAG_NUM[.sed" ])dnl
-cat >"\$tmp/subs-]_AC_SED_FRAG_NUM[.sed" <<\CEOF
-]m4_define([_AC_SED_CMD_NUM], 2)m4_define([_AC_SED_FRAG])dnl
-])dnl Last fragment ended.
-m4_define([_AC_SED_CMD_NUM], m4_eval(_AC_SED_CMD_NUM + 3))dnl
-[/^[    address@hidden@[        ]*$/{
-r $]_AC_Var[
+[# Create commands to substitute file output variables.
+  echo "cat >>$CONFIG_STATUS <<_ACEOF"
+  echo 'cat >>"\$tmp/subs.awk" <<\CEOF'
+  echo "$ac_subst_files" | sed 's/.*/F@<:@"&"@:>@ = "$&"/'
+  echo "CEOF"
+  echo "_ACEOF"
+} >conf$$
+. ./conf$$
+rm -f conf$$
-# Remaining file output variables are in a fragment that also has non-file
-# output varibles.
-m4_define([_AC_SED_FRAG], [
-m4_ifdef([_AC_SUBST_VARS], [m4_defn([_AC_SUBST_VARS]) ])address@hidden@],
-[m4_if(_AC_SED_DELIM_NUM, 0,
-[m4_if(_AC_Var, address@hidden@],
-[dnl The whole of the last fragment would be the final deletion of `|#_!!_#|'.
-m4_define([_AC_SED_CMDS], m4_defn([_AC_SED_CMDS])[| sed 's/|#_!!_#|//g' ])],
-ac_delim='%!_!# '
-for ac_last_try in false false false false false :; do
-  cat >conf$$subs.sed <<_ACEOF
-m4_if(_AC_Var, address@hidden@],
-      [m4_if(m4_eval(_AC_SED_CMD_NUM + 2 <= _AC_SED_CMD_LIMIT), 1,
-             [m4_define([_AC_SED_FRAG], [ end]m4_defn([_AC_SED_FRAG]))])],
-[m4_define([_AC_SED_CMD_NUM], m4_incr(_AC_SED_CMD_NUM))dnl
-m4_define([_AC_SED_DELIM_NUM], m4_incr(_AC_SED_DELIM_NUM))dnl
-      m4_if(_AC_Var, address@hidden@], m4_if(_AC_SED_CMD_NUM, 2, 2, 
-dnl Do not use grep on conf$$subs.sed, since AIX grep has a line length limit.
-  if test `sed -n "s/.*$ac_delim\$/X/p" conf$$subs.sed | grep -c X` = 
+  echo "cat >conf$$subs.awk <<_ACEOF"
+  echo "$ac_subst_vars" | sed 's/.*/&!$&$ac_delim/'
+  echo "_ACEOF"
+} >conf$$
+ac_delim_num=`echo "$ac_subst_vars" | grep -c '$'`
+ac_delim='%!_!# '
+for ac_last_try in false false false false false :; do
+  . ./conf$$
+dnl Do not use grep on conf$$subs.awk, since AIX grep has a line length limit.
+  if test `sed -n "s/.*$ac_delim\$/X/p" conf$$subs.awk | grep -c X` = 
$ac_delim_num; then
   elif $ac_last_try; then
     AC_MSG_ERROR([could not make $CONFIG_STATUS])
@@ -410,51 +381,92 @@
     ac_delim="$ac_delim!$ac_delim _$ac_delim!! "
+rm -f conf$$
 dnl Similarly, avoid grep here too.
-ac_eof=`sed -n '/^CEOF[[0-9]]*$/s/CEOF/0/p' conf$$subs.sed`
+ac_eof=`sed -n '/^CEOF[[0-9]]*$/s/CEOF/0/p' conf$$subs.awk`
 if test -n "$ac_eof"; then
   ac_eof=`echo "$ac_eof" | sort -nru | sed 1q`
   ac_eof=`expr $ac_eof + 1`
-dnl Increment fragment number.
-m4_define([_AC_SED_FRAG_NUM], m4_incr(_AC_SED_FRAG_NUM))dnl
-dnl Record that this fragment will need to be used.
-m4_defn([_AC_SED_CMDS])[| sed -f "$tmp/subs-]_AC_SED_FRAG_NUM[.sed" ])dnl
-cat >"\$tmp/subs-]_AC_SED_FRAG_NUM[.sed" <<\CEOF$ac_eof
-sed '
-s/[,\\&]/\\&/g; s/@/@|#_!!_#|/g
-s/^/s,@/; s/!/@,|#_!!_#|/
-t n
-s/'"$ac_delim"'$/,g/; t
-s/$/\\/; p
-N; s/^.*\n//; s/[,\\&]/\\&/g; s/@/@|#_!!_#|/g; b n
-' >>$CONFIG_STATUS <conf$$subs.sed
-rm -f conf$$subs.sed
-]m4_if(_AC_Var, address@hidden@],
-[m4_if(m4_eval(_AC_SED_CMD_NUM + 2 > _AC_SED_CMD_LIMIT), 1,
-[m4_define([_AC_SED_CMDS], m4_defn([_AC_SED_CMDS])[| sed 's/|#_!!_#|//g' ])],
-m4_define([_AC_SED_FRAG], [
-])m4_define([_AC_SED_DELIM_NUM], 0)m4_define([_AC_SED_CMD_NUM], 2)dnl
+dnl Initialize an awk array of substitutions, keyed by variable name.
+dnl First read a whole (potentially multi-line) substitution,
+dnl and construct `S["VAR"] ='.  Then, and split it into pieces that fit
+dnl in an awk literal.  Each piece then gets active characters escaped:
+dnl (if we escape earlier we risk splitting inside an escape sequence).
+dnl Output as separate string literals, joined with backslash-newline.
+dnl Eliminate the newline after `=' in a second script, for readability.
+dnl Notes to the main part of the awk script:
+dnl - the unusual FS value helps to avoid the limit of 99 fields,
+dnl - the space in `$ 0' avoid expansion by m4,
+dnl - we avoid sub/gsub because of the \& quoting issues, see
+dnl m4-double-quote most of the scripting for readability.
+cat >>"\$tmp/subs.awk" <<\CEOF$ac_eof
+sed '
+t line
+s/'"$ac_delim"'$//; t gotline
+N; b line
+s/^/S["/; s/!.*/"] = /; p
+t more
+t notlast
+s/["\\]/\\&/g; s/\n/\\n/g
+s/^/"/; s/$/"/
+s/["\\]/\\&/g; s/\n/\\n/g
+s/^/"/; s/$/"\\/
+b more
+' <conf$$subs.awk | sed '
+  N
+  s/\n//
+rm -f conf$$subs.awk
+  FS = "[|]#_!!_#[|]"
+  nfields = split($ 0, field, "@")
+  len = length(field[1])
+  for (i = 2; i < nfields; i++) {
+    key = field[i]
+    keylen = length(key)
+    if (key in S) {
+      $ 0 = substr($ 0, 1, len) "" S[key] "" substr($ 0, len + keylen + 3)
+      len += length(S[key]) + length(field[++i])
+    } else {
+      len += 1 + keylen
+      if (key in F && $ 0 ~ "^[         ]*@" key "@[    ]*$") {
+        while ((getline aline < (F[key])) > 0)
+          print(aline)
+        close(F[key])
+        next
+      }
+    }
+  }
+  print
+]dnl end of double-quoted part
 # VPATH may cause trouble with some makes, so we remove $(srcdir),
 # ${srcdir} and @srcdir@ from VPATH if srcdir is ".", strip leading and
@@ -554,7 +566,7 @@
 m4_ifndef([AC_DATAROOTDIR_CHECKED], [$ac_datarootdir_hack
-" $ac_file_inputs m4_defn([_AC_SED_CMDS])>$tmp/out
+" $ac_file_inputs m4_defn([_AC_SUBST_CMDS])>$tmp/out
 [test -z "$ac_datarootdir_hack$ac_datarootdir_seen" &&
--- tests/    2006-11-08 18:41:56.000000000 +0100
+++ tests/    2006-11-21 18:38:48.000000000 +0100
@@ -539,18 +539,26 @@
 # Solaris 9 /usr/ucb/sed that rejects commands longer than 4000 bytes.  HP/UX
 # sed dumps core around 8 KiB.  However, POSIX says that sed need not
 # handle lines longer than 2048 bytes (including the trailing newline).
-# So we'll just test a 2000-byte value.
+# So we'll just test a 2000-byte value, and for awk, we test a line with
+# almost 1000 words, and one variable with 4 lines of 500 bytes each.
 AT_SETUP([Substitute a 2000-byte string])
 AT_DATA([], address@hidden@
+AT_DATA([], address@hidden@
+AT_DATA([], address@hidden@
 AC_SUBST([foo], ]m4_for([n], 1, 100,, ....................)[)
+AC_SUBST([bar], "]m4_for([n], 1, 100,, @ @ @ @ @ @ @ @ @ @@)[")
+AC_SUBST([baz], "]m4_for([n], 1, 4,, m4_for([m], 1, 25,, ... ... ... ... ....)
@@ -558,6 +566,11 @@
 AT_CHECK([cat Foo], 0, m4_for([n], 1, 100,, ....................)
+AT_CHECK([cat Bar], 0, m4_for([n], 1, 100,, @ @ @ @ @ @ @ @ @ @@)
+AT_CHECK([cat Baz], 0, m4_for([n], 1, 4,, m4_for([m], 1, 25,, ... ... ... ... 
@@ -584,25 +597,57 @@
 ## Substitute and define special characters.  ##
 ## ------------------------------------------ ##
-# Use characters special to the shell, sed, and M4.
+# Use characters special to the shell, sed, awk, and M4.
 AT_SETUP([Substitute and define special characters])
 AT_DATA([], address@hidden@
address@hidden@@notsubsted@@baz@ stray @ and more@@@baz@
address@hidden @address@hidden
address@hidden @address@hidden@
address@hidden @baz@@baz@
+        @file@  
-[[foo="AS@&address@hidden([[X*'[]+ ", `\($foo]])"
+[[foo="AS@&address@hidden([[X*'[]+ ",& &`\($foo \& \\& \\\& \\\\& \ \\ \\\]])"
+bar="@foo@ @baz@"
-AC_DEFINE([foo], [[X*'[]+ ", `\($foo]], [Awful value.])
+AC_DEFINE([foo], [[X*'[]+ ",& &`\($foo]], [Awful value.])
-AT_CHECK([cat Foo], 0, [[X*'[]+ ", `\($foo
-AT_CHECK_DEFINES([[#define foo X*'[]+ ", `\($foo
+AT_CHECK([cat Foo], 0, [[X*'[]+ ",& &`\($foo \& \\& \\\& \\\\& \ \\ \\\
address@hidden@ @baz@@address@hidden stray @ and more@@bla
address@hidden@ @address@hidden@baz
address@hidden@ @address@hidden
address@hidden@ @address@hidden@
address@hidden blabaz
address@hidden blabaz@
address@hidden blabla
+AT_CHECK_DEFINES([[#define foo X*'[]+ ",& &`\($foo

reply via email to

[Prev in Thread] Current Thread [Next in Thread]