[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: config files substitution with awk

From: Ralf Wildenhues
Subject: Re: config files substitution with awk
Date: Fri, 24 Nov 2006 19:28:04 +0100
User-agent: Mutt/1.5.13 (2006-08-11)

Hello again, seems to be seeing some outage, so I may not have caught all
related messages.

* Paul Eggert wrote on Tue, Nov 21, 2006 at 10:04:01PM CET:
> Ralf Wildenhues <address@hidden> writes:
> > - Choose for FS a character unlikely to occur often; I'd guess # or ~
> >   should work?
> Yes.  I'd prefer something like that to invoking sed twice extra.
> You could even use control-G, say.  (But please see below.)

FWIW, once extra invocation would suffice, as sed is already used before
the awk.  But control-G it is for now.

> > I have another Solaris awk issue, and don't know how to get around
> > this easily: it supports `index in array' only in for statements:
> That should be documented.  I'll install the patch enclosed at the
> end of this message.

Thanks for this change, and all your work on documenting awk specialties
in the Autoconf manual.  Please forgive me if the following sounds
heretical, but after having studied
a bit, the following thought comes to mind:  The gawk.texi section about
differences in implementations is much more detailed than the Autoconf
one currently is (and I hope ever will be, lest we increase it lot);
OTOH, our manual contains information not present in the gawk one
(readily to me, at least).  How about adding a pointer to the gawk
manual part, feeding our extra knowledge that way, and removing these
bits from autoconf.texi after the follinwg gawk release?  IMHO it's
better to have one coherent source of information in this case.

FWIW, I'd volunteer to go over this thread and collect things to feed to
bug-gawk eventually.

> > Note that rewriting this to, say, test for nonempty 'array[index]'
> > instead (and using a marker to distinguish empty replacement strings)
> > could be quite memory-intensive, due to all the new array members
> > created on the way, so I'd prefer not to go that way, but I admit to
> > not having tested this.
> I wouldn't worry about this unless it's demonstratably bad.

Well, with the test cases I use, it's not too bad:  About a second more
for the large example, only little above the noise level.  Typically for
a from Automake, each per-target compilation rule and each
included .Po file will generate additional entries.  In any case memory
growth is at most proportional to config file size.

(IOW, we should revisit this issue together with the next complexity
reduction, but not earlier.  That one invokes only one awk process for
all config files, in order to replace the `F * (L + S)' term with a
`F * L' term for `./config.status' execution, by exploiting awks that
can divert output.  Autoconf isn't ready for this just yet, and it's
probably not pressing yet, either.)

> > +        while ((getline aline < (F[key])) > 0)
> I don't think this'll work with Solaris /bin/awk; it has only plain
> 'getline', with no support for | or <.  (Old traditional Awk didn't
> have getline at all.)
> One way to work around this would be to pipe the output of 'awk' into
> 'sh', and have 'sh' do the interpolation by calling 'cat'.  Or we
> could go back to using 'sed' for file interpolation.

Going back to sed sounds like a loss to me: that way, we retain most of
the complexity of the current M4 code in status.m4, which is undesirable
IMVHO (with an apology to Dan).  Piping through 'sh' works but causes
quite some overhead (about 6 seconds in the large test).  So...

> In either case, we could use getline if our dynamic test succeeds.

... yes, that is desirable.

> Or, if we can't come up with a better solution, perhaps we should go
> back to using AC_PROG_AWK.  Perhaps that's simpler.  After all, we'd
> have to go back quite a ways to find a host without a modern Awk.  But
> if we go this route, perhaps we should check that the Awk that we use
> actually has all the features we need.

It sounds wrong to me if we need to check for some feature but fail out
in the failure case.  Besides, it may also make our case harder for the
gawk package.

So let's see what we can do:
- For packages that do not use AC_SUBST_FILE, we can just write a
  portable-to-ancient-awk script.
- For packages that use AC_SUBST_FILE, we test awk for getline support.
  If yes, use that (for efficiency), if no, interpolate through $SHELL.

This helps keep the overhead low for packages not using AC_SUBST_FILE,
and for all packages on sane systems.  It has the small disadvantage
that status.m4 will end up being a bit more complicated again.  Oh well.

The patch below implements this strategy.  The getline test is rather
cheap, so I chose to do it at config.status time rather than configure
time, for simplicity and to keep the test precise even when, say, $PATH
is different when config.status is reinvoked by `make'.

The shell interpolation (and the workaround for `index in array') again
require the use of a special marker.  I hope `|#_!!_#|' is sufficient
for both uses (and mentioned this in autoconf.texi).

Since Paolo's tests define what happens with address@hidden@var2@', I added a
note in the manual that users should not rely on this.  (This is to
avoid the need to distinguish the "special substitutions" still done
with sed, and to allow us to change the implementation again later on.)

An independent open question to me is that, if AC_PROG_AWK was used
anyway by, whether we should then prefer $AWK over awk,
for efficiency.  WDYT?

Due to the `$ 0' issue (see other message), that is used only for
reading the current line now.

Is this patch ok for Autoconf now, so I can start asking bug-gawk?


2006-11-24  Ralf Wildenhues  <address@hidden>

        Rewrite config files generation: avoid quadratic growth in
        the number of substituted variables by using awk instead of sed
        for the bulk of the substitutions.
        * lib/autoconf/status.m4 (_AC_AWK_LITERAL_LIMIT): New macro.
        (_AC_OUTPUT_FILES_PREPARE): Instead of several sed scripts,
        generate just one large awk script for substitutions,
        eliminating much of the earlier complexity, while adding some
        new complexity.  Only expand the substitution templates at
        configure time, for smaller configure script size.  If
        _AC_SUBST_FILES are used, test 'awk' for working getline support
        at config.status time.  If absent, interpolate through the
        shell.  The awk script was written with much help
        from Paolo Bonzini and Paul Eggert.
        (_AC_SED_CMD_NUM, _AC_SED_DELIM_NUM, _AC_SED_FRAG): Removed.
        (_AC_SED_FRAG_NUM): Likewise.
        (_AC_SUBST_CMDS): Renamed from...
        (_AC_SED_CMDS): ...this.
        * tests/ (Substitute a 2000-byte string): Also
        substitute a line with 1000 words, and a variable with several
        long lines.
        (Substitute and define special characters): Test awk special
        characters, and put substitution input strings address@hidden@' in the
        output, to test that no recursion happens; test several other
        combinations from Paolo Bonzini.
        * doc/autoconf.texi (Makefile Substitutions): address@hidden@var2@' is
        undefined for two substituted variables.
        (Setting Output Variables): `|#_!!_#|' is also forbidden in the
        output (and thus input) file.
        * NEWS: Update.

--- NEWS        2006-11-23 20:01:04 -0000
+++ NEWS        2006-11-23 20:20:28 -0000
@@ -1,5 +1,8 @@
 * Major changes in Autoconf 2.61a (??)
+** config.status now uses awk for substitutions, for improved scaling
+  with the number of substituted variables.
 * Major changes in Autoconf 2.61 (2006-11-17)
--- doc/autoconf.texi   2006-11-23 20:09:38.000000000 +0100
+++ doc/autoconf.texi   2006-11-23 21:14:48.000000000 +0100
@@ -2183,8 +2183,10 @@
 substitute a particular variable into the output files, the macro
 @code{AC_SUBST} must be called with that variable name as an argument.
 Any occurrences of @samp{@@@var{variable}@@} for other variables are
-left unchanged.  @xref{Setting Output Variables}, for more information
-on creating output variables with @code{AC_SUBST}.
+left unchanged.  The input @samp{@@@var{variable1}@@@var{variable2}@@}
+with two substituted variables is missing a @samp{@@} and causes undefined
+output.  @xref{Setting Output Variables}, for more information on creating
+output variables with @code{AC_SUBST}.
 A software package that uses a @command{configure} script should be
 distributed with a file @file{}, but no makefile; that
@@ -8352,8 +8354,8 @@
 The substituted value is not rescanned for more output variables;
 occurrences of @samp{@@@var{variable}@@} in the value are inserted
 literally into the output file.  (The algorithm uses the special marker
address@hidden|#_!!_#|} internally, so the substituted value cannot contain
address@hidden|#_!!_#|} internally, so neither the substituted value nor the
+output file may contain @code{|#_!!_#|}.)
 If @var{value} is given, in addition assign it to @var{variable}.
--- lib/autoconf/status.m4      2006-11-18 04:04:15.000000000 +0100
+++ lib/autoconf/status.m4      2006-11-23 20:00:23.000000000 +0100
@@ -311,6 +311,16 @@
+# ---------------------
+# Evaluate the maximum number of characters to put in an awk
+# string literal, not counting escape characters.
+# Some awk's have small limits, such as Solaris and AIX awk.
 # ------------------------
 # Create the sed scripts needed for CONFIG_FILES.
@@ -319,90 +329,80 @@
 # The intention is to have readable config.status and configure, even
 # though this m4 code might be scaring.
-# This code was written by Dan Manthey.
+# This code was written by Dan Manthey and rewritten by Ralf Wildenhues.
 # This macro is expanded inside a here document.  If the here document is
 # closed, it has to be reopened with "cat >>$CONFIG_STATUS <<\_ACEOF".
-# Set up the sed scripts for CONFIG_FILES section.
-dnl ... and define _AC_SED_CMDS, the pipeline which executes them.
-m4_define([_AC_SED_CMDS], [])dnl
-# No need to generate the scripts if there are no CONFIG_FILES.
-# This happens for instance when ./config.status config.h
+[# Set up the scripts for CONFIG_FILES section.
+# No need to generate them if there are no CONFIG_FILES.
+# This happens for instance with `./config.status config.h'.
 if test -n "$CONFIG_FILES"; then
-m4_pushdef([_AC_SED_FRAG_NUM], 0)dnl Fragment number.
-m4_pushdef([_AC_SED_CMD_NUM], 2)dnl Num of commands in current frag so far.
-m4_pushdef([_AC_SED_DELIM_NUM], 0)dnl Expected number of delimiters in file.
-m4_pushdef([_AC_SED_FRAG], [])dnl The constant part of the current fragment.
+dnl For AC_SUBST_FILE, check for usable getline support in awk,
+dnl at config.status execution time.
+dnl Otherwise, do the interpolation in sh, which is slower.
+dnl Without any AC_SUBST_FILE, omit all related code.
+dnl Note the expansion is double-quoted for readability.
+[[if awk 'BEGIN { getline <"/dev/null" }' </dev/null 2>/dev/null; then
+  ac_cs_awk_getline=:
+  ac_cs_awk_pipe_init=
+  ac_cs_awk_read_file='
+      while ((getline aline < (F[key])) > 0)
+       print(aline)
+      close(F[key])'
+  ac_cs_awk_pipe_fini=
+  ac_cs_awk_getline=false
+  ac_cs_awk_pipe_init="print \"cat <<'|#_!!_#|'\""
+  ac_cs_awk_read_file='
+      print "|#_!!_#|"
+      print "cat " F[key]
+      '$ac_cs_awk_pipe_init
+  ac_cs_awk_pipe_fini='END { print "|#_!!_#|" }'
+dnl Define the pipe that does the substitution.
-[# Create sed commands to just substitute file output variables.
-m4_foreach_w([_AC_Var], m4_defn([_AC_SUBST_FILES]),
-[dnl End fragments at beginning of loop so that last fragment is not ended.
-m4_if(m4_eval(_AC_SED_CMD_NUM + 3 > _AC_SED_CMD_LIMIT), 1,
-[dnl Fragment is full and not the last one, so no need for the final un-escape.
-dnl Increment fragment number.
-m4_define([_AC_SED_FRAG_NUM], m4_incr(_AC_SED_FRAG_NUM))dnl
-dnl Record that this fragment will need to be used.
-  m4_defn([_AC_SED_CMDS])[| sed -f "$tmp/subs-]_AC_SED_FRAG_NUM[.sed" ])dnl
-cat >"\$tmp/subs-]_AC_SED_FRAG_NUM[.sed" <<\CEOF
+[m4_define([_AC_SUBST_CMDS], [|
+if $ac_cs_awk_getline; then
+  awk -f "$tmp/subs.awk"
+  awk -f "$tmp/subs.awk" | $SHELL
+[| awk -f "$tmp/subs.awk"])])dnl
+echo 'BEGIN {' >"$tmp/subs.awk"
-]m4_define([_AC_SED_CMD_NUM], 2)m4_define([_AC_SED_FRAG])dnl
-])dnl Last fragment ended.
-m4_define([_AC_SED_CMD_NUM], m4_eval(_AC_SED_CMD_NUM + 3))dnl
-[/^[    address@hidden@[        ]*$/{
-r $]_AC_Var[
+[# Create commands to substitute file output variables.
+  echo "cat >>$CONFIG_STATUS <<_ACEOF"
+  echo 'cat >>"\$tmp/subs.awk" <<\CEOF'
+  echo "$ac_subst_files" | sed 's/.*/F@<:@"&"@:>@="$&"/'
+  echo "CEOF"
+  echo "_ACEOF"
+} >conf$$
+. ./conf$$
+rm -f conf$$
-# Remaining file output variables are in a fragment that also has non-file
-# output varibles.
-m4_define([_AC_SED_FRAG], [
-m4_ifdef([_AC_SUBST_VARS], [m4_defn([_AC_SUBST_VARS]) ])address@hidden@],
-[m4_if(_AC_SED_DELIM_NUM, 0,
-[m4_if(_AC_Var, address@hidden@],
-[dnl The whole of the last fragment would be the final deletion of `|#_!!_#|'.
-m4_define([_AC_SED_CMDS], m4_defn([_AC_SED_CMDS])[| sed 's/|#_!!_#|//g' ])],
-ac_delim='%!_!# '
-for ac_last_try in false false false false false :; do
-  cat >conf$$subs.sed <<_ACEOF
-m4_if(_AC_Var, address@hidden@],
-      [m4_if(m4_eval(_AC_SED_CMD_NUM + 2 <= _AC_SED_CMD_LIMIT), 1,
-             [m4_define([_AC_SED_FRAG], [ end]m4_defn([_AC_SED_FRAG]))])],
-[m4_define([_AC_SED_CMD_NUM], m4_incr(_AC_SED_CMD_NUM))dnl
-m4_define([_AC_SED_DELIM_NUM], m4_incr(_AC_SED_DELIM_NUM))dnl
-      m4_if(_AC_Var, address@hidden@], m4_if(_AC_SED_CMD_NUM, 2, 2, 
-dnl Do not use grep on conf$$subs.sed, since AIX grep has a line length limit.
-  if test `sed -n "s/.*$ac_delim\$/X/p" conf$$subs.sed | grep -c X` = 
+  echo "cat >conf$$subs.awk <<_ACEOF"
+  echo "$ac_subst_vars" | sed 's/.*/&!$&$ac_delim/'
+  echo "_ACEOF"
+} >conf$$
+ac_delim_num=`echo "$ac_subst_vars" | grep -c '$'`
+ac_delim='%!_!# '
+for ac_last_try in false false false false false :; do
+  . ./conf$$
+dnl Do not use grep on conf$$subs.awk, since AIX grep has a line length limit.
+  if test `sed -n "s/.*$ac_delim\$/X/p" conf$$subs.awk | grep -c X` = 
$ac_delim_num; then
   elif $ac_last_try; then
     AC_MSG_ERROR([could not make $CONFIG_STATUS])
@@ -410,51 +410,110 @@
     ac_delim="$ac_delim!$ac_delim _$ac_delim!! "
+rm -f conf$$
 dnl Similarly, avoid grep here too.
-ac_eof=`sed -n '/^CEOF[[0-9]]*$/s/CEOF/0/p' conf$$subs.sed`
+ac_eof=`sed -n '/^CEOF[[0-9]]*$/s/CEOF/0/p' conf$$subs.awk`
 if test -n "$ac_eof"; then
   ac_eof=`echo "$ac_eof" | sort -nru | sed 1q`
   ac_eof=`expr $ac_eof + 1`
-dnl Increment fragment number.
-m4_define([_AC_SED_FRAG_NUM], m4_incr(_AC_SED_FRAG_NUM))dnl
-dnl Record that this fragment will need to be used.
-m4_defn([_AC_SED_CMDS])[| sed -f "$tmp/subs-]_AC_SED_FRAG_NUM[.sed" ])dnl
-cat >"\$tmp/subs-]_AC_SED_FRAG_NUM[.sed" <<\CEOF$ac_eof
-sed '
-s/[,\\&]/\\&/g; s/@/@|#_!!_#|/g
-s/^/s,@/; s/!/@,|#_!!_#|/
-t n
-s/'"$ac_delim"'$/,g/; t
-s/$/\\/; p
-N; s/^.*\n//; s/[,\\&]/\\&/g; s/@/@|#_!!_#|/g; b n
-' >>$CONFIG_STATUS <conf$$subs.sed
-rm -f conf$$subs.sed
+dnl Initialize an awk array of substitutions, keyed by variable name.
+dnl First read a whole (potentially multi-line) substitution,
+dnl and construct `S["VAR"]='.  Then, split it into pieces that fit
+dnl in an awk literal.  Each piece then gets active characters escaped
+dnl (if we escape earlier we risk splitting inside an escape sequence).
+dnl Output as separate string literals, joined with backslash-newline.
+dnl Eliminate the newline after `=' in a second script, for readability.
+dnl Notes to the main part of the awk script:
+dnl - the unusual FS value helps prevent running into the limit of 99 fields,
+dnl - we avoid sub/gsub because of the \& quoting issues, see
+dnl - Writing `$ 0' prevents expansion by both the shell and m4 here.
+dnl m4-double-quote most of the scripting for readability.
+cat >>"\$tmp/subs.awk" <<\CEOF$ac_eof
+sed '
+t line
+s/'"$ac_delim"'$//; t gotline
+N; b line
+s/^/S["/; s/!.*/"]=/; p
+t more
+t notlast
+s/["\\]/\\&/g; s/\n/\\n/g
+s/^/"/; s/$/"/
+s/["\\]/\\&/g; s/\n/\\n/g
+s/^/"/; s/$/"\\/
+b more
+' <conf$$subs.awk | sed '
+  N
+  s/\n//
+rm -f conf$$subs.awk
-]m4_if(_AC_Var, address@hidden@],
-[m4_if(m4_eval(_AC_SED_CMD_NUM + 2 > _AC_SED_CMD_LIMIT), 1,
-[m4_define([_AC_SED_CMDS], m4_defn([_AC_SED_CMDS])[| sed 's/|#_!!_#|//g' ])],
+cat >>"\$tmp/subs.awk" <<CEOF
+  for (key in S) {
+    if (S[key] == "")
+      S[key] = "|#_!!_#|"
+  }
+  FS = ""
+[  \$ac_cs_awk_pipe_init])[
+  line = $ 0
+  nfields = split(line, field, "@")
+  substed = 0
+  len = length(field[1])
+  for (i = 2; i < nfields; i++) {
+    key = field[i]
+    keylen = length(key)
+    if (S[key] != "") {
+      if (S[key] == "|#_!!_#|")
+        value = ""
+      else
+        value = S[key]
+      line = substr(line, 1, len) "" value "" substr(line, len + keylen + 3)
+      len += length(value) + length(field[++i])
+      substed = 1
+    } else
+      len += 1 + keylen
+  }
+[[  if (nfields == 3 && !substed) {
+    key = field[2]
+    if (F[key] != "" && line ~ /^[      address@hidden@[        ]*$/) {
+      \$ac_cs_awk_read_file
+      next
+    }
+  }]])[
+  print line
-m4_define([_AC_SED_FRAG], [
-])m4_define([_AC_SED_DELIM_NUM], 0)m4_define([_AC_SED_CMD_NUM], 2)dnl
+]dnl end of double-quoted part
 # VPATH may cause trouble with some makes, so we remove $(srcdir),
 # ${srcdir} and @srcdir@ from VPATH if srcdir is ".", strip leading and
@@ -554,7 +613,7 @@
 m4_ifndef([AC_DATAROOTDIR_CHECKED], [$ac_datarootdir_hack
-" $ac_file_inputs m4_defn([_AC_SED_CMDS])>$tmp/out
+" $ac_file_inputs m4_defn([_AC_SUBST_CMDS]) >$tmp/out
 [test -z "$ac_datarootdir_hack$ac_datarootdir_seen" &&
--- tests/    28 Oct 2006 19:41:07 -0000      1.72
+++ tests/    23 Nov 2006 20:20:39 -0000
@@ -539,18 +539,26 @@
 # Solaris 9 /usr/ucb/sed that rejects commands longer than 4000 bytes.  HP/UX
 # sed dumps core around 8 KiB.  However, POSIX says that sed need not
 # handle lines longer than 2048 bytes (including the trailing newline).
-# So we'll just test a 2000-byte value.
+# So we'll just test a 2000-byte value, and for awk, we test a line with
+# almost 1000 words, and one variable with 4 lines of 500 bytes each.
 AT_SETUP([Substitute a 2000-byte string])
 AT_DATA([], address@hidden@
+AT_DATA([], address@hidden@
+AT_DATA([], address@hidden@
 AC_SUBST([foo], ]m4_for([n], 1, 100,, ....................)[)
+AC_SUBST([bar], "]m4_for([n], 1, 100,, @ @ @ @ @ @ @ @ @ @@)[")
+AC_SUBST([baz], "]m4_for([n], 1, 4,, m4_for([m], 1, 25,, ... ... ... ... ....)
@@ -558,6 +566,11 @@
 AT_CHECK([cat Foo], 0, m4_for([n], 1, 100,, ....................)
+AT_CHECK([cat Bar], 0, m4_for([n], 1, 100,, @ @ @ @ @ @ @ @ @ @@)
+AT_CHECK([cat Baz], 0, m4_for([n], 1, 4,, m4_for([m], 1, 25,, ... ... ... ... 
@@ -584,25 +597,57 @@
 ## Substitute and define special characters.  ##
 ## ------------------------------------------ ##
-# Use characters special to the shell, sed, and M4.
+# Use characters special to the shell, sed, awk, and M4.
 AT_SETUP([Substitute and define special characters])
 AT_DATA([], address@hidden@
address@hidden@@notsubsted@@baz@ stray @ and more@@@baz@
address@hidden @address@hidden
address@hidden @address@hidden@
address@hidden @baz@@baz@
+        @file@  
-[[foo="AS@&address@hidden([[X*'[]+ ", `\($foo]])"
+[[foo="AS@&address@hidden([[X*'[]+ ",& &`\($foo \& \\& \\\& \\\\& \ \\ \\\]])"
+bar="@foo@ @baz@"
-AC_DEFINE([foo], [[X*'[]+ ", `\($foo]], [Awful value.])
+AC_DEFINE([foo], [[X*'[]+ ",& &`\($foo]], [Awful value.])
-AT_CHECK([cat Foo], 0, [[X*'[]+ ", `\($foo
+AT_CHECK([cat Foo], 0, [[X*'[]+ ",& &`\($foo \& \\& \\\& \\\\& \ \\ \\\
address@hidden@ @baz@@address@hidden stray @ and more@@bla
address@hidden@ @address@hidden@baz
address@hidden@ @address@hidden
address@hidden@ @address@hidden@
address@hidden blabaz
address@hidden blabaz@
address@hidden blabla
-AT_CHECK_DEFINES([[#define foo X*'[]+ ", `\($foo
+AT_CHECK_DEFINES([[#define foo X*'[]+ ",& &`\($foo

reply via email to

[Prev in Thread] Current Thread [Next in Thread]