bug-sed
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4


From: Assaf Gordon
Subject: bug#26879: end-of-line issue with cygwin 4.4-1 sed 4.4
Date: Fri, 12 May 2017 22:17:09 -0400

Hello,

Replying to few technical topics:

> On May 12, 2017, at 05:17, Dick Dunbar <address@hidden> wrote:
> 
> I want to render filenames emitted by a program ( not find) in single
> quotes so that no special characters are interpreted by the shell:
>   ( space, $, etc )

This could be a bit more tricky that it seems.

Let's start with an easy case: You know in advance files do not contain
new lines (CR/LF) nor single-quotes.
In that case, the following would work (on Linux):

  find -type f | sed -e "s/^/'/" -e "s/\$/'/"

In your cygwin case, where '\r' might be added at the end-of-line
before the '\n', we can simply discard it:

  find -type f | tr -d '\r' | sed -e "s/^/'/" -e "s/\$/'/"

But there's a problem: single-quote strings in shell can not contain
the single-quote as a character. If you have a file like this:

  touch "a'b"

Then you'll need to specifically escape it (by switching to double-quotes):

  $ touch "a'b" 'c$d' "e f"
  $ find -type f | tr -d '\r' | sed -e "s/'/'\"'\"'/g" -e "s/^/'/" -e "s/\$/'/"
  './e f'
  './a'"'"'b'
  './c$d'

Note that all these examples don't need NUL/-print0/-z because we assume
in advance that no file contains newlines.
I'm also ignoring the extra complications of CRLF vs LF
(and the possibility of a filename actually containing '\r').

----

At the risk of repeating myself, I'll just mention again that perhaps
it's worth asking *why* you want to protect the file names from special 
characters?
If the goal is eventually to pass them to some other program,
then consider perhaps your pipeline/script can be reworked
to use 'xargs -0' - which will pass the filenames directly (without shell
involvement) and there will be no problem with special shell characters.

E.g. to invoke a program once per file (The '%' will be replaced with the 
filename):

  $ find -type f -print0 | xargs -0 -I% echo ==%==
  ==./e f==
  ==./a'b==
  ==./c$d==

----


On May 12, 2017, at 05:26, Dick Dunbar <address@hidden> wrote:

> It is still unexplained how sed correctly finds the end-of-line correctly
> when there are no control characters at all.  ( \r, \n )

Sed works like so (more-or-less, some technical details omitted for brevity):

1. Sed reads the input until the END-OF-LINE character (\n or NUL).
2. It puts the bytes into something called "pattern space",
   WITHOUT the EOL character.
3. Any operation you perform (e.g. s/$/foo/) is done
   on the pattern-space (which does *not* contain the EOL character).
4. after executing all the sed commands,
   sed prints the content of the pattern space.
   IF the input line has an EOL character (which was removed),
   adds prints END-OF-LINE character again.

If sed does not encounter the EOL character, it reads until the
end of the input/end-of-file, performs all the send commands, then prints the 
content
without adding any EOL characters.
(A side note: input without terminating EOL is not POSIX-compatible.
See here for an interesting discussion about how different sed implementations
deal with lines without EOL: https://bugs.gnu.org/26574 )

To give some concrete examples:

---

printf "aaa\nbbb\nccc\n" | sed 's/[something]//'

   Above, sed uses '\n' as the EOL character. It reads
   3 lines, and performs the operation 's///' on each
   of them (once 'aaa', once 'bbb', once 'ccc').

   The character '\n' (ASCII \x0A) is NEVER stored in the buffer,
   and you can't modify it with 's///'.

---

printf "aaa\n" | sed 's/$/\n/'

  Above, sed reads the line until the '\n' (the content is 'aaa').
  The 's' commands replaces the end of the line with '\n'.
  The buffer (="pattern space" in sed lingo) becomes "aaa\n".
  sed prints it, and ALSO prints another EOL (as it does for every line).
  The result is one additional empty line in the output.


----

printf "aaa" | sed 's/./b/g'

   Above, the input did not contain EOL character ('\n').
   sed reads until the end of the input, performs the operation,
   then prints the output ('bbb') without adding a newline.
   (This is not universal for all sed implementations.)

---

printf "aaa\nbbb\nccc\n" | sed -z 's/$/x/'

   Above, 'sed -z' expect a NUL as EOL character - but there is none
   in the input - so it treats it like the previous example:
   reads the *entire* input, and performs the operation on it.
   The '\n' bytes (ASCII \x0A) have no special meaning in this case:
   sed treats them like any other bytes.
   The output will be: "aaa\nbbb\nccc\nX" .


> On May 12, 2017, at 15:30, Dick Dunbar <address@hidden> wrote:
> 
> 1. I hadn't realized sed had a -z option.  Here's how I used it:
>    find -print0 | sed -ze "s/^/'/" -e "s/\$/'\n/"


I hope that after the explanation about, you see that this example
won't do what you wanted: the sed command  "s/\$/'\n/"
will replace the end of the buffer (="pattern space") with "'\n",
but AFTER sed prints it, it will ALSO print the EOL character,
which is NUL (because of "-z").

To generalize:
If you use 'sed -z': both input EOL and output EOL will be NUL.
If you don't use "sed -z", both EOL and output EOL will be '\n'.
You can't easily mix them (i.e. have sed read input EOL as NUL,
but output '\n' EOL).

The only common tool I'm familiar with that can
use different EOL characters for input and output is awk, using
something like:

  find -print0 | gawk -vRS="\0" -vORS="\n" '{ print "file = " $0 }'

But I wouldn't recommend it.

Instead, I'd recommend the following:

   find [criteria] -print0 | tr -d '\r' \
       | sed -z 's/SOMETHING//' | tr '\0' '\n'

And a complete command:

  $ touch "a'b" 'c$d' 'e f' "$(printf 'g\nh')"
  $ find -type f -print0 \
         | tr -d '\r' \
         | sed -z -e "s/'/'\"'\"'/g" -e "s/^/'/" -e "s/\$/'/" \
         | tr '\0' '\n'                                                         
                        
  './e f'
  './a'"'"'b'
  './g
  h'
  './c$d'


I think above "should work", but I haven't tested it on cygwin.
(Comments from others are very welcomed.)

regards,
 - assaf







reply via email to

[Prev in Thread] Current Thread [Next in Thread]