bug-sed
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#26082: sed bug: . (dot) does not match all characters.


From: Assaf Gordon
Subject: bug#26082: sed bug: . (dot) does not match all characters.
Date: Tue, 14 Mar 2017 00:01:47 +0000
User-agent: Mutt/1.5.23 (2014-03-12)

tag 26082 notabug
close 26082
stop

Hello Roger,

On Mon, Mar 13, 2017 at 10:57:32AM +0100, Roger Wolff wrote:
I spent half an hour preparing the below bugreport and
while reading the bug submission guidlines the last hint
asks me not to report this issue.

No doubt frustrating. But I'll say it is very much appreciated
when people take the time to read through the bug-reporting
guidelines and submit a detailed, useful bug report -
thank you for doing that.


I must say that I find it inconsistent to, in the first
items you report that gnu SED does not adhere to POSIX
because posix behaviour is "stupid", and then in the last
"known issue"/"frequently reported bug" you decide to
adhere to the stupid POSIX defined behaviour.

Generally speaking, gnu coding standards recommend to adhere
to existing standards when possible, extend them when useful, and
violate them if deemed necessary:
https://www.gnu.org/prep/standards/standards.html#Compatibility
https://www.gnu.org/prep/standards/standards.html#Non_002dGNU-Standards

Same goes for gnu sed:
It tries to adhere to POSIX as much as possible, and provide many
useful extensions where they aren't in conflict with POSIX.
Deciding on POSIX violation is not done easily, and so far there are
very few and limited cases where POSIX behaviour was deemed as
clearly inferior ("N" commands come to mind as such case).

Matching invalid multibyte sequence is one case where all gnu
programs adhere to the POSIX standard (gnu libc, grep, sed, coreutils, gawk, etc.).

It is not likely that this behaviour will change without much further
discussion.


Unix is about being able to quickly do powerful things
with the basic tools that do one thing and to it well.

Sed is now failing in that respect, because this
supposedly simple task took me WAY too long.

There is one critical missing piece of information
that would have made it 'just work' for you:
In the C/POSIX locale, *any* octet is a valid character.
e.g.:
  LC_ALL=C sed '.....'
Does process invalid multibyte characters.

That 'trick' was not clearly explained in the manual,
but we recently added a new section dealing
specifically with multibyte characters (and invalid sequences as well).

I've created a temporary snapshot of the updated manual (which will appear when the next sed version is released), here:
http://download-mirror.savannah.gnu.org/releases/sed/sed-4.4.10-0580.html#Locale-Considerations
If you get a chance, please review this section,
and let us know if that would've helped in your situation,
and if you think other items might be added.


Two more things:
1. sed-4.4 has been relecently released with some bug fixes and
performance improvements. I recommend upgrading to this newer version.

2. The need to sanitize possibly invalid multibyte/utf8 input
is clear. In GNU coreutils, we are working on a new program
called 'unorm' which will detect and work-around invalid input
(and would also perform other related utf-8 processing).
It is not yet ready, but comments on the work-in-progress
are very welcomed:
https://lists.gnu.org/archive/html/coreutils/2016-09/msg00011.html
https://lists.gnu.org/archive/html/coreutils/2016-10/msg00001.html

regards,
- assaf






reply via email to

[Prev in Thread] Current Thread [Next in Thread]