bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM


From: Eli Zaretskii
Subject: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM
Date: Mon, 04 Jul 2022 14:31:01 +0300

> From: Lars Ingebrigtsen <larsi@gnus.org>
> Cc: rgm@gnu.org,  schwab@linux-m68k.org,  48324@debbugs.gnu.org
> Date: Mon, 04 Jul 2022 12:34:29 +0200
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > I see that it's actually 6 bytes _including_ the BOM.  So I think this
> > is confusing: if we are going to return a string with the BOM, we
> > should not count the BOM as part of the LENGTH bytes.  Because if I
> > requested to get characters which fit into N bytes, I should get those
> > N bytes of payload.  Or maybe we should have an optional argument to
> > control whether LENGTH includes or excludes the BOM.
> 
> It the caller has asked for a max number of bytes in a coding system
> that includes a BOM, then the BOM has to be counted -- otherwise the
> bytes won't fit into whatever field the protocol they're using limits
> the string to.

You obviously have a very specific use case in mind.  But there are
others.  Moreover, UTF and BOM is a special case, where the prefix is
known in advance.  Other encodings, notably from the ISO-2022 family,
are harder because the exact shift-ion sequence is not always easy to
guess.

Which is why I thought a way to control this aspect could be needed.
But we could just document the subtlety and wait for someone to come
up with a practical scenario where it would be needed.

> (And we don't have a -without-signature variant, do we?)

We do: utf-16le and utf-16be.

> > In any case, we should mention this aspect in the doc string, I think.
> 
> Yes.  But should we have -without-signature variants for utf-16?  Then
> the doc string could recommend using that if the caller wants BOM-less
> bytes.

See above.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]