--- Begin Message ---
Subject: |
XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save |
Date: |
Thu, 21 May 2015 20:50:58 +0200 |
User-agent: |
Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 |
Hi
When I was editing XHTML and HTML files, I wanted to make sure the BOM
was written out to the file in order to make it easier for the browser
to detect the UTF-8 encoding. Therefore I changed the coding system for
the file buffer to utf-8-with-signature-dos (since I am working on a
Windows System) before saving the file.
After some time I got surprised because the browser (IE11), didn't
report UTF-8 as the file's encoding. Having checked the hexdump of my
(X)HTML file, I saw the BOM was definitely missing.
Obviously, when a "UTF-8" string appears in the <meta charset="utf-8">
(even if commented out, see later below) or <?xml version="1.0"
encoding="utf-8"?> declaration, Emacs switches the file coding system to
utf-8, when it saves the file, even if utf-8-with-signature was
specified explicitly before. This appears to me as a bug, because there
is no way anymore to restore the BOM using Emacs.
I was not sure, if my bug is related to bug #8282, so I decided to
report it (again).
My Emacs version is: 24.5.1 (x86_64-unkown-cygwin) of 2015-04-10 on
Windows 8.1 x64.
I am running Emacs in text-mode only inside a Cygwin console.
This is my .emacs.d/init.el:
(line-number-mode)
(column-number-mode)
(setq-default fill-column 80)
(setq-default buffer-file-coding-system 'utf-8-dos)
(setq-default indent-tabs-mode nil)
With XML the problem can be reproduced in the most basic way as detailed
out by the following steps:
- Create a new file with C-x C-f in the current directory. Name it
test.txt for example.
- Switch to fundamental mode with M-x fundamental-mode.
- Type the text '<?xml version="1.0"' (without the surrounding single
quotes).
- Switch the encoding system to include the BOM: C-x RET f
utf-8-with-signature-dos.
- Verify the current encoding system with C-h Shift-c RET: Yes, the
encoding system for the file buffer is as specified before.
- Type C-x k to kill the help buffer if necessary and save the file with
C-x C-s.
- Check the file with a hex editor. Under the Cygwin Bash shell, 'od -Ax
-t xCaz test.txt' will also do it: The UTF-8 BOM 'EF BB BF' was written
at the beginning of the file.
- Complete the rest of the XML declaration as follows: ' encoding="utf-8"?>'
- Now save the file and check again: The encoding system for the buffer
has changed to utf-8-dos and the BOM has disappeared from the file!
Now the steps for HTML:
- Create a new file test1.txt in the current directory.
- Fill it with the following simple and yet incomplete HTML5 document:
<!doctype html>
<html>
<head>
<title>Test</title>
</head>
<body>
</body>
</html>
- Change the coding system to utf-8-with-signature-dos and save the file.
- Verify that the coding system for the buffer is correct and the BOM is
really written: Yes, it is.
- Insert the following *comment* between <head> and <title>: <!-- <meta
charset="utf-8"> -->
- Save the file and verify: The coding system has changed to utf-8-dos
and the BOM has vanished, even if it is just a comment and has no effect!
Regards
Simon
P. S. Information as reported by M-x report-emacs-bug:
In GNU Emacs 24.5.1 (x86_64-unknown-cygwin)
of 2015-04-10 on desktop-new
Configured using:
`configure
--srcdir=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5
--prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc
--docdir=/usr/share/doc/emacs --htmldir=/usr/share/doc/emacs/html -C
--with-x=no 'CFLAGS=-ggdb -O2 -pipe -Wimplicit-function-declaration
-fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/build=/usr/src/debug/emacs-24.5-1
-fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5=/usr/src/debug/emacs-24.5-1'
CPPFLAGS= LDFLAGS='
Important settings:
value of $LANG: en_US.UTF-8
locale-coding-system: utf-8-unix
Major mode: Help
Minor modes in effect:
tooltip-mode: t
electric-indent-mode: t
menu-bar-mode: t
file-name-shadow-mode: t
global-font-lock-mode: t
font-lock-mode: t
auto-composition-mode: t
auto-encryption-mode: t
auto-compression-mode: t
buffer-read-only: t
column-number-mode: t
line-number-mode: t
transient-mark-mode: t
Recent messages:
Beginning of buffer [3 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
Mark set [2 times]
Auto-saving...done
Mark set [2 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
No docstring slot for help-mode-setup
No docstring slot for help-mode-finish
Load-path shadows:
None found.
Features:
(shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml
mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev
gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util
help-fns mail-prsvr mail-utils misearch multi-isearch mule-diag
help-mode easymenu regexp-opt sgml-mode xterm time-date tooltip electric
uniquify ediff-hook vc-hooks lisp-float-type tabulated-list newcomment
lisp-mode prog-mode register page menu-bar rfn-eshadow timer select
mouse jit-lock font-lock syntax facemenu font-core frame cham georgian
utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean
japanese hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese case-table epa-hook jka-cmpr-hook help simple abbrev
minibuffer nadvice loaddefs button faces cus-face macroexp files
text-properties overlay sha1 md5 base64 format env code-pages mule
custom widget hashtable-print-readable backquote make-network-process
dbusbind gfilenotify multi-tty emacs)
Memory information:
((conses 16 81797 4691)
(symbols 48 17091 0)
(miscs 40 73 387)
(strings 32 11233 4887)
(string-bytes 1 291872)
(vectors 16 7587)
(vector-slots 8 342125 27930)
(floats 8 57 393)
(intervals 56 834 26)
(buffers 960 21))
--- End Message ---
--- Begin Message ---
Subject: |
Re: bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save |
Date: |
Sat, 11 Aug 2018 12:15:31 +0300 |
> Date: Wed, 8 Aug 2018 11:47:48 +0200
> From: Vincent Lefevre <address@hidden>
> Cc: Glenn Morris <address@hidden>, Simon Ledergerber <address@hidden>,
> Eli Zaretskii <address@hidden>, Alain Schneble <address@hidden>,
> address@hidden
>
> On 2017-12-04 12:38:57 -0500, Stefan Monnier wrote:
> > > Now reported with "fix this or get removed from the distribution"
> > > severity at <https://bugs.debian.org/883434>.
> >
> > I'm curious to see if the OP's "grave" severity settings will stick.
> > "Grave" is defined in https://www.debian.org/Bugs/Developer#severities as:
> >
> > makes the package in question unusable or mostly so, or causes data
> > loss, or introduces a security hole allowing access to the accounts
> > of users who use the package.
> >
> > The only part that could arguably apply is "causes data loss", but even
> > that is stretching the meaning of those words, I think.
>
> Actually there's the issue that the coding system (in Emacs sense)
> is changed, but also the fact that this change is invisible to the
> user (mainly because the BOM is usually not visible), which makes
> the issue even worse. Basically, this is invisible data corruption.
> Even though only two bytes are removed, this introduces breakage in
> other applications, and it can take much time to the user to find
> the cause.
>
> Emacs should not change the coding system when not needed, and when
> it needs to, it must make sure to have a confirmation from the user.
I agree with the last paragraph, so I've now fixed the remaining issue
of this bug (with HTML files) on the emacs-26 branch.
However, I would respectfully request that in the future bug reports
be accurate and fair in the assigned severity, and in particular make
sure that the severity matches the actual behavior as judged
objectively.
In this case, I cannot but express my extreme surprise to see such a
minor issue described as "grave". The alleged data loss is minor, if
it exists at all (the BOM is not data important for the user, nor data
whose loss cannot be easily repaired). The unspecified "breakage in
other applications" cannot be considered without the missing details,
but in general I'd be surprised to hear about modern applications
(browsers?) that really need a BOM in UTF-8 encoded HTML files to the
degree that the lack of BOM causes them to "break" in some way; if
they do, it could arguably be a bug in those applications.
Bottom line: artificially and unreasonably increasing the severity
level doesn't help the motivation to fix the bug, and if anything, has
the opposite effect of ignoring the source of the bug report as not
serious. I'm sure we don't want that, certainly not for bugs reported
by Debian.
Thanks.
--- End Message ---