emacs-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[debbugs-tracker] bug#20623: closed (XML and HTML files with encoding/ch


From: GNU bug Tracking System
Subject: [debbugs-tracker] bug#20623: closed (XML and HTML files with encoding/charset="utf-8" declaration lose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save)
Date: Sat, 11 Aug 2018 09:16:01 +0000

Your message dated Sat, 11 Aug 2018 12:15:31 +0300
with message-id <address@hidden>
and subject line Re: bug#20623: XML and HTML files with 
encoding/charset="utf-8" declaration loose BOM; Coding system is reset from 
utf-8-with-signature to utf-8 on save
has caused the debbugs.gnu.org bug report #20623,
regarding XML and HTML files with encoding/charset="utf-8" declaration lose 
BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
to be marked as done.

(If you believe you have received this mail in error, please contact
address@hidden)


-- 
20623: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=20623
GNU Bug Tracking System
Contact address@hidden with problems
--- Begin Message --- Subject: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save Date: Thu, 21 May 2015 20:50:58 +0200 User-agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0
Hi

When I was editing XHTML and HTML files, I wanted to make sure the BOM was written out to the file in order to make it easier for the browser to detect the UTF-8 encoding. Therefore I changed the coding system for the file buffer to utf-8-with-signature-dos (since I am working on a Windows System) before saving the file.

After some time I got surprised because the browser (IE11), didn't report UTF-8 as the file's encoding. Having checked the hexdump of my (X)HTML file, I saw the BOM was definitely missing.

Obviously, when a "UTF-8" string appears in the <meta charset="utf-8"> (even if commented out, see later below) or <?xml version="1.0" encoding="utf-8"?> declaration, Emacs switches the file coding system to utf-8, when it saves the file, even if utf-8-with-signature was specified explicitly before. This appears to me as a bug, because there is no way anymore to restore the BOM using Emacs.

I was not sure, if my bug is related to bug #8282, so I decided to report it (again).

My Emacs version is: 24.5.1 (x86_64-unkown-cygwin) of 2015-04-10 on Windows 8.1 x64.

I am running Emacs in text-mode only inside a Cygwin console.

This is my .emacs.d/init.el:
(line-number-mode)
(column-number-mode)
(setq-default fill-column 80)
(setq-default buffer-file-coding-system 'utf-8-dos)
(setq-default indent-tabs-mode nil)

With XML the problem can be reproduced in the most basic way as detailed out by the following steps:

- Create a new file with C-x C-f in the current directory. Name it test.txt for example.

- Switch to fundamental mode with M-x fundamental-mode.

- Type the text '<?xml version="1.0"' (without the surrounding single quotes).

- Switch the encoding system to include the BOM: C-x RET f utf-8-with-signature-dos.

- Verify the current encoding system with C-h Shift-c RET: Yes, the encoding system for the file buffer is as specified before.

- Type C-x k to kill the help buffer if necessary and save the file with C-x C-s.

- Check the file with a hex editor. Under the Cygwin Bash shell, 'od -Ax -t xCaz test.txt' will also do it: The UTF-8 BOM 'EF BB BF' was written at the beginning of the file.

- Complete the rest of the XML declaration as follows: ' encoding="utf-8"?>'

- Now save the file and check again: The encoding system for the buffer has changed to utf-8-dos and the BOM has disappeared from the file!

Now the steps for HTML:

- Create a new file test1.txt in the current directory.

- Fill it with the following simple and yet incomplete HTML5 document:
<!doctype html>
<html>
    <head>
        <title>Test</title>
    </head>
    <body>
    </body>
</html>

- Change the coding system to utf-8-with-signature-dos and save the file.

- Verify that the coding system for the buffer is correct and the BOM is really written: Yes, it is.

- Insert the following *comment* between <head> and <title>: <!-- <meta charset="utf-8"> -->

- Save the file and verify: The coding system has changed to utf-8-dos and the BOM has vanished, even if it is just a comment and has no effect!

Regards

Simon

P. S. Information as reported by M-x report-emacs-bug:
In GNU Emacs 24.5.1 (x86_64-unknown-cygwin)
 of 2015-04-10 on desktop-new
Configured using:
 `configure
 --srcdir=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5
 --prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc
 --docdir=/usr/share/doc/emacs --htmldir=/usr/share/doc/emacs/html -C
 --with-x=no 'CFLAGS=-ggdb -O2 -pipe -Wimplicit-function-declaration
 
-fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/build=/usr/src/debug/emacs-24.5-1
 
-fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5=/usr/src/debug/emacs-24.5-1'
 CPPFLAGS= LDFLAGS='

Important settings:
  value of $LANG: en_US.UTF-8
  locale-coding-system: utf-8-unix

Major mode: Help

Minor modes in effect:
  tooltip-mode: t
  electric-indent-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  buffer-read-only: t
  column-number-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent messages:
Beginning of buffer [3 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
Mark set [2 times]
Auto-saving...done
Mark set [2 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
No docstring slot for help-mode-setup
No docstring slot for help-mode-finish

Load-path shadows:
None found.

Features:
(shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml
mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev
gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util
help-fns mail-prsvr mail-utils misearch multi-isearch mule-diag
help-mode easymenu regexp-opt sgml-mode xterm time-date tooltip electric
uniquify ediff-hook vc-hooks lisp-float-type tabulated-list newcomment
lisp-mode prog-mode register page menu-bar rfn-eshadow timer select
mouse jit-lock font-lock syntax facemenu font-core frame cham georgian
utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean
japanese hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese case-table epa-hook jka-cmpr-hook help simple abbrev
minibuffer nadvice loaddefs button faces cus-face macroexp files
text-properties overlay sha1 md5 base64 format env code-pages mule
custom widget hashtable-print-readable backquote make-network-process
dbusbind gfilenotify multi-tty emacs)

Memory information:
((conses 16 81797 4691)
 (symbols 48 17091 0)
 (miscs 40 73 387)
 (strings 32 11233 4887)
 (string-bytes 1 291872)
 (vectors 16 7587)
 (vector-slots 8 342125 27930)
 (floats 8 57 393)
 (intervals 56 834 26)
 (buffers 960 21))




--- End Message ---
--- Begin Message --- Subject: Re: bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save Date: Sat, 11 Aug 2018 12:15:31 +0300
> Date: Wed, 8 Aug 2018 11:47:48 +0200
> From: Vincent Lefevre <address@hidden>
> Cc: Glenn Morris <address@hidden>, Simon Ledergerber <address@hidden>,
>       Eli Zaretskii <address@hidden>, Alain Schneble <address@hidden>,
>       address@hidden
> 
> On 2017-12-04 12:38:57 -0500, Stefan Monnier wrote:
> > > Now reported with "fix this or get removed from the distribution"
> > > severity at <https://bugs.debian.org/883434>.
> > 
> > I'm curious to see if the OP's "grave" severity settings will stick.
> > "Grave" is defined in https://www.debian.org/Bugs/Developer#severities as:
> > 
> >     makes the package in question unusable or mostly so, or causes data
> >     loss, or introduces a security hole allowing access to the accounts
> >     of users who use the package.
> > 
> > The only part that could arguably apply is "causes data loss", but even
> > that is stretching the meaning of those words, I think.
> 
> Actually there's the issue that the coding system (in Emacs sense)
> is changed, but also the fact that this change is invisible to the
> user (mainly because the BOM is usually not visible), which makes
> the issue even worse. Basically, this is invisible data corruption.
> Even though only two bytes are removed, this introduces breakage in
> other applications, and it can take much time to the user to find
> the cause.
> 
> Emacs should not change the coding system when not needed, and when
> it needs to, it must make sure to have a confirmation from the user.

I agree with the last paragraph, so I've now fixed the remaining issue
of this bug (with HTML files) on the emacs-26 branch.

However, I would respectfully request that in the future bug reports
be accurate and fair in the assigned severity, and in particular make
sure that the severity matches the actual behavior as judged
objectively.

In this case, I cannot but express my extreme surprise to see such a
minor issue described as "grave".  The alleged data loss is minor, if
it exists at all (the BOM is not data important for the user, nor data
whose loss cannot be easily repaired).  The unspecified "breakage in
other applications" cannot be considered without the missing details,
but in general I'd be surprised to hear about modern applications
(browsers?) that really need a BOM in UTF-8 encoded HTML files to the
degree that the lack of BOM causes them to "break" in some way; if
they do, it could arguably be a bug in those applications.

Bottom line: artificially and unreasonably increasing the severity
level doesn't help the motivation to fix the bug, and if anything, has
the opposite effect of ignoring the source of the bug report as not
serious.  I'm sure we don't want that, certainly not for bugs reported
by Debian.

Thanks.


--- End Message ---

reply via email to

[Prev in Thread] Current Thread [Next in Thread]