emacs-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[debbugs-tracker] bug#20623: closed (XML and HTML files with encoding/ch


From: GNU bug Tracking System
Subject: [debbugs-tracker] bug#20623: closed (XML and HTML files with encoding/charset="utf-8" declaration lose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save)
Date: Fri, 15 Dec 2017 09:10:02 +0000

Your message dated Fri, 15 Dec 2017 11:08:50 +0200
with message-id <address@hidden>
and subject line Re: bug#20623: XML and HTML files with 
encoding/charset="utf-8"        declaration loose BOM; Coding system is reset 
from utf-8-with-signature to utf-8 on save
has caused the debbugs.gnu.org bug report #20623,
regarding XML and HTML files with encoding/charset="utf-8" declaration lose 
BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
to be marked as done.

(If you believe you have received this mail in error, please contact
address@hidden)


-- 
20623: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=20623
GNU Bug Tracking System
Contact address@hidden with problems
--- Begin Message --- Subject: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save Date: Thu, 21 May 2015 20:50:58 +0200 User-agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0
Hi

When I was editing XHTML and HTML files, I wanted to make sure the BOM was written out to the file in order to make it easier for the browser to detect the UTF-8 encoding. Therefore I changed the coding system for the file buffer to utf-8-with-signature-dos (since I am working on a Windows System) before saving the file.

After some time I got surprised because the browser (IE11), didn't report UTF-8 as the file's encoding. Having checked the hexdump of my (X)HTML file, I saw the BOM was definitely missing.

Obviously, when a "UTF-8" string appears in the <meta charset="utf-8"> (even if commented out, see later below) or <?xml version="1.0" encoding="utf-8"?> declaration, Emacs switches the file coding system to utf-8, when it saves the file, even if utf-8-with-signature was specified explicitly before. This appears to me as a bug, because there is no way anymore to restore the BOM using Emacs.

I was not sure, if my bug is related to bug #8282, so I decided to report it (again).

My Emacs version is: 24.5.1 (x86_64-unkown-cygwin) of 2015-04-10 on Windows 8.1 x64.

I am running Emacs in text-mode only inside a Cygwin console.

This is my .emacs.d/init.el:
(line-number-mode)
(column-number-mode)
(setq-default fill-column 80)
(setq-default buffer-file-coding-system 'utf-8-dos)
(setq-default indent-tabs-mode nil)

With XML the problem can be reproduced in the most basic way as detailed out by the following steps:

- Create a new file with C-x C-f in the current directory. Name it test.txt for example.

- Switch to fundamental mode with M-x fundamental-mode.

- Type the text '<?xml version="1.0"' (without the surrounding single quotes).

- Switch the encoding system to include the BOM: C-x RET f utf-8-with-signature-dos.

- Verify the current encoding system with C-h Shift-c RET: Yes, the encoding system for the file buffer is as specified before.

- Type C-x k to kill the help buffer if necessary and save the file with C-x C-s.

- Check the file with a hex editor. Under the Cygwin Bash shell, 'od -Ax -t xCaz test.txt' will also do it: The UTF-8 BOM 'EF BB BF' was written at the beginning of the file.

- Complete the rest of the XML declaration as follows: ' encoding="utf-8"?>'

- Now save the file and check again: The encoding system for the buffer has changed to utf-8-dos and the BOM has disappeared from the file!

Now the steps for HTML:

- Create a new file test1.txt in the current directory.

- Fill it with the following simple and yet incomplete HTML5 document:
<!doctype html>
<html>
    <head>
        <title>Test</title>
    </head>
    <body>
    </body>
</html>

- Change the coding system to utf-8-with-signature-dos and save the file.

- Verify that the coding system for the buffer is correct and the BOM is really written: Yes, it is.

- Insert the following *comment* between <head> and <title>: <!-- <meta charset="utf-8"> -->

- Save the file and verify: The coding system has changed to utf-8-dos and the BOM has vanished, even if it is just a comment and has no effect!

Regards

Simon

P. S. Information as reported by M-x report-emacs-bug:
In GNU Emacs 24.5.1 (x86_64-unknown-cygwin)
 of 2015-04-10 on desktop-new
Configured using:
 `configure
 --srcdir=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5
 --prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc
 --docdir=/usr/share/doc/emacs --htmldir=/usr/share/doc/emacs/html -C
 --with-x=no 'CFLAGS=-ggdb -O2 -pipe -Wimplicit-function-declaration
 
-fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/build=/usr/src/debug/emacs-24.5-1
 
-fdebug-prefix-map=/home/kbrown/src/cygemacs/emacs-24.5-1.x86_64/src/emacs-24.5=/usr/src/debug/emacs-24.5-1'
 CPPFLAGS= LDFLAGS='

Important settings:
  value of $LANG: en_US.UTF-8
  locale-coding-system: utf-8-unix

Major mode: Help

Minor modes in effect:
  tooltip-mode: t
  electric-indent-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  buffer-read-only: t
  column-number-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent messages:
Beginning of buffer [3 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
Mark set [2 times]
Auto-saving...done
Mark set [2 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
No docstring slot for help-mode-setup
No docstring slot for help-mode-finish

Load-path shadows:
None found.

Features:
(shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml
mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev
gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util
help-fns mail-prsvr mail-utils misearch multi-isearch mule-diag
help-mode easymenu regexp-opt sgml-mode xterm time-date tooltip electric
uniquify ediff-hook vc-hooks lisp-float-type tabulated-list newcomment
lisp-mode prog-mode register page menu-bar rfn-eshadow timer select
mouse jit-lock font-lock syntax facemenu font-core frame cham georgian
utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean
japanese hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese case-table epa-hook jka-cmpr-hook help simple abbrev
minibuffer nadvice loaddefs button faces cus-face macroexp files
text-properties overlay sha1 md5 base64 format env code-pages mule
custom widget hashtable-print-readable backquote make-network-process
dbusbind gfilenotify multi-tty emacs)

Memory information:
((conses 16 81797 4691)
 (symbols 48 17091 0)
 (miscs 40 73 387)
 (strings 32 11233 4887)
 (string-bytes 1 291872)
 (vectors 16 7587)
 (vector-slots 8 342125 27930)
 (floats 8 57 393)
 (intervals 56 834 26)
 (buffers 960 21))




--- End Message ---
--- Begin Message --- Subject: Re: bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save Date: Fri, 15 Dec 2017 11:08:50 +0200
> Date: Sun, 10 Dec 2017 21:17:00 +0200
> From: Eli Zaretskii <address@hidden>
> Cc: address@hidden, address@hidden, address@hidden
> 
> I would like to propose the following alternative patch, which accepts
> utf-8-with-signature and utf-8-hfs as variants of utf-8 for the
> purposes of encoding of XML files.  Comments?  Do we want a similar
> treatment for UTF-16?  (That doesn't seem to be required by the bug
> report, and UTF-16 in XML files is non-standard anyway.  But what
> about HTML?)

No further comments, so I've pushed the change and I'm marking this
bug done.


--- End Message ---

reply via email to

[Prev in Thread] Current Thread [Next in Thread]