[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#20623: XML and HTML files with encoding/charset="utf-8" declaration

From: Simon Ledergerber
Subject: bug#20623: XML and HTML files with encoding/charset="utf-8" declaration loose BOM; Coding system is reset from utf-8-with-signature to utf-8 on save
Date: Thu, 21 May 2015 20:50:58 +0200
User-agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0


When I was editing XHTML and HTML files, I wanted to make sure the BOM was written out to the file in order to make it easier for the browser to detect the UTF-8 encoding. Therefore I changed the coding system for the file buffer to utf-8-with-signature-dos (since I am working on a Windows System) before saving the file.

After some time I got surprised because the browser (IE11), didn't report UTF-8 as the file's encoding. Having checked the hexdump of my (X)HTML file, I saw the BOM was definitely missing.

Obviously, when a "UTF-8" string appears in the <meta charset="utf-8"> (even if commented out, see later below) or <?xml version="1.0" encoding="utf-8"?> declaration, Emacs switches the file coding system to utf-8, when it saves the file, even if utf-8-with-signature was specified explicitly before. This appears to me as a bug, because there is no way anymore to restore the BOM using Emacs.

I was not sure, if my bug is related to bug #8282, so I decided to report it (again).

My Emacs version is: 24.5.1 (x86_64-unkown-cygwin) of 2015-04-10 on Windows 8.1 x64.

I am running Emacs in text-mode only inside a Cygwin console.

This is my .emacs.d/init.el:
(setq-default fill-column 80)
(setq-default buffer-file-coding-system 'utf-8-dos)
(setq-default indent-tabs-mode nil)

With XML the problem can be reproduced in the most basic way as detailed out by the following steps:

- Create a new file with C-x C-f in the current directory. Name it test.txt for example.

- Switch to fundamental mode with M-x fundamental-mode.

- Type the text '<?xml version="1.0"' (without the surrounding single quotes).

- Switch the encoding system to include the BOM: C-x RET f utf-8-with-signature-dos.

- Verify the current encoding system with C-h Shift-c RET: Yes, the encoding system for the file buffer is as specified before.

- Type C-x k to kill the help buffer if necessary and save the file with C-x C-s.

- Check the file with a hex editor. Under the Cygwin Bash shell, 'od -Ax -t xCaz test.txt' will also do it: The UTF-8 BOM 'EF BB BF' was written at the beginning of the file.

- Complete the rest of the XML declaration as follows: ' encoding="utf-8"?>'

- Now save the file and check again: The encoding system for the buffer has changed to utf-8-dos and the BOM has disappeared from the file!

Now the steps for HTML:

- Create a new file test1.txt in the current directory.

- Fill it with the following simple and yet incomplete HTML5 document:
<!doctype html>

- Change the coding system to utf-8-with-signature-dos and save the file.

- Verify that the coding system for the buffer is correct and the BOM is really written: Yes, it is.

- Insert the following *comment* between <head> and <title>: <!-- <meta charset="utf-8"> -->

- Save the file and verify: The coding system has changed to utf-8-dos and the BOM has vanished, even if it is just a comment and has no effect!



P. S. Information as reported by M-x report-emacs-bug:
In GNU Emacs 24.5.1 (x86_64-unknown-cygwin)
 of 2015-04-10 on desktop-new
Configured using:
 --prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc
 --docdir=/usr/share/doc/emacs --htmldir=/usr/share/doc/emacs/html -C
 --with-x=no 'CFLAGS=-ggdb -O2 -pipe -Wimplicit-function-declaration

Important settings:
  value of $LANG: en_US.UTF-8
  locale-coding-system: utf-8-unix

Major mode: Help

Minor modes in effect:
  tooltip-mode: t
  electric-indent-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  buffer-read-only: t
  column-number-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent messages:
Beginning of buffer [3 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
Mark set [2 times]
Mark set [2 times]
Saving file /cygdrive/c/users/.../html_basics/basic.xhtml...
Wrote /cygdrive/c/users/.../html_basics/basic.xhtml
No docstring slot for help-mode-setup
No docstring slot for help-mode-finish

Load-path shadows:
None found.

(shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml
mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev
gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util
help-fns mail-prsvr mail-utils misearch multi-isearch mule-diag
help-mode easymenu regexp-opt sgml-mode xterm time-date tooltip electric
uniquify ediff-hook vc-hooks lisp-float-type tabulated-list newcomment
lisp-mode prog-mode register page menu-bar rfn-eshadow timer select
mouse jit-lock font-lock syntax facemenu font-core frame cham georgian
utf-8-lang misc-lang vietnamese tibetan thai tai-viet lao korean
japanese hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese case-table epa-hook jka-cmpr-hook help simple abbrev
minibuffer nadvice loaddefs button faces cus-face macroexp files
text-properties overlay sha1 md5 base64 format env code-pages mule
custom widget hashtable-print-readable backquote make-network-process
dbusbind gfilenotify multi-tty emacs)

Memory information:
((conses 16 81797 4691)
 (symbols 48 17091 0)
 (miscs 40 73 387)
 (strings 32 11233 4887)
 (string-bytes 1 291872)
 (vectors 16 7587)
 (vector-slots 8 342125 27930)
 (floats 8 57 393)
 (intervals 56 834 26)
 (buffers 960 21))

reply via email to

[Prev in Thread] Current Thread [Next in Thread]