lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Lynx-dev] bugreport: dumping utf8 html to utf8 text malforms \c5\a0 cha


From: Pavel Smerk
Subject: [Lynx-dev] bugreport: dumping utf8 html to utf8 text malforms \c5\a0 character before a new line
Date: Tue, 6 Oct 2009 23:52:57 +0200
User-agent: Mutt/1.4.2.2i

        Hello all,

having the following HTML code

<html><head>
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
</head><body>
&#352;&#352;
</body></html>

in the file in.html and running the following command

lynx -dump -display_charset=utf-8 -assume_charset=utf-8 -nomargins in.html > 
out.txt

one gets back the following five bytes in the file out.txt

C5 A0 C5 0A 0A

where the second C5 is only a beginning of the correct two-byte utf-8
character C5 A0. May be the A0 byte is deleted because of some end-of-line
spaces trimming, which, however, would be rather surprising as the A0 itself
is not a correct utf-8 character, but in this case both the input and the
output are utf-8. And, of course, neither C5 itself is a correct utf-8
character, which means that the output is not even a correct utf-8 file.

Nevertheless, thank you for the great piece of software. :-)

With regards,

Pavel Smerk




reply via email to

[Prev in Thread] Current Thread [Next in Thread]