groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Groff] mom : unicode in .INCLUDE'd files


From: Mike Bianchi
Subject: Re: [Groff] mom : unicode in .INCLUDE'd files
Date: Sun, 23 Jul 2017 08:23:51 -0400
User-agent: Mutt/1.5.23 (2014-03-12)

This library purports to be a way to approach the problem ...

  
https://www.autoitconsulting.com/site/development/utf-8-utf-16-text-encoding-detection-library/
 

        UTF-8 and UTF-16 Text Encoding Detection Library
        by Jonathan Bennett | Aug 23, 2014 | Development |

This post shows how to detect UTF-8 and UTF-16 text and presents a fully
functional C++ and C# library that can be used to help with the detection.

I recently had to upgrade the text file handling feature of AutoIt to better
handle text files where no byte order mark (BOM) was present.  The older
version of code I was using worked fine for UTF-8 files (with or without BOM)
but it wasn't able to detect UTF-16 files without a BOM. I tried to the the
IsTextUnicode Win32 API function but this seemed extremely unreliable and
wouldn't detect UTF-16 Big-Endian text in my tests.

Note, especially for UTF-16 detection, there is always an element of ambiguity.
This post by Raymond shows that however you try and detect encoding there will
always be some sequence of bytes that will make your guesses look stupid.

Here are the detection methods I'm currently using for the various types of
text file.  The order of the checks I perform are:

    BOM
    UTF-8
    UTF-16 (newline)
    UTF-16 (null distribution)
        :
        :

--
 Mike Bianchi



reply via email to

[Prev in Thread] Current Thread [Next in Thread]