[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
od command and unicode
From: |
Alain Williams |
Subject: |
od command and unicode |
Date: |
Thu, 4 Dec 2014 15:00:43 +0000 |
User-agent: |
Mutt/1.5.20 (2009-12-10) |
I am increasingly using Unicode multi byte characters (often UTF-8) in web
pages, etc. There are times when I get something and need to work out what it
is; sometimes things are wrongly encoded eg in ISO-8859-1 when it should be
UTF-8. It can be hard to look at a string of bytes and work out the Unicode
code point from them.
Suggestion: the 'od' command should do decoding and print out Unicode code
points.
I propose the '-u' option to do this, This would work in a similar way to '-x'.
** BEWARE ** below is the UTF-8 encoding for a pound (GBP) U+A3.
This might wrap horribly in your mail reader.
Eg:
echo 'They cost £1 each' | od -cu
0000000 T h e y c o s
t 302 243 1 e a
54 68 65 79 20 63 6f 73
74 20 A3 31 20 65 61
0000020 c h \n
63 68 0a
0000023
Notes:
that the UTF-8 encoding for the pound symbol is takes 2 characters, they are
displayed
as 2 octal characters - no change here (-c output) other than increased spacing
(to a width of 9)
and that the Unicode octal character representation is within 9 places.
The pound symbol takes 1 place on the Unicode line - although it is 2
characters on the line above.
This gives a mismatch - maybe the 'A3' should have a marker (eg '.') following
it to show this, eg:
20 A3 . 31
The line below gives the Unicode code points in hex, as is traditional.
I have suppressed leading zeros, but they could be put in, eg:
0054 0068 0065 0079 0020 0063 006f 0073
0074 0020 00A3 0031 0020 0065 0061
or:
000054 000068 000065 000079 000020 000063 00006f 000073
000074 000020 0000A3 000031 000020 000065 000061
Depending on how many Unicode digits are wanted, maybe there should be an
option to
specify ?
How does output look when a multi byte character is split between 2 line of
'-c' output ?
Which Unicode encoding should be used ?
* One way would be to look at $LANG, which I set to 'en_GB.utf8' - so use UTF-8.
* This could be overridden with -U or --unicode-encoding options, eg:
--unicode-encoding=iso-8859-1
It might be nice to add a -C option that simply output non-control characters,
ie
leave it up to the terminal driver to interpret.
This would make my life much easier.
See:
http://en.wikipedia.org/wiki/Unicode
Discuss.
--
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT
Lecturer.
+44 (0) 787 668 0256 http://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information:
http://www.phcomp.co.uk/contact.php
#include <std_disclaimer.h>
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- od command and unicode,
Alain Williams <=