[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Is there a way to print unicode characters and the actual code?
From: |
Assaf Gordon |
Subject: |
Re: Is there a way to print unicode characters and the actual code? |
Date: |
Sun, 25 Feb 2018 00:27:57 -0700 |
User-agent: |
Mutt/1.5.24 (2015-08-30) |
Hello,
On Sat, Feb 24, 2018 at 08:12:01PM -0600, Peng Yu wrote:
> > $ od -An -tx1 -ta -tc <<< 'exámple'
> > 65 78 c3 a1 6d 70 6c 65 0a
> > e x C ! m p l e nl
> > e x 303 241 m p l e \n
Interestingly, FreeBSD's od(1) does support multibyte characters:
$ printf "ex\303\241mple\n" | LC_ALL=en_CA.UTF-8 od -An -tx1c
65 78 c3 a1 6d 70 6c 65 0a
e x á ** m p l e \n
Adding this functionality to coreutils is definitely on my TODO list (the most
recent patch includes a partially working implementation, but far from
complete).
> At this moment, I wrote some python code to do this, which prints both
> the decoded code as well as the encoded code in both hex and binary
> numbers in TSV format.
If you don't care about alignment, a simple perl script can do it:
$ printf "ex\303\241mple\n" \
| perl -C -MEncode -lne '$a=unpack("H*",encode("utf8",$_));
$a=~s/(..)/\1 /g;
print $a,"\n",$_'
65 78 c3 a1 6d 70 6c 65
exámple
If you do care about alighment, a slightly longer perl script works:
$ printf "ex\303\241mple\n" \
| perl -C -MEncode -lne 'foreach $c (split//) {
$a=unpack("H*",encode("utf8",$c));
$a=~s/(..)/\1 /g;
$hex.=$a;
$l=length($a)/3-1;
$txt.=$c." ".("** " x $l);
} ;
print $hex,"\n",$txt'
65 78 c3 a1 6d 70 6c 65
e x á ** m p l e
> $ ./dumpunicode0.py <<< á
> á 0xe1 0b11100001 0xa1c3 0b1010000111000011
> \n 0xa 0b1010 0xa 0b1010
In your example code you print one character per line
(which is not exactly what you previously asked about).
If one character per line is fine, the following sed+perl would work:
$ printf "ex\303\241mple\n" \
| sed 's/./&\n/g' \
| perl -lne '$a=unpack("H*");$a=~s/(..)/\1 /g;print $_,"\t",$a'
e 65
x 78
á c3 a1
m 6d
p 70
l 6c
e 65
Or sed+awk:
$ printf "ex\303\241mple\n" \
| sed 's/./&\n/g' \
| LC_ALL=C awk 'BEGIN{for(n=0;n<256;n++)ord[sprintf("%c",n)]=n}
{
n=split($0,a,"");
printf "%s\t", $0 ;
for (i in a) {
printf "%x ",ord[a[i]]
} ;
printf "\n"
}'
e 65
x 78
á c3 a1
m 6d
p 70
l 6c
e 65
And, if your don't care much about regular ASCII values, but want to
easily detect multibyte characters (and octal is acceptable), this
simple command would work:
$ printf "ex\303\241mple\n" \
| sed 's/./&\n/g' | sed -n 'p;l' | sed 's/\$$//' | paste - -
e e
x x
á \303\241
m m
p p
l l
e e
HTH,
- assaf