coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wc enhancement (character frequency table)


From: Pádraig Brady
Subject: Re: wc enhancement (character frequency table)
Date: Tue, 24 May 2011 12:01:57 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3

On 24/05/11 11:32, Stefan Rueger wrote:
> Dear wc authors and coreutil maintainers,
> 
> 
> I needed a character breakdown table of all the characters in a file and
> have added  corresponding options to wc (-b and -M). I have attached the
> corresponding wc.c source modified from coreutils-8.5. Feel free to  put
> this modified code into further versions of wc if you think this is useful
> to others...
> 
> Best wishes
> 
> 
> Stefan Rueger
> 
> 
> 
> Example 1: Character breakdown table of a file (the binary wc)
> 
> $ ./wc -b wc | head -30
> 
>  93661 110506 wc
>       0: 32458 nul 3796 soh 2454 stx 1858 etx 3220 eot 1597 enq 934 ack  634 
> bel
>       8:  3208 bs   497 ht   477 lf  1119 vt   666 ff   522 cr  575 so   595 
> si 
>      10:   634 dle  547 dc1  623 dc2  571 dc3  352 dc4  182 nak 172 syn  165 
> etb
>      18:   324 can  172 em   116 sub  168 esc  328 fs   151 gs  203 rs   149 
> us 
>      20:  1003 sp   156 !     82 "    518 #   1073 $    350 %   252 &    174 
> '  
>      28:   129 (     95 )     75 *     92 +    234 ,    217 -   548 .    298 
> /  
>      30:   266 0    227 1    188 2    112 3    275 4    111 5   119 6    111 
> 7  
>      38:   162 8    150 9    342 :    399 ;    170 <    207 =   124 >    143 
> ?  
>      40:   307 @    372 A    249 B    589 C    606 D    661 E   123 F    240 
> G  
>      48:   169 H    644 I    139 J    105 K    342 L    272 M   242 N    241 
> O  
>      50:   436 P    168 Q    318 R    719 S    358 T    361 U   216 V    214 
> W  
>      58:   234 X    102 Y     81 Z     94 [    140 \    198 ]    72 ^   1895 
> _  
>      60:   148 `    809 a    262 b    619 c    432 d   1189 e   351 f    429 
> g  
>      68:   508 h    769 i     45 j    123 k    657 l    376 m   787 n    824 
> o  
>      70:   354 p    169 q    827 r    920 s   1544 t    863 u   216 v    149 
> w  
>      78:   183 x    194 y     63 z     62 {    128 |    134 }    90 ~     89 
> del
>      80:     1 CC1    0 CC2    0 BPH    0 NBH    0 IND    0 NEL   0 SSA    0 
> ESA
>      88:     0 CTS    2 CTJ    0 LTS    4 PLF    0 PLB    0 RLF   0 SS2    0 
> SS3
>      c0:     0 À      1 Á      0 Â      2 Ã      0 Ä      3 Å     0 Æ      0 
> Ç  
>      c8:     0 È      8 É      0 Ê      4 Ë      0 Ì     37 Í     0 Î      0 
> Ï  
>      d0:    34 Ð      0 Ñ      0 Ò      0 Ó      0 Ô      0 Õ     0 Ö      0 
> ×  
>      e0:     0 à      3 á      0 â      0 ã      0 ä      0 å     0 æ      0 
> ç  
>     108:     2 Ĉ      4 ĉ      0 Ċ      1 ċ      0 Č      1 č     0 Ď      0 
> ď  
>     120:     0 Ġ      0 ġ      0 Ģ      0 ģ      0 Ĥ      1 ĥ     0 Ħ      0 
> ħ  
>     128:     0 Ĩ      0 ĩ      0 Ī      0 ī      2 Ĭ      0 ĭ     0 Į      0 
> į  
>     138:     0 ĸ      0 Ĺ      0 ĺ      0 Ļ      1 ļ      0 Ľ     0 ľ     28 
> Ŀ  
>     180:     0 ƀ      0 Ɓ      0 Ƃ      0 ƃ      0 Ƅ      6 ƅ     0 Ɔ      0 
> Ƈ  
>     188:     0 ƈ      2 Ɖ      0 Ɗ      3 Ƌ      0 ƌ      2 ƍ     1 Ǝ      1 
> Ə  
>     190:     1 Ɛ      0 Ƒ      0 ƒ      0 Ɠ      0 Ɣ      0 ƕ     0 Ɩ      0 
> Ɨ  
> 
> The locale is utf8, so there are many ill-formed characters in the binary
> file wc, and the character breakdown only lists the legal characters.
> 
> 
> Example 2: Byte breakdown table of the file (doing a character count in 
> locale C)
> 
> $ LANG=C ./wc -b wc
> 
> 110506 110506 wc
>  0: 32458 nul 3796 soh 2454 stx 1858 etx 3220 eot 1597 enq 934 ack  634 bel
>  8:  3208 bs   497 ht   477 lf  1119 vt   666 ff   522 cr  575 so   595 si 
> 10:   634 dle  547 dc1  623 dc2  571 dc3  352 dc4  182 nak 172 syn  165 etb
> 18:   324 can  172 em   116 sub  168 esc  328 fs   151 gs  203 rs   149 us 
> 20:  1003 sp   156 !     82 "    518 #   1073 $    350 %   252 &    174 '  
> 28:   129 (     95 )     75 *     92 +    234 ,    217 -   548 .    298 /  
> 30:   266 0    227 1    188 2    112 3    275 4    111 5   119 6    111 7  
> 38:   162 8    150 9    342 :    399 ;    170 <    207 =   124 >    143 ?  
> 40:   307 @    372 A    249 B    589 C    606 D    661 E   123 F    240 G  
> 48:   169 H    644 I    139 J    105 K    342 L    272 M   242 N    241 O  
> 50:   436 P    168 Q    318 R    719 S    358 T    361 U   216 V    214 W  
> 58:   234 X    102 Y     81 Z     94 [    140 \    198 ]    72 ^   1895 _  
> 60:   148 `    809 a    262 b    619 c    432 d   1189 e   351 f    429 g  
> 68:   508 h    769 i     45 j    123 k    657 l    376 m   787 n    824 o  
> 70:   354 p    169 q    827 r    920 s   1544 t    863 u   216 v    149 w  
> 78:   183 x    194 y     63 z     62 {    128 |    134 }    90 ~     89 del
> 80:   253 $80   85 $81   67 $82  442 $83  163 $84  443 $85  85 $86   49 $87
> 88:   103 $88 1126 $89   49 $8a  799 $8b   43 $8c  388 $8d  94 $8e   67 $8f
> 90:   408 $90  250 $91   57 $92  152 $93   86 $94  160 $95  52 $96   69 $97
> 98:    63 $98   34 $99   63 $9a   58 $9b   73 $9c   59 $9d  74 $9e   56 $9f
> a0:   118 $a0  121 $a1   63 $a2   82 $a3   52 $a4   42 $a5  48 $a6   54 $a7
> a8:    57 $a8   79 $a9   37 $aa   30 $ab  100 $ac   57 $ad  71 $ae   38 $af
> b0:    84 $b0   37 $b1   36 $b2   70 $b3  123 $b4   60 $b5 187 $b6   46 $b7
> b8:    97 $b8   48 $b9   60 $ba   86 $bb  102 $bc   54 $bd  49 $be  326 $bf
> c0:   287 $c0   83 $c1   71 $c2  227 $c3  164 $c4   23 $c5 135 $c6  577 $c7
> c8:   137 $c8  121 $c9   71 $ca   60 $cb   86 $cc   82 $cd  39 $ce   44 $cf
> d0:   230 $d0   59 $d1   89 $d2  118 $d3  101 $d4   48 $d5  88 $d6   49 $d7
> d8:   119 $d8   74 $d9   95 $da   97 $db   92 $dc   65 $dd  55 $de   67 $df
> e0:   132 $e0   41 $e1   31 $e2   58 $e3  138 $e4  159 $e5 103 $e6   97 $e7
> e8:   441 $e8  220 $e9   69 $ea   88 $eb  176 $ec   34 $ed  38 $ee   54 $ef
> f0:   146 $f0   73 $f1  109 $f2   65 $f3  109 $f4   37 $f5  99 $f6   70 $f7
> f8:   128 $f8   68 $f9   61 $fa   83 $fb   97 $fc   75 $fd 109 $fe 2072 $ff
> 
> 
> Example 3: Just the character frequencies in 16 columns
> 
> $ ./wc  -M16 wc | head -20
> 
>  93661 110506 wc
>       0: 32458 3796 2454 1858 3220 1597 934 634 3208 497 477 1119 666 522 575 
>  595
>      10:   634  547  623  571  352  182 172 165  324 172 116  168 328 151 203 
>  149
>      20:  1003  156   82  518 1073  350 252 174  129  95  75   92 234 217 548 
>  298
>      30:   266  227  188  112  275  111 119 111  162 150 342  399 170 207 124 
>  143
>      40:   307  372  249  589  606  661 123 240  169 644 139  105 342 272 242 
>  241
>      50:   436  168  318  719  358  361 216 214  234 102  81   94 140 198  72 
> 1895
>      60:   148  809  262  619  432 1189 351 429  508 769  45  123 657 376 787 
>  824
>      70:   354  169  827  920 1544  863 216 149  183 194  63   62 128 134  90 
>   89
>      80:     1    0    0    0    0    0   0   0    0   2   0    4   0   0   0 
>    0
>      c0:     0    1    0    2    0    3   0   0    0   8   0    4   0  37   0 
>    0
>      d0:    34    0    0    0    0    0   0   0    0   0   0    0   0   0   0 
>    0
>      e0:     0    3    0    0    0    0   0   0    0   0   0    0   0   0   0 
>    0
>     100:     0    0    0    0    0    0   0   0    2   4   0    1   0   1   0 
>    0
>     120:     0    0    0    0    0    1   0   0    0   0   0    0   2   0   0 
>    0
>     130:     0    0    0    0    0    0   0   0    0   0   0    0   1   0   0 
>   28
>     180:     0    0    0    0    0    6   0   0    0   2   0    3   0   2   1 
>    1
>     190:     1    0    0    0    0    0   0   0    0   0   0    0   0   0   0 
>    0
>     1a0:     0    1    0    0    0    0   0   0    0   0   2    0   0   0   0 
>    0
>     1b0:     0    0    0    0    0    0   0   0    1   0   0    0   0   0   0 
>    0
>     1c0:     0    0    0    1    0   78   0   0    0   4   0    0   0   1   0 
>    0
>     1e0:     0    1    0    2    0    0   0   0    0   0   0    0   0   0   0 
>    0
>     200:     0    0    0    2    0    2   0   0    1  14   0    4   0   0   0 
>    0
>     210:     1    0    0    0    0    0   0   0    0   0   0    0   0   0   0 
>    1
>     220:     0    1    0    0    0    0   0   0    0   0   0    0   0   0   0 
>    0
>     230:     0    0    1    0    0    0   0   0    0   0   0    0   0   0   0 
>    0
>     240:     0    0    0    0    0    0   0   0    0   2   0    2   0   0   0 
>    0
>     250:     0    1    0    0    0    0   0   0    0   0   0    0   0   0   0 
>    0
>     260:     0    1    0    0    0    0   0   0    0   0   0    0   0   0   0 
>    0
>     270:     0    0    1    0    0    0   0   0    0   0   0    0   0   0   0 
>    0
> 
> 
> 
> Example 4: The ten most frequent characters in a file
> 
> $ ./wc -b1 wc | tail -n +2 | sort -k 2rn | head 
> 
>       0: 32458 nul
>       1:  3796 soh
>       4:  3220 eot
>       8:  3208 bs 
>       2:  2454 stx
>      5f:  1895 _  
>       3:  1858 etx
>       5:  1597 enq
>      74:  1544 t  
>      65:  1189 e  
> 
> 

This seems like a fairly good fit for wc,
however I'm not sure I've ever needed that functionality.
Could you describe some use cases?

Also you can achieve similar frequency analysis of text,
using existing tools:

sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -rn | column -x

cheers,
Pádraig.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]