[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: wc enhancement (character frequency table)
From: |
Pádraig Brady |
Subject: |
Re: wc enhancement (character frequency table) |
Date: |
Tue, 24 May 2011 12:01:57 +0100 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3 |
On 24/05/11 11:32, Stefan Rueger wrote:
> Dear wc authors and coreutil maintainers,
>
>
> I needed a character breakdown table of all the characters in a file and
> have added corresponding options to wc (-b and -M). I have attached the
> corresponding wc.c source modified from coreutils-8.5. Feel free to put
> this modified code into further versions of wc if you think this is useful
> to others...
>
> Best wishes
>
>
> Stefan Rueger
>
>
>
> Example 1: Character breakdown table of a file (the binary wc)
>
> $ ./wc -b wc | head -30
>
> 93661 110506 wc
> 0: 32458 nul 3796 soh 2454 stx 1858 etx 3220 eot 1597 enq 934 ack 634
> bel
> 8: 3208 bs 497 ht 477 lf 1119 vt 666 ff 522 cr 575 so 595
> si
> 10: 634 dle 547 dc1 623 dc2 571 dc3 352 dc4 182 nak 172 syn 165
> etb
> 18: 324 can 172 em 116 sub 168 esc 328 fs 151 gs 203 rs 149
> us
> 20: 1003 sp 156 ! 82 " 518 # 1073 $ 350 % 252 & 174
> '
> 28: 129 ( 95 ) 75 * 92 + 234 , 217 - 548 . 298
> /
> 30: 266 0 227 1 188 2 112 3 275 4 111 5 119 6 111
> 7
> 38: 162 8 150 9 342 : 399 ; 170 < 207 = 124 > 143
> ?
> 40: 307 @ 372 A 249 B 589 C 606 D 661 E 123 F 240
> G
> 48: 169 H 644 I 139 J 105 K 342 L 272 M 242 N 241
> O
> 50: 436 P 168 Q 318 R 719 S 358 T 361 U 216 V 214
> W
> 58: 234 X 102 Y 81 Z 94 [ 140 \ 198 ] 72 ^ 1895
> _
> 60: 148 ` 809 a 262 b 619 c 432 d 1189 e 351 f 429
> g
> 68: 508 h 769 i 45 j 123 k 657 l 376 m 787 n 824
> o
> 70: 354 p 169 q 827 r 920 s 1544 t 863 u 216 v 149
> w
> 78: 183 x 194 y 63 z 62 { 128 | 134 } 90 ~ 89
> del
> 80: 1 CC1 0 CC2 0 BPH 0 NBH 0 IND 0 NEL 0 SSA 0
> ESA
> 88: 0 CTS 2 CTJ 0 LTS 4 PLF 0 PLB 0 RLF 0 SS2 0
> SS3
> c0: 0 À 1 Á 0 Â 2 Ã 0 Ä 3 Å 0 Æ 0
> Ç
> c8: 0 È 8 É 0 Ê 4 Ë 0 Ì 37 Í 0 Î 0
> Ï
> d0: 34 Ð 0 Ñ 0 Ò 0 Ó 0 Ô 0 Õ 0 Ö 0
> ×
> e0: 0 à 3 á 0 â 0 ã 0 ä 0 å 0 æ 0
> ç
> 108: 2 Ĉ 4 ĉ 0 Ċ 1 ċ 0 Č 1 č 0 Ď 0
> ď
> 120: 0 Ġ 0 ġ 0 Ģ 0 ģ 0 Ĥ 1 ĥ 0 Ħ 0
> ħ
> 128: 0 Ĩ 0 ĩ 0 Ī 0 ī 2 Ĭ 0 ĭ 0 Į 0
> į
> 138: 0 ĸ 0 Ĺ 0 ĺ 0 Ļ 1 ļ 0 Ľ 0 ľ 28
> Ŀ
> 180: 0 ƀ 0 Ɓ 0 Ƃ 0 ƃ 0 Ƅ 6 ƅ 0 Ɔ 0
> Ƈ
> 188: 0 ƈ 2 Ɖ 0 Ɗ 3 Ƌ 0 ƌ 2 ƍ 1 Ǝ 1
> Ə
> 190: 1 Ɛ 0 Ƒ 0 ƒ 0 Ɠ 0 Ɣ 0 ƕ 0 Ɩ 0
> Ɨ
>
> The locale is utf8, so there are many ill-formed characters in the binary
> file wc, and the character breakdown only lists the legal characters.
>
>
> Example 2: Byte breakdown table of the file (doing a character count in
> locale C)
>
> $ LANG=C ./wc -b wc
>
> 110506 110506 wc
> 0: 32458 nul 3796 soh 2454 stx 1858 etx 3220 eot 1597 enq 934 ack 634 bel
> 8: 3208 bs 497 ht 477 lf 1119 vt 666 ff 522 cr 575 so 595 si
> 10: 634 dle 547 dc1 623 dc2 571 dc3 352 dc4 182 nak 172 syn 165 etb
> 18: 324 can 172 em 116 sub 168 esc 328 fs 151 gs 203 rs 149 us
> 20: 1003 sp 156 ! 82 " 518 # 1073 $ 350 % 252 & 174 '
> 28: 129 ( 95 ) 75 * 92 + 234 , 217 - 548 . 298 /
> 30: 266 0 227 1 188 2 112 3 275 4 111 5 119 6 111 7
> 38: 162 8 150 9 342 : 399 ; 170 < 207 = 124 > 143 ?
> 40: 307 @ 372 A 249 B 589 C 606 D 661 E 123 F 240 G
> 48: 169 H 644 I 139 J 105 K 342 L 272 M 242 N 241 O
> 50: 436 P 168 Q 318 R 719 S 358 T 361 U 216 V 214 W
> 58: 234 X 102 Y 81 Z 94 [ 140 \ 198 ] 72 ^ 1895 _
> 60: 148 ` 809 a 262 b 619 c 432 d 1189 e 351 f 429 g
> 68: 508 h 769 i 45 j 123 k 657 l 376 m 787 n 824 o
> 70: 354 p 169 q 827 r 920 s 1544 t 863 u 216 v 149 w
> 78: 183 x 194 y 63 z 62 { 128 | 134 } 90 ~ 89 del
> 80: 253 $80 85 $81 67 $82 442 $83 163 $84 443 $85 85 $86 49 $87
> 88: 103 $88 1126 $89 49 $8a 799 $8b 43 $8c 388 $8d 94 $8e 67 $8f
> 90: 408 $90 250 $91 57 $92 152 $93 86 $94 160 $95 52 $96 69 $97
> 98: 63 $98 34 $99 63 $9a 58 $9b 73 $9c 59 $9d 74 $9e 56 $9f
> a0: 118 $a0 121 $a1 63 $a2 82 $a3 52 $a4 42 $a5 48 $a6 54 $a7
> a8: 57 $a8 79 $a9 37 $aa 30 $ab 100 $ac 57 $ad 71 $ae 38 $af
> b0: 84 $b0 37 $b1 36 $b2 70 $b3 123 $b4 60 $b5 187 $b6 46 $b7
> b8: 97 $b8 48 $b9 60 $ba 86 $bb 102 $bc 54 $bd 49 $be 326 $bf
> c0: 287 $c0 83 $c1 71 $c2 227 $c3 164 $c4 23 $c5 135 $c6 577 $c7
> c8: 137 $c8 121 $c9 71 $ca 60 $cb 86 $cc 82 $cd 39 $ce 44 $cf
> d0: 230 $d0 59 $d1 89 $d2 118 $d3 101 $d4 48 $d5 88 $d6 49 $d7
> d8: 119 $d8 74 $d9 95 $da 97 $db 92 $dc 65 $dd 55 $de 67 $df
> e0: 132 $e0 41 $e1 31 $e2 58 $e3 138 $e4 159 $e5 103 $e6 97 $e7
> e8: 441 $e8 220 $e9 69 $ea 88 $eb 176 $ec 34 $ed 38 $ee 54 $ef
> f0: 146 $f0 73 $f1 109 $f2 65 $f3 109 $f4 37 $f5 99 $f6 70 $f7
> f8: 128 $f8 68 $f9 61 $fa 83 $fb 97 $fc 75 $fd 109 $fe 2072 $ff
>
>
> Example 3: Just the character frequencies in 16 columns
>
> $ ./wc -M16 wc | head -20
>
> 93661 110506 wc
> 0: 32458 3796 2454 1858 3220 1597 934 634 3208 497 477 1119 666 522 575
> 595
> 10: 634 547 623 571 352 182 172 165 324 172 116 168 328 151 203
> 149
> 20: 1003 156 82 518 1073 350 252 174 129 95 75 92 234 217 548
> 298
> 30: 266 227 188 112 275 111 119 111 162 150 342 399 170 207 124
> 143
> 40: 307 372 249 589 606 661 123 240 169 644 139 105 342 272 242
> 241
> 50: 436 168 318 719 358 361 216 214 234 102 81 94 140 198 72
> 1895
> 60: 148 809 262 619 432 1189 351 429 508 769 45 123 657 376 787
> 824
> 70: 354 169 827 920 1544 863 216 149 183 194 63 62 128 134 90
> 89
> 80: 1 0 0 0 0 0 0 0 0 2 0 4 0 0 0
> 0
> c0: 0 1 0 2 0 3 0 0 0 8 0 4 0 37 0
> 0
> d0: 34 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 0
> e0: 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0
> 0
> 100: 0 0 0 0 0 0 0 0 2 4 0 1 0 1 0
> 0
> 120: 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0
> 0
> 130: 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
> 28
> 180: 0 0 0 0 0 6 0 0 0 2 0 3 0 2 1
> 1
> 190: 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 0
> 1a0: 0 1 0 0 0 0 0 0 0 0 2 0 0 0 0
> 0
> 1b0: 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
> 0
> 1c0: 0 0 0 1 0 78 0 0 0 4 0 0 0 1 0
> 0
> 1e0: 0 1 0 2 0 0 0 0 0 0 0 0 0 0 0
> 0
> 200: 0 0 0 2 0 2 0 0 1 14 0 4 0 0 0
> 0
> 210: 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 1
> 220: 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
> 0
> 230: 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
> 0
> 240: 0 0 0 0 0 0 0 0 0 2 0 2 0 0 0
> 0
> 250: 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
> 0
> 260: 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
> 0
> 270: 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
> 0
>
>
>
> Example 4: The ten most frequent characters in a file
>
> $ ./wc -b1 wc | tail -n +2 | sort -k 2rn | head
>
> 0: 32458 nul
> 1: 3796 soh
> 4: 3220 eot
> 8: 3208 bs
> 2: 2454 stx
> 5f: 1895 _
> 3: 1858 etx
> 5: 1597 enq
> 74: 1544 t
> 65: 1189 e
>
>
This seems like a fairly good fit for wc,
however I'm not sure I've ever needed that functionality.
Could you describe some use cases?
Also you can achieve similar frequency analysis of text,
using existing tools:
sed 's/\(.\)/\1\n/g' | sort | uniq -c | sort -rn | column -x
cheers,
Pádraig.
- Re: wc enhancement (character frequency table),
Pádraig Brady <=