[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: wc enhancement (character frequency table)
From: |
Stefan Rueger |
Subject: |
Re: wc enhancement (character frequency table) |
Date: |
Tue, 24 May 2011 14:57:51 +0100 |
User-agent: |
Mutt/1.5.20 (2009-06-14) |
wc -b tells me how often each character appears in a file (breakdown).
A trivial question with almost any number of every-day applications:
- Does the input file have embedded nul characters in its text?
- Or any other control characters some programme might choke on?
- How many ^L (ff) characters are there in the text (number of pages)?
- Is the del character used in the text file?
- Have my beloved accented characters turned into esc sequences in this
output?
- How many esc characters are there? (Am I seeing VT-100 control sequences
here?)
- What are the line delimiters? lf? cr? cr-lf? Or a mixture of lf-cr and lf?
- Are there irregularities in the input/output of a program?
- Do the number of "<"s match the number of ">"s in the XML output of a
program?
- Is there a matching number of round brackets, curly brackets?
- Is the number of tabs thrice the number of lfs? (Do lines have 4
columns?)
- Does the number of semicolons match the number of equal signs?
There are so many programs with constraints in their input/output...
- Which language(s) am I likely to encounter in this text file?
- What kind of file might this be?
- Pure ASCII, utf8, some odd encoding or binary?
- Xml? (lots of < and >), latex? (lots of \), etc
Yes, sure, one can write a perl/python/sed script or sh pipeline for
almost any of these questions, but "wc -b" is such a simple concept.
And simple things ought to be simple.
wc -M
results in a "fingerprint" of character frequencies for any file
(corresponds to -m, which just counts all characters). It is the same as
-b but leaves away the coumn with the character print. In particular, one
column output -M1 is good for automated processing from there, for
example, computing entropy, similarity computation between files (how much
have these possibly binary files changed on a character granularity
level?), file type guessing...
I find both variants useful enough to use them regularly, especially for
sanity check of cases with constrained input.
Cheers,
Stefan
PS: Just a pity that the current wc does not return a count of
"ill-formed" characters (application: is this file well-formed UTF8?).
Would be a trivial addition to wc, albeit one I have not coded.
--
The Open University is incorporated by Royal Charter (RC 000391), an exempt
charity in England & Wales and a charity registered in Scotland (SC 038302).