help-gnu-emacs
[Top][All Lists]

Re: How to generate a wordlist for a document

 From: Arnaldo Mandel Subject: Re: How to generate a wordlist for a document Date: Wed, 17 Aug 2011 10:44:13 -0300

On Tue, Aug 16, 2011 at 10:33 AM, Richard Fieldsend wrote:
Hi Thorsten,
you haven't mentioned which OS you are running, or whether you want to include LaTeX commands.  Assuming that you are only interested in the text of the document I would recommend the following steps:

1) For each of the files in your multi-file document run 'detex' to remove all of the TeX and LaTeX formatting.
2) Compile a single file containing the detex'd versions of the files using cat:

Actually, not needed.  detex follows \input and \include commands.

cat file1 >> completefile

3) You can then make the file one word per line, then sort it and make each term appear just once by doing the following:

grep -o -E '\w+' *sourcefile* | sort | uniq > output

Actually, detex can give a wordlist, so the pipeline reduces to

detex -w mainfile.tex | sort -u

Within emacs, I would use it in dired, keying ! at the mainfile and typing

detex -w * | sort -u

Of course, if one uses this a lot, one can always wrap it into an emacs function or a shell command.

Arnaldo

If you need word frequency information then you can make uniq prepend the number of occurences.

For the record, this doesn't lowercase anything so multiple occurences of the same word are likely.

HTH

Richard

----- Original Message -----
To: help-gnu-emacs@gnu.org
Cc:
Sent: Monday, 15 August 2011, 22:20
Subject: How to generate a wordlist for a document

Hi list,
how do I generate a word list for a document in Emacs (in my case a
multi-file LaTex document)?
(With wordlist I mean a list with all unique words in the document)
Thanks for any hints
Thorsten