[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Most used words in current buffer
From: |
Udyant Wig |
Subject: |
Re: Most used words in current buffer |
Date: |
Wed, 18 Jul 2018 15:06:56 +0530 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 |
On 07/18/2018 12:11 AM, Emanuel Berg wrote:
> Do it!
>
> But if you can let go of the Elisp requirement here are some examples
> how to do it with everyday GNU/Unix tools:
>
>
https://unix.stackexchange.com/questions/41479/find-n-most-frequent-words-in-a-file
I went ahead and did it. I obtained many solutions, in fact. Only
today did I check the link above.
First, of the solutions in Emacs Lisp, this one came out as the
quickest:
---
(defun buffer-most-used-words-1 (n)
"Make a list of the N most used words in buffer."
(let ((counts (make-hash-table :test #'equal))
(words (split-string (buffer-string)))
sorted-counts)
(dolist (word words)
(let ((count (gethash (downcase word) counts 0)))
(puthash (downcase word) (1+ count) counts)))
(loop for word being the hash-keys of counts
using (hash-values count)
do
(push (list word count) sorted-counts)
finally (setf sorted-counts (cl-sort sorted-counts #'>
:key #'second)))
(mapcar #'first (cl-subseq sorted-counts 0 n))))
---
Briefly, it obtains a list of the strings in the buffer, hashes them,
puts the words and their counts in a list, sorts it, and lists the first
N words. (I had also written solutions (1) using alists; (2) using the
handy AVL tree library I found among the Emacs Lisp files in the Emacs
distribution; and (3) reading the words directly and hashing them. None
beat the above.)
The function is suffixed with '-1' because it is the the core of
another, interactive function, which takes the above generated list and
displays it nicely in another buffer.
I was curious about possible solutions in other languages. I wrote
programs in both Common Lisp and Python, based on the essential hash
table approach. While a lot faster than the Emacs Lisp solution above,
they were left behind by this old Awk solution (also using hashing) I
found in the classic /The Unix Programming Environment/ by Kernighan and
Pike:
---
#!/bin/sh
awk ' { for (i = 1; i <= NF; i++) num[$i]++ }
END { for (word in num) print word, num[word] }
' $* | sort +1 -nr | head -10 | awk '{ print $1 }'
---
I appended the last awk pipeline to only give the words without the
counts. I wrapped it up in an Emacs command to display the words in
another buffer, just like my original Emacs Lisp solution above.
Udyant Wig
--
We make our discoveries through our mistakes: we watch one another's
success: and where there is freedom to experiment there is hope to
improve.
-- Arthur Quiller-Couch
- Most used words in current buffer, Udyant Wig, 2018/07/17
- Re: Most used words in current buffer, Emanuel Berg, 2018/07/17
- Re: Most used words in current buffer,
Udyant Wig <=
- Re: Most used words in current buffer, Ben Bacarisse, 2018/07/18
- Re: Most used words in current buffer, Bob Proulx, 2018/07/18
- Message not available
- Re: Most used words in current buffer, Udyant Wig, 2018/07/19
- Re: Most used words in current buffer, Bob Proulx, 2018/07/19
- Re: Most used words in current buffer, tomas, 2018/07/19
- Re: Most used words in current buffer, Nick Dokos, 2018/07/19
- Re: Most used words in current buffer, Eli Zaretskii, 2018/07/19