help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Any faster way to find frequency of words?


From: Eric Abrahamsen
Subject: Re: Any faster way to find frequency of words?
Date: Sun, 09 May 2021 07:56:09 -0700
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)

Jean Louis <bugs@gnu.support> writes:

> I am interested if there is some better way for Emacs Lisp to find
> frequency of words.
>
> Purpose is to create HTML clickable tag clouds similar to image tag
> clouds. But I will invoke Perl from Emacs to generate it. For that, I
> have to analyze the text first.

Is there any particular improvement you're trying to make?

> (setq text "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec a 
> diam
> lectus. Sed sit amet ipsum mauris. Maecenas congue ligula ac quam
> viverra nec consectetur ante hendrerit. Maecenas congue ligula ac quam
> viverra nec consectetur ante hendrerit..")
>
> (defun text-alphabetic-only (text)
>   "Return alphabetic characters from TEXT."
>   (replace-regexp-in-string "[^[:alpha:]]" " " text))
>
> (defun word-frequency (text &optional length)
>   "Returns word frequency as hash from TEXT."
>   (let* ((hash (make-hash-table :test 'equal))
>        (text (text-alphabetic-only text))
>        (words (split-string text " " t " ")))

I guess I'd suggest using Emacs syntax parsing functions, ie
`forward-word' and `buffer-substring'. Then you can fine tune the
definition of words using the local syntax table.

>     (mapc (lambda (word)
>           (when (> (length word) 2)
>             (let ((word (downcase word)))
>               (if (numberp (gethash word hash))
>                   (puthash word (1+ (gethash word hash)) hash)
>                 (puthash word 1 hash)))))

While hash tables are probably best for very large texts, alists are
nice because you can use place-setting with a default, simplifying the
above to:

(cl-incf (alist-get word frequency-alist 0 nil #'equal))

Eric



reply via email to

[Prev in Thread] Current Thread [Next in Thread]