[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "Readability" feature in eww

From: David Engster
Subject: Re: "Readability" feature in eww
Date: Mon, 03 Nov 2014 22:37:36 +0100
User-agent: Gnus/5.13001 (Ma Gnus v0.10) Emacs/24.3.91 (gnu/linux)

Lars Magne Ingebrigtsen writes:
> The `R' command in eww will try to find the parts of the current page
> where most of the text is, and only display that part.  This makes all
> the menus and stuff disappear, and you don't have to page forever to
> find the actual article on newspaper sites.
> This is a heuristic, of course, so it can be tweaked endlessly.  The
> current algorithm just gives most words a positive score, HTML markup a
> negative score, and words inside <a> tags a negative score.  For such a
> simple algorithm, it seems to give pretty good results.
> But tweaking is necessary for it to be ... better.  If anybody has ideas
> for tweaks or better algorithms, please be my guest and have at it.

I've looked into this a bit years ago when I was coding on emacs-w3m's
'shimbun' feature for Gnus. I took a peek at the algorithm which was
used for the 'boilerplate' library[1], but never got around implementing
it. Since I mostly needed it for reading blogs, I coded a quick solution
which looks at the 'generator' meta-tag and extracts the main content
for CMS like Wordpress, Typepad or Blogspot/Blogger, which was already
enough for me.

It'd be great if you could make this extraction method flexible, similar
to the 'washing' feature from Gnus, so that users could hook their own
methods for extracting the main content into eww. The user would provide
an extraction function and the corresponding regexp that matches against
the URL, or optionally also against the source to match things like the
'generator' meta-tag.


[1] http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf

reply via email to

[Prev in Thread] Current Thread [Next in Thread]