emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: debbugs.gnu.org search


From: Maxim Nikulin
Subject: Re: debbugs.gnu.org search
Date: Tue, 31 Aug 2021 18:56:39 +0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0

On 30/08/2021 22:30, Glenn Morris wrote:
Maxim Nikulin wrote:
and links to raw mail messages, e.g. debbugs.gnu.org/db/17/17678.html

At the top of the page, and the bottom, in bold red text, is a link
"Click here to see this page with the latest information and nicer formatting."

Thank you, I have not noticed this link. I believed that link to raw messages were indexed by mistake instead of regular pages. My expectations were based on what I saw in bug reports on bugs.debian.org. My point is that it is still inconvenient to both humans (intermediate page) and search engines (heuristics is not powerful enough to recognize valuable parts).

For a suggestion for further improvement, see
https://lists.gnu.org/r/help-debbugs/2020-12/msg00026.html

Though this branch of "mail vs. web UI" discussion of communication with users and contributors is rather off-topic, it seems, the link you provided, might explain why some part of users prefer web UI for interaction. Technical details of communication are not available to crawlers (e.g. forensic ones, unfortunately been still fully available to site owners though). Alternative variant of your link:
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=43073
"#43073 Trim/hide full email headers on debbugs"

For this particular query I expect to get

     *#29645 Feature Request: Locale aware formatting*
     https://debbugs.gnu.org/cgi/bugreport.cgi?bug=29645

It's the second result?!

Glad to see that the recipe works for someone. I am not so lucky. See below for details.

https://debbugs.gnu.org/robots.txt
Disallow: /cgi/

Disallowed due to performance reasons.

The "static" pages are indexed, and contain prominent (though clearly not
prominent enough) links to the "dynamic" pages.

I suspected that indexing was broken intentionally. However static pages may still be better formatted in my opinion (and with a footer with prominent last update time to avoid confusion related to recent updates).

I have not noticed any special HTTP header sent by bugs.debian.org that my alleviate server load during scan by search engines. Unsure that "cache-control: public, max-age=600" plays a significant role.

There is also a simple search on https://debbugs.gnu.org/
and a complex one (I agree the interface is weird) on
https://debbugs.gnu.org/cgi/search.cgi

I intentionally preserved the following in my previous message.

5. And debbugs.gnu.org search sucks.  Or at least I suck
at trying to find anything using it.

My impression is that "general purpose" search engines are usually able to provide more relevant results due to handling of common typos, synonyms, etc. At least while there are no precise criteria for a filter...

So the problem is not really reproducible, thus harder to debug. Namely duckduckgo does not show #29645 for me on the first page at all. Maybe it depends on region from which a request is originated.

There is another issue with "static" debbugs pages for indexing by search engines: poor metadata. HTML TITLE element contains just "GNU bug report logs - #29645". Debbugs does not support '<meta name="description" content="">' and similar info.

+ duckduckgo (unsure concerning particular underlying engine in my case,
  maybe bing):
  - no #29645 in results
  - title is not informative: "GNU bug report logs - #27544"
  - summary is either enumeration of headers or a part of message
    body. In the former case it is impossible to estimate relevance
    of particular result.

https://html.duckduckgo.com/html?q=site%3Adebbugs.gnu.org+emacs+locale+number
+ google
  - title often (but not always) is taken from H1 element,
    so it is usually much better
    "GNU bug report logs - #29645 Feature Request: Locale aware ..."
    Unfortunately it is trimmed.
  - summary: often useless raw headers
"... X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM, ... Request: Locale aware formatting To: bug-gnu-emacs@HIDDEN Content-Type: ..."

https://www.google.com/search?q=site%3Adebbugs.gnu.org+emacs+locale+number
+ yandex
  - #29645 sometimes is present, sometimes it is not
  - title is useless: "GNU bug report logs - #29645"
  - summary: a snippet from report is hardly noticeable since
    bug status is placed earlier
    "Package: emacs; Severity: wishlist; Reported by: Gustaf
    Waldemarson ... A while ago I started looking for some simple way
    of writing numbers correctly formatted to the locale. Specifically,
    I wanted the output to use the locale's..."
    It may be completely useless though:
    "Package: emacs; Reported by: Jan Synacek <jsynacek@HIDDEN>;
    merged with #3219 ... Information forwarded to bug-gnu-emacs@HIDDEN:
    bug#40007; Package emacs. Full text available. Merged 3219 4123 9589
    13675 15555..."
  https://yandex.ru/search/?text=site%3Adebbugs.gnu.org+emacs+locale+number
+ bing
  - no #29645
  - titles are useless: "GNU bug report logs - #5618"
  - summary varies from message body snippets to just
"information forwarded to bug-gnu-emacs@HIDDEN: bug#3229; Package emacs. Full text available"
  https://www.bing.com/search?q=site%3Adebbugs.gnu.org+emacs+locale+number

My conclusion is that debbugs.gnu.org is not friendly to search engines, so relevant results are not guaranteed, it is hard to estimate if particular item may be useful looking at its title and summary. It does not matter whether DebBugs, GitLab, or SourceHut is used as a bug tracker if robots.txt file does not allow to index descriptions friendly to users and to search engines.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]