emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Distribution statistics for ELPA and EMMS


From: Philip Kaludercic
Subject: Re: Distribution statistics for ELPA and EMMS
Date: Tue, 19 Sep 2023 22:06:02 +0000

Akib Azmain Turja <akib@disroot.org> writes:

> Philip Kaludercic <philipk@posteo.net> writes:
>
>> Adam Porter <adam@alphapapa.net> writes:
>>
>>> [I just noticed this message from a few months ago.]
>>>
>>> On 7/16/23 21:25, Richard Stallman wrote:
>>>> [[[ To any NSA and FBI agents reading my email: please consider    ]]]
>>>> [[[ whether defending the US Constitution against all enemies,     ]]]
>>>> [[[ foreign or domestic, requires you to follow Snowden's example. ]]]
>>>> We could have two options for downloading, one which is "for a real
>>>> user" and one which is "for periodic testing".
>>>> The only difference would be that the former increments the user
>>>> download count and the latter does not.
>>>
>>> I like this idea, but it seems like it would be hard to enforce.  It
>>> could even go the other way, i.e. have Emacs send a query string or
>>> header when installing a package manually, which could be logged and
>>> used to filter the download logs later.  But even that might be harder
>>> than it seems, e.g. if I call a command like:
>>>
>>>   emacs --eval "(package-install FOO)"
>>>
>>> ...to non-interactively install a package into a local directory for
>>> testing, how far, and in how many places, would some kind of flag need
>>> to be propagated to end up in the server's logs?
>>
>> There is an inherent unreliability in these kinds of statistics that has
>> to be accepted.  The question is therefore are issues like these
>> significant or would they skew the results.  This has to be considered
>> under a false-positive and a false-negative approach, depending on what
>> we want to measure.
>
> How are these numbers going to be useful?  This can't be a measure of
> "popularity."

Yes, they are at best an indicator.  A malicious person could always
manipulate them, unless considerable effort is put into verifying the
information -- which not only comes at the cost of time but also is
likely to decrease the amount of available information.

> Say, for example, the package "git-commit" is 11th most downloaded
> package on MELPA.  Is it really popular?  Few people install it
> explicitly.  Only one package depends on it, which is Magit, a super
> popular package.  So git-commit is automatically installed as a
> dependency when Magit is installed.

We should be able to solve that problem by adding a query string to the
request, as Adam suggests:

https://elpa.gnu.org/packages/poker-0.2.tar?selected=yes
https://elpa.gnu.org/packages/seq-2.24.tar?selected=no
https://elpa.gnu.org/packages/project-0.10.0.tar?selected=yes&upgrade=yes
etc.

Given this information, you know the user doesn't object to having this
information used (depending on whether or not this is a opt-in or
out-out thing), the version being fetched, whether it is a dependency or
not and whether it was an upgrade.

> And also, packages that get more frequent update are downloaded more
> than whose update less frequently.  So its indeed possible for a less
> popular but frequently updated package gets more downloaded than a
> mature well written more popular package.

We can remember upgrade-counts over the last week, year and all time.

> And also there are straight.el, Elpaca and Quelpa guys who don't use the
> ELPA at all.

Of course, hence "inherent unreliability", though I would be surprised
if the choice of package manager has a strong causal effect on what
packages one uses (setting aside that from-source package managers can
install unreleased packages that are not distributed in any archive).

>>                      If it is all about dopamine-boosting, I think a
>> false-positive approach would be better ;^)
>
> OK...
>
> (while t
>   (package-install 'eat)
>   (package-delete (cadr (assoc 'eat package-alist))))
>
> Soon: Eat is the most popular terminal emulator.  xD

Good point (though just asynchronously spamming the right URL would be
more efficient), my idea would be to count an IP address only once per
day, ignoring how many concrete requests were sent out and also use a
list of excluded addresses, such as Tor exit nodes, to filter out from
the statistics.

This approach approach, together with the fact that from-source package
managers wouldn't participate unless they are actively instructed to do
so, are further arguments for a false-negative approach.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]