[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Architecture to reduce download time when pulling multiple packages

From: Josh Marshall
Subject: Re: Architecture to reduce download time when pulling multiple packages – historic success with magnet URLs, BTIHs, & Aria2c!
Date: Sun, 15 Oct 2023 14:21:59 -0400

So it sounds like my first steps are to re-implement the downloads
using aria2c.  This would affect the minimum base package, no?  Can I
get some buy-in from maintainers that such changes are acceptable?

On Fri, Oct 13, 2023 at 2:06 PM James R. Haigh (+ML.GNU.Guix
subaddress) <> wrote:
> Hi Josh,
> At Z-0400=2023-10-13Fri12:36:01, Josh Marshall sent:
> > This is to parallelize connections which should never hurt downloading but 
> > can help.  Mirroring would be parallelizing for providing packages, what I 
> > want to implement is to parallelize obtaining packages.  Server side vs 
> > client side.
>         Please, if you are going to do something like this, please use a 
> torrent architecture like BitTorrent or GNUnet – I suggest Aria2c as a very 
> good CLI download backend that can be daemonised and sent instructions over a 
> socket to add, pause, remove downloads, etc., and it supports magnet URLs 
> including the existing nontorrent servers (via ‘as’ parameters, iirc.).
>         I actually implemented this in a local copy of APT Daemon many years 
> ago (circa 2011), but the change was not accepted upstream to Launchpad 
> (because I was not on bleeding-edge; I was too slow to keep-up with the 
> upstream development).  My fork got forgotten about, because to get the full 
> benefit the server would have had to have added a BitTorrent Info Hash (BTIH) 
> to the metadata of each package, along with the MD5, SHA-256, etc. that it 
> already did (not a big ask, really).  That said, without the full benefit of 
> having the metadata, it did provide immediate benefit and I used it for many 
> years, not upgrading my Ubuntu 11.04 Natty Narwhal that I was using back then 
> until I really had to.
>         The immediate benefit that it provided was exactly as you described: 
> It allowed parallelisation of nontorrent downloads, be it from the same 
> server or from multiple mirrors.  Iirc., I achieved this by simply passing 
> the download list to Aria2c in daemon mode, I think I also converted all the 
> HTTP URLs to ‘as’ parameters in magnet links, so that multiple mirrors could 
> be passed using multiple ‘as’ parameters in each magnet link.  Then I simply 
> relied on Aria2c being amazing at parallelising everything that I had given 
> it!  I then also implemented progress updates such that APT Daemon could 
> reflect where Aria2c was up to.
>         The way I implemented this using Aria2c and magnet URLs meant that if 
> additional hashes were known, they could be used as well, and so if the 
> server metadata made the simple addition of adding BTIHs, it allows swarming 
> to occur, which in-turn would massively reduce load on the central servers, 
> and allow anyone who want to be a mirror to be a mirror simply by seeding 
> indefinitely.  A default share ratio of 1.0 means that no user is a burden on 
> the network, unless they deliberately change that.  Users can donate to the 
> running costs of the project simply by increasing their share ratio, which 
> adds another means of contribution that they may find easier than the others.
>         Anyone keen to keep old packages online can simply seed them 
> indefinitely, so this is also really great for archival purposes.  Even if 
> the central project loses interest in the old packages and deletes them, 
> anyone else can keep them up.  The hashes ensure that they have not been 
> tampered with.
>         There is also a really cool benefit that occurs, or can occur, on a 
> LAN.  An entire network of computers can all swarm locally with each other, 
> thus needing each package to only need downloading through the metered last 
> mile bottleneck from the WAN precisely once – providing that local 
> broadcasting is supported.  I think this requires Avahi, and I seem to 
> remember that Aria2c supports this but I can't remember.  I don't ever 
> remember getting this bit working but also I did not try hard because it 
> would have required the metadata that I didn't have until after download, so 
> even if I got it working it would not have been directly useful unless the 
> APT repositories that I was using would include the BTIHs.
>         So yeah, loads of great benefits to this architecture, and I 
> highly-recommend it: convert all existing URLs to magnet links (can be done 
> client-side as I did; or server-side); optionally add any additional mirrors 
> as additional ‘as’ parameters (again client-side or server-side); add ‘btih’ 
> parameters to the magnet links (the BTIH must be included in the server 
> metadata to get the full benefit of the swarming, but conversion to magnet 
> link format can be done client-side or server-side); then simply pass all 
> this to a really good parallelising backend such as Aria2c; then update any 
> progress data and relay pause, resume, cancel, etc. to the backend.
>         One final note, as I am sure that there are a lot of GNUnet fans on 
> this list, is that I would try Aria2c first to see how well it can work, and 
> then try GNUnet or whatever else once you have a standard to benchmark 
> against.  Both are Free Software, so no concern there.  Aria2c is an 
> all-round download manager CLI that works with or without swarming, i.e. it 
> is just as good at HTTPS as it is BitTorrent, and can do both at the same 
> time.  GNUnet has the advantage of working from SHA-256 iirc., which is 
> generally already included in the metadata of the repositories of various 
> distributions, but I think it lacks a lot of other features and stability and 
> ecosystem of alternative backends, compared to the BitTorrent network.
>         Of course, there is no harm in including other hashes along with 
> BTIH, to allow people to experiment with alternative backends, while always 
> ensuring that what works works well.  Another hash that may be useful to 
> include is the Tiger Tree Hash, which is structurally very similar to BTIH, 
> but stronger, iirc..
>         The first thing that the Guix project can do to signal interest in 
> this architecture is to simply include the BTIH of each package in the 
> repository metadata.  Be it in magnet URL form or not does not matter because 
> the client can later convert that as needed.  The important thing is an 
> authoritative statement in metadata that this version of this package has 
> this BTIH.  Once that metadata is available, the game is on to implement 
> swarming support, be it with Aria2c as a backend (as I recommend at least 
> starting with) or otherwise.
>         I know that this architecture works well out of first-hand experience 
> with APT Daemon written in Python.  The only failure I had with it was lack 
> of upstream support.  So I consider it important to first attain the upstream 
> approval before really investing more time into this.  I seem to remember 
> suggesting this to the Nix project many years ago and didn't get anywhere, 
> and now I don't have the energy to try to improve upstream projects if they 
> reject my ideas, so I'll be interested to see whether you have any success 
> with your attempt to do the same.
>         Good luck! ;-)
> Kind regards,
> James.
> --
> Wealth doesn't bring happiness, but poverty brings sadness.
> Sent from Debian with Claws Mail, using email subaddressing as an alternative 
> to error-prone heuristical spam filtering.
> Postal: James R. Haigh, Middle Farm, Vennington, nr. Westbury, nr. 
> Shrewsbury, Salop, SY5 9RG, Britain

reply via email to

[Prev in Thread] Current Thread [Next in Thread]