guix-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug#33600] Using a CDN or some other mirror?


From: Chris Marusich
Subject: [bug#33600] Using a CDN or some other mirror?
Date: Sat, 08 Dec 2018 19:33:17 -0800
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)

Hi everyone,

address@hidden (Ludovic Courtès) writes:

> Ludovic Courtès <address@hidden> skribis:
>
> [...] I’m thinking about using a similar setup, but hosting the mirror
> on some Big Corp CDN or similar.  Chris Marusich came up with a setup
> along these lines a while back:
>
>   https://lists.gnu.org/archive/html/guix-devel/2016-03/msg00312.html
>
> Compared to Chris’s setup, given that ‘guix publish’ now provides
> ‘Cache-Control’ headers (that was not the case back then, see
> <https://lists.gnu.org/archive/html/guix-devel/2016-03/msg00360.html>),
> caching in the proxy should Just Work.
>
> I would like us to set up such a mirror for berlin and then have
> ci.guix.info point to that.  The project should be able to pay the
> hosting fees.
>
> Thoughts?

Regarding DNS, it would be nice if we could use an official GNU
subdomain.  If we can't use a GNU subdomain, we should at least make
sure we have some kind of DNS auto-renewal set up so that nobody can
poach our domain names.  And the operators should take appropriate
precautions when sharing any credentials used for managing it all.

Regarding CDNs, I definitely think it's worth a try!  Even Debian is
using CloudFront (cloudfront.debian.net).  In fact, email correspondence
suggests that as of 2013, Amazon may even have been paying for it!

https://lists.debian.org/debian-cloud/2013/05/msg00071.html

I wonder if Amazon would be willing to pay for our CloudFront
distribution if we asked them nicely?

In any case, before deciding to use Amazon CloudFront for ci.guix.info,
it would be prudent to estimate the cost.  CloudFront, like most Amazon
AWS services, is a "pay for what you use" model.  The pricing is here:

https://aws.amazon.com/cloudfront/pricing

To accurately estimate the cost, we need to know how many requests we
expect to receive, and how many bytes we expect to transfer out, during
a single month.  Do we have information like this for berlin today?

Although I don't doubt that a CDN will perform better than what we have
now, I do think it would be good to measure the performance so that we
know for sure the money spent is actually providing a benefit.  It would
be nice to have some data before and after to measure how availability
and performance have changed.  Apart from anecdotes, what data do we
have to determine whether performance has improved after introducing a
CDN?  For example, the following information could be useful:

  * Network load on the origin server(s)
  * Clients' latency to (the addresses pointed to by) ci.guix.info
  * Clients' throughput while downloading substitutes from ci.guix.info

We don't log or collect client metrics, and that's fine.  It could be
useful to add code to Guix to measure things like this when the user
asks to do so, but perhaps it isn't necessary.  It may be good enough if
people just volunteer to manually gather some information and share it.
For example, you can define a shell function like this:

--8<---------------cut here---------------start------------->8---
measure_get () {
curl -L \
     -o /dev/null \
     -w "url_effective: %{url_effective}\\n\
http_code: %{http_code}\\n\
num_connects: %{num_connects}\\n\
num_redirects: %{num_redirects}\\n\
remote_ip: %{remote_ip}\\n\
remote_port: %{remote_port}\\n\
size_download: %{size_download} B\\n\
speed_download: %{speed_download} B/s\\n\
time_appconnect: %{time_appconnect} s\\n\
time_connect: %{time_connect} s\\n\
time_namelookup: %{time_namelookup} s\\n\
time_pretransfer: %{time_pretransfer} s\\n\
time_redirect: %{time_redirect} s\\n\
time_starttransfer: %{time_starttransfer} s\\n\
time_total: %{time_total} s\\n" \
"$1"
}
--8<---------------cut here---------------end--------------->8---

See "man curl" for the meaning of each metric.

You can then use this function to measure a substitute download.  Here's
an example in which I download a large substitute (linux-libre) from one
of my machines in Seattle:

--8<---------------cut here---------------start------------->8---
$ measure_get 
https://berlin.guixsd.org/nar/gzip/1bq783rbkzv9z9zdhivbvfzhsz2s5yac-linux-libre-4.19
 2>/dev/null
url_effective: 
https://berlin.guixsd.org/nar/gzip/1bq783rbkzv9z9zdhivbvfzhsz2s5yac-linux-libre-4.19
http_code: 200
num_connects: 1
num_redirects: 0
remote_ip: 141.80.181.40
remote_port: 443
size_download: 69899433 B
speed_download: 4945831.000 B/s
time_appconnect: 0.885277 s
time_connect: 0.459667 s
time_namelookup: 0.254210 s
time_pretransfer: 0.885478 s
time_redirect: 0.000000 s
time_starttransfer: 1.273994 s
time_total: 14.133584 s
$ 
--8<---------------cut here---------------end--------------->8---

Here, it took 0.459667 - 0.254210 = 0.205457 seconds (about 205 ms) to
establish the TCP connection after the DNS lookup.  The average
throughput was 1924285 bytes per second (about 40 megabits per second,
where 1 megabit = 10^6 bits).  It seems my connection to berlin is
already pretty good!

We can get more information about latency by using a tool like mtr:

--8<---------------cut here---------------start------------->8---
$ sudo mtr -c 10 --report-wide --tcp -P 443 berlin.guixsd.org
Start: 2018-12-08T16:57:40-0800
HOST: localhost.localdomain                        Loss%   Snt   Last   Avg  
Best  Wrst StDev
[... I've omitted the intermediate hops because they aren't relevant ...]
 13.|-- 141.80.181.40                                 0.0%    10  205.0 201.9 
194.9 212.8   5.6
--8<---------------cut here---------------end--------------->8---

My machine's latency to berlin is about 202 ms, which matches what we
calculated above.

For experimentation, I've set up a CloudFront distribution at
berlin-mirror.marusich.info that uses berlin.guixsd.org as its origin
server.  Let's repeat these steps to measure the performance of the
distribution from my machine's perspective (before I did this, I made
sure the GET would result in a cache hit by downloading the substitute
once before and verifying that the same remote IP address was used):

--8<---------------cut here---------------start------------->8---
$ measure_get 
https://berlin-mirror.marusich.info/nar/gzip/1bq783rbkzv9z9zdhivbvfzhsz2s5yac-linux-libre-4.19
 2>/dev/null
url_effective: 
https://berlin-mirror.marusich.info/nar/gzip/1bq783rbkzv9z9zdhivbvfzhsz2s5yac-linux-libre-4.19
http_code: 200
num_connects: 1
num_redirects: 0
remote_ip: 13.32.254.57
remote_port: 443
size_download: 69899433 B
speed_download: 9821474.000 B/s
time_appconnect: 0.607593 s
time_connect: 0.532417 s
time_namelookup: 0.511086 s
time_pretransfer: 0.608029 s
time_redirect: 0.000000 s
time_starttransfer: 0.663578 s
time_total: 7.117266 s
$ sudo mtr -c 10 --report-wide --tcp -P 443 berlin-mirror.marusich.info
Start: 2018-12-08T17:04:48-0800
HOST: localhost.localdomain                        Loss%   Snt   Last   Avg  
Best  Wrst StDev
[... I've omitted the intermediate hops because they aren't relevant ...]
 14.|-- server-52-84-21-199.sea32.r.cloudfront.net    0.0%    10   19.8  20.3  
14.3  28.9   4.9
--8<---------------cut here---------------end--------------->8---

Establishing the TCP connection took about 21 ms (which matches the mtr
output), and the throughput was about 79 megabits per second.  (On this
machine, 100 Mbps is the current link speed, according to dmesg output.)
This means that in my case, when using CloudFront the latency is 10x
lower, and the throughput (for a cache hit) is 2x higher, than using
berlin.guixsd.org directly!

It would be interesting to see what the performance is for others.

Ricardo Wurmus <address@hidden> writes:

> Large ISPs also provide CDN services.  I already contacted Deutsche
> Telekom so that we can compare their CDN offer with the Amazon Cloudfont
> setup that Chris has configured.

That's great!  There are many CDN services out there.  I am unfamiliar
with most of them.  It will be good to see how Deutsche Telekom's
offering compares to CloudFront.

FYI, CloudFront has edge locations in the following parts of the world:

https://aws.amazon.com/cloudfront/features/

Hartmut Goebel <address@hidden> writes:

> Am 03.12.2018 um 17:12 schrieb Ludovic Courtès:
>> Thus, I’m thinking about using a similar setup, but hosting the mirror
>> on some Big Corp CDN or similar.
>
> Isn't this a contradiction: Building a free infrastructure relaying on
> servers from some Big Corporation? Let allow the privacy concerns
> raising when delivering data via some Big Corporation.
>
> If delivering "packages" works via static data without requiring any
> additional service, we could ask universities to host Guix, too. IMHO
> this is a much preferred solution since this is a decentralized publish
> infrastructure already in place for many GNU/Linux distributions.

I understand your concern about using a third-party service.  However,
we wouldn't be using a CDN as a "software substitute", which is one of
the primary risks of using a web service today:

https://www.gnu.org/philosophy/who-does-that-server-really-serve.html

Instead, we would be using a CDN as a performance optimization that is
transparent to a Guix user.  You seem unsettled by the idea of
entrusting any part of substitute delivery to a third party, but
concretely what risks do you foresee?

Regarding your suggestion to ask universities to host mirrors (really,
caching proxies), I think it could be a good idea.  As Leo mentioned,
the configuration to set up an NGINX caching proxy of Hydra (or berlin)
is freely available in maintenance.git.  Do you think we could convince
some universities to host caching proxies that just run an NGINX web
server using those configurations?

If we can accomplish that, it may still be helpful.  If there is
interest in going down this path, I can explore some possibilities in
the Seattle area.  If the university-owned caching proxies are easily
discoverable (i.e., we list them on the website), then users might
manually set their substitute URL to point to one that's close by.

Going further, if our DNS provider supports something like "geolocation
routing" for DNS queries, we might even be able to create DNS records
for ci.guix.info that point to those universities' caching proxies.  In
this way, when a user resolves ci.guix.info, they would get the address
of a university-owned caching proxy close by.  This could have the
benefits of requiring less money than a full-fledged CDN like Amazon
CloudFront, and also decentralizing the substitute delivery, while still
remaining transparent to Guix users.  However, it would still require us
to rely on a third-party DNS service.

For example, Amazon Route 53 provides this sort of geolocation routing:

https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-policy.html#routing-policy-geo

I wouldn't be surprised if there are other DNS providers out there who
offer something similar.  However, I also wouldn't be surprised if the
overall performance of CloudFront turns out to be better.

"Thompson, David" <address@hidden> writes:

> If AWS CloudFront is the path chosen, it may be worthwhile to follow
> the "infrastructure as code" practice and use CloudFormation to
> provision the CloudFront distribution and any other supporting
> resources. The benefit is that there would be a record of exactly
> *how* the project is using these commercial services and the setup
> could be easily reproduced.  The timing is interesting here because I
> just attended the annual AWS conference on behalf of my employer and
> while I was there I felt inspired to write a Guile API for building
> CloudFormation "stacks".  You can see a small sample of what it does
> here: https://gist.github.com/davexunit/db4b9d3e67902216fbdbc66cd9c6413e

Nice!  That seems useful.  I will have to play with it.  I created my
distributions manually using the AWS Management Console, since it's
relatively easy to do.  I agree it would be better to practice
"infrastructure as code."

On that topic, I've also heard good things about Terraform by HashiCorp,
which is available under the Mozilla Public License 2.0:

https://github.com/hashicorp/terraform

Here is a comparison of Terraform and CloudFormation:

https://www.terraform.io/intro/vs/cloudformation.html

I looked briefly into packaging Terraform for Guix.  It's written in Go.
It seems possible, but I haven't invested enough time yet.

As a final option, since the AWS CLI is already packaged in Guix, we
could just drive CloudFormation or CloudFront directly from the CLI.

Meiyo Peng <address@hidden> writes:

> I like the idea of IPFS. We should try it. It would be great if it works
> well.
>
> If at some point we need to setup traditional mirrors like other major
> Gnu/Linux distros, I can contact my friends in China to setup mirrors in
> several universities. I was a member of address@hidden, which provides the
> largest FLOSS mirror in China.

IPFS would be neat.  So would Gnunet.  Heck, even a publication
mechanism using good old BitTorrent would be nice.  All of these would
require changes to Guix, I suppose.

A CDN would require no changes to Guix, and that's part of why it's so
appealing.

-- 
Chris

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]