bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] no_proxy domain matching


From: Tim Rühsen
Subject: Re: [PATCH] no_proxy domain matching
Date: Wed, 20 Nov 2019 18:47:03 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2

On 20.11.19 12:41, Tomas Hozza wrote:
> On 7. 11. 2019 21:30, Tim Rühsen wrote:
>> On 07.11.19 15:21, Tomas Hozza wrote:
>>> Hi.
>>>
>>> In RHEL-8, we ship a wget version that suffers from bug fixed by [1]. The 
>>> fix resolved issue with matching subdomains when no_proxy domain definition 
>>> was prefixed with dot, e.q. "no_prefix=.mit.edu". As part of backporting 
>>> the fix to RHEL, I wanted to create an upstream test for no_prefix 
>>> functionality. However I found that there is still one corner case, which 
>>> is not handled by the current upstream code and honestly I'm not sure what 
>>> is the intended domain matching behavior in that case. Man page is also not 
>>> very specific in this regard.
>>>
>>> The corner case is as follows:
>>> - no_proxy=.mit.edu
>>> - download URL is e.g. "http://mit.edu/file1";
>>>
>>> In this case the proxy settings are used, because domains don't match due 
>>> to the leftmost dot in no_proxy domain definition. This is either intended 
>>> or corner case that was not considered. One could argue, that if the 
>>> no_proxy is set to ".mit.edu", then leftmost dot means that the proxy 
>>> settings should not apply only to subdomains of "mit.edu", but proxy 
>>> settings should still apply to "mit.edu" domain itself. From my point of 
>>> view, after reading wget man page, I don't think that the leftmost dost in 
>>> no_proxy definition has any special meaning.
>>
>> Hello Tomas,
>>
>> hard to decide how to handle this. I personally would like to see a
>> match with curl's behavior (see https://github.com/curl/curl/issues/1208).
>>
>> Given the docs from GNU emacs, you are right. "no_proxy=.mit.edu" means
>> "mit.edu and subdomains" are excluded from proxy settings.
>> (see https://www.gnu.org/software/emacs/manual/html_node/url/Proxies.html)
>>
>> The caveat with emacs' behavior is that you cannot exclude just all
>> subdomains of mit.edu without mit.edu itself. Effectively, that creates
>> a corner case that can't be handled at all. (but if curl also does it
>> that way, let's go for it).
>>
>> Maybe you can find out about the current no_proxy behavior of typical
>> and wide-spread tools (regarding leftmost dot) !? Once we have that
>> information, we can make a confident decision.
>>
>> Regards, Tim
>
> Hi Tim.
>
> It took me some time to go through the current situation and to be honest, it 
> is kind of a mess. While each tool handles the no_proxy env a little bit 
> differently, there are some similarities. Nevertheless I was not able to find 
> any standard.
>
> curl's behavior:
> - "no_proxy=.mit.edu"
>   - will match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - "no_proxy=mit.edu"
>   - will match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - downside: can not match only the host; can not match only the domain and 
> subdomains
>
> current wget's behavior:
> - "no_proxy=.mit.edu"
>   - will match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will NOT match the host "mit.edu"
> - "no_proxy=mit.edu"
>   - will match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - downside: can not match only the host
>
> wget's behavior with proposed patch:
> - "no_proxy=.mit.edu"
>   - will match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - "no_proxy=mit.edu"
>   - will match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - downside: can not match only the host; can not match only the domain and 
> subdomains
> - it would be consistent with curl's behavior
>
> emacs's behavior:
> - "no_proxy=.mit.edu"
>   - will match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - "no_proxy=mit.edu"
>   - will NOT match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - downside: can not match only subdomains
>
> python httplib2's behavior:
> - "no_proxy=.mit.edu"
>   - will match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - "no_proxy=mit.edu"
>   - will NOT match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - downside: can not match only subdomains
>
> To sum it up. Each approach has some downsides. Given the change that I 
> provided, wget's behavior would be consistent with curl's behavior. However 
> it will have more downsides that it currently has, specifically it will loose 
> the ability to not to match the host, but only domain and subdomains. Emacs's 
> behavior is similar to Python's httplib2 behavior regarding the leftmost dot.
>
> Honestly I have a soft preference for keeping the current wget's behavior. 
> But I admit that making the behavior consistent with curl's behavior makes 
> sense. Please let me know how you would like to proceed.
>
> To make the behavior consistent with curl, the previously attached changes 
> should be OK. If you find those new conditions too complicated, I can try to 
> rethink it, but I already tried to make it as little complicated as possible 
> and at the same time trying to not completely rewrite the function.
>
> If you'll decide to keep the current behavior, I'll modify the test that I 
> added to cope with the behavior.

Great work, Tomas !

Wow, didn't think it's so messed up :-(

We should definitely document your results, e.g. in the wget manual.

If we keep the current behavior, we could adjust it with a new option or
a new env variable 'WGET_NO_PROXY_MODE'. Which could take well-defined
values like 'curl', 'emacs', 'wget' (the default), and maybe a new one
('strict') with none of the detected downsides.

Looks a bit over-engineered, but it means that wget can easily adopt to
existing environments. And the code seems pretty straight forward.

Let's see if some more opinions come in.

Regards, Tim



reply via email to

[Prev in Thread] Current Thread [Next in Thread]