Re: [PATCH] no_proxy domain matching

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] no_proxy domain matching

From:	Tim Rühsen
Subject:	Re: [PATCH] no_proxy domain matching
Date:	Wed, 20 Nov 2019 18:47:03 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2

On 20.11.19 12:41, Tomas Hozza wrote:
> On 7. 11. 2019 21:30, Tim Rühsen wrote:
>> On 07.11.19 15:21, Tomas Hozza wrote:
>>> Hi.
>>>
>>> In RHEL-8, we ship a wget version that suffers from bug fixed by [1]. The 
>>> fix resolved issue with matching subdomains when no_proxy domain definition 
>>> was prefixed with dot, e.q. "no_prefix=.mit.edu". As part of backporting 
>>> the fix to RHEL, I wanted to create an upstream test for no_prefix 
>>> functionality. However I found that there is still one corner case, which 
>>> is not handled by the current upstream code and honestly I'm not sure what 
>>> is the intended domain matching behavior in that case. Man page is also not 
>>> very specific in this regard.
>>>
>>> The corner case is as follows:
>>> - no_proxy=.mit.edu
>>> - download URL is e.g. "http://mit.edu/file1";
>>>
>>> In this case the proxy settings are used, because domains don't match due 
>>> to the leftmost dot in no_proxy domain definition. This is either intended 
>>> or corner case that was not considered. One could argue, that if the 
>>> no_proxy is set to ".mit.edu", then leftmost dot means that the proxy 
>>> settings should not apply only to subdomains of "mit.edu", but proxy 
>>> settings should still apply to "mit.edu" domain itself. From my point of 
>>> view, after reading wget man page, I don't think that the leftmost dost in 
>>> no_proxy definition has any special meaning.
>>
>> Hello Tomas,
>>
>> hard to decide how to handle this. I personally would like to see a
>> match with curl's behavior (see https://github.com/curl/curl/issues/1208).
>>
>> Given the docs from GNU emacs, you are right. "no_proxy=.mit.edu" means
>> "mit.edu and subdomains" are excluded from proxy settings.
>> (see https://www.gnu.org/software/emacs/manual/html_node/url/Proxies.html)
>>
>> The caveat with emacs' behavior is that you cannot exclude just all
>> subdomains of mit.edu without mit.edu itself. Effectively, that creates
>> a corner case that can't be handled at all. (but if curl also does it
>> that way, let's go for it).
>>
>> Maybe you can find out about the current no_proxy behavior of typical
>> and wide-spread tools (regarding leftmost dot) !? Once we have that
>> information, we can make a confident decision.
>>
>> Regards, Tim
>
> Hi Tim.
>
> It took me some time to go through the current situation and to be honest, it 
> is kind of a mess. While each tool handles the no_proxy env a little bit 
> differently, there are some similarities. Nevertheless I was not able to find 
> any standard.
>
> curl's behavior:
> - "no_proxy=.mit.edu"
>   - will match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - "no_proxy=mit.edu"
>   - will match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - downside: can not match only the host; can not match only the domain and 
> subdomains
>
> current wget's behavior:
> - "no_proxy=.mit.edu"
>   - will match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will NOT match the host "mit.edu"
> - "no_proxy=mit.edu"
>   - will match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - downside: can not match only the host
>
> wget's behavior with proposed patch:
> - "no_proxy=.mit.edu"
>   - will match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - "no_proxy=mit.edu"
>   - will match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - downside: can not match only the host; can not match only the domain and 
> subdomains
> - it would be consistent with curl's behavior
>
> emacs's behavior:
> - "no_proxy=.mit.edu"
>   - will match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - "no_proxy=mit.edu"
>   - will NOT match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - downside: can not match only subdomains
>
> python httplib2's behavior:
> - "no_proxy=.mit.edu"
>   - will match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - "no_proxy=mit.edu"
>   - will NOT match the domain and subdomains e.g. "www.mit.edu" or 
> "www.subdomain.mit.edu"
>   - will match the host "mit.edu"
> - downside: can not match only subdomains
>
> To sum it up. Each approach has some downsides. Given the change that I 
> provided, wget's behavior would be consistent with curl's behavior. However 
> it will have more downsides that it currently has, specifically it will loose 
> the ability to not to match the host, but only domain and subdomains. Emacs's 
> behavior is similar to Python's httplib2 behavior regarding the leftmost dot.
>
> Honestly I have a soft preference for keeping the current wget's behavior. 
> But I admit that making the behavior consistent with curl's behavior makes 
> sense. Please let me know how you would like to proceed.
>
> To make the behavior consistent with curl, the previously attached changes 
> should be OK. If you find those new conditions too complicated, I can try to 
> rethink it, but I already tried to make it as little complicated as possible 
> and at the same time trying to not completely rewrite the function.
>
> If you'll decide to keep the current behavior, I'll modify the test that I 
> added to cope with the behavior.

Great work, Tomas !

Wow, didn't think it's so messed up :-(

We should definitely document your results, e.g. in the wget manual.

If we keep the current behavior, we could adjust it with a new option or
a new env variable 'WGET_NO_PROXY_MODE'. Which could take well-defined
values like 'curl', 'emacs', 'wget' (the default), and maybe a new one
('strict') with none of the detected downsides.

Looks a bit over-engineered, but it means that wget can easily adopt to
existing environments. And the code seems pretty straight forward.

Let's see if some more opinions come in.

Regards, Tim

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH] no_proxy domain matching, Tomas Hozza, 2019/11/07
- Re: [PATCH] no_proxy domain matching, Tim Rühsen, 2019/11/07
  - Re: [PATCH] no_proxy domain matching, Tomas Hozza, 2019/11/20
    - Re: [PATCH] no_proxy domain matching, Tim Rühsen <=
    - Re: [PATCH] no_proxy domain matching, Tomas Hozza, 2019/11/26
    - Re: [PATCH] no_proxy domain matching, Tim Rühsen, 2019/11/27

Prev by Date: Re: [PATCH] no_proxy domain matching
Next by Date: Re: [PATCH] Allow running tests in testenv on a different wget binary
Previous by thread: Re: [PATCH] no_proxy domain matching
Next by thread: Re: [PATCH] no_proxy domain matching
Index(es):
- Date
- Thread