[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] [PATCH] Fix possible issues running in a turkish locale
From: |
Ángel González |
Subject: |
Re: [Bug-wget] [PATCH] Fix possible issues running in a turkish locale |
Date: |
Thu, 20 Nov 2014 20:17:29 +0100 |
User-agent: |
Thunderbird |
Darshit Shah wrote:
In short, I;m not sure if we want to use c_strcasecmp for in http.c or
recur.c since domain names allowed to be non-ASCII. And if I'm not
wrong, even the HTTP headers may contain non-ASCII data. In such cases
would c_strcasecmp work correctly?
I expect them to be in punycode at that point. Maybe I read it too fast.
And if it isn't, it would only work -by chance- if the page uses the same
encoding as your locale.
I don't think the headers would be a problem. Note that most cases only
deal with the ascii case-unsensitiveness (eg. html5 explicitely notes that).
Tim Ruehsen wrote:
On Thursday 20 November 2014 00:12:08 Ángel González wrote:
I am attaching it here fwiw. I generally changed them on a few more
places, although I think
some of your edits to init.c are incorrect, as well as those on
progress.c: as they are
user-parameters, they _might_ be introduced in the user locale (they
would misteriously fail
when run under C locale in cron, though. I'm not so sure it should be
supported).
Please be more specific.
I would have included the diff of the diffs, but that would have been
really ugly to read. :)
Seemed better to provide the other patch.
Imaging user input --level=INF (or --level=inf) will be compared with "inf".
The turkish people will be used to enter the correct char in this case, namely
'I' or 'i' and not 'İ' or 'ı'. Else most programs would simply break. In this
case a ASCII comparison (c_str...) is absolutely ok. Using locale-aware
comparison would not work (well, the user could try it out since he gets
immediate response by Wget).
Then we can get rid of the strcasecmp comparisons left.
Notwithstanding with keeping parameters in user-locale case, I made the
accepts list C-case.
That's the most arguable one, but doesn't seem sensible to change the
code to support that.
I think this is not correct.
This comment was related to in_acclist. That function can perform
comparisons
in several ways:
- Using fnmatch_nocase (this function explicitely uses C locale)
- Using match_tail() (I had changed it to c_strcasecmp, as it was used
for eg. cookies)
- Using a strcasecmp
Thus, I changed the third branch to c_strcasecmp
The accepts and regexes are filename related.
Filenames are not limited to ASCII. What we have to do here is a normalization
to UTF-8 (using the users locale). Filenames/pathes found in HTML or CSS also
have to be converted to UTF-8 (using the page's locale). These UTF-8 strings
have to be compared with an appropriate function. str(n)casecmp would not be
correct here, a byte-by-byte comparison like c_str(n)casecmp is better but not
perfect. libunistring has functions for that.
You are probably right, which means we should redo this part of the code.
I would suggest that I push my patch.
We still have two weeks to inspect the changes... if in doubt, let's set up a
test case. Just give an example of what could go wrong and we can simply try
it out.
Ok, I rebased my patch on top of what you committed.
Changes:
-Remove strcase(n)cmp from cmpt.c -> Shouldn't be a problem
- Domain comparisons on cookies.c -> Perhaps not ok
- css-url.c and ftp-basic.c -> hardcoded ascii text , shall be c_strcasecmp
- ftp.c -> it's a user-provided glob, we can keep strcasecmp
- hash -> Mixed. make_nocase_string_hash_table is used for hosts, tags
and attributes
- html-parse.c -> the tags are only C-insensitive. Shall be changed
- html-url.c -> same with attrs
- parse_content_range -> this is a header reply. Change
- persistent_available_p (1) -> maybe not ok
- persistent_available_p (2-3) -> these are C constants. Change
- ensure_extension -> only called with either .html or .css. I would
change it.
- netrc -> Domain. Maybe not ok
- recur -> idem
- openssl.c -> clarifies a comment
- url.c -> We are comapring with a scheme, thus C comparison.
- utils.c -> Discussed above
So while some are completely clear, we should clarify hosts comparison.
Regards
0001-Replace-strcasecmp-and-strncasecmp-with-c_strcasecmp.patch
Description: Text Data
- Re: [Bug-wget] [PATCH] Fix possible issues running in a turkish locale, (continued)
- Re: [Bug-wget] [PATCH] Fix possible issues running in a turkish locale, Tim Ruehsen, 2014/11/20
- Re: [Bug-wget] [PATCH] Fix possible issues running in a turkish locale, Giuseppe Scrivano, 2014/11/20
- Re: [Bug-wget] Removing form feeds from sources, Tim Ruehsen, 2014/11/20
- Re: [Bug-wget] Removing form feeds from sources, Giuseppe Scrivano, 2014/11/20
- Re: [Bug-wget] Removing form feeds from sources, Tim Ruehsen, 2014/11/20
- Re: [Bug-wget] Removing form feeds from sources, Darshit Shah, 2014/11/21
Re: [Bug-wget] [PATCH] Fix possible issues running in a turkish locale, Ángel González, 2014/11/19