[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] bad filenames (again)
From: |
Tim Ruehsen |
Subject: |
Re: [Bug-wget] bad filenames (again) |
Date: |
Mon, 24 Aug 2015 15:44:09 +0200 |
User-agent: |
KMail/4.14.2 (Linux/4.1.0-1-amd64; KDE/4.14.2; x86_64; ; ) |
On Saturday 22 August 2015 00:39:01 Andries E. Brouwer wrote:
> On Fri, Aug 21, 2015 at 08:54:28PM +0200, Tim Rühsen wrote:
> > > Content-Disposition: attachment;
> > > filename="20101202_%EB...%A8-%EB%B0%B1_.sgf"
> > > This encodes a valid utf-8 filename, and that name should be used.
> > > So wget should save this file under the name
> > > 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf
> >
> > This is a different issue. Here we are talking about the encoding of HTTP
> > headers, especially 'filename' values within Content-Disposition HTTP
> > header. Wget simply does not parse this correctly - it is just not coded
> > in. It is just Wget missing some code here (worth opening a separate
> > bug).
> Good, saved for later.
Just implemented (or let's say fixed) Content-Disposition in wget2. It now
saves the file as
20101202_농심신라면배_바둑(다카오신지9단-백_.sgf
Content-Disposition (filename, filename*) is standardized, but browsers seems
to behave/parse very different, ignoring standards.
See
http://stackoverflow.com/questions/93551/how-to-encode-the-filename-parameter-of-content-disposition-header-in-http
(answer 2 from Martin Ørding-Thomsen)
But that's just FYI. Different issue.
> > If the server AND the document do not explicitly specify the character
> > encoding, there still is one - namely the default. Has been ISO-8859-1
> > a while ago. AFAIR, HTML5 might have changed that (too late for me now
> > to look it up).
>
> Yes - that is our main difference. You read the standard and find there
> what everyone is supposed to do, or what the default is.
> I download stuff from the net and encounter lots of things people do,
> that are perhaps not according to the most recent standard,
> and may differ from the default.
>
> As a consequence I prefer to base the decision about what to do
> on the form of the filename (ASCII / UTF-8 / other), not on the
> headers encountered on the way to this file.
I guess we can find an easy agreement.
1. Wget has to obey the defaults. If it fails or we find a well-known
misbehavior (server/document fault), handle it automatically.
That's how we try do do it now.
2. If still a problem arises, the user should be able to intercept. Using
special command line options for fine-tuning Wget's behavior.
Of course we try our best, so that 2. is normally not necessary.
You already gave some examples, one of it (the Content-Disposition example)
already lead to an optimization (I'll transfer the code to Wget1.x soon).
The other two obeyed the standards (one had f*cked up content, but that didn't
touch Wget's functionality).
I would ask you to give more examples of websites that you think aren't
standard and/or where Wget has problems parsing out the links.
That would be 50% of the work.
> (By the way, I checked my conjecture that iconv from UTF-8
> to UTF-8 need not be the identity map, and that is indeed the case.
> On my Ubuntu machine iconv from UTF-8 to UTF-8 converts NFD to NFC.)
We should have a 'shortcut', so if to-charset and from-charset are the same,
we don't convert.
Tim
signature.asc
Description: This is a digitally signed message part.
- Re: [Bug-wget] bad filenames (again), (continued)
- Re: [Bug-wget] bad filenames (again), Tim Ruehsen, 2015/08/20
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/20
- Re: [Bug-wget] bad filenames (again), Tim Ruehsen, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Tim Ruehsen, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Tim Ruehsen, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Tim Rühsen, 2015/08/21
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/21
- Re: [Bug-wget] bad filenames (again),
Tim Ruehsen <=
- Re: [Bug-wget] bad filenames (again), Andries E. Brouwer, 2015/08/25
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/20
- Re: [Bug-wget] bad filenames (again), Tim Ruehsen, 2015/08/20
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/19
- Re: [Bug-wget] bad filenames (again), Ángel González, 2015/08/20
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/20
- Re: [Bug-wget] bad filenames (again), Ángel González, 2015/08/23
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/23
- Re: [Bug-wget] bad filenames (again), Ángel González, 2015/08/23
- Re: [Bug-wget] bad filenames (again), Eli Zaretskii, 2015/08/23