[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)

From: Andries E. Brouwer
Subject: Re: [Bug-wget] bad filenames (again)
Date: Tue, 25 Aug 2015 14:59:58 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

On Mon, Aug 24, 2015 at 03:44:09PM +0200, Tim Ruehsen wrote:

> Just implemented (or let's say fixed) Content-Disposition in wget2. It now
> saves the file as
> 20101202_농심신라면배_바둑(다카오신지9단-백_.sgf


> Content-Disposition (filename, filename*) is standardized, but browsers seems 
> to behave/parse very different, ignoring standards.

Yes. On the web a general phenomenon is that non-specialists create websites.
They know nothing about standards, but fiddle until it works (say, with IE6).
Also Microsoft does/did not respect standards.

A consequence is that practice is more important than theory.
One has to try for robust solutions.

> > I prefer to base the decision about what to do on the form
> > of the filename (ASCII / UTF-8 / other), not on the
> > headers encountered on the way to this file.
> I guess we can find an easy agreement.
> 1. Wget has to obey the defaults. If it fails or we find a well-known 
> misbehavior (server/document fault), handle it automatically.
> That's how we try do do it now.
> 2. If still a problem arises, the user should be able to intercept. Using 
> special command line options for fine-tuning Wget's behavior.

Yes, whatever the user says, we do, the case where options have been given
is nonproblematic.

Remains your point 1. I am not sure what you think the defaults are.

My basic example is the %-encoded pure ASCII url, referring to a non-text
object. How should wget save the object? There is zero charset information.
My answer today (after conversation with Eli) is:
"Decode the %-encoded string. The last part is the suggested filename.
If it is ASCII, use that ASCII name (where valid for the OS).
If it is UTF-8 (but not ASCII), use it when the locale is UTF-8,
otherwise convert (if possible) or escape.  If it is not UTF-8, escape."

[That is, I would recognize only what is easy to recognize,
and prefer not to rely on any headers. Prefer not to convert
except possibly in the UTF-8 case.]

How does your answer differ?
Some ancient docs say that ISO-8859-1 is a default. Even if such docs
have not yet been replaced, I feel that they no longer reflect current
practice. ISO-8859-x is dying. All the web should converge to Unicode,
whatever that may be.

The relevant example might be that
I have the impression that you are happy with "kn=C3=A4ckebr=C3=B6d.jpg"
but I would be unhappy with that (although it happens to be correct),
since guessing and conversion is involved.
Guessing may not be so bad, but guessing and converting is terrible:
it can be really complicated to retrieve the original filename
after an incorrect conversion.


Another URL:
This is about holidays near the beautiful city Győr in Hungary.
But what happened with the city? Its name was written in ISO-8859-2,
using 0xf5, and that was %-escaped to %f5, and that was again
%-escaped to %25f5.

I would undo the %-escape and see pure ASCII, and save as
What would you do?
The page has <meta charset="ISO-8859-2" />
The headers have Content-Type: text/html without charset information.


Similarly http://www.matklubben.se/recept/lchf+kn%25e4ckebr%25f6d+mandelmj%25f6l
has the %-encoded version of "Lchf kn%e4ckebr%f6d mandelmj%f6l"
which again encoded the ISO-8859-1 version of lchf knäckebröd mandelmjöl.

Such double encodings are not uncommon.
But as a first approximation I think wget should not try to recognize them.


ends in %C0%B6的%D1的%C0.HTM - this is an %-encoding using fat %-signs (U+ff05).

You see that one can encounter all levels of messiness.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]