[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Nmh-workers] What should nhm do with busted Subject: lines?
From: |
Ken Hornstein |
Subject: |
Re: [Nmh-workers] What should nhm do with busted Subject: lines? |
Date: |
Wed, 04 Nov 2015 19:08:38 -0500 |
>So I got an e-mail from an Outlook abuser that had some UTF-8 smart
>quote characters in the Subject: line - sans RFC2047 encoding, just
>bare UTF-8 characters, naked as the day they were typed, plonked in the
>middle of the line.
>
>What *should* nmh do here (given that we don't have a way to tell it
>was UTF-8 versus an ISO8859-N or 2022 or what-have-you)?
Technically ... those are legal nowadays. See RFC 6532. That's a
message/global message.
What should we do? We should deal with it. I think we might not do so
well right now. Okay, fine, what does 'deal with it' mean? Well ...
technically the only valid 'raw' 8-bit characters in headers are UTF-8.
But I am aware that some busticated MUAs still send raw 8-bit data in
other character sets.
I see two possible sets of ways to deal with it better:
1) Assume any unencoded 8-bit characters in email headers are UTF-8. Treat
as UTF-8, which means converting to local character set if necessary.
If it turns out those bytes are not UTF-8, then either they'll fail
character conversion or end up as mojibake on a user's terminal (well,
they'll probably end up as the UTF-8 invalid character).
2) Do 1), except check first to see if all of the 8-bit sequences are
valid UTF-8 encoding (it's possible for an arbitrary sequence of
8-bit characters to be a valid UTF-8 encoded sequence, but very unlikely).
If it's all valid, treat as 1). Otherwise use substitution characters
for everything 8-bit.
--Ken