chicken-hackers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Patch for CHICKEN 6 uri-generic


From: Peter Bex
Subject: Re: Patch for CHICKEN 6 uri-generic
Date: Tue, 27 Aug 2024 14:47:19 +0200

Hi Ivan,

I'd like to get something committed, this is blocking porting efforts
of the rest of the web stack, which I'd like to work on during the
Gosling CHICKEN event.  Would you object if I just commit what I have
now?  We can make improvements on this as we go.
Note that CHICKEN 6 isn't officially out yet anyway, but it'd be nice
if most of the important eggs already work on the day it's released.

Cheers,
Peter

On Wed, May 22, 2024 at 11:33:08AM -0700, Ivan Raikov wrote:
> Hello Peter,
> 
> Thanks a lot for the patch! Overall it looks ok, but it has been quite
> a while since I have had to deal with UTF-8 at this level of detail,
> so I don't really understand all the bitwise operations and range
> comparisons. I am wondering if it is possible to factor out the
> UTF-8-specific logic into a separate module and let it be invoked by
> the uri-generic parsing routines. Also, I think it would be
> tremendously helpful to use named constants, as I don't quite know the
> significance of #x800 or #x10000. Perhaps CHICKEN 6 already offers the
> definitions and routines to make this code more readable? I will try
> to install CHICKEN 6 and actually run the code with your patch soon.
> 
> Thanks,
> Ivan
> 
> On Thu, May 16, 2024 at 2:52 AM Peter Bex <peter@more-magic.net> wrote:
> >
> > On Wed, May 15, 2024 at 02:44:07PM +0200, Peter Bex wrote:
> > > Unfortunately, it also means we must now choose to reject certain URIs
> > > (at least in uri-common) by raising an exception instead of allowing them
> > > to be decoded.  These are for invalid UTF-8 encoded characters, either
> > > because they're a truncated byte sequence or because they encode a
> > > character in too many bytes.
> >
> > I realised that there was a bug in this code, since "eat-rest-chars"
> > would consume the percent-encoded bytes and then they'd get discarded
> > in case the set of characters to decode doesn't contain the decoded
> > character in question.
> >
> > After trying this out, I noticed that the code actually worked, to my
> > astonishment.  Then I realised that this was because the code would
> > still cons the UTF-8 *leading byte* back onto the result, and then
> > traverse the cdr of the UTF-8 tail bytes as the rest list, which would
> > pass through unprocessed.
> >
> > I added test cases for both situations and changed the code to detect
> > UTF-8 continuation bytes without a leading byte and bail out in such
> > cases, or other unforeseen and unhandled cases (the "else" in the main
> > decoder).  And of course tweaked the eat-rest-chars code and callers to
> > always restore all of the undecoded bytes.
> >
> > Still not super happy with using "values" here and the way we're passing
> > in somewhat redundant information about the first consumed byte.  I realise
> > an alternative would be to pass the success continuation as an argument
> > but I don't think that makes the code much clearer.
> >
> > Cheers,
> > Peter
> 
> 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]