chicken-hackers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Patch for CHICKEN 6 uri-generic


From: Ivan Raikov
Subject: Re: Patch for CHICKEN 6 uri-generic
Date: Sat, 31 Aug 2024 18:51:47 -0700

Hello Peter,

Thanks for your patience, and apologies for blocking porting the web
stack. It has been a really busy summer for me. I think pct-encode and
pct-decode contain many undocumented constants, which makes it
difficult to understand for someone unfamiliar with UTF8 encoding. Let
me read through and try to understand and at least annotate the logic
over the next couple of days, so that it is relatable to the UTF-8
byte sequence syntax in RFC 3629.

Thanks,
Ivan

On Tue, Aug 27, 2024 at 5:47 AM Peter Bex <peter@more-magic.net> wrote:
>
> Hi Ivan,
>
> I'd like to get something committed, this is blocking porting efforts
> of the rest of the web stack, which I'd like to work on during the
> Gosling CHICKEN event.  Would you object if I just commit what I have
> now?  We can make improvements on this as we go.
> Note that CHICKEN 6 isn't officially out yet anyway, but it'd be nice
> if most of the important eggs already work on the day it's released.
>
> Cheers,
> Peter
>
> On Wed, May 22, 2024 at 11:33:08AM -0700, Ivan Raikov wrote:
> > Hello Peter,
> >
> > Thanks a lot for the patch! Overall it looks ok, but it has been quite
> > a while since I have had to deal with UTF-8 at this level of detail,
> > so I don't really understand all the bitwise operations and range
> > comparisons. I am wondering if it is possible to factor out the
> > UTF-8-specific logic into a separate module and let it be invoked by
> > the uri-generic parsing routines. Also, I think it would be
> > tremendously helpful to use named constants, as I don't quite know the
> > significance of #x800 or #x10000. Perhaps CHICKEN 6 already offers the
> > definitions and routines to make this code more readable? I will try
> > to install CHICKEN 6 and actually run the code with your patch soon.
> >
> > Thanks,
> > Ivan
> >
> > On Thu, May 16, 2024 at 2:52 AM Peter Bex <peter@more-magic.net> wrote:
> > >
> > > On Wed, May 15, 2024 at 02:44:07PM +0200, Peter Bex wrote:
> > > > Unfortunately, it also means we must now choose to reject certain URIs
> > > > (at least in uri-common) by raising an exception instead of allowing 
> > > > them
> > > > to be decoded.  These are for invalid UTF-8 encoded characters, either
> > > > because they're a truncated byte sequence or because they encode a
> > > > character in too many bytes.
> > >
> > > I realised that there was a bug in this code, since "eat-rest-chars"
> > > would consume the percent-encoded bytes and then they'd get discarded
> > > in case the set of characters to decode doesn't contain the decoded
> > > character in question.
> > >
> > > After trying this out, I noticed that the code actually worked, to my
> > > astonishment.  Then I realised that this was because the code would
> > > still cons the UTF-8 *leading byte* back onto the result, and then
> > > traverse the cdr of the UTF-8 tail bytes as the rest list, which would
> > > pass through unprocessed.
> > >
> > > I added test cases for both situations and changed the code to detect
> > > UTF-8 continuation bytes without a leading byte and bail out in such
> > > cases, or other unforeseen and unhandled cases (the "else" in the main
> > > decoder).  And of course tweaked the eat-rest-chars code and callers to
> > > always restore all of the undecoded bytes.
> > >
> > > Still not super happy with using "values" here and the way we're passing
> > > in somewhat redundant information about the first consumed byte.  I 
> > > realise
> > > an alternative would be to pass the success continuation as an argument
> > > but I don't think that makes the code much clearer.
> > >
> > > Cheers,
> > > Peter
> >
> >



reply via email to

[Prev in Thread] Current Thread [Next in Thread]