[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Chicken-users] Basic abnf usage?
Re: [Chicken-users] Basic abnf usage?
Thu, 16 Apr 2015 22:35:00 +0200
sorry for the late reply, got busy :-)
On 28 March 2015 22:18 CET, Matt Gushee wrote:
> On Sat, Mar 28, 2015 at 5:33 AM, Moritz Heidkamp <address@hidden
>> ah, that's what you are referring to, I see! It's like that because I
>> didn't want to force a utf8 dependency on the user.
> Maybe in that case it would be good if the API doc said something like:
> "comparse is compatible with UTF-8, but many of the built-in combinators do
> not work with UTF-8 characters, so you may need to construct your own. For
> example: ..."
Sure, we can do that! I didn't mention it so far as this property is
implicit with how CHICKEN core strings work. Also, note that Comparse
can parse arbitrary inputs, not just character strings / sequences,
e.g. you can use it to parse a list of symbols into a structure. No
arbitrary limitations and all that :-)
>> This doesn't work because `sake' is a string and Comparse operates on
>> the byte level by default (in accordance with CHICKEN core string
> Well, yes. And, though I understand that if you *know* your programs
> only need to process single-byte characters, it is convenient (and
> better performance-wise, though I wonder how much in the year 2015) to
> equate characters with bytes.
It's not quite that simple: Characters may be encoded in many ways,
UTF-8 is far from the only widely used one and not ideal in all cases,
e.g. the algorithmic complexity of some operations on UTF-8 encoded
strings is objectively worse than those on UTF-32 encoded strings. And
it will remain so even till the year 2050. Not guarantees on what
happens after that, though!
> I'm of the opinion (shared by many I18n experts, if I'm not mistaken)
> that a high-level language in the 21st century should have in its core
> a rock-solid character abstraction that is never, ever conflated with
> a byte.
The character abstraction actually is rock-solid even in CHICKEN 4
already: A character object represents a Unicode codepoint in an
encoding independent way.
> There are a lot of things I love about Chicken, but the (IMHO
> obsolete) string implementation is not one of them.
Yeah, strings being equivalent to u8vectors / blobs is a bit messy at
times. I think this is something worth addressing in CHICKEN 5. It would
be a rather invasive change, though, and so far nobody seems inclined to
put in the effort.
For the time being, it's safest to mainly think of strings as byte
arrays, i.e. mentally (or actually, as the utf8 egg does internally)
replace the `string-' prefix with a `byte-' prefix in the core string
Whenever you work with actual strings, use the extension for the
respective encoding you're dealing with (though currently the only fully
supported option with CHICKEN 4 is UTF-8 via the utf8 egg AFAIK).
>> We could create a comparse-utf8 egg to facilitate this. It's not
>> currently on my agenda but I will put it in my Comparse notes for future
>> reference. If you feel inclined to create one, I'm happy to provide you
>> with code review and feedback!
> I was thinking about that.
That'd be great! I would suggest to make it a separate egg so that we
keep the utf8 dependency optional. To give an example of why such
dependencies matter: Medea used to depend on the utf8 egg for some
things but we removed that dependency when we wanted to package it up
for a mobile application. The reason was that the utf8 egg bundles some
extra case map files which we didn't need at all but would have
complicated the packaging considerably.
> I looked through the source code to see what string handling functions
> are used that are not provided by the utf8 egg, and thus would need to
> be reimplemented. So far I've found:
> Substring/shared is not too big a deal, but that KMP stuff is a bit
> daunting. Maybe I'll look into it if I have time. I do like the comparse
> API, and would like to be able to use it.
Substring/shared is not a big deal, indeed, since it's essentially an
alias of the regular substring procedure in CHICKEN 4 anyway. And as
John pointed out, you don't really need to touch any of the KMP
Thinking about it some more, I believe it would be necessary to expose
the char-seq-cursor API to be able to provide a fully featured
comparse-utf8 module. Let me know if that turns out to be the case --
I'd be happy to extract a comparse-lolevel module for that purpose :-)
Description: PGP signature