bug-apl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-apl] Regex support


From: Elias Mårtenson
Subject: Re: [Bug-apl] Regex support
Date: Fri, 22 Sep 2017 18:28:24 +0800

I made the changes needed to use UTF-32 instead. It turned out that the PCRE version 1 API I was using does not properly support UTF-32 patterns (only match data). Thus, I changed the code to use version 2 instead.

I have attached the two files that I changed. It works, as can be seen in the below example, but it's nowhere near complete.

      '(..)..(..)⍱$' ⎕RE "footesting⌽⍱"
┏→━━━━━━━━┓
┃"st" "g⌽"┃
┗∊━━━━━━━━┛

Now, there are two changes I would like to see:
The reason I haven't implemented these myself is because I find the current code to be absolutely awful, especially with all the duplicated code to deallocate PCRE structures. In Lisp I'd use an UNWIND-PROTECT (or try/finally in Java), but in C++ I think I have to declare a new class with a destructor to handle this, correct? Is there anyone who would like to clean this up?

Regards,
Elias

On 21 September 2017 at 19:39, Juergen Sauermann <address@hidden> wrote:
Hi Elias,

the UTF8_constructors look OK, but it can be tricky to properly interpret indices (the elements of sub in your code) of
UTF8-encoded strings (i.e whether they mean code points or byte offsets).

My feeling is that you should avoid UTF8_strings completely and go for the UTF32 option of the library (assuming that
UTF32 are codepoints encoded as 32 bit integers). APL character strings are almost UTF32 strings (except for gaps between
the codepoints) and they avoid all the bits shifting needed for UTF8 strings.

Best Regards,
/// Jürgen


On 09/21/2017 12:09 PM, Elias Mårtenson wrote:
I've implemented the bare minimal needed to get regexes working through a ⎕RE function. I've attached the diff.

I really need Jürgen to take a look at this, since my code that constructs the return value cannot possibly be correct. There must be a better way to handle this which does not involve conversion back and forth between std::string.

Also, I have the result in an UTF-8-encoded C string, and I try to create an UTF8_string from it like this:

    Value_P field_value(UTF8_string(field.c_str()), LOC);

However, when I test this in APL I get the following result:

      '(..)..(..)$' ⎕RE 'sdklfjfj⍉'
┏→━━━━━━━━━━┓
┃"lf" "jâ\215\211"┃
┗∊━━━━━━━━━━┛

It seems the UTF-8 conversion is not done correctly by the UTF8_string constructor. What did I do wrong?

Regards,
Elias      

On 21 September 2017 at 11:38, Xiao-Yong Jin <address@hidden> wrote:

> On Sep 20, 2017, at 9:19 PM, Peter Teeson <address@hidden> wrote:
>
> (These days performance can hardly be a compelling argument
> with multiple many-core CPU chips.)

This kind of argument for APL is exactly why Fortran is still alive and well.




Attachment: Quad_RE.cc
Description: Text Data

Attachment: ax_path_lib_pcre.m4
Description: application/m4


reply via email to

[Prev in Thread] Current Thread [Next in Thread]