poke-devel
[Top][All Lists]

## Questions about how to implement pickles/utf8.pk

 From: Mohammad-Reza Nabipoor Subject: Questions about how to implement pickles/utf8.pk Date: Wed, 16 Sep 2020 23:40:55 +0430

```Hi,

I tried to write a pickle to poke UTF8 (`utf8.pk`). I came up with two different
types:

```poke
/*
* len from    to       byte[0]   byte[1]   byte[2]   byte[3]
* 1   U+0000  U+007F   0xxxxxxx
* 2   U+0080  U+07FF   110xxxxx  10xxxxxx
* 3   U+0800  U+FFFF   1110xxxx  10xxxxxx  10xxxxxx
* 4   U+10000 U+10FFFF 11110xxx  10xxxxxx  10xxxxxx  10xxxxxx
*
* ref: https://en.wikipedia.org/wiki/UTF-8
*/

deftype UTF8_CodePoint = uint<21>;

deftype UTF8_1 =
union
{
byte[1] d1 : (d1[0] & 0x80) == 0;

byte[2] d2 : (d2[0] & 0xe0) == 0xc0 && (d2[1] & 0xc0) == 0x80;

byte[3] d3 : (d3[0] & 0xf0) == 0xe0 &&
(d3[1] & 0xc0) == 0x80 &&
(d3[2] & 0xc0) == 0x80;

byte[4] d4 : (d4[0] & 0xf8) == 0xf0 &&
(d4[1] & 0xc0) == 0x80 &&
(d4[2] & 0xc0) == 0x80 &&
(d4[3] & 0xc0) == 0x80;
};

deftype UTF8_2 =
union
{
struct
{
byte[1] d: (d[0] & 0x80) == 0;
};

struct
{
byte[2] d : (d[0] & 0xe0) == 0xc0 && (d[1] & 0xc0) == 0x80;
};

struct
{
byte[3] d : (d[0] & 0xf0) == 0xe0 &&
(d[1] & 0xc0) == 0x80 &&
(d[2] & 0xc0) == 0x80;
};

struct
{
byte[4] d : (d[0] & 0xf8) == 0xf0 &&
(d[1] & 0xc0) == 0x80 &&
(d[2] & 0xc0) == 0x80 &&
(d[3] & 0xc0) == 0x80;
};
};
```

## Question 1

How can I figure out the active field in a union?

For these types (`UTF8_*`) I can find it by `size` attribute of the instance.
But I think there should be a general mechanism.

An example:

```poke
defun utf8_decode = (UTF8_1 x) UTF8_CodePoint:
{
if (x'size == 1#B)
return x.d1[0];

if (x'size == 2#B)
return (x.d2[0] & 0x1f) <<. 6 | (x.d2[1] & 0x3f);

if (x'size == 3#B)
return (x.d3[0] & 0x0f) <<. 12 | (x.d3[1] & 0x3f) <<. 6 | (x.d3[2] &
0x3f);

if (x'size == 4#B)
return (x.d4[0] & 0x07) <<. 18 | (x.d4[1] & 0x3f) <<. 12 |
(x.d4[2] & 0x3f) <<. 6  | (x.d4[3] & 0x3f);
}

```

## Question 2

If I want to define a `decode` method for `UTF8_1` instead of the `utf8_decode`
fucntion, how can I access the `size` attribute?

## Question 3

I prefer the `UTF8_2` over the `UTF8_1`, because always I have to deal with only
one field. From the user POV, it's an array with variable length (1-4).

How can I access the `d` field?
Or if you think that my question is insane, could you please explain why?

Currently this doesn't work:

```poke
// Inside `UTF8_2` type
method decode = UTF8_CodePoint:
{
if (d'size == 1#B)
return d;

if (d'size == 2#B)
return (d[0] & 0x1f) <<. 6 | (d[1] & 0x3f);

if (d'size == 3#B)
return (d[0] & 0x0f) <<. 12  | (d[1] & 0x3f) <<. 6 | (d[2] & 0x3f);

if (d'size == 4#B)
return (d[0] & 0x07) <<. 18 | (d[1] & 0x3f) <<. 12 |
(d[2] & 0x3f) <<. 6  | (d[3] & 0x3f);
}
```

## Question 4

I cannot write `utf8_encode` function for `UTF8_1`, because union construction
does not work (like the problem for pinned structs [Bug 26527][2]).

How do you write an encode function?

This does not work:

```poke
defun utf8_encode = (UTF8_CodePoint cp) UTF8_1:
{
if (cp < 0x7f)
return UTF8_1 {d1 = [cp as byte]};

if (cp < 0x7ff)
return UTF8_1 {
d2 = [
(0xc0 | (cp .>> 6 & 0x1f)) as byte,
(0x80 | (cp       & 0x3f)) as byte,
]
};

if (cp < 0xffff)
return UTF8_1 {
d3 = [
(0xe0 | (cp .>> 12 & 0x0f)) as byte,
(0x80 | (cp .>> 6  & 0x3f)) as byte,
(0x80 | (cp        & 0x3f)) as byte,
]
};

return UTF8_1 {
d4 = [
(0xf0 | (cp .>> 18 & 0x07)) as byte,
(0x80 | (cp .>> 12 & 0x3f)) as byte,
(0x80 | (cp .>> 6  & 0x3f)) as byte,
(0x80 | (cp        & 0x3f)) as byte,
]
};
}
```

## Question 5

Is there any other approach to poke UTF8?

BTW you can download `utf8.pk` at [1] (it's plain-text, you can `wget` it).
(I don't attach it because I've changed some names)

Regards,