[bug-libunistring] Re: UTF-8 backward iteration proposal for libunistrin

bug-libunistring
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-libunistring] Re: UTF-8 backward iteration proposal for libunistrin

From:	Ben Pfaff
Subject:	[bug-libunistring] Re: UTF-8 backward iteration proposal for libunistring
Date:	Sat, 13 Nov 2010 17:09:16 -0800
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux)
Bruno Haible <address@hidden> writes:

> Hello Ben,
>
>> > Are there other cases where the forward iteration behaviour does
>> > not allow an equivalent O(1) backward iteration?
>> 
>> No, that's the only one.  I have a corrected version of
>> u8-mbtouc-aux.c here, along with a draft of a reverse-iterating
>> version.  A test program that exhaustively tests all of the
>> possibilities in forward and reverse order reports that it works
>> OK.
>
> Cool! Very nice.
>
>> Here's the diff for the u8-mbtoc-aux.c that I've got here so far
>
> You find the fixed u8_mbtouc* code already in gnulib:
> http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=commitdiff;h=8cada094a301d3f78c086ef0291e8ca88cbe7a1d
> Just run an 'autogen.sh' in libunistring, and you can rebuild
> libunistring with it. Or work directly in gnulib with a testdir.
>
> Now, I'd like to see your changes to u8-prev.c. And then, of course,
> the new backward iteration primitives.

I've been working from gnulib.  I'm appending my initial change.
It adds a function under the name u8_mb_prev_uc() as you had
suggested.  I'd follow up with modules for u16, u32, etc., of
course, and probably a u8_mb_prev_ucr() analogous to u8_mbtoucr().

But I hadn't been working on u8_prev() because I assumed that its
behavior on ill-formed sequences was intentional.  It is, for
example, analogous to the behavior of u8_next(), which also just
gives up for an ill-formed sequence.  (I'm not sure, therefore,
that u8_prev() should be called out specifically in the manual
when u8_next() is not.)  Do you want to change the behavior of
u8_next()?  Presumably u8_strmbtouc() that u8_next() builds upon
would also want to change in that case.

Thanks,

Ben

--8<--------------------------cut here-------------------------->8--

From: Ben Pfaff <address@hidden>
Date: Sat, 13 Nov 2010 17:03:12 -0800
Subject: [PATCH] New module 'u8-mb-prev-uc'.

* lib/unistr.in.h (u8_mb_prev_uc): New declaration.
(u8_mb_prev_uc_aux): New declaration.
* lib/unistr/u8-mb-prev-uc.c: New file.
* lib/unistr/u8-mb-prev-uc-aux.c: New file.
* tests/test-u8-mb-prev-uc.c: New file.
* modules/u8-mb-prev-uc: New file.
* modules/u8-mb-prev-uc-tests: New file.
---
 ChangeLog                          |   11 ++
 lib/unistr.in.h                    |   23 +++
 lib/unistr/u8-mb-prev-uc-aux.c     |  128 ++++++++++++++++
 lib/unistr/u8-mb-prev-uc.c         |  139 +++++++++++++++++
 modules/unistr/u8-mb-prev-uc       |   28 ++++
 modules/unistr/u8-mb-prev-uc-tests |   14 ++
 tests/unistr/test-u8-mb-prev-uc.c  |  288 ++++++++++++++++++++++++++++++++++++
 7 files changed, 631 insertions(+), 0 deletions(-)
 create mode 100644 lib/unistr/u8-mb-prev-uc-aux.c
 create mode 100644 lib/unistr/u8-mb-prev-uc.c
 create mode 100644 modules/unistr/u8-mb-prev-uc
 create mode 100644 modules/unistr/u8-mb-prev-uc-tests
 create mode 100644 tests/unistr/test-u8-mb-prev-uc.c

diff --git a/ChangeLog b/ChangeLog
index e2bd35c..5ed6cff 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,14 @@
+2010-11-13  Ben Pfaff  <address@hidden>
+
+       New module 'u8-mb-prev-uc'.
+       * lib/unistr.in.h (u8_mb_prev_uc): New declaration.
+       (u8_mb_prev_uc_aux): New declaration.
+       * lib/unistr/u8-mb-prev-uc.c: New file.
+       * lib/unistr/u8-mb-prev-uc-aux.c: New file.
+       * tests/test-u8-mb-prev-uc.c: New file.
+       * modules/u8-mb-prev-uc: New file.
+       * modules/u8-mb-prev-uc-tests: New file.
+
 2010-11-13  Bruno Haible  <address@hidden>
 
        rename test: Add comments.
diff --git a/lib/unistr.in.h b/lib/unistr.in.h
index e574f94..0a7f860 100644
--- a/lib/unistr.in.h
+++ b/lib/unistr.in.h
@@ -294,6 +294,29 @@ extern int
        u32_mbtoucr (ucs4_t *puc, const uint32_t *s, size_t n);
 #endif
 
+#if GNULIB_UNISTR_U8_MB_PREV_UC || HAVE_LIBUNISTRING
+# if !HAVE_INLINE
+extern int
+       u8_mb_prev_uc (ucs4_t *puc, const uint8_t *s, size_t n);
+# else
+extern int
+       u8_mb_prev_uc_aux (ucs4_t *puc, const uint8_t *s, size_t n);
+static inline int
+u8_mb_prev_uc (ucs4_t *puc, const uint8_t *s, size_t n)
+{
+  uint8_t c = s[n - 1];
+
+  if (c < 0x80)
+    {
+      *puc = c;
+      return 1;
+    }
+  else
+    return u8_mb_prev_uc_aux (puc, s, n);
+}
+# endif
+#endif
+
 /* Put the multibyte character represented by UC in S, returning its
    length.  Return -1 upon failure, -2 if the number of available units, N,
    is too small.  The latter case cannot occur if N >= 6/2/1, respectively.  */
diff --git a/lib/unistr/u8-mb-prev-uc-aux.c b/lib/unistr/u8-mb-prev-uc-aux.c
new file mode 100644
index 0000000..ca49c9a
--- /dev/null
+++ b/lib/unistr/u8-mb-prev-uc-aux.c
@@ -0,0 +1,128 @@
+/* Conversion UTF-8 to UCS-4.
+   Copyright (C) 2001-2002, 2006-2007, 2009-2010 Free Software Foundation, Inc.
+   Written by Ben Pfaff <address@hidden>, 2010,
+   based on code by Bruno Haible <address@hidden>, 2001.
+
+   This program is free software: you can redistribute it and/or modify it
+   under the terms of the GNU Lesser General Public License as published
+   by the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
+
+#include <config.h>
+
+/* Specification.  */
+#include "unistr.h"
+
+#if defined IN_LIBUNISTRING || HAVE_INLINE
+
+int
+u8_mb_prev_uc_aux (ucs4_t *puc, const uint8_t *s, size_t n)
+{
+  uint8_t c_1 = s[n - 1];
+
+#if CONFIG_UNICODE_SAFETY
+  if (c_1 <= 0xbf)
+#endif
+    {
+      if (n >= 2)
+        {
+          uint8_t c_2 = s[n - 2];
+
+          if ((c_2 ^ 0x80) >= 0x40)
+            {
+#if CONFIG_UNICODE_SAFETY
+              if (c_2 >= 0xc2 && c_2 < 0xe0)
+#endif
+                {
+                  *puc = ((unsigned int) (c_2 & 0x1f) << 6)
+                         | (unsigned int) (c_1 ^ 0x80);
+                  return 2;
+                }
+#if CONFIG_UNICODE_SAFETY
+              if (c_2 >= 0xe0 && c_2 < 0xf8)
+                {
+                  /* incomplete multibyte character */
+                  *puc = 0xfffd;
+                  return 2;
+                }
+#endif
+            }
+          else if (n >= 3)
+            {
+              uint8_t c_3 = s[n - 3];
+
+              if ((c_3 ^ 0x80) >= 0x40)
+                {
+#if CONFIG_UNICODE_SAFETY
+                  if ((c_3 == 0xe0 && c_2 >= 0xa0)
+                      || (c_3 >= 0xe1 && c_3 < 0xed)
+                      || (c_3 == 0xed && c_2 < 0xa0)
+                      || (c_3 >= 0xee && c_3 < 0xf0))
+#endif
+                    {
+                      *puc = ((unsigned int) (c_3 & 0x0f) << 12)
+                             | (unsigned int) ((c_2 ^ 0x80) << 6)
+                             | (unsigned int) (c_1 ^ 0x80);
+                      return 3;
+                    }
+#if CONFIG_UNICODE_SAFETY
+                  if (c_3 >= 0xe0 && c_3 < 0xf8)
+                    {
+                      /* 0xe0: overlong sequence.
+                         0xe1...0xec: not reached.
+                         0xed: UTF-16 surrogate.
+                         0xee...0xef: not reached.
+                         0xf0...0xf7: incomplete multibyte character. */
+                      *puc = 0xfffd;
+                      return 3;
+                    }
+#endif
+                }
+              else if (n >= 4)
+                {
+                  uint8_t c_4 = s[n - 4];
+
+                  if ((c_4 ^ 0x80) >= 0x40)
+                    {
+#if CONFIG_UNICODE_SAFETY
+                      if ((c_4 == 0xf0 && c_3 >= 0x90)
+                          || (c_4 >= 0xf1 && c_4 < 0xf4)
+                          || (c_4 == 0xf4 && c_3 < 0x90))
+#endif
+                        {
+                          *puc = (unsigned int) ((c_4 & 0x07) << 18)
+                                 | (unsigned int) ((c_3 ^ 0x80) << 12)
+                                 | (unsigned int) ((c_2 ^ 0x80) << 6)
+                                 | (unsigned int) (c_1 ^ 0x80);
+                          return 4;
+                        }
+#if CONFIG_UNICODE_SAFETY
+                      if (c_4 >= 0xf0 && c_4 < 0xf8)
+                        {
+                          /* 0xf0: overlong sequence.
+                             0xf1...0xf3: not reached.
+                             0xf4...0xf7: invalid code point above U+10FFFF */
+                          *puc = 0xfffd;
+                          return 4;
+                        }
+#endif
+                    }
+                }
+            }
+        }
+    }
+
+  /* invalid or incomplete multibyte character */
+  *puc = 0xfffd;
+  return 1;
+}
+
+#endif
diff --git a/lib/unistr/u8-mb-prev-uc.c b/lib/unistr/u8-mb-prev-uc.c
new file mode 100644
index 0000000..12a99ef
--- /dev/null
+++ b/lib/unistr/u8-mb-prev-uc.c
@@ -0,0 +1,139 @@
+/* Conversion UTF-8 to UCS-4.
+   Copyright (C) 2001-2002, 2006-2007, 2009-2010 Free Software Foundation, Inc.
+   Written by Ben Pfaff <address@hidden>, 2010,
+   based on code by Bruno Haible <address@hidden>, 2001.
+
+   This program is free software: you can redistribute it and/or modify it
+   under the terms of the GNU Lesser General Public License as published
+   by the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
+
+#include <config.h>
+
+#if defined IN_LIBUNISTRING
+/* Tell unistr.h to declare u8_mb_prev_uc as 'extern', not 'static inline'.  */
+# include "unistring-notinline.h"
+#endif
+
+/* Specification.  */
+#include "unistr.h"
+
+#if !HAVE_INLINE
+
+int
+u8_mb_prev_uc (ucs4_t *puc, const uint8_t *s, size_t n)
+{
+  uint8_t c_1 = s[n - 1];
+
+  if (c_1 < 0x80)
+    {
+      *puc = c_1;
+      return 1;
+    }
+
+#if CONFIG_UNICODE_SAFETY
+  if (c_1 <= 0xbf)
+#endif
+    {
+      if (n >= 2)
+        {
+          uint8_t c_2 = s[n - 2];
+
+          if ((c_2 ^ 0x80) >= 0x40)
+            {
+#if CONFIG_UNICODE_SAFETY
+              if (c_2 >= 0xc2 && c_2 < 0xe0)
+#endif
+                {
+                  *puc = ((unsigned int) (c_2 & 0x1f) << 6)
+                         | (unsigned int) (c_1 ^ 0x80);
+                  return 2;
+                }
+#if CONFIG_UNICODE_SAFETY
+              if (c_2 >= 0xe0 && c_2 < 0xf8)
+                {
+                  /* incomplete multibyte character */
+                  *puc = 0xfffd;
+                  return 2;
+                }
+#endif
+            }
+          else if (n >= 3)
+            {
+              uint8_t c_3 = s[n - 3];
+
+              if ((c_3 ^ 0x80) >= 0x40)
+                {
+#if CONFIG_UNICODE_SAFETY
+                  if ((c_3 == 0xe0 && c_2 >= 0xa0)
+                      || (c_3 >= 0xe1 && c_3 < 0xed)
+                      || (c_3 == 0xed && c_2 < 0xa0)
+                      || (c_3 >= 0xee && c_3 < 0xf0))
+#endif
+                    {
+                      *puc = ((unsigned int) (c_3 & 0x0f) << 12)
+                             | (unsigned int) ((c_2 ^ 0x80) << 6)
+                             | (unsigned int) (c_1 ^ 0x80);
+                      return 3;
+                    }
+#if CONFIG_UNICODE_SAFETY
+                  if (c_3 >= 0xe0 && c_3 < 0xf8)
+                    {
+                      /* 0xe0: overlong sequence.
+                         0xe1...0xec: not reached.
+                         0xed: UTF-16 surrogate.
+                         0xee...0xef: not reached.
+                         0xf0...0xf7: incomplete multibyte character. */
+                      *puc = 0xfffd;
+                      return 3;
+                    }
+#endif
+                }
+              else if (n >= 4)
+                {
+                  uint8_t c_4 = s[n - 4];
+
+                  if ((c_4 ^ 0x80) >= 0x40)
+                    {
+#if CONFIG_UNICODE_SAFETY
+                      if ((c_4 == 0xf0 && c_3 >= 0x90)
+                          || (c_4 >= 0xf1 && c_4 < 0xf4)
+                          || (c_4 == 0xf4 && c_3 < 0x90))
+#endif
+                        {
+                          *puc = (unsigned int) ((c_4 & 0x07) << 18)
+                                 | (unsigned int) ((c_3 ^ 0x80) << 12)
+                                 | (unsigned int) ((c_2 ^ 0x80) << 6)
+                                 | (unsigned int) (c_1 ^ 0x80);
+                          return 4;
+                        }
+#if CONFIG_UNICODE_SAFETY
+                      if (c_4 >= 0xf0 && c_4 < 0xf8)
+                        {
+                          /* 0xf0: overlong sequence.
+                             0xf1...0xf3: not reached.
+                             0xf4...0xf7: invalid code point above U+10FFFF */
+                          *puc = 0xfffd;
+                          return 4;
+                        }
+#endif
+                    }
+                }
+            }
+        }
+    }
+
+  /* invalid or incomplete multibyte character */
+  *puc = 0xfffd;
+  return 1;
+}
+
+#endif
diff --git a/modules/unistr/u8-mb-prev-uc b/modules/unistr/u8-mb-prev-uc
new file mode 100644
index 0000000..2a12805
--- /dev/null
+++ b/modules/unistr/u8-mb-prev-uc
@@ -0,0 +1,28 @@
+Description:
+Look at last character in UTF-8 string.
+
+Files:
+lib/unistr/u8-mb-prev-uc.c
+lib/unistr/u8-mb-prev-uc-aux.c
+
+Depends-on:
+unistr/base
+
+configure.ac:
+gl_MODULE_INDICATOR([unistr/u8-mb-prev-uc])
+gl_LIBUNISTRING_MODULE([0.9.4], [unistr/u8-mb-prev-uc])
+
+Makefile.am:
+if LIBUNISTRING_COMPILE_UNISTR_U8_MB_PREV_UC
+lib_SOURCES += unistr/u8-mb-prev-uc.c unistr/u8-mb-prev-uc-aux.c
+endif
+
+Include:
+"unistr.h"
+
+License:
+LGPL
+
+Maintainer:
+Bruno Haible, Ben Pfaff
+
diff --git a/modules/unistr/u8-mb-prev-uc-tests 
b/modules/unistr/u8-mb-prev-uc-tests
new file mode 100644
index 0000000..66a593a
--- /dev/null
+++ b/modules/unistr/u8-mb-prev-uc-tests
@@ -0,0 +1,14 @@
+Files:
+tests/unistr/test-u8-mb-prev-uc.c
+tests/macros.h
+
+Depends-on:
+unistr/u8-mbtouc
+
+configure.ac:
+
+Makefile.am:
+TESTS += test-u8-mb-prev-uc
+check_PROGRAMS += test-u8-mb-prev-uc
+test_u8_mb_prev_uc_SOURCES = unistr/test-u8-mb-prev-uc.c
+test_u8_mb_prev_uc_LDADD = $(LDADD) $(LIBUNISTRING)
diff --git a/tests/unistr/test-u8-mb-prev-uc.c 
b/tests/unistr/test-u8-mb-prev-uc.c
new file mode 100644
index 0000000..4cab451
--- /dev/null
+++ b/tests/unistr/test-u8-mb-prev-uc.c
@@ -0,0 +1,288 @@
+/* Test of u8_mb_prev_uc() function.
+   Copyright (C) 2010 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
+
+/* Written by Ben Pfaff, 2010.  */
+
+#include <config.h>
+
+#include "unistr.h"
+
+#include <assert.h>
+
+#include "macros.h"
+
+struct uc
+  {
+    /* UTF-8 representation. */
+    const uint8_t *s;
+    int n;
+
+    /* Code point. */
+    ucs4_t uc;
+  };
+
+/* Print the N code points and their representations in UC on stderr, preceded
+   by TITLE. */
+static void
+print_ucs (const char *title, const struct uc *uc, size_t n)
+{
+  fprintf (stderr, "%s:", title);
+  for (; n-- > 0; uc++)
+    {
+      size_t i;
+
+      fprintf (stderr, " <");
+      for (i = 0; i < uc->n; i++)
+        {
+          if (i > 0)
+            putc (' ', stderr);
+          fprintf (stderr, "%02x", (unsigned int) uc->s[i]);
+        }
+      fprintf (stderr, "> U+%04X", (unsigned int) uc->uc);
+    }
+  putc ('\n', stderr);
+}
+
+/* Reverses the order of the N elements of UC. */
+static void
+reverse_ucs (struct uc *uc, size_t n)
+{
+  size_t i;
+
+  for (i = 0; i < n / 2; i++)
+    {
+      size_t j = n - (i + 1);
+      struct uc tmp = uc[i];
+      uc[i] = uc[j];
+      uc[j] = tmp;
+    }
+}
+
+static bool
+equal_ucs (const struct uc *a, size_t n_a, const struct uc *b, size_t n_b)
+{
+  if (n_a != n_b)
+    return false;
+  for (; n_a-- > 0; a++, b++)
+    if (a->n != b->n || a->s != b->s || a->uc != b->uc)
+      return false;
+  return true;
+}
+
+#define MAX_LENGTH 16
+
+/* Checks that the N units in S yield the same code points whether iterated
+   in the forward or reverse direction. */
+static void
+check_bidirectionally (const uint8_t *s, int n)
+{
+  struct uc ucf[16];
+  struct uc ucr[16];
+  int n_ucf, n_ucr;
+  int used;
+
+  assert (n <= SIZEOF (ucf));
+  assert (n <= SIZEOF (ucr));
+
+  /* Translate units to code points forward. */
+  used = 0;
+  n_ucf = 0;
+  while (used < n)
+    {
+      struct uc *uc = &ucf[n_ucf++];
+      uc->s = &s[used];
+      uc->n = u8_mbtouc (&uc->uc, uc->s, n - used);
+      ASSERT (uc->n >= 1);
+      ASSERT (uc->n <= n - used);
+      used += uc->n;
+    }
+
+  /* Translate units to code points backward. */
+  used = 0;
+  n_ucr = 0;
+  while (used < n)
+    {
+      struct uc *uc = &ucr[n_ucr++];
+      uc->n = u8_mb_prev_uc (&uc->uc, s, n - used);
+      ASSERT (uc->n >= 1);
+      ASSERT (uc->n <= n - used);
+      used += uc->n;
+      uc->s = &s[n - used];
+    }
+  reverse_ucs (ucr, n_ucr);
+
+  /* Check that the results were the same. */
+  if (!equal_ucs (ucf, n_ucf, ucr, n_ucr))
+    {
+      fprintf (stderr, "%s:%d: forward and reverse differ\n",
+               __FILE__, __LINE__);
+      print_ucs ("forward", ucf, n_ucf);
+      print_ucs ("reverse", ucr, n_ucr);
+      fflush (stderr);
+      abort ();
+    }
+}
+
+#if CONFIG_UNICODE_SAFETY
+/* This test exhaustively compares how u8_mbtouc() and u8_mb_prev_uc() treat
+   all UTF-8 well-formed and ill-formed sequences that are MAX_LENGTH units or
+   shorter.  To do so in a reasonable amount of time, it uses a trick: many
+   UTF-8 unit values are in classes whose members are all treated the same way.
+   Thus, it is only necessary to test one member of each class. */
+static void
+exhaustive_test (int max_length)
+{
+  /* The units to test. */
+  static const uint8_t units[] = {
+    /* The smallest value in each class. (Any other member or members would
+       work as well). */
+    0x00, 0x80, 0x90, 0xa0, 0xc0, 0xc2, 0xe0, 0xe1, 0xed, 0xee, 0xf0, 0xf1,
+    0xf4, 0xf5,
+
+    /* The UTF-8 units that make up U+FFFD, since that is such a special value
+       for these routines. */
+    0xef, 0xbf, 0xbd
+  };
+  uint8_t s[16];
+  int length;
+  int n;
+
+  assert (max_length <= SIZEOF (s));
+
+  n = 1;
+  for (length = 0; length <= max_length; length++)
+    {
+      int i;
+
+      n *= SIZEOF (units);
+      for (i = 0; i < n; i++)
+        {
+          int r, j;
+
+          r = i;
+          for (j = 0; j < length; j++)
+            {
+              s[j] = units[r % SIZEOF (units)];
+              r /= SIZEOF (units);
+            }
+          check_bidirectionally (s, length);
+        }
+    }
+}
+#endif  /* CONFIG_UNICODE_SAFETY */
+
+static void
+do_well_formed_test (const uint8_t *start, uint8_t *s, int n)
+{
+  if (n == 0)
+    {
+      check_bidirectionally (start, s - start);
+      return;
+    }
+
+  /* Test single-byte characters. */
+  s[0] = 0;
+  do_well_formed_test (start, s + 1, n - 1);
+
+  s[0] = 0x41;
+  do_well_formed_test (start, s + 1, n - 1);
+
+  /* Test 2-byte characters. */
+  if (n >= 2)
+    {
+      s[0] = 0xc2;
+      s[1] = 0xb0;
+      do_well_formed_test (start, s + 2, n - 2);
+    }
+
+  /* Test 3-byte characters. */
+  if (n >= 3)
+    {
+      s[0] = 0xe0;
+      s[1] = 0xa0;
+      s[2] = 0xa5;
+      do_well_formed_test (start, s + 3, n - 3);
+
+      s[0] = 0xe5;
+      s[1] = 0xbf;
+      s[2] = 0x81;
+      do_well_formed_test (start, s + 3, n - 3);
+
+      s[0] = 0xed;
+      s[1] = 0x9f;
+      s[2] = 0x99;
+      do_well_formed_test (start, s + 3, n - 3);
+    }
+
+  /* Test 4-byte characters. */
+  if (n >= 4)
+    {
+      s[0] = 0xf0;
+      s[1] = 0x90;
+      s[2] = 0xbb;
+      s[3] = 0x80;
+      do_well_formed_test (start, s + 4, n - 4);
+
+      s[0] = 0xf2;
+      s[1] = 0x80;
+      s[2] = 0xbf;
+      s[3] = 0x80;
+      do_well_formed_test (start, s + 4, n - 4);
+
+      s[0] = 0xf4;
+      s[1] = 0x8f;
+      s[2] = 0x80;
+      s[3] = 0xbf;
+      do_well_formed_test (start, s + 4, n - 4);
+    }
+}
+
+/* Checks iteration through all possible sets of UTF-8 sequence lengths with
+   no more than MAX_LENGTH units. */
+static void
+well_formed_test (int max_length)
+{
+  uint8_t s[16];
+  int length;
+
+  assert (max_length <= SIZEOF (s));
+  for (length = 0; length <= max_length; length++)
+    {
+
+      do_well_formed_test (s, s, length);
+    }
+}
+
+int
+main (void)
+{
+#if CONFIG_UNICODE_SAFETY
+  /* This only passes if Unicode safety was compiled in, because most of the
+     sequences that it tests are ill-formed UTF-8.
+
+     Runtime increases exponentially with the argument: 4 runs in a fraction
+     of a second, 5 in a few seconds, 6 in half a minute. */
+  exhaustive_test (5);
+#endif
+
+  /* This only tests well-formed characters so it should always pass.
+
+     Runtime increases exponentially but much more slowly than with
+     exhaustive_test(). */
+  well_formed_test (10);
+
+  return 0;
+}
-- 
Ben Pfaff 
http://benpfaff.org
[Prev in Thread]
Current Thread
[Next in Thread]
[bug-libunistring] UTF-8 backward iteration proposal for libunistring, Ben Pfaff, 2010/11/13
- Re: [bug-libunistring] UTF-8 backward iteration proposal for libunistring, Paolo Bonzini, 2010/11/13
  - Re: [bug-libunistring] UTF-8 backward iteration proposal for libunistring, Ben Pfaff, 2010/11/13
    - Re: [bug-libunistring] UTF-8 backward iteration proposal for libunistring, Bruno Haible, 2010/11/13
- [bug-libunistring] Re: UTF-8 backward iteration proposal for libunistring, Bruno Haible, 2010/11/13
  - [bug-libunistring] Re: UTF-8 backward iteration proposal for libunistring, Ben Pfaff, 2010/11/13
    - [bug-libunistring] Re: UTF-8 backward iteration proposal for libunistring, Bruno Haible, 2010/11/13
    - [bug-libunistring] Re: UTF-8 backward iteration proposal for libunistring, Ben Pfaff <=
Prev by Date: [bug-libunistring] Re: UTF-8 backward iteration proposal for libunistring
Next by Date: [bug-libunistring] Collation on unassigned code points
Previous by thread: [bug-libunistring] Re: UTF-8 backward iteration proposal for libunistring
Next by thread: [bug-libunistring] Collation on unassigned code points
Index(es):
- Date
- Thread