bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: character ranges in regular expressions


From: Bruno Haible
Subject: Re: character ranges in regular expressions
Date: Thu, 23 Sep 2010 23:55:25 +0200
User-agent: KMail/1.9.9

Paolo,

> Bruno, ... Can you shed light on what __collseq_table_lookup is supposed 
> to mean?

It is a runtime lookup function into a table that maps Unicode characters to
uint32_t values. For a 'char' value, the most efficient way to implement
a mapping from 'char' to uint32_t is through an array: uint32_t[UCHAR_MAX+1].
For a 'wchar_t' value whose width is up to 21 bits, the data structure we
use in glibc (and also in gnulib / libunistring) is a 3-level lookup table.
See the file locale/programs/3level.h for details.

In regcomp.c and regexec.c the _NL_COLLATE_COLLSEQWC field of the LC_COLLATE
part of the locale is encoded in this way. In glibc/locale/programs/ld-collate.c
this field is being constructed from a table called 'collate->wcseqorder'.
The role of this table is to be used in regular expression matching and
wildcard matching. The table is derived from (but does not represent the
entire information from) the LC_COLLATE portion of the locale input file.

Bruno



reply via email to

[Prev in Thread] Current Thread [Next in Thread]