[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: character ranges in regular expressions
From: |
Bruno Haible |
Subject: |
Re: character ranges in regular expressions |
Date: |
Fri, 24 Sep 2010 23:52:38 +0200 |
User-agent: |
KMail/1.9.9 |
Paolo Bonzini wrote:
> > What is the correct result for 'grep' and for regex? (I assume it's the
> > same for both, since both are specified by POSIX.)
>
> Unfortunately POSIX only (implicitly) specifies that the two have to be
> consistent, but the exact result is unspecified.
Indeed: <http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html>
section 9.3.5.(7) defines the behaviour of range expressions only for the
"POSIX"
locale.
> The sensible results are of course three: 51 omitting "a" (aAbB...zZ),
> 51 omitting "z" (AaBb...Zz), 26.
1) Is there an agreement of what the result should be? Jim seems to prefer to
extrapolate the result of the "C" locale, i.e. 26. For other people, the locale
dependent behaviour is useful, that is, 51 is desired. From around 2000, I
remember a mail from Ulrich Drepper where he essentially said "you have to
learn that in other locales range expressions work differently, use [[:alpha:]]
instead".
2) Is Ulrich aware that the subtle differences in the localedata/locales/*
files lead to bizarre behaviour of regexec() in the cs_CZ, pl_PL, etc. locales?
Test program:
================================ foo.c ========================================
#include <langinfo.h>
#include <locale.h>
#include <regex.h>
#include <stdio.h>
#include <stdlib.h>
int main ()
{
regex_t r;
int ret;
int count;
int c;
setlocale (LC_ALL, "");
ret = regcomp (&r, "[A-Z]", 0);
if (ret != 0) { fprintf (stderr, "regcomp failed\n"); return 1; }
count = 0;
for (c = 32; c < 127; c++)
{
char line[2] = { c, '\0' };
ret = regexec (&r, line, 0, NULL, 0);
count += (ret == 0);
}
printf ("%-20s%-15s%d\n", getenv ("LC_ALL"), nl_langinfo (CODESET), count);
return 0;
}
===============================================================================
$ for l in `locale -a`; do LC_ALL=$l ./foo ; done
aa_DJ ISO-8859-1 26
aa_DJ.utf8 UTF-8 26
aa_ER UTF-8 26
address@hidden UTF-8 26
aa_ER.utf8 UTF-8 26
aa_ET UTF-8 26
aa_ET.utf8 UTF-8 26
af_ZA ISO-8859-1 26
af_ZA.utf8 UTF-8 26
am_ET UTF-8 26
am_ET.utf8 UTF-8 26
an_ES ISO-8859-15 26
an_ES.utf8 UTF-8 26
ar_AE ISO-8859-6 26
ar_AE.utf8 UTF-8 26
ar_BH ISO-8859-6 26
ar_BH.utf8 UTF-8 26
ar_DZ ISO-8859-6 26
ar_DZ.utf8 UTF-8 26
ar_EG ISO-8859-6 26
ar_EG.utf8 UTF-8 26
ar_IN UTF-8 26
ar_IN.utf8 UTF-8 26
ar_IQ ISO-8859-6 26
ar_IQ.utf8 UTF-8 26
ar_JO ISO-8859-6 26
ar_JO.utf8 UTF-8 26
ar_KW ISO-8859-6 26
ar_KW.utf8 UTF-8 26
ar_LB ISO-8859-6 26
ar_LB.utf8 UTF-8 26
ar_LY ISO-8859-6 26
ar_LY.utf8 UTF-8 26
ar_MA ISO-8859-6 26
ar_MA.utf8 UTF-8 26
ar_OM ISO-8859-6 26
ar_OM.utf8 UTF-8 26
ar_QA ISO-8859-6 26
ar_QA.utf8 UTF-8 26
ar_SA ISO-8859-6 51
ar_SA.utf8 UTF-8 51
ar_SD ISO-8859-6 26
ar_SD.utf8 UTF-8 26
ar_SY ISO-8859-6 26
ar_SY.utf8 UTF-8 26
ar_TN ISO-8859-6 26
ar_TN.utf8 UTF-8 26
ar_YE ISO-8859-6 26
ar_YE.utf8 UTF-8 26
as_IN.utf8 UTF-8 51
ast_ES ISO-8859-15 26
ast_ES.utf8 UTF-8 26
az_AZ.utf8 UTF-8 26
be_BY CP1251 26
address@hidden UTF-8 26
be_BY.utf8 UTF-8 26
ber_DZ UTF-8 26
ber_MA UTF-8 26
bg_BG CP1251 26
bg_BG.utf8 UTF-8 26
bn_BD UTF-8 26
bn_BD.utf8 UTF-8 26
bn_IN UTF-8 26
bn_IN.utf8 UTF-8 26
bo_CN UTF-8 26
bo_IN UTF-8 26
br_FR ISO-8859-1 26
address@hidden ISO-8859-15 26
br_FR.utf8 UTF-8 26
bs_BA ISO-8859-2 26
bs_BA.utf8 UTF-8 26
byn_ER UTF-8 26
byn_ER.utf8 UTF-8 26
C ANSI_X3.4-1968 26
ca_AD ISO-8859-15 26
ca_AD.utf8 UTF-8 26
ca_ES ISO-8859-1 26
address@hidden ISO-8859-15 26
ca_ES.utf8 UTF-8 26
ca_FR ISO-8859-15 26
ca_FR.utf8 UTF-8 26
ca_IT ISO-8859-15 26
ca_IT.utf8 UTF-8 26
crh_UA UTF-8 26
csb_PL UTF-8 26
cs_CZ ISO-8859-2 51
cs_CZ.utf8 UTF-8 51
cy_GB ISO-8859-14 26
cy_GB.utf8 UTF-8 26
da_DK ISO-8859-1 26
da_DK.utf8 UTF-8 26
de_AT ISO-8859-1 26
address@hidden ISO-8859-15 26
de_AT.utf8 UTF-8 26
de_BE ISO-8859-1 26
address@hidden ISO-8859-15 26
de_BE.utf8 UTF-8 26
de_CH ISO-8859-1 26
de_CH.utf8 UTF-8 26
de_DE ISO-8859-1 26
address@hidden ISO-8859-15 26
de_DE.utf8 UTF-8 26
de_LU ISO-8859-1 26
address@hidden ISO-8859-15 26
de_LU.utf8 UTF-8 26
dz_BT UTF-8 26
el_CY ISO-8859-7 26
el_CY.utf8 UTF-8 26
el_GR ISO-8859-7 26
el_GR.utf8 UTF-8 26
en_AU ISO-8859-1 26
en_AU.utf8 UTF-8 26
en_BE ISO-8859-1 51
address@hidden ISO-8859-15 51
en_BE.utf8 UTF-8 51
en_BW ISO-8859-1 26
en_BW.utf8 UTF-8 26
en_CA ISO-8859-1 26
en_CA.utf8 UTF-8 26
en_DK ISO-8859-1 26
en_DK.utf8 UTF-8 26
en_GB ISO-8859-1 26
en_GB.iso885915 ISO-8859-15 26
en_GB.utf8 UTF-8 26
en_HK ISO-8859-1 26
en_HK.utf8 UTF-8 26
en_IE ISO-8859-1 26
address@hidden ISO-8859-15 26
en_IE.utf8 UTF-8 26
en_IN UTF-8 26
en_IN.utf8 UTF-8 26
en_NG UTF-8 26
en_NZ ISO-8859-1 26
en_NZ.utf8 UTF-8 26
en_PH ISO-8859-1 26
en_PH.utf8 UTF-8 26
en_SG ISO-8859-1 26
en_SG.utf8 UTF-8 26
en_US ISO-8859-1 26
en_US.iso885915 ISO-8859-15 26
en_US.utf8 UTF-8 26
en_ZA ISO-8859-1 26
en_ZA.utf8 UTF-8 26
en_ZW ISO-8859-1 26
en_ZW.utf8 UTF-8 26
es_AR ISO-8859-1 26
es_AR.utf8 UTF-8 26
es_BO ISO-8859-1 26
es_BO.utf8 UTF-8 26
es_CL ISO-8859-1 26
es_CL.utf8 UTF-8 26
es_CO ISO-8859-1 26
es_CO.utf8 UTF-8 26
es_CR ISO-8859-1 26
es_CR.utf8 UTF-8 26
es_DO ISO-8859-1 26
es_DO.utf8 UTF-8 26
es_EC ISO-8859-1 26
es_EC.utf8 UTF-8 26
es_ES ISO-8859-1 26
address@hidden ISO-8859-15 26
es_ES.utf8 UTF-8 26
es_GT ISO-8859-1 26
es_GT.utf8 UTF-8 26
es_HN ISO-8859-1 26
es_HN.utf8 UTF-8 26
es_MX ISO-8859-1 26
es_MX.utf8 UTF-8 26
es_NI ISO-8859-1 26
es_NI.utf8 UTF-8 26
es_PA ISO-8859-1 26
es_PA.utf8 UTF-8 26
es_PE ISO-8859-1 26
es_PE.utf8 UTF-8 26
es_PR ISO-8859-1 26
es_PR.utf8 UTF-8 26
es_PY ISO-8859-1 26
es_PY.utf8 UTF-8 26
es_SV ISO-8859-1 26
es_SV.utf8 UTF-8 26
es_US ISO-8859-1 26
es_US.utf8 UTF-8 26
es_UY ISO-8859-1 26
es_UY.utf8 UTF-8 26
es_VE ISO-8859-1 26
es_VE.utf8 UTF-8 26
et_EE ISO-8859-1 39
et_EE.iso885915 ISO-8859-15 39
et_EE.utf8 UTF-8 39
eu_ES ISO-8859-1 26
address@hidden ISO-8859-15 26
eu_ES.utf8 UTF-8 26
fa_IR UTF-8 26
fa_IR.utf8 UTF-8 26
fi_FI ISO-8859-1 28
address@hidden ISO-8859-15 28
fi_FI.utf8 UTF-8 28
fil_PH UTF-8 26
fo_FO ISO-8859-1 26
fo_FO.utf8 UTF-8 26
fr_BE ISO-8859-1 26
address@hidden ISO-8859-15 26
fr_BE.utf8 UTF-8 26
fr_CA ISO-8859-1 26
fr_CA.utf8 UTF-8 26
fr_CH ISO-8859-1 26
fr_CH.utf8 UTF-8 26
fr_FR ISO-8859-1 26
address@hidden ISO-8859-15 26
fr_FR.utf8 UTF-8 26
fr_LU ISO-8859-1 26
address@hidden ISO-8859-15 26
fr_LU.utf8 UTF-8 26
fur_IT UTF-8 26
fy_DE UTF-8 26
fy_NL UTF-8 26
ga_IE ISO-8859-1 26
address@hidden ISO-8859-15 26
ga_IE.utf8 UTF-8 26
gd_GB ISO-8859-15 26
gd_GB.utf8 UTF-8 26
gez_ER UTF-8 26
address@hidden UTF-8 26
gez_ET UTF-8 26
address@hidden UTF-8 26
gl_ES ISO-8859-1 26
address@hidden ISO-8859-15 26
gl_ES.utf8 UTF-8 26
gu_IN UTF-8 26
gv_GB ISO-8859-1 26
gv_GB.utf8 UTF-8 26
ha_NG UTF-8 26
he_IL ISO-8859-8 26
he_IL.utf8 UTF-8 26
hi_IN UTF-8 26
hi_IN.utf8 UTF-8 26
hr_HR ISO-8859-2 51
hr_HR.utf8 UTF-8 51
hsb_DE ISO-8859-2 51
hsb_DE.utf8 UTF-8 51
hu_HU ISO-8859-2 26
hu_HU.utf8 UTF-8 26
hy_AM UTF-8 26
hy_AM.armscii8 ANSI_X3.4-1968 26
id_ID ISO-8859-1 26
id_ID.utf8 UTF-8 26
ig_NG UTF-8 26
ik_CA UTF-8 26
is_IS ISO-8859-1 51
is_IS.utf8 UTF-8 51
it_CH ISO-8859-1 26
it_CH.utf8 UTF-8 26
it_IT ISO-8859-1 26
address@hidden ISO-8859-15 26
it_IT.utf8 UTF-8 26
iu_CA UTF-8 26
iw_IL ISO-8859-8 26
iw_IL.utf8 UTF-8 26
ja_JP.eucjp EUC-JP 26
ja_JP.shiftjisx0213 ANSI_X3.4-1968 26
ja_JP.sjis SHIFT_JIS 26
ja_JP.utf8 UTF-8 26
ka_GE GEORGIAN-PS 26
ka_GE.utf8 UTF-8 26
kk_KZ PT154 26
kk_KZ.utf8 UTF-8 26
kl_GL ISO-8859-1 26
kl_GL.utf8 UTF-8 26
km_KH UTF-8 51
kn_IN UTF-8 26
ko_KR.euckr EUC-KR 26
ko_KR.utf8 UTF-8 26
ku_TR ISO-8859-9 26
ku_TR.utf8 UTF-8 26
kw_GB ISO-8859-1 26
kw_GB.utf8 UTF-8 26
ky_KG UTF-8 26
lg_UG ISO-8859-10 26
lg_UG.utf8 UTF-8 26
li_BE UTF-8 26
li_NL UTF-8 26
lo_LA UTF-8 51
lt_LT ISO-8859-13 51
lt_LT.utf8 UTF-8 51
lv_LV ISO-8859-13 51
lv_LV.utf8 UTF-8 51
mai_IN UTF-8 26
mg_MG ISO-8859-15 26
mg_MG.utf8 UTF-8 26
mi_NZ ISO-8859-13 26
mi_NZ.utf8 UTF-8 26
mk_MK ISO-8859-5 26
mk_MK.utf8 UTF-8 26
ml_IN UTF-8 26
ml_IN.utf8 UTF-8 26
mn_MN UTF-8 26
mn_MN.utf8 UTF-8 26
mr_IN UTF-8 26
mr_IN.utf8 UTF-8 26
ms_MY ISO-8859-1 26
ms_MY.utf8 UTF-8 26
mt_MT ISO-8859-3 26
mt_MT.utf8 UTF-8 26
nb_NO ISO-8859-1 26
nb_NO.utf8 UTF-8 26
nds_DE UTF-8 26
nds_NL UTF-8 26
ne_NP UTF-8 26
ne_NP.utf8 UTF-8 26
nl_BE ISO-8859-1 26
address@hidden ISO-8859-15 26
nl_BE.utf8 UTF-8 26
nl_NL ISO-8859-1 26
address@hidden ISO-8859-15 26
nl_NL.utf8 UTF-8 26
nn_NO ISO-8859-1 26
nn_NO.utf8 UTF-8 26
no_NO ISO-8859-1 26
no_NO.utf8 UTF-8 26
nr_ZA UTF-8 26
nso_ZA UTF-8 26
oc_FR ISO-8859-1 26
oc_FR.utf8 UTF-8 26
om_ET UTF-8 26
om_ET.utf8 UTF-8 26
om_KE ISO-8859-1 26
om_KE.utf8 UTF-8 26
or_IN UTF-8 51
pa_IN UTF-8 26
pa_IN.utf8 UTF-8 26
pap_AN UTF-8 26
pa_PK UTF-8 26
pl_PL ISO-8859-2 51
pl_PL.utf8 UTF-8 51
POSIX ANSI_X3.4-1968 26
pt_BR ISO-8859-1 26
pt_BR.utf8 UTF-8 26
pt_PT ISO-8859-1 26
address@hidden ISO-8859-15 26
pt_PT.utf8 UTF-8 26
ro_RO ISO-8859-2 26
ro_RO.utf8 UTF-8 26
ru_RU ISO-8859-5 26
ru_RU.koi8r KOI8-R 26
ru_RU.utf8 UTF-8 26
ru_UA KOI8-U 26
ru_UA.utf8 UTF-8 26
rw_RW UTF-8 26
sa_IN UTF-8 26
sc_IT UTF-8 26
se_NO UTF-8 26
se_NO.utf8 UTF-8 26
shs_CA UTF-8 26
sh_YU ISO-8859-2 51
sh_YU.utf8 UTF-8 51
sid_ET UTF-8 26
sid_ET.utf8 UTF-8 26
si_LK UTF-8 26
sk_SK ISO-8859-2 51
sk_SK.utf8 UTF-8 51
sl_SI ISO-8859-2 51
sl_SI.utf8 UTF-8 51
so_DJ ISO-8859-1 26
so_DJ.utf8 UTF-8 26
so_ET UTF-8 26
so_ET.utf8 UTF-8 26
so_KE ISO-8859-1 26
so_KE.utf8 UTF-8 26
so_SO ISO-8859-1 26
so_SO.utf8 UTF-8 26
sq_AL ISO-8859-1 26
sq_AL.utf8 UTF-8 26
sr_ME UTF-8 26
sr_RS UTF-8 26
address@hidden UTF-8 26
ss_ZA UTF-8 26
st_ZA ISO-8859-1 26
st_ZA.utf8 UTF-8 26
sv_FI ISO-8859-1 28
address@hidden ISO-8859-15 28
sv_FI.utf8 UTF-8 28
sv_SE ISO-8859-1 26
sv_SE.iso885915 ISO-8859-15 26
sv_SE.utf8 UTF-8 26
ta_IN UTF-8 26
ta_IN.utf8 UTF-8 26
te_IN UTF-8 26
te_IN.utf8 UTF-8 26
tg_TJ KOI8-T 26
tg_TJ.utf8 UTF-8 26
th_TH TIS-620 51
th_TH.utf8 UTF-8 51
ti_ER UTF-8 26
ti_ER.utf8 UTF-8 26
ti_ET UTF-8 26
ti_ET.utf8 UTF-8 26
tig_ER UTF-8 26
tig_ER.utf8 UTF-8 26
tk_TM UTF-8 26
tl_PH ISO-8859-1 26
tl_PH.utf8 UTF-8 26
tn_ZA UTF-8 26
tr_CY ISO-8859-9 51
tr_CY.utf8 UTF-8 51
tr_TR ISO-8859-9 51
tr_TR.utf8 UTF-8 51
ts_ZA UTF-8 26
address@hidden UTF-8 26
tt_RU.utf8 UTF-8 26
ug_CN UTF-8 26
uk_UA KOI8-U 26
uk_UA.utf8 UTF-8 26
ur_PK UTF-8 26
ur_PK.utf8 UTF-8 26
uz_UZ ISO-8859-1 26
address@hidden UTF-8 26
ve_ZA UTF-8 26
vi_VN UTF-8 26
vi_VN.tcvn TCVN5712-1 26
wa_BE ISO-8859-1 26
address@hidden ISO-8859-15 26
wa_BE.utf8 UTF-8 26
wo_SN UTF-8 26
xh_ZA ISO-8859-1 26
xh_ZA.utf8 UTF-8 26
yi_US CP1255 26
yi_US.utf8 UTF-8 26
yo_NG UTF-8 26
zh_CN GB2312 26
zh_CN.gb18030 GB18030 26
zh_CN.gbk GBK 26
zh_CN.utf8 UTF-8 26
zh_HK BIG5-HKSCS 26
zh_HK.utf8 UTF-8 26
zh_SG GB2312 26
zh_SG.gbk GBK 26
zh_SG.utf8 UTF-8 26
zh_TW BIG5 26
zh_TW.euctw EUC-TW 26
zh_TW.utf8 UTF-8 26
zu_ZA ISO-8859-1 26
zu_ZA.utf8 UTF-8 26
At least the results don't depend on the locale encoding (bugs of this kind
must be due to 'grep'). But it's not clear whether this behaviour was
intended.
Bruno
- [PATCH 2/2] tests: add testcase for previous fix, (continued)
- [PATCH 2/2] tests: add testcase for previous fix, Paolo Bonzini, 2010/09/21
- Re: [PATCH 2/2] tests: add testcase for previous fix, Jim Meyering, 2010/09/23
- Re: [PATCH 2/2] tests: add testcase for previous fix, Paolo Bonzini, 2010/09/23
- Re: [PATCH 2/2] tests: add testcase for previous fix, Jim Meyering, 2010/09/23
- Re: [PATCH 2/2] tests: add testcase for previous fix, Paul Eggert, 2010/09/23
- Re: [PATCH 2/2] tests: add testcase for previous fix, Paolo Bonzini, 2010/09/23
- Re: character ranges in regular expressions, Bruno Haible, 2010/09/23
- Re: character ranges in regular expressions, Paolo Bonzini, 2010/09/24
- Re: character ranges in regular expressions, Bruno Haible, 2010/09/24
- Re: character ranges in regular expressions, Paolo Bonzini, 2010/09/24
- Re: character ranges in regular expressions,
Bruno Haible <=
- Re: character ranges in regular expressions, Paul Eggert, 2010/09/24
- Re: character ranges in regular expressions, Eric Blake, 2010/09/24
[PATCH 0/2] process range expressions consistently with system regex, Paolo Bonzini, 2010/09/21