bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: character ranges in regular expressions


From: Bruno Haible
Subject: Re: character ranges in regular expressions
Date: Fri, 24 Sep 2010 23:52:38 +0200
User-agent: KMail/1.9.9

Paolo Bonzini wrote:
> > What is the correct result for 'grep' and for regex? (I assume it's the
> > same for both, since both are specified by POSIX.)
> 
> Unfortunately POSIX only (implicitly) specifies that the two have to be 
> consistent, but the exact result is unspecified.

Indeed: <http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html>
section 9.3.5.(7) defines the behaviour of range expressions only for the 
"POSIX"
locale.

> The sensible results are of course three: 51 omitting "a" (aAbB...zZ),
> 51 omitting "z" (AaBb...Zz), 26.

1) Is there an agreement of what the result should be? Jim seems to prefer to
extrapolate the result of the "C" locale, i.e. 26. For other people, the locale
dependent behaviour is useful, that is, 51 is desired. From around 2000, I
remember a mail from Ulrich Drepper where he essentially said "you have to
learn that in other locales range expressions work differently, use [[:alpha:]]
instead".

2) Is Ulrich aware that the subtle differences in the localedata/locales/*
files lead to bizarre behaviour of regexec() in the cs_CZ, pl_PL, etc. locales?

Test program:
================================ foo.c ========================================
#include <langinfo.h>
#include <locale.h>
#include <regex.h>
#include <stdio.h>
#include <stdlib.h>

int main ()
{
  regex_t r;
  int ret;
  int count;
  int c;

  setlocale (LC_ALL, "");

  ret = regcomp (&r, "[A-Z]", 0);
  if (ret != 0) { fprintf (stderr, "regcomp failed\n"); return 1; }

  count = 0;
  for (c = 32; c < 127; c++)
    {
      char line[2] = { c, '\0' };
      ret = regexec (&r, line, 0, NULL, 0);
      count += (ret == 0);
    }

  printf ("%-20s%-15s%d\n", getenv ("LC_ALL"), nl_langinfo (CODESET), count);
  return 0;
}
===============================================================================

$ for l in `locale -a`; do LC_ALL=$l ./foo ; done
aa_DJ               ISO-8859-1     26
aa_DJ.utf8          UTF-8          26
aa_ER               UTF-8          26
address@hidden         UTF-8          26
aa_ER.utf8          UTF-8          26
aa_ET               UTF-8          26
aa_ET.utf8          UTF-8          26
af_ZA               ISO-8859-1     26
af_ZA.utf8          UTF-8          26
am_ET               UTF-8          26
am_ET.utf8          UTF-8          26
an_ES               ISO-8859-15    26
an_ES.utf8          UTF-8          26
ar_AE               ISO-8859-6     26
ar_AE.utf8          UTF-8          26
ar_BH               ISO-8859-6     26
ar_BH.utf8          UTF-8          26
ar_DZ               ISO-8859-6     26
ar_DZ.utf8          UTF-8          26
ar_EG               ISO-8859-6     26
ar_EG.utf8          UTF-8          26
ar_IN               UTF-8          26
ar_IN.utf8          UTF-8          26
ar_IQ               ISO-8859-6     26
ar_IQ.utf8          UTF-8          26
ar_JO               ISO-8859-6     26
ar_JO.utf8          UTF-8          26
ar_KW               ISO-8859-6     26
ar_KW.utf8          UTF-8          26
ar_LB               ISO-8859-6     26
ar_LB.utf8          UTF-8          26
ar_LY               ISO-8859-6     26
ar_LY.utf8          UTF-8          26
ar_MA               ISO-8859-6     26
ar_MA.utf8          UTF-8          26
ar_OM               ISO-8859-6     26
ar_OM.utf8          UTF-8          26
ar_QA               ISO-8859-6     26
ar_QA.utf8          UTF-8          26
ar_SA               ISO-8859-6     51
ar_SA.utf8          UTF-8          51
ar_SD               ISO-8859-6     26
ar_SD.utf8          UTF-8          26
ar_SY               ISO-8859-6     26
ar_SY.utf8          UTF-8          26
ar_TN               ISO-8859-6     26
ar_TN.utf8          UTF-8          26
ar_YE               ISO-8859-6     26
ar_YE.utf8          UTF-8          26
as_IN.utf8          UTF-8          51
ast_ES              ISO-8859-15    26
ast_ES.utf8         UTF-8          26
az_AZ.utf8          UTF-8          26
be_BY               CP1251         26
address@hidden         UTF-8          26
be_BY.utf8          UTF-8          26
ber_DZ              UTF-8          26
ber_MA              UTF-8          26
bg_BG               CP1251         26
bg_BG.utf8          UTF-8          26
bn_BD               UTF-8          26
bn_BD.utf8          UTF-8          26
bn_IN               UTF-8          26
bn_IN.utf8          UTF-8          26
bo_CN               UTF-8          26
bo_IN               UTF-8          26
br_FR               ISO-8859-1     26
address@hidden          ISO-8859-15    26
br_FR.utf8          UTF-8          26
bs_BA               ISO-8859-2     26
bs_BA.utf8          UTF-8          26
byn_ER              UTF-8          26
byn_ER.utf8         UTF-8          26
C                   ANSI_X3.4-1968 26
ca_AD               ISO-8859-15    26
ca_AD.utf8          UTF-8          26
ca_ES               ISO-8859-1     26
address@hidden          ISO-8859-15    26
ca_ES.utf8          UTF-8          26
ca_FR               ISO-8859-15    26
ca_FR.utf8          UTF-8          26
ca_IT               ISO-8859-15    26
ca_IT.utf8          UTF-8          26
crh_UA              UTF-8          26
csb_PL              UTF-8          26
cs_CZ               ISO-8859-2     51
cs_CZ.utf8          UTF-8          51
cy_GB               ISO-8859-14    26
cy_GB.utf8          UTF-8          26
da_DK               ISO-8859-1     26
da_DK.utf8          UTF-8          26
de_AT               ISO-8859-1     26
address@hidden          ISO-8859-15    26
de_AT.utf8          UTF-8          26
de_BE               ISO-8859-1     26
address@hidden          ISO-8859-15    26
de_BE.utf8          UTF-8          26
de_CH               ISO-8859-1     26
de_CH.utf8          UTF-8          26
de_DE               ISO-8859-1     26
address@hidden          ISO-8859-15    26
de_DE.utf8          UTF-8          26
de_LU               ISO-8859-1     26
address@hidden          ISO-8859-15    26
de_LU.utf8          UTF-8          26
dz_BT               UTF-8          26
el_CY               ISO-8859-7     26
el_CY.utf8          UTF-8          26
el_GR               ISO-8859-7     26
el_GR.utf8          UTF-8          26
en_AU               ISO-8859-1     26
en_AU.utf8          UTF-8          26
en_BE               ISO-8859-1     51
address@hidden          ISO-8859-15    51
en_BE.utf8          UTF-8          51
en_BW               ISO-8859-1     26
en_BW.utf8          UTF-8          26
en_CA               ISO-8859-1     26
en_CA.utf8          UTF-8          26
en_DK               ISO-8859-1     26
en_DK.utf8          UTF-8          26
en_GB               ISO-8859-1     26
en_GB.iso885915     ISO-8859-15    26
en_GB.utf8          UTF-8          26
en_HK               ISO-8859-1     26
en_HK.utf8          UTF-8          26
en_IE               ISO-8859-1     26
address@hidden          ISO-8859-15    26
en_IE.utf8          UTF-8          26
en_IN               UTF-8          26
en_IN.utf8          UTF-8          26
en_NG               UTF-8          26
en_NZ               ISO-8859-1     26
en_NZ.utf8          UTF-8          26
en_PH               ISO-8859-1     26
en_PH.utf8          UTF-8          26
en_SG               ISO-8859-1     26
en_SG.utf8          UTF-8          26
en_US               ISO-8859-1     26
en_US.iso885915     ISO-8859-15    26
en_US.utf8          UTF-8          26
en_ZA               ISO-8859-1     26
en_ZA.utf8          UTF-8          26
en_ZW               ISO-8859-1     26
en_ZW.utf8          UTF-8          26
es_AR               ISO-8859-1     26
es_AR.utf8          UTF-8          26
es_BO               ISO-8859-1     26
es_BO.utf8          UTF-8          26
es_CL               ISO-8859-1     26
es_CL.utf8          UTF-8          26
es_CO               ISO-8859-1     26
es_CO.utf8          UTF-8          26
es_CR               ISO-8859-1     26
es_CR.utf8          UTF-8          26
es_DO               ISO-8859-1     26
es_DO.utf8          UTF-8          26
es_EC               ISO-8859-1     26
es_EC.utf8          UTF-8          26
es_ES               ISO-8859-1     26
address@hidden          ISO-8859-15    26
es_ES.utf8          UTF-8          26
es_GT               ISO-8859-1     26
es_GT.utf8          UTF-8          26
es_HN               ISO-8859-1     26
es_HN.utf8          UTF-8          26
es_MX               ISO-8859-1     26
es_MX.utf8          UTF-8          26
es_NI               ISO-8859-1     26
es_NI.utf8          UTF-8          26
es_PA               ISO-8859-1     26
es_PA.utf8          UTF-8          26
es_PE               ISO-8859-1     26
es_PE.utf8          UTF-8          26
es_PR               ISO-8859-1     26
es_PR.utf8          UTF-8          26
es_PY               ISO-8859-1     26
es_PY.utf8          UTF-8          26
es_SV               ISO-8859-1     26
es_SV.utf8          UTF-8          26
es_US               ISO-8859-1     26
es_US.utf8          UTF-8          26
es_UY               ISO-8859-1     26
es_UY.utf8          UTF-8          26
es_VE               ISO-8859-1     26
es_VE.utf8          UTF-8          26
et_EE               ISO-8859-1     39
et_EE.iso885915     ISO-8859-15    39
et_EE.utf8          UTF-8          39
eu_ES               ISO-8859-1     26
address@hidden          ISO-8859-15    26
eu_ES.utf8          UTF-8          26
fa_IR               UTF-8          26
fa_IR.utf8          UTF-8          26
fi_FI               ISO-8859-1     28
address@hidden          ISO-8859-15    28
fi_FI.utf8          UTF-8          28
fil_PH              UTF-8          26
fo_FO               ISO-8859-1     26
fo_FO.utf8          UTF-8          26
fr_BE               ISO-8859-1     26
address@hidden          ISO-8859-15    26
fr_BE.utf8          UTF-8          26
fr_CA               ISO-8859-1     26
fr_CA.utf8          UTF-8          26
fr_CH               ISO-8859-1     26
fr_CH.utf8          UTF-8          26
fr_FR               ISO-8859-1     26
address@hidden          ISO-8859-15    26
fr_FR.utf8          UTF-8          26
fr_LU               ISO-8859-1     26
address@hidden          ISO-8859-15    26
fr_LU.utf8          UTF-8          26
fur_IT              UTF-8          26
fy_DE               UTF-8          26
fy_NL               UTF-8          26
ga_IE               ISO-8859-1     26
address@hidden          ISO-8859-15    26
ga_IE.utf8          UTF-8          26
gd_GB               ISO-8859-15    26
gd_GB.utf8          UTF-8          26
gez_ER              UTF-8          26
address@hidden      UTF-8          26
gez_ET              UTF-8          26
address@hidden      UTF-8          26
gl_ES               ISO-8859-1     26
address@hidden          ISO-8859-15    26
gl_ES.utf8          UTF-8          26
gu_IN               UTF-8          26
gv_GB               ISO-8859-1     26
gv_GB.utf8          UTF-8          26
ha_NG               UTF-8          26
he_IL               ISO-8859-8     26
he_IL.utf8          UTF-8          26
hi_IN               UTF-8          26
hi_IN.utf8          UTF-8          26
hr_HR               ISO-8859-2     51
hr_HR.utf8          UTF-8          51
hsb_DE              ISO-8859-2     51
hsb_DE.utf8         UTF-8          51
hu_HU               ISO-8859-2     26
hu_HU.utf8          UTF-8          26
hy_AM               UTF-8          26
hy_AM.armscii8      ANSI_X3.4-1968 26
id_ID               ISO-8859-1     26
id_ID.utf8          UTF-8          26
ig_NG               UTF-8          26
ik_CA               UTF-8          26
is_IS               ISO-8859-1     51
is_IS.utf8          UTF-8          51
it_CH               ISO-8859-1     26
it_CH.utf8          UTF-8          26
it_IT               ISO-8859-1     26
address@hidden          ISO-8859-15    26
it_IT.utf8          UTF-8          26
iu_CA               UTF-8          26
iw_IL               ISO-8859-8     26
iw_IL.utf8          UTF-8          26
ja_JP.eucjp         EUC-JP         26
ja_JP.shiftjisx0213 ANSI_X3.4-1968 26
ja_JP.sjis          SHIFT_JIS      26
ja_JP.utf8          UTF-8          26
ka_GE               GEORGIAN-PS    26
ka_GE.utf8          UTF-8          26
kk_KZ               PT154          26
kk_KZ.utf8          UTF-8          26
kl_GL               ISO-8859-1     26
kl_GL.utf8          UTF-8          26
km_KH               UTF-8          51
kn_IN               UTF-8          26
ko_KR.euckr         EUC-KR         26
ko_KR.utf8          UTF-8          26
ku_TR               ISO-8859-9     26
ku_TR.utf8          UTF-8          26
kw_GB               ISO-8859-1     26
kw_GB.utf8          UTF-8          26
ky_KG               UTF-8          26
lg_UG               ISO-8859-10    26
lg_UG.utf8          UTF-8          26
li_BE               UTF-8          26
li_NL               UTF-8          26
lo_LA               UTF-8          51
lt_LT               ISO-8859-13    51
lt_LT.utf8          UTF-8          51
lv_LV               ISO-8859-13    51
lv_LV.utf8          UTF-8          51
mai_IN              UTF-8          26
mg_MG               ISO-8859-15    26
mg_MG.utf8          UTF-8          26
mi_NZ               ISO-8859-13    26
mi_NZ.utf8          UTF-8          26
mk_MK               ISO-8859-5     26
mk_MK.utf8          UTF-8          26
ml_IN               UTF-8          26
ml_IN.utf8          UTF-8          26
mn_MN               UTF-8          26
mn_MN.utf8          UTF-8          26
mr_IN               UTF-8          26
mr_IN.utf8          UTF-8          26
ms_MY               ISO-8859-1     26
ms_MY.utf8          UTF-8          26
mt_MT               ISO-8859-3     26
mt_MT.utf8          UTF-8          26
nb_NO               ISO-8859-1     26
nb_NO.utf8          UTF-8          26
nds_DE              UTF-8          26
nds_NL              UTF-8          26
ne_NP               UTF-8          26
ne_NP.utf8          UTF-8          26
nl_BE               ISO-8859-1     26
address@hidden          ISO-8859-15    26
nl_BE.utf8          UTF-8          26
nl_NL               ISO-8859-1     26
address@hidden          ISO-8859-15    26
nl_NL.utf8          UTF-8          26
nn_NO               ISO-8859-1     26
nn_NO.utf8          UTF-8          26
no_NO               ISO-8859-1     26
no_NO.utf8          UTF-8          26
nr_ZA               UTF-8          26
nso_ZA              UTF-8          26
oc_FR               ISO-8859-1     26
oc_FR.utf8          UTF-8          26
om_ET               UTF-8          26
om_ET.utf8          UTF-8          26
om_KE               ISO-8859-1     26
om_KE.utf8          UTF-8          26
or_IN               UTF-8          51
pa_IN               UTF-8          26
pa_IN.utf8          UTF-8          26
pap_AN              UTF-8          26
pa_PK               UTF-8          26
pl_PL               ISO-8859-2     51
pl_PL.utf8          UTF-8          51
POSIX               ANSI_X3.4-1968 26
pt_BR               ISO-8859-1     26
pt_BR.utf8          UTF-8          26
pt_PT               ISO-8859-1     26
address@hidden          ISO-8859-15    26
pt_PT.utf8          UTF-8          26
ro_RO               ISO-8859-2     26
ro_RO.utf8          UTF-8          26
ru_RU               ISO-8859-5     26
ru_RU.koi8r         KOI8-R         26
ru_RU.utf8          UTF-8          26
ru_UA               KOI8-U         26
ru_UA.utf8          UTF-8          26
rw_RW               UTF-8          26
sa_IN               UTF-8          26
sc_IT               UTF-8          26
se_NO               UTF-8          26
se_NO.utf8          UTF-8          26
shs_CA              UTF-8          26
sh_YU               ISO-8859-2     51
sh_YU.utf8          UTF-8          51
sid_ET              UTF-8          26
sid_ET.utf8         UTF-8          26
si_LK               UTF-8          26
sk_SK               ISO-8859-2     51
sk_SK.utf8          UTF-8          51
sl_SI               ISO-8859-2     51
sl_SI.utf8          UTF-8          51
so_DJ               ISO-8859-1     26
so_DJ.utf8          UTF-8          26
so_ET               UTF-8          26
so_ET.utf8          UTF-8          26
so_KE               ISO-8859-1     26
so_KE.utf8          UTF-8          26
so_SO               ISO-8859-1     26
so_SO.utf8          UTF-8          26
sq_AL               ISO-8859-1     26
sq_AL.utf8          UTF-8          26
sr_ME               UTF-8          26
sr_RS               UTF-8          26
address@hidden         UTF-8          26
ss_ZA               UTF-8          26
st_ZA               ISO-8859-1     26
st_ZA.utf8          UTF-8          26
sv_FI               ISO-8859-1     28
address@hidden          ISO-8859-15    28
sv_FI.utf8          UTF-8          28
sv_SE               ISO-8859-1     26
sv_SE.iso885915     ISO-8859-15    26
sv_SE.utf8          UTF-8          26
ta_IN               UTF-8          26
ta_IN.utf8          UTF-8          26
te_IN               UTF-8          26
te_IN.utf8          UTF-8          26
tg_TJ               KOI8-T         26
tg_TJ.utf8          UTF-8          26
th_TH               TIS-620        51
th_TH.utf8          UTF-8          51
ti_ER               UTF-8          26
ti_ER.utf8          UTF-8          26
ti_ET               UTF-8          26
ti_ET.utf8          UTF-8          26
tig_ER              UTF-8          26
tig_ER.utf8         UTF-8          26
tk_TM               UTF-8          26
tl_PH               ISO-8859-1     26
tl_PH.utf8          UTF-8          26
tn_ZA               UTF-8          26
tr_CY               ISO-8859-9     51
tr_CY.utf8          UTF-8          51
tr_TR               ISO-8859-9     51
tr_TR.utf8          UTF-8          51
ts_ZA               UTF-8          26
address@hidden UTF-8          26
tt_RU.utf8          UTF-8          26
ug_CN               UTF-8          26
uk_UA               KOI8-U         26
uk_UA.utf8          UTF-8          26
ur_PK               UTF-8          26
ur_PK.utf8          UTF-8          26
uz_UZ               ISO-8859-1     26
address@hidden      UTF-8          26
ve_ZA               UTF-8          26
vi_VN               UTF-8          26
vi_VN.tcvn          TCVN5712-1     26
wa_BE               ISO-8859-1     26
address@hidden          ISO-8859-15    26
wa_BE.utf8          UTF-8          26
wo_SN               UTF-8          26
xh_ZA               ISO-8859-1     26
xh_ZA.utf8          UTF-8          26
yi_US               CP1255         26
yi_US.utf8          UTF-8          26
yo_NG               UTF-8          26
zh_CN               GB2312         26
zh_CN.gb18030       GB18030        26
zh_CN.gbk           GBK            26
zh_CN.utf8          UTF-8          26
zh_HK               BIG5-HKSCS     26
zh_HK.utf8          UTF-8          26
zh_SG               GB2312         26
zh_SG.gbk           GBK            26
zh_SG.utf8          UTF-8          26
zh_TW               BIG5           26
zh_TW.euctw         EUC-TW         26
zh_TW.utf8          UTF-8          26
zu_ZA               ISO-8859-1     26
zu_ZA.utf8          UTF-8          26


At least the results don't depend on the locale encoding (bugs of this kind
must be due to 'grep'). But it's not clear whether this behaviour was
intended.

Bruno



reply via email to

[Prev in Thread] Current Thread [Next in Thread]