Re: NSString lowercaseString

discuss-gnustep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NSString lowercaseString

From:	Sebastian Reitenbach
Subject:	Re: NSString lowercaseString
Date:	Wed, 01 Aug 2012 10:50:57 +0200
User-agent:	SOGoMail 1.3.17

On Wednesday, August 1, 2012 05:16 CEST, Eric Wasylishen 
<ewasylishen@gmail.com> wrote:

> Hi,
>
> A while ago I added code to NSString.m to use ICU for the -compare: and 
> -rangeOfString: methods, so they're done correctly with respect to unicode 
> and locales, as well as tests that verify the behaviour matches Cocoa for the 
> most part.
>
> The -lowercaseString/-uppercaseString methods should probably use 
> u_strFoldCase if ICU is available.
>
> I'm skimming through the NSString API looking for methods that we should use 
> ICU for and currently don't (or don't implement), and there are only a 
> handful:
>
> -decomposedString* and -precomposedString* methods
> -uppercase/lowercase/capitalized methods
> -stringByFoldingWithOptions:locale:
> -localizedStandardCompare:
> -rangeOfComposedCharacterSequenceAtIndex:
> -rangeOfComposedCharacterSequencesForRange:
> -initWithFormat:locale: and friends perhaps? Maybe what we have now is fine 
> though, I'm not too familiar with it.
>
> I'd be willing to do the case folding ones at some point, for a start. :-)

I "enhanced" my test program a bit, and compared output when running on Linux 
and OpenBSD:

#import <Foundation/Foundation.h>


int main(int argc, char *argv[]) {
NSLog(@"Lowercase: %@", [[NSString stringWithString:@"TöÖst"] lowercaseString]);

}

running the test program on a Linux box in xterm (opensuse 11.3) without my 
patch:
sre@sre:~> LC_CTYPE='de_DE.UTF-8' ./lowercase
2012-08-01 08:49:57.972 lowercase[16574] autorelease called without pool for 
object (0x72db28) of class GSCInlineString in thread <NSThread: 0x6b0cc8>
2012-08-01 08:49:57.974 lowercase[16574] autorelease called without pool for 
object (0x72dce8) of class GSCInlineString in thread <NSThread: 0x6b0cc8>
2012-08-01 08:49:57.974 lowercase[16574] Lowercase: tÃ¶Ãst
sre@sre:~> LC_CTYPE='en_EN.UTF-8' ./lowercase
2012-08-01 08:50:09.500 lowercase[16584] autorelease called without pool for 
object (0x72d538) of class GSCInlineString in thread <NSThread: 0x6b06d8>
2012-08-01 08:50:09.501 lowercase[16584] autorelease called without pool for 
object (0x72d6f8) of class GSCInlineString in thread <NSThread: 0x6b06d8>
2012-08-01 08:50:09.501 lowercase[16584] Lowercase: töÖst

logged in from the same Linux box, xterm, to the OpenBSD host I get (with and 
without my patch):
$ LC_CTYPE='de_DE.UTF-8' ./lowercase
2012-08-01 10:38:52.850 lowercase[5483] autorelease called without pool for 
object (0x20c403f88) of class GSUnicodeInlineString in thread <NSThread: 
0x20750be08>
2012-08-01 10:38:52.851 lowercase[5483] autorelease called without pool for 
object (0x209c1c5c8) of class GSUnicodeInlineString in thread <NSThread: 
0x20750be08>
2012-08-01 10:38:52.852 lowercase[5483] Lowercase: tööst
$ LC_CTYPE='en_EN.UTF-8' ./lowercase
2012-08-01 10:38:46.754 lowercase[32569] autorelease called without pool for 
object (0x20af26088) of class GSUnicodeInlineString in thread <NSThread: 
0x2028f9308>
2012-08-01 10:38:46.756 lowercase[32569] autorelease called without pool for 
object (0x20444f248) of class GSUnicodeInlineString in thread <NSThread: 
0x2028f9308>
2012-08-01 10:38:46.756 lowercase[32569] Lowercase: t��st

The weird thing on Linux is that the second Ö is not lowercase, but on OpenBSD 
it is. Also on Linux its linked against icu4c.
Even weirder is that the LC_CTYPE, with DE it works on OpenBSD, but not Linux, 
and with EN the other way around?

Sebastian


>
> Eric
>
> On Jul 31, 2012, at 3:40 PM, Stefan Bidi <stefanbidi@gmail.com> wrote:
>
> > On Tue, Jul 31, 2012 at 12:27 PM, Sebastian Reitenbach
> > <sebastia@l00-bugdead-prods.de> wrote:
> >>
> >> On Tuesday, July 31, 2012 19:06 CEST, David Chisnall <theraven@sucs.org> 
> >> wrote:
> >>
> >>> Are you using GNUstep with or without ICU?  When you say skipped, is it 
> >>> removed from the destination, or just passed through unmodified?  Is your 
> >>> locale set to something that recognises letters with umlauts?
> >>
> >> It's with ICU, and I run OGo with
> >> LC_CTYPE='de_DE.UTF-8'
> >> so, supposed to recognize Umlauts.
> >>
> >> I had some NSLog in GSString lowercase, and without my patch, it returns 0 
> >> for an Umlaut, so its not really skipped, but the
> >> o->_contents.c[i] is set to 0 in the middle of a string :(
> >>
> >> My patch just checks if tolower returned 0, and then just pass the 
> >> character it cannot handle without doing anything with it.
> >>
> >> following ICU is installed:
> >> $ pkg_info | grep icu4c
> >> icu4c-4.8.1.1       International Components for Unicode
> >
> > Just FYI, GNUstep doesn't use ICU in NSString (David add a GSICUString
> > class, but it isn't used very often).  I looked into it over a year
> > ago but decided against implementing something.  The reason was
> > because I didn't completely understand the code and at that point I
> > had already started working on CFString, which I could freely break
> > without anyone noticing.
> >
> > Stef
> >
> >>
> >> gnustep is from the latest releases, using libobjc from gcc 4.2.1, if that 
> >> matters.
> >>
> >> Sebastian
> >>
> >>
> >>>
> >>> David
> >>>
> >>> On 31 Jul 2012, at 18:02, Sebastian Reitenbach wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> with OGo, I convert a UTF-8 string to lowercase, using [NSStrings 
> >>>> lowercaseString]
> >>>>
> >>>> when there are Umlauts in the string, then GNUstep just omits the 
> >>>> character.
> >>>> I've no idea, whether this is right or wrong actually.
> >>>>
> >>>> With the attached patch below to GSString it does not omit the character 
> >>>> anymore.
> >>>>
> >>>>
> >>>> gcc -fgnu-runtime -fconstant-string-class=NSConstantString 
> >>>> -I/usr/local/include -L/usr/local/lib -l gnustep-base lowercase.m -o 
> >>>> lowercase
> >>>>
> >>>> cat lowercase.m
> >>>> #import <Foundation/Foundation.h>
> >>>>
> >>>>
> >>>> int main(int argc, char *argv[]) {
> >>>>       NSLog(@"Lowercase: %@", [[NSString stringWithString:@"Töst"] 
> >>>> lowercaseString]);
> >>>>
> >>>> }
> >>>>
> >>>>
> >>>>
> >>>> Does above running the program on a Mac output the ö or omit it from the 
> >>>> string?
> >>>>
> >>>> does it change when running with LC_CTYPE="C" or LC_CTYPE='de_DE.UTF-8' ?
> >>>>
> >>>> I don't have a Mac, so cannot test myself, maybe also the approach used 
> >>>> by OGo could be wrong.
> >>>> At least when reading the Apple docs, then there is nothing said about 
> >>>> skipped characters,
> >>>> only that i.e. a ß may change to SS when i.e. using uppercaseString.
> >>>> Since they mentioned the ß in the documentation, I'd expect the 
> >>>> lowercaseString to handle other Umlauts too, or is that just plain wrong 
> >>>> assumption?
> >>>>
> >>>> if someone could hit me with a cluestick please ;)
> >>>>
> >>>> cheers,
> >>>> Sebastian
> >>>>
> >>>> the patch to not omit Umlauts.
> >>>> $OpenBSD$
> >>>> --- Source/GSString.m.orig  Tue Jul 31 18:31:36 2012
> >>>> +++ Source/GSString.m       Tue Jul 31 18:32:24 2012
> >>>> @@ -3699,6 +3700,8 @@ agree, create a new GSCInlineString otherwise.
> >>>>  while (i-- > 0)
> >>>>    {
> >>>>      o->_contents.c[i] = tolower(_contents.c[i]);
> >>>> +      if (o->_contents.c[i] == 0)
> >>>> +   o->_contents.c[i] = _contents.c[i];
> >>>>    }
> >>>>  o->_flags.wide = 0;
> >>>>  o->_flags.owned = 1;      // Ignored on dealloc, but means we own buffer
> >>>>
> >>>> _______________________________________________
> >>>> Discuss-gnustep mailing list
> >>>> Discuss-gnustep@gnu.org
> >>>> https://lists.gnu.org/mailman/listinfo/discuss-gnustep
> >>>
> >>> --
> >>> This email complies with ISO 3103
> >>>
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Discuss-gnustep mailing list
> >> Discuss-gnustep@gnu.org
> >> https://lists.gnu.org/mailman/listinfo/discuss-gnustep
> >
> > _______________________________________________
> > Discuss-gnustep mailing list
> > Discuss-gnustep@gnu.org
> > https://lists.gnu.org/mailman/listinfo/discuss-gnustep
>

[Prev in Thread]

Current Thread

[Next in Thread]

Re: NSString lowercaseString, Sebastian Reitenbach <=
- Re: NSString lowercaseString, Ivan Vučica, 2012/08/01
  - Re: NSString lowercaseString, Sebastian Reitenbach, 2012/08/01
    - Re: NSString lowercaseString, Ivan Vučica, 2012/08/01
    - Re: NSString lowercaseString, Sebastian Reitenbach, 2012/08/01
- Re: NSString lowercaseString, David Chisnall, 2012/08/01
  - Re: NSString lowercaseString, Sebastian Reitenbach, 2012/08/01
    - Re: NSString lowercaseString, Thomas Gamper, 2012/08/01
    - Re: NSString lowercaseString, Thomas Gamper, 2012/08/01
    - Re: NSString lowercaseString, David Chisnall, 2012/08/01
    - Re: NSString lowercaseString, Sebastian Reitenbach, 2012/08/01

Next by Date: Re: NSString lowercaseString
Next by thread: Re: NSString lowercaseString
Index(es):
- Date
- Thread