[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: NSString lowercaseString
From: |
Sebastian Reitenbach |
Subject: |
Re: NSString lowercaseString |
Date: |
Wed, 01 Aug 2012 10:50:57 +0200 |
User-agent: |
SOGoMail 1.3.17 |
On Wednesday, August 1, 2012 05:16 CEST, Eric Wasylishen
<ewasylishen@gmail.com> wrote:
> Hi,
>
> A while ago I added code to NSString.m to use ICU for the -compare: and
> -rangeOfString: methods, so they're done correctly with respect to unicode
> and locales, as well as tests that verify the behaviour matches Cocoa for the
> most part.
>
> The -lowercaseString/-uppercaseString methods should probably use
> u_strFoldCase if ICU is available.
>
> I'm skimming through the NSString API looking for methods that we should use
> ICU for and currently don't (or don't implement), and there are only a
> handful:
>
> -decomposedString* and -precomposedString* methods
> -uppercase/lowercase/capitalized methods
> -stringByFoldingWithOptions:locale:
> -localizedStandardCompare:
> -rangeOfComposedCharacterSequenceAtIndex:
> -rangeOfComposedCharacterSequencesForRange:
> -initWithFormat:locale: and friends perhaps? Maybe what we have now is fine
> though, I'm not too familiar with it.
>
> I'd be willing to do the case folding ones at some point, for a start. :-)
I "enhanced" my test program a bit, and compared output when running on Linux
and OpenBSD:
#import <Foundation/Foundation.h>
int main(int argc, char *argv[]) {
NSLog(@"Lowercase: %@", [[NSString stringWithString:@"TöÖst"] lowercaseString]);
}
running the test program on a Linux box in xterm (opensuse 11.3) without my
patch:
sre@sre:~> LC_CTYPE='de_DE.UTF-8' ./lowercase
2012-08-01 08:49:57.972 lowercase[16574] autorelease called without pool for
object (0x72db28) of class GSCInlineString in thread <NSThread: 0x6b0cc8>
2012-08-01 08:49:57.974 lowercase[16574] autorelease called without pool for
object (0x72dce8) of class GSCInlineString in thread <NSThread: 0x6b0cc8>
2012-08-01 08:49:57.974 lowercase[16574] Lowercase: töÃst
sre@sre:~> LC_CTYPE='en_EN.UTF-8' ./lowercase
2012-08-01 08:50:09.500 lowercase[16584] autorelease called without pool for
object (0x72d538) of class GSCInlineString in thread <NSThread: 0x6b06d8>
2012-08-01 08:50:09.501 lowercase[16584] autorelease called without pool for
object (0x72d6f8) of class GSCInlineString in thread <NSThread: 0x6b06d8>
2012-08-01 08:50:09.501 lowercase[16584] Lowercase: töÖst
logged in from the same Linux box, xterm, to the OpenBSD host I get (with and
without my patch):
$ LC_CTYPE='de_DE.UTF-8' ./lowercase
2012-08-01 10:38:52.850 lowercase[5483] autorelease called without pool for
object (0x20c403f88) of class GSUnicodeInlineString in thread <NSThread:
0x20750be08>
2012-08-01 10:38:52.851 lowercase[5483] autorelease called without pool for
object (0x209c1c5c8) of class GSUnicodeInlineString in thread <NSThread:
0x20750be08>
2012-08-01 10:38:52.852 lowercase[5483] Lowercase: tööst
$ LC_CTYPE='en_EN.UTF-8' ./lowercase
2012-08-01 10:38:46.754 lowercase[32569] autorelease called without pool for
object (0x20af26088) of class GSUnicodeInlineString in thread <NSThread:
0x2028f9308>
2012-08-01 10:38:46.756 lowercase[32569] autorelease called without pool for
object (0x20444f248) of class GSUnicodeInlineString in thread <NSThread:
0x2028f9308>
2012-08-01 10:38:46.756 lowercase[32569] Lowercase: t��st
The weird thing on Linux is that the second Ö is not lowercase, but on OpenBSD
it is. Also on Linux its linked against icu4c.
Even weirder is that the LC_CTYPE, with DE it works on OpenBSD, but not Linux,
and with EN the other way around?
Sebastian
>
> Eric
>
> On Jul 31, 2012, at 3:40 PM, Stefan Bidi <stefanbidi@gmail.com> wrote:
>
> > On Tue, Jul 31, 2012 at 12:27 PM, Sebastian Reitenbach
> > <sebastia@l00-bugdead-prods.de> wrote:
> >>
> >> On Tuesday, July 31, 2012 19:06 CEST, David Chisnall <theraven@sucs.org>
> >> wrote:
> >>
> >>> Are you using GNUstep with or without ICU? When you say skipped, is it
> >>> removed from the destination, or just passed through unmodified? Is your
> >>> locale set to something that recognises letters with umlauts?
> >>
> >> It's with ICU, and I run OGo with
> >> LC_CTYPE='de_DE.UTF-8'
> >> so, supposed to recognize Umlauts.
> >>
> >> I had some NSLog in GSString lowercase, and without my patch, it returns 0
> >> for an Umlaut, so its not really skipped, but the
> >> o->_contents.c[i] is set to 0 in the middle of a string :(
> >>
> >> My patch just checks if tolower returned 0, and then just pass the
> >> character it cannot handle without doing anything with it.
> >>
> >> following ICU is installed:
> >> $ pkg_info | grep icu4c
> >> icu4c-4.8.1.1 International Components for Unicode
> >
> > Just FYI, GNUstep doesn't use ICU in NSString (David add a GSICUString
> > class, but it isn't used very often). I looked into it over a year
> > ago but decided against implementing something. The reason was
> > because I didn't completely understand the code and at that point I
> > had already started working on CFString, which I could freely break
> > without anyone noticing.
> >
> > Stef
> >
> >>
> >> gnustep is from the latest releases, using libobjc from gcc 4.2.1, if that
> >> matters.
> >>
> >> Sebastian
> >>
> >>
> >>>
> >>> David
> >>>
> >>> On 31 Jul 2012, at 18:02, Sebastian Reitenbach wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> with OGo, I convert a UTF-8 string to lowercase, using [NSStrings
> >>>> lowercaseString]
> >>>>
> >>>> when there are Umlauts in the string, then GNUstep just omits the
> >>>> character.
> >>>> I've no idea, whether this is right or wrong actually.
> >>>>
> >>>> With the attached patch below to GSString it does not omit the character
> >>>> anymore.
> >>>>
> >>>>
> >>>> gcc -fgnu-runtime -fconstant-string-class=NSConstantString
> >>>> -I/usr/local/include -L/usr/local/lib -l gnustep-base lowercase.m -o
> >>>> lowercase
> >>>>
> >>>> cat lowercase.m
> >>>> #import <Foundation/Foundation.h>
> >>>>
> >>>>
> >>>> int main(int argc, char *argv[]) {
> >>>> NSLog(@"Lowercase: %@", [[NSString stringWithString:@"Töst"]
> >>>> lowercaseString]);
> >>>>
> >>>> }
> >>>>
> >>>>
> >>>>
> >>>> Does above running the program on a Mac output the ö or omit it from the
> >>>> string?
> >>>>
> >>>> does it change when running with LC_CTYPE="C" or LC_CTYPE='de_DE.UTF-8' ?
> >>>>
> >>>> I don't have a Mac, so cannot test myself, maybe also the approach used
> >>>> by OGo could be wrong.
> >>>> At least when reading the Apple docs, then there is nothing said about
> >>>> skipped characters,
> >>>> only that i.e. a ß may change to SS when i.e. using uppercaseString.
> >>>> Since they mentioned the ß in the documentation, I'd expect the
> >>>> lowercaseString to handle other Umlauts too, or is that just plain wrong
> >>>> assumption?
> >>>>
> >>>> if someone could hit me with a cluestick please ;)
> >>>>
> >>>> cheers,
> >>>> Sebastian
> >>>>
> >>>> the patch to not omit Umlauts.
> >>>> $OpenBSD$
> >>>> --- Source/GSString.m.orig Tue Jul 31 18:31:36 2012
> >>>> +++ Source/GSString.m Tue Jul 31 18:32:24 2012
> >>>> @@ -3699,6 +3700,8 @@ agree, create a new GSCInlineString otherwise.
> >>>> while (i-- > 0)
> >>>> {
> >>>> o->_contents.c[i] = tolower(_contents.c[i]);
> >>>> + if (o->_contents.c[i] == 0)
> >>>> + o->_contents.c[i] = _contents.c[i];
> >>>> }
> >>>> o->_flags.wide = 0;
> >>>> o->_flags.owned = 1; // Ignored on dealloc, but means we own buffer
> >>>>
> >>>> _______________________________________________
> >>>> Discuss-gnustep mailing list
> >>>> Discuss-gnustep@gnu.org
> >>>> https://lists.gnu.org/mailman/listinfo/discuss-gnustep
> >>>
> >>> --
> >>> This email complies with ISO 3103
> >>>
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Discuss-gnustep mailing list
> >> Discuss-gnustep@gnu.org
> >> https://lists.gnu.org/mailman/listinfo/discuss-gnustep
> >
> > _______________________________________________
> > Discuss-gnustep mailing list
> > Discuss-gnustep@gnu.org
> > https://lists.gnu.org/mailman/listinfo/discuss-gnustep
>