[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: NSString lowercaseString
From: |
Sebastian Reitenbach |
Subject: |
Re: NSString lowercaseString |
Date: |
Wed, 01 Aug 2012 13:21:05 +0200 |
User-agent: |
SOGoMail 1.3.17 |
On Wednesday, August 1, 2012 11:49 CEST, Ivan Vučica <ivucica@gmail.com> wrote:
> Which charset is your terminal configured to use on each operating system?
sorry, don't know how to figure that out?
Sebastian
>
> On 1. 8. 2012., at 10:50, "Sebastian Reitenbach"
> <sebastia@l00-bugdead-prods.de> wrote:
>
> >
> > On Wednesday, August 1, 2012 05:16 CEST, Eric Wasylishen
> > <ewasylishen@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> A while ago I added code to NSString.m to use ICU for the -compare: and
> >> -rangeOfString: methods, so they're done correctly with respect to unicode
> >> and locales, as well as tests that verify the behaviour matches Cocoa for
> >> the most part.
> >>
> >> The -lowercaseString/-uppercaseString methods should probably use
> >> u_strFoldCase if ICU is available.
> >>
> >> I'm skimming through the NSString API looking for methods that we should
> >> use ICU for and currently don't (or don't implement), and there are only a
> >> handful:
> >>
> >> -decomposedString* and -precomposedString* methods
> >> -uppercase/lowercase/capitalized methods
> >> -stringByFoldingWithOptions:locale:
> >> -localizedStandardCompare:
> >> -rangeOfComposedCharacterSequenceAtIndex:
> >> -rangeOfComposedCharacterSequencesForRange:
> >> -initWithFormat:locale: and friends perhaps? Maybe what we have now is
> >> fine though, I'm not too familiar with it.
> >>
> >> I'd be willing to do the case folding ones at some point, for a start. :-)
> >
> > I "enhanced" my test program a bit, and compared output when running on
> > Linux and OpenBSD:
> >
> > #import <Foundation/Foundation.h>
> >
> >
> > int main(int argc, char *argv[]) {
> > NSLog(@"Lowercase: %@", [[NSString stringWithString:@"TöÖst"]
> > lowercaseString]);
> >
> > }
> >
> > running the test program on a Linux box in xterm (opensuse 11.3) without my
> > patch:
> > sre@sre:~> LC_CTYPE='de_DE.UTF-8' ./lowercase
> > 2012-08-01 08:49:57.972 lowercase[16574] autorelease called without pool
> > for object (0x72db28) of class GSCInlineString in thread <NSThread:
> > 0x6b0cc8>
> > 2012-08-01 08:49:57.974 lowercase[16574] autorelease called without pool
> > for object (0x72dce8) of class GSCInlineString in thread <NSThread:
> > 0x6b0cc8>
> > 2012-08-01 08:49:57.974 lowercase[16574] Lowercase: töÃst
> > sre@sre:~> LC_CTYPE='en_EN.UTF-8' ./lowercase
> > 2012-08-01 08:50:09.500 lowercase[16584] autorelease called without pool
> > for object (0x72d538) of class GSCInlineString in thread <NSThread:
> > 0x6b06d8>
> > 2012-08-01 08:50:09.501 lowercase[16584] autorelease called without pool
> > for object (0x72d6f8) of class GSCInlineString in thread <NSThread:
> > 0x6b06d8>
> > 2012-08-01 08:50:09.501 lowercase[16584] Lowercase: töÖst
> >
> > logged in from the same Linux box, xterm, to the OpenBSD host I get (with
> > and without my patch):
> > $ LC_CTYPE='de_DE.UTF-8' ./lowercase
> > 2012-08-01 10:38:52.850 lowercase[5483] autorelease called without pool for
> > object (0x20c403f88) of class GSUnicodeInlineString in thread <NSThread:
> > 0x20750be08>
> > 2012-08-01 10:38:52.851 lowercase[5483] autorelease called without pool for
> > object (0x209c1c5c8) of class GSUnicodeInlineString in thread <NSThread:
> > 0x20750be08>
> > 2012-08-01 10:38:52.852 lowercase[5483] Lowercase: tööst
> > $ LC_CTYPE='en_EN.UTF-8' ./lowercase
> > 2012-08-01 10:38:46.754 lowercase[32569] autorelease called without pool
> > for object (0x20af26088) of class GSUnicodeInlineString in thread
> > <NSThread: 0x2028f9308>
> > 2012-08-01 10:38:46.756 lowercase[32569] autorelease called without pool
> > for object (0x20444f248) of class GSUnicodeInlineString in thread
> > <NSThread: 0x2028f9308>
> > 2012-08-01 10:38:46.756 lowercase[32569] Lowercase: t��st
> >
> > The weird thing on Linux is that the second Ö is not lowercase, but on
> > OpenBSD it is. Also on Linux its linked against icu4c.
> > Even weirder is that the LC_CTYPE, with DE it works on OpenBSD, but not
> > Linux, and with EN the other way around?
> >
> > Sebastian
> >
> >
> >>
> >> Eric
> >>
> >> On Jul 31, 2012, at 3:40 PM, Stefan Bidi <stefanbidi@gmail.com> wrote:
> >>
> >>> On Tue, Jul 31, 2012 at 12:27 PM, Sebastian Reitenbach
> >>> <sebastia@l00-bugdead-prods.de> wrote:
> >>>>
> >>>> On Tuesday, July 31, 2012 19:06 CEST, David Chisnall <theraven@sucs.org>
> >>>> wrote:
> >>>>
> >>>>> Are you using GNUstep with or without ICU? When you say skipped, is it
> >>>>> removed from the destination, or just passed through unmodified? Is
> >>>>> your locale set to something that recognises letters with umlauts?
> >>>>
> >>>> It's with ICU, and I run OGo with
> >>>> LC_CTYPE='de_DE.UTF-8'
> >>>> so, supposed to recognize Umlauts.
> >>>>
> >>>> I had some NSLog in GSString lowercase, and without my patch, it returns
> >>>> 0 for an Umlaut, so its not really skipped, but the
> >>>> o->_contents.c[i] is set to 0 in the middle of a string :(
> >>>>
> >>>> My patch just checks if tolower returned 0, and then just pass the
> >>>> character it cannot handle without doing anything with it.
> >>>>
> >>>> following ICU is installed:
> >>>> $ pkg_info | grep icu4c
> >>>> icu4c-4.8.1.1 International Components for Unicode
> >>>
> >>> Just FYI, GNUstep doesn't use ICU in NSString (David add a GSICUString
> >>> class, but it isn't used very often). I looked into it over a year
> >>> ago but decided against implementing something. The reason was
> >>> because I didn't completely understand the code and at that point I
> >>> had already started working on CFString, which I could freely break
> >>> without anyone noticing.
> >>>
> >>> Stef
> >>>
> >>>>
> >>>> gnustep is from the latest releases, using libobjc from gcc 4.2.1, if
> >>>> that matters.
> >>>>
> >>>> Sebastian
> >>>>
> >>>>
> >>>>>
> >>>>> David
> >>>>>
> >>>>> On 31 Jul 2012, at 18:02, Sebastian Reitenbach wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> with OGo, I convert a UTF-8 string to lowercase, using [NSStrings
> >>>>>> lowercaseString]
> >>>>>>
> >>>>>> when there are Umlauts in the string, then GNUstep just omits the
> >>>>>> character.
> >>>>>> I've no idea, whether this is right or wrong actually.
> >>>>>>
> >>>>>> With the attached patch below to GSString it does not omit the
> >>>>>> character anymore.
> >>>>>>
> >>>>>>
> >>>>>> gcc -fgnu-runtime -fconstant-string-class=NSConstantString
> >>>>>> -I/usr/local/include -L/usr/local/lib -l gnustep-base lowercase.m -o
> >>>>>> lowercase
> >>>>>>
> >>>>>> cat lowercase.m
> >>>>>> #import <Foundation/Foundation.h>
> >>>>>>
> >>>>>>
> >>>>>> int main(int argc, char *argv[]) {
> >>>>>> NSLog(@"Lowercase: %@", [[NSString stringWithString:@"Töst"]
> >>>>>> lowercaseString]);
> >>>>>>
> >>>>>> }
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Does above running the program on a Mac output the ö or omit it from
> >>>>>> the string?
> >>>>>>
> >>>>>> does it change when running with LC_CTYPE="C" or
> >>>>>> LC_CTYPE='de_DE.UTF-8' ?
> >>>>>>
> >>>>>> I don't have a Mac, so cannot test myself, maybe also the approach
> >>>>>> used by OGo could be wrong.
> >>>>>> At least when reading the Apple docs, then there is nothing said about
> >>>>>> skipped characters,
> >>>>>> only that i.e. a ß may change to SS when i.e. using uppercaseString.
> >>>>>> Since they mentioned the ß in the documentation, I'd expect the
> >>>>>> lowercaseString to handle other Umlauts too, or is that just plain
> >>>>>> wrong assumption?
> >>>>>>
> >>>>>> if someone could hit me with a cluestick please ;)
> >>>>>>
> >>>>>> cheers,
> >>>>>> Sebastian
> >>>>>>
> >>>>>> the patch to not omit Umlauts.
> >>>>>> $OpenBSD$
> >>>>>> --- Source/GSString.m.orig Tue Jul 31 18:31:36 2012
> >>>>>> +++ Source/GSString.m Tue Jul 31 18:32:24 2012
> >>>>>> @@ -3699,6 +3700,8 @@ agree, create a new GSCInlineString otherwise.
> >>>>>> while (i-- > 0)
> >>>>>> {
> >>>>>> o->_contents.c[i] = tolower(_contents.c[i]);
> >>>>>> + if (o->_contents.c[i] == 0)
> >>>>>> + o->_contents.c[i] = _contents.c[i];
> >>>>>> }
> >>>>>> o->_flags.wide = 0;
> >>>>>> o->_flags.owned = 1; // Ignored on dealloc, but means we own
> >>>>>> buffer
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Discuss-gnustep mailing list
> >>>>>> Discuss-gnustep@gnu.org
> >>>>>> https://lists.gnu.org/mailman/listinfo/discuss-gnustep
> >>>>>
> >>>>> --
> >>>>> This email complies with ISO 3103
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Discuss-gnustep mailing list
> >>>> Discuss-gnustep@gnu.org
> >>>> https://lists.gnu.org/mailman/listinfo/discuss-gnustep
> >>>
> >>> _______________________________________________
> >>> Discuss-gnustep mailing list
> >>> Discuss-gnustep@gnu.org
> >>> https://lists.gnu.org/mailman/listinfo/discuss-gnustep
> >>
> >
> >
> >
> >
> >
> > _______________________________________________
> > Discuss-gnustep mailing list
> > Discuss-gnustep@gnu.org
> > https://lists.gnu.org/mailman/listinfo/discuss-gnustep
>
- Re: NSString lowercaseString, Sebastian Reitenbach, 2012/08/01
- Re: NSString lowercaseString, David Chisnall, 2012/08/01
- Re: NSString lowercaseString, Sebastian Reitenbach, 2012/08/01
- Re: NSString lowercaseString, David Chisnall, 2012/08/01
- Re: NSString lowercaseString, Sebastian Reitenbach, 2012/08/01
- Re: NSString lowercaseString, Sebastian Reitenbach, 2012/08/01
- Re: NSString lowercaseString, Stefan Bidi, 2012/08/01