[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: NSString lowercaseString
From: |
Ivan Vučica |
Subject: |
Re: NSString lowercaseString |
Date: |
Wed, 1 Aug 2012 11:49:35 +0200 |
Which charset is your terminal configured to use on each operating system?
On 1. 8. 2012., at 10:50, "Sebastian Reitenbach"
<sebastia@l00-bugdead-prods.de> wrote:
>
> On Wednesday, August 1, 2012 05:16 CEST, Eric Wasylishen
> <ewasylishen@gmail.com> wrote:
>
>> Hi,
>>
>> A while ago I added code to NSString.m to use ICU for the -compare: and
>> -rangeOfString: methods, so they're done correctly with respect to unicode
>> and locales, as well as tests that verify the behaviour matches Cocoa for
>> the most part.
>>
>> The -lowercaseString/-uppercaseString methods should probably use
>> u_strFoldCase if ICU is available.
>>
>> I'm skimming through the NSString API looking for methods that we should use
>> ICU for and currently don't (or don't implement), and there are only a
>> handful:
>>
>> -decomposedString* and -precomposedString* methods
>> -uppercase/lowercase/capitalized methods
>> -stringByFoldingWithOptions:locale:
>> -localizedStandardCompare:
>> -rangeOfComposedCharacterSequenceAtIndex:
>> -rangeOfComposedCharacterSequencesForRange:
>> -initWithFormat:locale: and friends perhaps? Maybe what we have now is fine
>> though, I'm not too familiar with it.
>>
>> I'd be willing to do the case folding ones at some point, for a start. :-)
>
> I "enhanced" my test program a bit, and compared output when running on Linux
> and OpenBSD:
>
> #import <Foundation/Foundation.h>
>
>
> int main(int argc, char *argv[]) {
> NSLog(@"Lowercase: %@", [[NSString stringWithString:@"TöÖst"]
> lowercaseString]);
>
> }
>
> running the test program on a Linux box in xterm (opensuse 11.3) without my
> patch:
> sre@sre:~> LC_CTYPE='de_DE.UTF-8' ./lowercase
> 2012-08-01 08:49:57.972 lowercase[16574] autorelease called without pool for
> object (0x72db28) of class GSCInlineString in thread <NSThread: 0x6b0cc8>
> 2012-08-01 08:49:57.974 lowercase[16574] autorelease called without pool for
> object (0x72dce8) of class GSCInlineString in thread <NSThread: 0x6b0cc8>
> 2012-08-01 08:49:57.974 lowercase[16574] Lowercase: töÃst
> sre@sre:~> LC_CTYPE='en_EN.UTF-8' ./lowercase
> 2012-08-01 08:50:09.500 lowercase[16584] autorelease called without pool for
> object (0x72d538) of class GSCInlineString in thread <NSThread: 0x6b06d8>
> 2012-08-01 08:50:09.501 lowercase[16584] autorelease called without pool for
> object (0x72d6f8) of class GSCInlineString in thread <NSThread: 0x6b06d8>
> 2012-08-01 08:50:09.501 lowercase[16584] Lowercase: töÖst
>
> logged in from the same Linux box, xterm, to the OpenBSD host I get (with and
> without my patch):
> $ LC_CTYPE='de_DE.UTF-8' ./lowercase
> 2012-08-01 10:38:52.850 lowercase[5483] autorelease called without pool for
> object (0x20c403f88) of class GSUnicodeInlineString in thread <NSThread:
> 0x20750be08>
> 2012-08-01 10:38:52.851 lowercase[5483] autorelease called without pool for
> object (0x209c1c5c8) of class GSUnicodeInlineString in thread <NSThread:
> 0x20750be08>
> 2012-08-01 10:38:52.852 lowercase[5483] Lowercase: tööst
> $ LC_CTYPE='en_EN.UTF-8' ./lowercase
> 2012-08-01 10:38:46.754 lowercase[32569] autorelease called without pool for
> object (0x20af26088) of class GSUnicodeInlineString in thread <NSThread:
> 0x2028f9308>
> 2012-08-01 10:38:46.756 lowercase[32569] autorelease called without pool for
> object (0x20444f248) of class GSUnicodeInlineString in thread <NSThread:
> 0x2028f9308>
> 2012-08-01 10:38:46.756 lowercase[32569] Lowercase: t��st
>
> The weird thing on Linux is that the second Ö is not lowercase, but on
> OpenBSD it is. Also on Linux its linked against icu4c.
> Even weirder is that the LC_CTYPE, with DE it works on OpenBSD, but not
> Linux, and with EN the other way around?
>
> Sebastian
>
>
>>
>> Eric
>>
>> On Jul 31, 2012, at 3:40 PM, Stefan Bidi <stefanbidi@gmail.com> wrote:
>>
>>> On Tue, Jul 31, 2012 at 12:27 PM, Sebastian Reitenbach
>>> <sebastia@l00-bugdead-prods.de> wrote:
>>>>
>>>> On Tuesday, July 31, 2012 19:06 CEST, David Chisnall <theraven@sucs.org>
>>>> wrote:
>>>>
>>>>> Are you using GNUstep with or without ICU? When you say skipped, is it
>>>>> removed from the destination, or just passed through unmodified? Is your
>>>>> locale set to something that recognises letters with umlauts?
>>>>
>>>> It's with ICU, and I run OGo with
>>>> LC_CTYPE='de_DE.UTF-8'
>>>> so, supposed to recognize Umlauts.
>>>>
>>>> I had some NSLog in GSString lowercase, and without my patch, it returns 0
>>>> for an Umlaut, so its not really skipped, but the
>>>> o->_contents.c[i] is set to 0 in the middle of a string :(
>>>>
>>>> My patch just checks if tolower returned 0, and then just pass the
>>>> character it cannot handle without doing anything with it.
>>>>
>>>> following ICU is installed:
>>>> $ pkg_info | grep icu4c
>>>> icu4c-4.8.1.1 International Components for Unicode
>>>
>>> Just FYI, GNUstep doesn't use ICU in NSString (David add a GSICUString
>>> class, but it isn't used very often). I looked into it over a year
>>> ago but decided against implementing something. The reason was
>>> because I didn't completely understand the code and at that point I
>>> had already started working on CFString, which I could freely break
>>> without anyone noticing.
>>>
>>> Stef
>>>
>>>>
>>>> gnustep is from the latest releases, using libobjc from gcc 4.2.1, if that
>>>> matters.
>>>>
>>>> Sebastian
>>>>
>>>>
>>>>>
>>>>> David
>>>>>
>>>>> On 31 Jul 2012, at 18:02, Sebastian Reitenbach wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> with OGo, I convert a UTF-8 string to lowercase, using [NSStrings
>>>>>> lowercaseString]
>>>>>>
>>>>>> when there are Umlauts in the string, then GNUstep just omits the
>>>>>> character.
>>>>>> I've no idea, whether this is right or wrong actually.
>>>>>>
>>>>>> With the attached patch below to GSString it does not omit the character
>>>>>> anymore.
>>>>>>
>>>>>>
>>>>>> gcc -fgnu-runtime -fconstant-string-class=NSConstantString
>>>>>> -I/usr/local/include -L/usr/local/lib -l gnustep-base lowercase.m -o
>>>>>> lowercase
>>>>>>
>>>>>> cat lowercase.m
>>>>>> #import <Foundation/Foundation.h>
>>>>>>
>>>>>>
>>>>>> int main(int argc, char *argv[]) {
>>>>>> NSLog(@"Lowercase: %@", [[NSString stringWithString:@"Töst"]
>>>>>> lowercaseString]);
>>>>>>
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>> Does above running the program on a Mac output the ö or omit it from the
>>>>>> string?
>>>>>>
>>>>>> does it change when running with LC_CTYPE="C" or LC_CTYPE='de_DE.UTF-8' ?
>>>>>>
>>>>>> I don't have a Mac, so cannot test myself, maybe also the approach used
>>>>>> by OGo could be wrong.
>>>>>> At least when reading the Apple docs, then there is nothing said about
>>>>>> skipped characters,
>>>>>> only that i.e. a ß may change to SS when i.e. using uppercaseString.
>>>>>> Since they mentioned the ß in the documentation, I'd expect the
>>>>>> lowercaseString to handle other Umlauts too, or is that just plain wrong
>>>>>> assumption?
>>>>>>
>>>>>> if someone could hit me with a cluestick please ;)
>>>>>>
>>>>>> cheers,
>>>>>> Sebastian
>>>>>>
>>>>>> the patch to not omit Umlauts.
>>>>>> $OpenBSD$
>>>>>> --- Source/GSString.m.orig Tue Jul 31 18:31:36 2012
>>>>>> +++ Source/GSString.m Tue Jul 31 18:32:24 2012
>>>>>> @@ -3699,6 +3700,8 @@ agree, create a new GSCInlineString otherwise.
>>>>>> while (i-- > 0)
>>>>>> {
>>>>>> o->_contents.c[i] = tolower(_contents.c[i]);
>>>>>> + if (o->_contents.c[i] == 0)
>>>>>> + o->_contents.c[i] = _contents.c[i];
>>>>>> }
>>>>>> o->_flags.wide = 0;
>>>>>> o->_flags.owned = 1; // Ignored on dealloc, but means we own buffer
>>>>>>
>>>>>> _______________________________________________
>>>>>> Discuss-gnustep mailing list
>>>>>> Discuss-gnustep@gnu.org
>>>>>> https://lists.gnu.org/mailman/listinfo/discuss-gnustep
>>>>>
>>>>> --
>>>>> This email complies with ISO 3103
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Discuss-gnustep mailing list
>>>> Discuss-gnustep@gnu.org
>>>> https://lists.gnu.org/mailman/listinfo/discuss-gnustep
>>>
>>> _______________________________________________
>>> Discuss-gnustep mailing list
>>> Discuss-gnustep@gnu.org
>>> https://lists.gnu.org/mailman/listinfo/discuss-gnustep
>>
>
>
>
>
>
> _______________________________________________
> Discuss-gnustep mailing list
> Discuss-gnustep@gnu.org
> https://lists.gnu.org/mailman/listinfo/discuss-gnustep