lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev Japanese & spaces in forms-options menu


From: Leonid Pauzner
Subject: Re: lynx-dev Japanese & spaces in forms-options menu
Date: Tue, 22 Sep 1998 17:00:27 +0400 (MSD)

> Conclusion is that for Chinese and Japanese, interword space is
> meaningless.  Therefore introducing a space at a line wrap in the
> source means an added artifact that the author did not intend.  I
> would agree that the correct way to handle it would be to go on the
> basis of the character set of the document, but there are just too
> many unlabled documents out there.  If someone has their display
> charset set for Japanese or Chinese, then I assume they plan to
> read Japanese or Chinese and don't want extra spaces thrown in.  If
> someone wants to read a language that requires interword space, they
> should set their display charset to that language.

This generally right but 7 bit us-ascii is the special case
which should be recognized automatically.

> I made the patch, and it was incorporated in the "1998-08-29 (2.8.1dev.23)"
> CHANGES: "* don't replace '\n' with ' ' if Chinese or Japanese - HN".
> If someone has the interest and the programming skills, then a "better"
> alternative I suppose would be to test if Lynx is really getting a multi-
> byte stream or not, and only _not_ add the space if that's true.  Seems to
> add more complexity than necessary, but the "big two" do handle interword
> space correctly in mixed documents, and it's nice I admit.

> __Henry

A simplest way - check whether the previous character is from 20-7E or not.
But EUC-JP seems use "ISO 2022 rules" to select whether 20-7E byte is us-ascii
depending on SS2 and SS3 flags...

Quoted from IANA character sets list:



Name: JIS_Encoding
MIBenum: 16
Source: JIS X 0202-1991.  Uses ISO 2022 escape sequences to
        shift code sets as documented in JIS X 0202-1991.
Alias: csJISEncoding

Name: Shift_JIS  (preferred MIME name)
MIBenum: 17
Source: A Microsoft code that extends csHalfWidthKatakana to include
        kanji by adding a second byte when the value of the first
        byte is in the ranges 81-9F or E0-EF.
Alias: MS_Kanji
Alias: csShiftJIS

Name: Extended_UNIX_Code_Packed_Format_for_Japanese
MIBenum: 18
Source: Standardized by OSF, UNIX International, and UNIX Systems
        Laboratories Pacific.  Uses ISO 2022 rules to select
               code set 0: US-ASCII (a single 7-bit byte set)
               code set 1: JIS X0208-1990 (a double 8-bit byte set)
                           restricted to A0-FF in both bytes
               code set 2: Half Width Katakana (a single 7-bit byte set)
                           requiring SS2 as the character prefix
               code set 3: JIS X0212-1990 (a double 7-bit byte set)
                           restricted to A0-FF in both bytes
                           requiring SS3 as the character prefix
Alias: csEUCPkdFmtJapanese
Alias: EUC-JP  (preferred MIME name)

Name: Windows-31J
MIBenum: 2024
Source: Windows Japanese.  A further extension of csShiftJIS
        to include several OEM-specific kanji extensions.
        Like csShiftJIS, it adds a second byte when the value
        of the first byte is in the ranges 81-9F or E0-EF.
        PCL Symbol Set id: 19K
Alias: csWindows31J

Name: GB2312  (preferred MIME name)
MIBenum: 2025
Source: Chinese for People's Republic of China (PRC) mixed one byte,
        two byte set:
          20-7E = one byte ASCII
          A1-FE = two byte PRC Kanji
        See GB 2312-80
        PCL Symbol Set Id: 18C
Alias: csGB2312

Name: HZ-GB-2312
MIBenum: 2085
Source: RFC 1842, RFC 1843                              [RFC1842, RFC1843]

Name: Big5  (preferred MIME name)
MIBenum: 2026
Source: Chinese for Taiwan Multi-byte set.
        PCL Symbol Set Id: 18T
Alias: csBig5


[RFC1468]  Murai, J., Crispin, M., and E. van der Poel, "Japanese
           Character Encoding for Internet Messages", RFC 1468,
           Keio University, Panda Programming, June 1993.

[RFC1554]  Ohta, M., and K. Handa, "ISO-2022-JP-2: Multilingual
           Extension of ISO-2022-JP", RFC1554, Tokyo Institute of
           Technology, ETL, December 1993.

[RFC1557]  Choi, U., Chon, K., and H. Park, "Korean Character Encoding
           for Internet Messages", KAIST, Solvit Chosun Media,
           December 1993.

[RFC1815]  Ohta, M., "Character Sets ISO-10646 and ISO-10646-J-1",
           RFC 1815, Tokyo Institute of Technology, July 1995.

[RFC1842]  Wei, Y., J. Li, and Y. Jiang, "ASCII Printable
           Characters-Based Chinese Character Encoding for Internet
           Messages", RFC 1842, Harvard University, Rice University,
           University of Maryland, August 1995.

[RFC1843]  Lee, F., "HZ - A Data Format for Exchanging Files of
           Arbitrarily Mixed Chinese and ASCII Characters", RFC 1843,
           Stanford University, August 1995.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]