bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctua


From: Glenn Morris
Subject: bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation
Date: Sat, 20 Jun 2015 19:34:01 -0400
User-agent: Gnus (www.gnus.org), GNU Emacs (www.gnu.org/software/emacs/)

I spent some time looking at some of these.
In no case could I see a clear path from the inputs to the outputs.

Eli Zaretskii wrote:

>   . characters.el:
>
>     . The modify-category-entry calls -- they basically can be derived
>       from Blocks.txt

I looked at it briefly. I can see that they are somewhat related, but
not precisely how. Eg:

Emacs: 2E80:312F and 3190:33FF are "line breakable".
Which means that "Hangul Compatibility Jamo" isn't. I have no idea why.

Emacs: 3400:4DBF and 4E00:9FAF are "2-byte han".
Which means that "Yijing Hexagram Symbols" aren't. Again, I have no idea why.

I didn't look any further.

>     . The modify-syntax-entry and set-case-syntax calls can be derived
>       from the values of the 'general-category' property returned by
>       'get-char-code-property', perhaps augmented by 'paired-bracket'
>       and 'paired-type' properties

I didn't look at this yet.

>     . The set-case-syntax-pair calls (perhaps use the data in
>       CaseFolding.txt, or even the case mapping information in
>       UnicodeData.txt)

I didn't look at this yet.

>     . The setup of char-width-table -- I think the information is in
>       EastAsianWidth.txt, with background information described in
>       UAX#11 (http://www.unicode.org/reports/tr11/)

Looks somewhat promising, but could you be more specific?
There's nothing in that file that defines "zero width" characters, so I
don't see where Emacs's width 0 characters come from.

The width 2 characters look like they might be the "W" and "F" characters,
but just doing that gives a list that has many differences to the list
Emacs uses.

>     . The setup of char-acronym-table: at least some of the data is in
>       NameAliases.txt and NameList.txt

Looks somewhat promising.
I can see how most of this comes from NameAliases.txt.
But there are many oddities:

Why does Emacs not have anything for 0009 (HT or TAB) or 000A (LF, NL,
or EOF)?
0019 is EOM in the source but EM in Emacs.

0080 is PAD in the source but XXX in Emacs.
0081 is HOP in the source but XXX in Emacs.
008F is SS3 in the source but SS1 in Emacs.
0099 is SGC in the source but XXX in Emacs.

How does Emacs choose which entries to list? There are many more in the
source. Could it do any harm to add more?

Where does "KIVAQ" come from? That appears nowhere in the source AFAICS.
Why does Emacs list two Khmer entries, and nothing else? There are loads
more of them.

>   . fontset.el:
>
>     . The setup of script-representative-chars

I don't see how. It seems to be "for some of, but not all, the entries
in char-script-table, choose a single character somewhere in the range."
There seems to be no pattern to how the character is chosen within the
range. Often the first one, but by no means always.

>   . mule-cmds.el:
>
>     . The setting of locale-language-names -- the data is available in
>       IANA's Language Subtag Registry
>       
> (http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry)
>       and in ISO 639-2 (http://www.loc.gov/standards/iso639-2/,
>       http://www.loc.gov/standards/iso639-2/php/English_list.php)

Again, I don't see how. Eg nowhere in those source files do I see Welsh
associated with iso-8859-14, and the comment in mule-cmds says that the
last part is "implementation dependent".

> P.S. It would be good to add to somewhere (admin/make-tarball.txt?) a
> reminder to fetch all those reference files and regenerate their
> dependencies, before we prepare a release.

admin/FOR-RELEASE contains that kind of thing.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]