bug-cvs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: does cvs use iso 8859 or unicode?


From: Mark D. Baushke
Subject: Re: does cvs use iso 8859 or unicode?
Date: Sun, 09 Mar 2003 22:58:23 -0800

Ronald Petty <ron.petty@unigeek.com> writes:

> I am working on a parser for cvs files like somefile,v.  However I read
> the man page for rcs file "man rcsfile" and it says they mainly use iso
> 8859/1 for what they consider visible characters and whitespace.  I was
> wondering if this is true or not.
> 
> Basically when I write this parser, in order to 
> 
>        sym       ::=  {digit}* idchar {idchar | digit}*
> 
>        idchar    ::=  any visible graphic character except special
> 
>        special   ::=  $ | , | . | : | ; | @
> 
> I need to be sure what "any visible graphic character" is.  If cvs only
> uses ascii this would be easy, but I doubt that is the case.

If you look in ccvs/src/rcs.c at the RCS_check_tag() function, you will
see that it must begin with an isalpha() and may not contain any
!isgraph() characters.

According my system isgraph() tests for any character for which
ispunct(), isupper(), islower() and isdigit() is true. A standard
conforming isgraph() tests for any character for which isalnum() and
ispunct() are true, or any character in the current locale-defined
"graph" calss which is neither a space nor a character for which
iscntrl() is true.

If you look in ccvs/src/rcs.c at the do_symbols() function, you will see
that it will suck any non-whitespace character into the tag symbol up to
the ":" character and whatever string of characters follows it up to
more whitespace or a ';' character.

However, the semanitcs of rcs means that a revision that starts with at
digit will end up trying to be a version number rather than a symbolic
tag name.

As to what is considered whitespace, the spacetab table in ccvs/src/rcs.c
provides that the following characters are whitespace:

  0x08 (BS, aka Control-H, aka BackSpace)
  0x09 (HT, aka Control-I, aka HorizontalTab)
  0x0a (NL, aka Control-J, aka NewLine)
  0x0b (VT, aka Control-K, aka VerticalTab)
  0x0c (NP, aka Control-L, aka NewPage, aka FormFeed)
  0x0d (CR, aka Control-M, aka CarriageReturn)
  0x20 (SP, aka SPace)

In summary, I think you should consider any eight-bit character that is
not listed under special or as whitespace to be possible to read in an
RCS sym, but you may wish to limit what things you send to any routine
to write new entries to be in a more restricted set of characters such
as 8859-1. I do not believe that cvs at present does anything at all
with wide characters or real UTF-8 or UTF-16 character sets.

I hope the above is useful to you.

        Good luck,
        -- Mark




reply via email to

[Prev in Thread] Current Thread [Next in Thread]