[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: EOL: unix/dos/mac

From: Stephen J. Turnbull
Subject: Re: EOL: unix/dos/mac
Date: Wed, 27 Mar 2013 03:12:11 +0900

Eli Zaretskii writes:
 > > From: "Stephen J. Turnbull" <address@hidden>

 > > Currently NLFs *are* displayed, if they don't match the default for
 > > the buffer.
 > No, they are displayed because nothing other than a single LF is
 > treated like NLF by the Emacs internals.

Emacs doesn't get to define NLF; it's a Unicode concept.  You'll get
in trouble if you get confused about that.  Those *are* NLFs, and in
the "CR in *-unix buffer" form they *are* displayed as "^M"s, while in
the "bare LF in *-doc buffer" form they *do* appear as stair-stepping
lines.  That does bother some users, including some who understand why
it happens.

 > > Because you have to fix pretty much everything
 > I'm probably missing something important, because things I think will
 > need fixing are nowhere near "pretty much everything".  How about
 > posting a long enough list of things to fix to convince me that
 > "pretty much everything" is close to the truth?

"Everything" is of course an exaggeration.  At a minimum, you need to
change delete and motion commands to handle the fact that EOL doesn't
have a constant width in characters.  Should users be able to move
*into* a CRLF in -unix buffer?  How about a -dos buffer?  Should
forward-char-command move into or *over* a CRLF?  Does it matter what
the EOL convention is for that buffer?  What are we going to do for
the occasional user who wants the less usual behavior for some reason?
You need to decide what (insert "\015") means in a -dos buffer, and
you can be pretty sure that some users will be confused whichever you
choose.  Ditto (insert "\012") in a -mac buffer.  You may very well
want those to mean something different from the commands that
self-insert either or both of those characters.  Until now,
skip-chars-forward and regexps would find EOL if the string defining
the target contained "\n".  Is that going to continue to be true?  How
do you propose to find a bare LF -- are we going to make users use
octal or hex escapes, or do we define new string syntax?

 > > Code will be massively uglified with tests for variable-length
 > > sequences instead of single characters
 > The code is already replete with that, ever since Emacs started using
 > a multi-byte representation for characters in buffers.  We have a set
 > of macros to fetch and examine multi-byte sequences, for that reason.
 > I see nothing hard or "ugly" here, sorry.

Ah, but this is completely a different story.  Those there are C
macros, and not visible to Lisp programs, which know that a line break
is represented by a single character, U+000A.  That's no longer true
for NLF, which by definition is composed of one or more *characters*,
not code units.  It's *Lisp* code that has to deal with this.

 > > Any code handling old-style hidden lines (with CR marking
 > > "invisible" lines) will have to be changed.
 > First, we want to deprecate and remove this feature anyway (there's
 > already an implemented alternative).  And second, we already handle
 > this today so that we don't display ^M there; the same method can be
 > used for the other NLFs.

Sorry, that breaks immediately.  That ^M is now an NLF, and you either
treat it that way and not as an invisibility marker, or the meaning of
the buffer changes when you switch that mode on and off in a very
delicate way.  I'm pretty sure it will corrupt the buffer unless you
mark preexisting ^Ms as NLFs or convert them to something else.  Which
is what I'm proposing, of course.

So you can fall back on deprecation.  Has the feature actually been
scheduled for deprecation and eventual removal?  If not, you're
looking at 5-10 years before it gets removed.

 > If the problem _is_ significant, we might as well solve it The
 > Right Way, instead of applying more and more band-aid.  Conversion
 > of NLFs to a single LF is a kludge,

Not to mention a close approximation to the right way to handle them
according to the Unicode standard under many circumstances.  (The
truly correct way to handle them is to substitute LINE SEPARATOR, as I
mentioned earlier.)

 > You cannot do such conversion efficiently if you need to discover
 > the EOL format for every line.

Of course you can.  You don't need to "discover" the EOL format; you
know that an EOL is anything that matches "\r\n\|\r\|\n\|\205" as you
move forward through the buffer.  It's only a tiny bit more expensive
than current conversion for -dos or -mac, and those are hardly
prohibitive, especially when compared to I/O itself.

 > What it adds doesn't seem so frightening to me, certainly less so
 > than, say, adding bidi support ;-)

Agreed, but irrelevant.  bidi is a new feature necessary to support
some languages currently used by millions of people, and the hairiness
is mandated by UAX #9 -- an alternative implementation is not going to
make conformance much easier.  What we're talking about here are
alternative implementations of a much smaller feature, NLF, and which
one is going to be more efficient and more natural for Emacs.

 > The internal representation is still exposed, so nothing's changed in
 > that department.

I know, and taking advantage of that exposure still falls in the class
of "Kids, these stunts are performed by trained professionals.  Don't
try this at home!"  Can you deny that?

 > > I think you're hearing monsters in the closet.
 > And I think _you_ are hearing them.

Well, yes, I am.  But I've worked with implementations of coding
systems in both XEmacs and Python, and I know that what I'm talking
about will work and be efficient, and buffers and strings will
continue to conform to the Emacs model.  I know that what you're
talking about will break some invariants for character motion and
editing at line end, and that worries me.  Proof?  You're right, I
have none.  By the same token, you don't either.  What worries me is
that while I can prove (or perhaps disprove) my point with a small set
of unit tests and benchmarks, you will have to hand that version of
Emacs to real users for a year or three to find out if anybody really
cares that the model broke.

 > Or maybe you will show me such a large list of things that will
 > become broken by keeping NLFs that I will change my mind.

I can't; I gave you my list already, and I grant that it's not all
that long and several of the potential problems can't be confirmed at
this point.  But if you decide to keep NLFs in the buffer rather than
conforming to the tried and true Emacs/Mule model of converting them
to a one-character representation, I predict you will find plenty of
breakage over years, just as the \201 bug regressed multiple times
over something like a decade.

It's true that keeping NLFs in the buffer will bring Emacs's internal
representation into closer conformance with the Unicode Standard, but
both the benefits and the costs of that are unclear to me.  Sure, it
makes it conceptually straightforward to support Unicode handling of
NLF in regexps, but you can already do that by simply avoiding EOL
conversion when you need highly accurate Unicode conformance.  On the
other hand, when you are treating NLFs as NLFs, you will be breaking
the 40-year-old Emacs model of a linebreak marked by a single
character.  I don't know what trouble that will cause, but there's no
easy workaround for it that preserves those NLFs.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]