lilypond-devel
[Top][All Lists]

## Re: LilyPond strings and \markup

 From: Joe Neeman Subject: Re: LilyPond strings and \markup Date: Mon, 17 Aug 2009 20:47:40 +1000

On Mon, 2009-08-17 at 00:59 -0700, Mark Polesky wrote:
> Carl Sorensen wrote:
> > Because \[alphanum]+ is a STRING_IDENTIFIER.
>
> Not always:
>
> num = 1
> sym = #'symbol
>
> % error: syntax error, unexpected NUMBER_IDENTIFIER (\num)
> % \num
>
> % error: syntax error, unexpected SCM_IDENTIFIER (\sym)
> % \markup \sym
>
>
> > Remember, these are *parser* error messages.  They come from
> > interpreting the symbols in the input stream, not from
> > evaluating the variables resulting from parsing the input
> > stream.
> >
> > The error message happens when the parser sees the text, not
> > when the parser evaluates the text.
>
> If that were true, then wouldn't the NUMBER_ and SCM_IDENTIFIERs
> above all be STRING_IDENTIFIERs?

You're right, the error does depend of the type of the variables. More
on that below.

>
> Carl, you're being a good sport trying to help me understand
> all of this. But it's just very counterintuitive to me. And it
> continues to be. I'm repeatedly told to look at parser.yy and
> lexer.ll (or whatever they are). But most of that code is too
> unfamiliar for me to understand.

Those files are written in bison (parser.yy) and flex (lexer.ll). It's
probably worth your while to learn a little bit of those two languages.
I only know the basics myself, but it's enough to understand about 95%
of those two files.

> At first I thought there was such a thing as a LilyPond STRING, as
> if it were a datatype in the traditional sense. But then I find
> out that LilyPond STRINGs have different rules when they're in
> \markup, and glancing at the lexer/parser code, it looks like
> there's yet another set of rules for lyrics.

It's probably not a good idea to think of STRING (as it appears in the
parser/lexer) as being any particular data type. The STRING lexer token
is used for several purposes in lilypond, including for identifiers
(which can obviously be associated with many differently-typed objects).

> I'm trying to find accurate definitions for these basic concepts,
> as in my initial attempt:
>
> A valid LilyPond unquoted 'STRING':
> 1) must be entirely alphabetic, and
> 2) cannot be interpreted as a number, pitch, rest, or operator.

What do you mean by STRING? In
foo = bar
are both "foo" and "bar" strings? I would call foo an "identifier" and
bar an "lvalue". FWIW, the parser calls bar an "identifier_init", and
its rules are defined on line 568.

> And then there's the whole thing where # and \ seem to behave the
> same way, but only some of the time.

The handling of # is (in some sense) very simple. You just run the guile
parser on whatever follows the #, and return the result (as a SCM_TOKEN
lexer token, which will be reinterpreted by the parser as an
embedded_scm). Now here's something fairly general that might be
helpful: most of the tokens (declared on lines 255-440 of the parser)
have type SCM. For example, numbers and strings are both stored as SCM.
In other words, in
foo = "bar"
the quoted string "bar" is parsed as a string (lexer token STRING). It
is then converted into a SCM data type and the name foo is associated
with that chunk of scheme.
On the other hand, the input
foo2 = #"bar"
results in the guile interpreter being called on the string "bar". Guile
does its magic and returns a SCM data type. Because "bar" represents a
string (in guile's language, scheme), guile happens to return a chunk of
SCM representing a string. But lilypond has no idea what followed the #
mark; it could just as well have been
foo2 = #(do-some-funky-scheme-stuff-and-return "bar")
Anyway, now that both statements are parsed and have been assigned with
identifiers, dereferencing foo is _exactly_ the same as dereferencing
foo2, since they both refer to a piece of SCM that represents the string
"bar". In particular, they both complain about unexpected
STRING_IDENTIFIER if used in the wrong place.

> The fact that there's more
> than one way to define something, and more than one way to refer
> to something, inevitably leads to confusion.

I suspect the confusion mostly arises from the fact that we aren't very
consistent on whether we require, say, a number or a piece of SCM that
represents a number. For example, line 1263 defines an override as
OVERRIDE simple_string property_path '=' embedded_scm
whereas line 1312 has
OVERRIDE context_prop_spec property_path '=' scalar
and so the first one requires # whereas the second one will accept
\override Grob #'prop = 3
I don't know the reason for the discrepancy, but I suspect it could be
changed without breaking anything.

> And then # is by turns required, optional, and invalid...
>
> % ex.1.   # required:
> \markup { left \hspace #1 right }

only distinguishes between embedded_scm (ie. something with #), markup
and lists. The \hspace command expects embedded_scm followed by a
markup. Perhaps markups could be relaxed to expect scalars rather than
embedded_scm.

> % ex.2.   # optional:
> one = #1

Right, because of the rule for identifier_init, which will match a wide
variety of things and convert them to SCM.

> % ex.3.   # invalid
> \markup #1

Because we're seeing an embedded_scm as the argument to \markup, rather
than an actual markup.

> To mix things up even further, quotes are optional in ex.3 and
> invalid in ex.1.

Because ex.1 expects embedded_scm. Note that the parser would accept
#"1" because it doesn't check what data type the SCM takes, but lilypond
will crash later on because it expects an SCM integer, not a SCM string.

> Come on, you have to admit that this is terribly confusing! And as
> far as I can tell, none of this is satisfactorily documented. I'm
> trying to do something about it, but my goodness, I've already hit
> the wall on my first datatype.

I suspect that much of the confusion could be alleviated by modifying
the parser to allow scalar where embedded_scm is currently required.
That would remove most (all?) of the cases in which # is required. It's
possible, however, that this might introduce ambiguities in the parser.

Joe