Standardizing simple markup for translatable strings

bug-gettext
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Standardizing simple markup for translatable strings

From:	Lasse Collin
Subject:	Standardizing simple markup for translatable strings
Date:	Wed, 24 Apr 2024 20:39:03 +0300
Background
----------

The --help text in many command line tools uses hardcoded line breaks.
These are simple in code but many translators have problems with them:
lines might become longer than 80 columns and alignment can be all over
the place. (In some cases even the original English text might be less
than ideal, for example, "wget --help" in GNU wget 1.24.5.)

For packages that care about keeping things polished even in translated
form, this creates a significant amount of extra work to get
translators to edit their strings. It's annoying for both the
translators and the package maintainers.

This has made me wish for automatic word wrapping method that is still
simple enough in code. That is, it might not support all languages
properly but if majority can work well then it would be a huge
improvement already. The remaining languages would still work with
manual word wrapping.

GNU argp has had word wrapping support for --help text for ages.
However, argp does quite a few things and sometimes having just the
word wrapping part could be convenient.


Feature wishes
--------------

The kind of word wrapping code I have played around considers only
spaces as line-breaking opportunities (LBOs). Hyphens (or any other
characters) aren't LBOs because often enough hyphens are used in
command line tool output in a way where it's clearer if a line break
doesn't occur there.

With some testing I got a feeling that some simple markup would be
valuable to get better results:

  - Non-breaking spaces allow avoiding line breaks at unwanted places.

  - Soft-hyphens help with languages like German and Finnish which have
    long compound words. For example, XZ Utils' German translation has
    "Filterkettennummer" which in one place is manually hyphenated for
    line wrapping reasons.

  - Zero-width spaces allow specifying LBOs in other cases, for
    example, if there is a regular hyphen, then adding a zero-width
    space after it would allow line to wrap there.

This isn't a perfect method at all:

  - Right-to-left languages might not work.

  - Languages that don't use spaces would need zero-width spaces at
    regular intervals to insert LBOs.

However, manual line breaks would keep working just like before as long
as the line lengths stay below the automatic wrapping length. So
languages in the above categories could continue translations almost
like they did before.

While I'd like the wrapping code (and especially the markup syntax) to
be usable in multiple situations, the --help text is the main use case
for me. It typically consist of two columns, the option and the
description:

  -p, --pages=RANGE     Set which pages to process. RANGE can be
                        a number to process a single page, or two
                        numbers separated by a '-' (FIRST-LAST) to
                        specify the first and last pages to process.

Some comments I've heard suggest splitting it into separate strings:

    (1) "-p, --pages="
    (2) "RANGE"
    (3) "Set which ..."

This has pros and cons:

  + It's clear that (1) must not be translated as translators won't
    even get this string in any msgid.

  + There are no worries about obscure markup or alignment (counting
    spaces) to indcate where the string (2) ends and (3) begins.

  - The description needs a TRANSLATORS comment which includes the
    string from (1) so that translators can see the context. Thus,
    quite a few TRANSLATORS comments would be needed.

  - It's not *obvious* that RANGE will *always* get translated the
    same way in both strings.

  - From programmer's point of view, the split method is slightly
    less convenient.

An alternative would be to specify all three parts in a single string
but have a separator between (2) and (3), for example, a tab '\t'
character or some other markup:

    "-p, --pages=RANGE\tSet which ..."

While something like this would avoid the downsides listed above and be
convenient for me as a programmer, it's not clear how well it would be
received by translators:

  - Is it too confusing to know that "-p, --pages=" must not be
    translated? (It doesn't seem to be a problem with the current
    hard-wrapped strings.)

  - Is the \t (or other markup) too confusing and which could
    accidentally get replaced by a regular space or something else that
    isn't correct?


Markup syntax ideas
-------------------

One idea I tried uses '&' followed by a second character:

    &> (or &= or \t)   hanging indentation
    &_                 non-breaking space
    &-                 soft hyphen
    &<SPACE>           a line-break opportunity
    &1 ... &9          replacement string like (%s or %9$s in printf)
    &&                 "&"

It's simple but also completely new to translators to understand. So
while it could be nice from my point of view, I understand it might not
be loved by others.

Examples:

    "-p, --pages=RANGE&>Set which pages to process. RANGE can be "
    "a number to process a single page, or two numbers separated "
    "by a '-' (FIRST-LAST) to specify the first and last pages "
    "to process."

    "dict=NUM&>dictionary size (4KiB&_-&_1536MiB;&_8MiB)"

    "nice=NUM&>nice length of a match (2-273;&_64)"

I wrote C code to implement this idea a few years ago already. It needs
some final polish before it's truly ready though.

In case the above is too cryptic for translators or there is some other
reason to reject it, another idea could be to keep using printf's
standard "%s" or "%2$s" for replacement strings, and use HTML entities
for the special markup. The HTML entity idea I took from the Gettext
manual:

    HTML markup, however, is common enough that it's probably ok
    to use in translatable strings.  But please bear in mind that
    the GNU gettext tools don't verify that the translations are
    well-formed HTML.

Here is the list:

    \t                 hanging indentation [*]
    &nbsp;             non-breaking space
    &shy;              soft hyphen
    &ZeroWidthSpace;   a line-break opportunity
    &amp;              "&"
    %s ... %9$s        replacement string with printf syntax

    [*] Or would using &indent; or similar new non-standard entity
        be clearer than \t?

Only the specific HTML entities would be supported (the list could be
expaneded slightly but only slightly). Other uses of & would pass
through as is, for example, "foo & bar" would be fine.

Examples:

    "dict=NUM\tdictionary size (4KiB&nbsp;-&nbsp;1536MiB;&nbsp;8MiB)"

    "nice=NUM\tnice length of a match (2-273;&nbsp;64)"

This needs only slightly more complex code but I haven't tried to
implement it yet. (The word "complex" has to be taken in context of
command line tools whose main purpose isn't text formatting.)

I'm open to hearing other ideas.


New flag
--------

Would it be useful to standardize a new flag for translatable strings
that support the extra markup? There is "c-format" for printf format
strings now.

  * With the first &-based format, perhaps "wrap-format" or such would
    work.

  * For the HTML entities method, maybe the pair "c-format, entities"
    could mark that the string uses both printf formatting and HTML
    entities. A combined tag like "c-format-entities" might perhaps
    work too.


Other considerations
--------------------

With manual line breaks, translators have been able to adjust the
starting column of the descriptions in --help text for all strings. It
makes sense if many options get a character or two longer and thus
would require making the --help text longer by inserting extra newlines.

Some translators use the freedom of manual line breaks to indent the
continuation lines by additional two spaces. This style can be seen in
original English strings in some tools too, for example, "cp --help".

These kind of things could be made configurable per translation basis
by having extra translatable strings to specify options for wrapping.
Example:

    #. TRANSLATORS: The default starting column of the --help
    #. text descriptions is 23. This special msgid can be used
    #. to change it.
    msgid ":help-column:"
    msgstr "26"

However, at the moment I don't think this is worth it. (GNU argp allows
setting this kind of options via the environment variable
ARGP_HELP_FMT.)


Closing words
-------------

The effort needed from both translators and upstream maintainers to fix
the manual word wrapping problems in XZ Utils translations has been
significant. It wouldn't surprise me if a few translators gave up
translating the package because of that.

In 2023 and early 2024, XZ Utils had a second maintainer who handled
the discussion with the translators. He recently disappeared without a
prior warning.

I don't want to continue with the manual translation fixing efforts; I
want that *most* languages will have translations that look polished
without detailed quality control by me. Automatic word wrapping with
some simple markup could be a solution. Standardizing the markup across
packages would be good and perhaps the code I will use can be used by
others as well (it would be under the 0BSD license).

Thanks!

-- 
Lasse Collin
[Prev in Thread]
Current Thread
[Next in Thread]
Standardizing simple markup for translatable strings, Lasse Collin <=
Prev by Date: [bug #62158] PHP indented heredoc ending marker are not handled correctly
Next by Date: ./configure with MSVC + Debug blocks with dialog box (fixed with patch!)
Previous by thread: [bug #62158] PHP indented heredoc ending marker are not handled correctly
Next by thread: ./configure with MSVC + Debug blocks with dialog box (fixed with patch!)
Index(es):
- Date
- Thread