diff -ru /scratch1/gettext-0.14.2/gettext-tools/doc/gettext.texi gettext-0.14.2/gettext-tools/doc/gettext.texi --- /scratch1/gettext-0.14.2/gettext-tools/doc/gettext.texi 2005-02-23 06:48:29.000000000 -0700 +++ gettext-0.14.2/gettext-tools/doc/gettext.texi 2005-03-25 10:24:38.533409123 -0700 @@ -3046,7 +3046,10 @@ what it thinks to be the old translation for the new modified entry. The slight alteration in the original string (the @code{msgid} string) should often be reflected in the translated string, and this requires -the intervention of the translator. For this reason, @code{msgmerge} +the intervention of the translator. @code{msgmerge} has 5 different +algorithms that attempt to find as close a match with existing +translations as possible, in the hopes of saving translators some +effort. For this reason, @code{msgmerge} might mark some entries as being fuzzy. @emindex moving by fuzzy entries diff -ru /scratch1/gettext-0.14.2/gettext-tools/doc/msgmerge.texi gettext-0.14.2/gettext-tools/doc/msgmerge.texi --- /scratch1/gettext-0.14.2/gettext-tools/doc/msgmerge.texi 2003-10-22 04:40:39.000000000 -0600 +++ gettext-0.14.2/gettext-tools/doc/msgmerge.texi 2005-03-25 10:07:53.739582530 -0700 @@ -125,7 +125,51 @@ @opindex address@hidden, @code{msgmerge} option} @opindex address@hidden, @code{msgmerge} option} Do not use fuzzy matching when an exact match is not found. This may speed -up the operation considerably. +up the operation considerably. The fuzzy matching done uses the fstrcmp +algorithm, that compares the two msgid strings (one from the ref, the other from the +compendia or "def" PO files), and computes a value equal to: + +((number of chars in common) / (average length of the strings)) + +and, selects the highest value found from all the msgid's in the +compendia and PO files. The corresponding msgstr is then used as +the result. When the maximum match value is below some threshold, then +the matches are rejected, and the fuzzy match fails. Needless to say, +if the compendia or "def" PO are large, and/or the msgid string to be +matched is large. this algorithm will become computationally expensive +to perform. Because this algorithm can come up with matches that vary +from near exact, to irrelevant and perhaps even humorous, it is done last, +only if the other algorithms turn up no better alternatives. + address@hidden --no-fuzzy2-matching address@hidden address@hidden, @code{msgmerge} option} +Do not use the "object match" fuzzy matching algorithm. This algorithm uses +a hash table containing strings generated from the original msgids, which +have all words downcased and separated by a single space. Punctuation, common +keywords from GNU, Gnome, KDE, Mozilla, and OpenOffice, etc, are replaced with +a '' marker. Matches via this algorithm could replace the markers +using the data in the msgid in the reference file, thus reducing the amount +of work the translator has to do on the resulting msgstr. + address@hidden --no-fuzzy3-matching address@hidden address@hidden, @code{msgmerge} option} +Do not use the "word match" fuzzy matching algorithm, which works the same +as the the fuzzy2 matching algorithm, except that the markers are left +out of the strings in the hash tables. + address@hidden --no-fuzzy4-matching address@hidden address@hidden, @code{msgmerge} option} +Do not use the "sentence match" fuzzy matching algorithm, which looks for +multi-sentence messages (Sentences end or are separated by a period or +question mark) in a message, and splits up the message into sentences using +the same format as fuzzy3, and looks up each sentence individually. All matches +are concatenated into a new msgstr. + address@hidden --no-fuzzy5-matching address@hidden address@hidden, @code{msgmerge} option} +Do not use the "individual word match" fuzzy matching algorithm, which +splits the msgid into separate words, and looks up each word in the definitions +separately. All matched msgstr's are concatenated into a new msgstr. @end table @subsection Input file syntax diff -ru /scratch1/gettext-0.14.2/gettext-tools/src/ChangeLog gettext-0.14.2/gettext-tools/src/ChangeLog --- /scratch1/gettext-0.14.2/gettext-tools/src/ChangeLog 2005-02-24 05:57:35.000000000 -0700 +++ gettext-0.14.2/gettext-tools/src/ChangeLog 2005-03-25 09:41:25.000000000 -0700 @@ -1,3 +1,39 @@ +2005-03-25 Steve Murphy + + * Makefile.am, added fuzzy.c, fuzzy.h to the build for msgmerge + * Makefile.in, added fuzzy.c, fuzzy.h to the build for msgmerge + * message.c (message_alloc). Added inits for msgid_TO and msgstr_TO. + * message.c (message_list_list_search) Added code to set weight of fuzzy messages to 1 instead + of 2, to allow fuzzy messages to be upgraded from fuzzy + to exact. + * message.h Added include for fuzzy.h + * message.h Added pointers to "translation_object", msgid_TO, and msgstr_TO, to the message_ty + struct. + * msgmerge.c Added fuzzy.h include. + * msgmerge.c Added new global booleans, use_fuzzy[2-5]_matching + * msgmerge.c Added new args 'no-fuzzy[2-5]-matching' to the "long_options" array. + * msgmerge.c (main) Added code to set option variables use_fuzzy[2-5]_matching. + * msgmerge.c (main) Added new options to usage output. + * msgmerge.c (message_merge) Added arg to function call, to allow the msgstr to be + generated based on matched msgstr, and the msgid of the 'ref' + message, for fuzzy algorithms 2 and 3. + * msgmerge.c (match_domain) Added extra args to pass in the 2 new hash tables. Added + the four new fuzzy matchers here. Changed the logic slightly + to allow the matchers to work when the an empty msgstr exists + in the matched definitions stuff. I also reformatted the + indentation here to make it easier for me to read. Sorry. + * msgmerge.c (merge) Create 2 new hash tables, and added some vars to collect stats. + One hash table contains canonicalized msgid's with markers. + The other leaves out the markers, but includes broken out + sentences from existing msgid's as new, fuzzy msgid's. + Updated calls to match_domain to include the new message lists. + * write-po.c (message_print_obsolete) Added sultry comment and code to immediately + return from func, if message is fuzzy. No obsolete + fuzzy messages from msgmerge anymore! + * fuzzy.c New file -- functions for new fuzzy matching algorithms. + * fuzzy.h New file + + 2005-02-24 Bruno Haible * gettext-0.14.2 released. diff -ru /scratch1/gettext-0.14.2/gettext-tools/src/Makefile.am gettext-0.14.2/gettext-tools/src/Makefile.am --- /scratch1/gettext-0.14.2/gettext-tools/src/Makefile.am 2005-02-07 04:40:19.000000000 -0700 +++ gettext-0.14.2/gettext-tools/src/Makefile.am 2005-03-08 12:10:47.000000000 -0700 @@ -134,7 +134,7 @@ msgfmt_SOURCES = msgfmt.c \ write-mo.c write-java.c write-csharp.c write-resources.c write-tcl.c \ write-qt.c plural-eval.c -msgmerge_SOURCES = msgmerge.c plural-count.c +msgmerge_SOURCES = msgmerge.c plural-count.c fuzzy.c msgunfmt_SOURCES = msgunfmt.c \ read-mo.c read-java.c read-csharp.c read-resources.c read-tcl.c xgettext_SOURCES = xgettext.c \ diff -ru /scratch1/gettext-0.14.2/gettext-tools/src/Makefile.in gettext-0.14.2/gettext-tools/src/Makefile.in --- /scratch1/gettext-0.14.2/gettext-tools/src/Makefile.in 2005-02-24 06:11:35.000000000 -0700 +++ gettext-0.14.2/gettext-tools/src/Makefile.in 2005-03-08 12:16:50.000000000 -0700 @@ -189,7 +189,7 @@ ../intl/address@hidden@o \ libgettextsrc.la am_msgmerge_OBJECTS = msgmerge-msgmerge.$(OBJEXT) \ - msgmerge-plural-count.$(OBJEXT) + msgmerge-plural-count.$(OBJEXT) fuzzy.$(OBJEXT) msgmerge_OBJECTS = $(am_msgmerge_OBJECTS) msgmerge_DEPENDENCIES = libgettextsrc.la am_msgunfmt_OBJECTS = msgunfmt-msgunfmt.$(OBJEXT) \ @@ -563,7 +563,7 @@ write-mo.c write-java.c write-csharp.c write-resources.c write-tcl.c \ write-qt.c plural-eval.c -msgmerge_SOURCES = msgmerge.c plural-count.c +msgmerge_SOURCES = msgmerge.c plural-count.c fuzzy.c msgunfmt_SOURCES = msgunfmt.c \ read-mo.c read-java.c read-csharp.c read-resources.c read-tcl.c @@ -993,6 +993,9 @@ msgmerge-msgmerge.o: msgmerge.c $(CC) $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) $(AM_CPPFLAGS) $(CPPFLAGS) $(msgmerge_CFLAGS) $(CFLAGS) -c -o msgmerge-msgmerge.o `test -f 'msgmerge.c' || echo '$(srcdir)/'`msgmerge.c +fuzzy.o: fuzzy.c + $(CC) $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) $(AM_CPPFLAGS) $(CPPFLAGS) $(msgmerge_CFLAGS) $(CFLAGS) -c -o fuzzy.o `test -f 'fuzzy.c' || echo '$(srcdir)/'`fuzzy.c + msgmerge-msgmerge.obj: msgmerge.c $(CC) $(DEFS) $(DEFAULT_INCLUDES) $(INCLUDES) $(AM_CPPFLAGS) $(CPPFLAGS) $(msgmerge_CFLAGS) $(CFLAGS) -c -o msgmerge-msgmerge.obj `if test -f 'msgmerge.c'; then $(CYGPATH_W) 'msgmerge.c'; else $(CYGPATH_W) '$(srcdir)/msgmerge.c'; fi` diff -ru /scratch1/gettext-0.14.2/gettext-tools/src/message.c gettext-0.14.2/gettext-tools/src/message.c --- /scratch1/gettext-0.14.2/gettext-tools/src/message.c 2005-01-18 04:32:25.000000000 -0700 +++ gettext-0.14.2/gettext-tools/src/message.c 2005-03-22 21:35:56.000000000 -0700 @@ -114,6 +114,8 @@ mp->do_wrap = undecided; mp->used = 0; mp->obsolete = false; + mp->msgid_TO = 0; + mp->msgstr_TO = 0; return mp; } @@ -531,6 +533,8 @@ if (mp) { int weight = (mp->msgstr_len == 1 && mp->msgstr[0] == '\0' ? 1 : 2); + if( mp->is_fuzzy ) /* an exact match will outweigh a fuzzy match */ + weight = 1; if (weight > best_weight) { best_mp = mp; diff -ru /scratch1/gettext-0.14.2/gettext-tools/src/message.h gettext-0.14.2/gettext-tools/src/message.h --- /scratch1/gettext-0.14.2/gettext-tools/src/message.h 2005-01-18 04:32:25.000000000 -0700 +++ gettext-0.14.2/gettext-tools/src/message.h 2005-03-10 19:22:58.000000000 -0700 @@ -23,6 +23,7 @@ #include "str-list.h" #include "pos.h" #include "hash.h" +#include "fuzzy.h" #include @@ -140,6 +141,9 @@ /* Used for looking up the target message, in the msgcat program. */ message_ty *tmp; + translation_object *msgid_TO; /* just so we don't have to keep generating these! */ + translation_object *msgstr_TO; + /* Used for combining alternative translations, in the msgcat program. */ int alternative_count; struct altstr diff -ru /scratch1/gettext-0.14.2/gettext-tools/src/msgmerge.c gettext-0.14.2/gettext-tools/src/msgmerge.c --- /scratch1/gettext-0.14.2/gettext-tools/src/msgmerge.c 2005-02-08 04:16:57.000000000 -0700 +++ gettext-0.14.2/gettext-tools/src/msgmerge.c 2005-03-25 07:36:31.000000000 -0700 @@ -53,6 +53,7 @@ #include "backupfile.h" #include "copy-file.h" #include "gettext.h" +#include "fuzzy.h" #define _(str) gettext (str) @@ -74,6 +75,10 @@ /* Determines whether to use fuzzy matching. */ static bool use_fuzzy_matching = true; +static bool use_fuzzy2_matching = true; +static bool use_fuzzy3_matching = true; +static bool use_fuzzy4_matching = true; +static bool use_fuzzy5_matching = true; /* List of user-specified compendiums. */ static message_list_list_ty *compendiums; @@ -97,6 +102,10 @@ { "multi-domain", no_argument, NULL, 'm' }, { "no-escape", no_argument, NULL, 'e' }, { "no-fuzzy-matching", no_argument, NULL, 'N' }, + { "no-fuzzy2-matching", no_argument, NULL, CHAR_MAX + 7 }, + { "no-fuzzy3-matching", no_argument, NULL, CHAR_MAX + 8 }, + { "no-fuzzy4-matching", no_argument, NULL, CHAR_MAX + 9 }, + { "no-fuzzy5-matching", no_argument, NULL, CHAR_MAX + 10 }, { "no-location", no_argument, &line_comment, 0 }, { "no-wrap", no_argument, NULL, CHAR_MAX + 4 }, { "output-file", required_argument, NULL, 'o' }, @@ -284,6 +293,22 @@ message_print_syntax_stringtable (); break; + case CHAR_MAX + 7: /* --stringtable-output */ + use_fuzzy2_matching = false; + break; + + case CHAR_MAX + 8: /* --stringtable-output */ + use_fuzzy3_matching = false; + break; + + case CHAR_MAX + 9: /* --stringtable-output */ + use_fuzzy4_matching = false; + break; + + case CHAR_MAX + 10: /* --stringtable-output */ + use_fuzzy5_matching = false; + break; + default: usage (EXIT_FAILURE); break; @@ -496,6 +521,18 @@ -N, --no-fuzzy-matching do not use fuzzy matching\n")); printf ("\n"); printf (_("\ + --no-fuzzy2-matching do not use fuzzy matching method 2\n")); + printf ("\n"); + printf (_("\ + --no-fuzzy3-matching do not use fuzzy matching method 3\n")); + printf ("\n"); + printf (_("\ + --no-fuzzy4-matching do not use fuzzy matching method 4\n")); + printf ("\n"); + printf (_("\ + --no-fuzzy5-matching do not use fuzzy matching method 5\n")); + printf ("\n"); + printf (_("\ Input file syntax:\n")); printf (_("\ -P, --properties-input input files are in Java .properties syntax\n")); @@ -615,7 +652,7 @@ static message_ty * -message_merge (message_ty *def, message_ty *ref) +message_merge (message_ty *def, message_ty *ref, int match) { const char *msgstr; size_t msgstr_len; @@ -630,6 +667,7 @@ is usually empty, as it was generated by xgettext. If we currently process the header entry we have to merge the msgstr by using the Report-Msgid-Bugs-To and POT-Creation-Date fields from the reference. */ + if (ref->msgid[0] == '\0') { /* Oh, oh. The header entry and we have something to fill in. */ @@ -828,8 +866,32 @@ } else { - msgstr = def->msgstr; - msgstr_len = def->msgstr_len; + if( match ) + { + translation_object *this_msgid; + translation_object *matched_msgid; + translation_object *matched_msgstr; + + if( ref->msgid_TO ) + this_msgid = ref->msgid_TO; + else + ref->msgid_TO = this_msgid = wordmatch_parse_msgid(ref->msgid); + if( def->msgid_TO ) + matched_msgid = def->msgid_TO; + else + def->msgid_TO = matched_msgid = wordmatch_parse_msgid(def->msgid); + if( def->msgstr_TO ) + matched_msgstr = def->msgstr_TO; + else + def->msgstr_TO = matched_msgstr = wordmatch_parse_msgid(def->msgstr); + msgstr = wordmatch_synth_msgstr(this_msgid, matched_msgid, matched_msgstr); + msgstr_len = strlen(msgstr)+1; + } + else + { + msgstr = def->msgstr; + msgstr_len = def->msgstr_len; + } } result = message_alloc (xstrdup (ref->msgid), ref->msgid_plural, @@ -905,201 +967,464 @@ static void match_domain (const char *fn1, const char *fn2, - message_list_list_ty *definitions, message_list_ty *refmlp, - message_list_ty *resultmlp, - struct statistics *stats, unsigned int *processed) + message_list_list_ty *definitions, + message_list_list_ty *definitions2, + message_list_list_ty *definitions3, + message_list_ty *refmlp, + message_list_ty *resultmlp, + struct statistics *stats, unsigned int *processed) { - message_ty *header_entry; - unsigned long int nplurals; - char *untranslated_plural_msgstr; - size_t j; - - header_entry = message_list_search (definitions->item[0], ""); - nplurals = get_plural_count (header_entry ? header_entry->msgstr : NULL); - untranslated_plural_msgstr = (char *) xmalloc (nplurals); - memset (untranslated_plural_msgstr, '\0', nplurals); - - for (j = 0; j < refmlp->nitems; j++, (*processed)++) + message_ty *header_entry; + unsigned long int nplurals; + char *untranslated_plural_msgstr; + size_t j; + + header_entry = message_list_search (definitions->item[0], ""); + nplurals = get_plural_count (header_entry ? header_entry->msgstr : NULL); + untranslated_plural_msgstr = (char *) xmalloc (nplurals); + memset (untranslated_plural_msgstr, '\0', nplurals); + + for (j = 0; j < refmlp->nitems; j++, (*processed)++) { - message_ty *refmsg; - message_ty *defmsg; - - /* Because merging can take a while we print something to signal - we are not dead. */ - if (!quiet && verbosity_level <= 1 && *processed % DOT_FREQUENCY == 0) - fputc ('.', stderr); - - refmsg = refmlp->item[j]; - - /* See if it is in the other file. */ - defmsg = message_list_list_search (definitions, refmsg->msgid); - if (defmsg) - { - /* Merge the reference with the definition: take the #. and - #: comments from the reference, take the # comments from - the definition, take the msgstr from the definition. Add - this merged entry to the output message list. */ - message_ty *mp = message_merge (defmsg, refmsg); - - message_list_append (resultmlp, mp); - - /* Remember that this message has been used, when we scan - later to see if anything was omitted. */ - defmsg->used = 1; - stats->merged++; - } - else if (refmsg->msgid[0] != '\0') - { - /* If the message was not defined at all, try to find a very - similar message, it could be a typo, or the suggestion may - help. */ - if (use_fuzzy_matching - && ((defmsg = - message_list_list_search_fuzzy (definitions, - refmsg->msgid)) != NULL)) - { - message_ty *mp; - - if (verbosity_level > 1) + message_ty *refmsg; + message_ty *defmsg; + + /* Because merging can take a while we print something to signal + we are not dead. */ + if (!quiet && verbosity_level <= 1 && *processed % DOT_FREQUENCY == 0) + fputc ('.', stderr); + + refmsg = refmlp->item[j]; + + /* See if it is in the other file. */ + if (verbosity_level > 0) { - po_gram_error_at_line (&refmsg->pos, _("\ -this message is used but not defined...")); - po_gram_error_at_line (&defmsg->pos, _("\ -...but this definition is similar")); + printf("Looking for Exact Match for %s\n",refmsg->msgid); + fflush(stdout); } - - /* Merge the reference with the definition: take the #. and - #: comments from the reference, take the # comments from - the definition, take the msgstr from the definition. Add - this merged entry to the output message list. */ - mp = message_merge (defmsg, refmsg); - - mp->is_fuzzy = true; - - message_list_append (resultmlp, mp); - - /* Remember that this message has been used, when we scan - later to see if anything was omitted. */ - defmsg->used = 1; - stats->fuzzied++; - if (!quiet && verbosity_level <= 1) - /* Always print a dot if we handled a fuzzy match. */ - fputc ('.', stderr); - } - else - { - message_ty *mp; - bool is_untranslated; - const char *p; - const char *pend; - - if (verbosity_level > 1) - po_gram_error_at_line (&refmsg->pos, _("\ -this message is used but not defined in %s"), fn1); - - mp = message_copy (refmsg); - - if (mp->msgid_plural != NULL) + + defmsg = message_list_list_search (definitions, refmsg->msgid); + if (defmsg && defmsg->msgstr && defmsg->msgstr[0]) + /* I just added the non-empty test, or the only + way this alg would ever do fuzzy matching + is if the ref has a new message not in the + def. + */ { - /* Test if mp is untranslated. (It most likely is.) */ - is_untranslated = true; - for (p = mp->msgstr, pend = p + mp->msgstr_len; p < pend; p++) - if (*p != '\0') - { - is_untranslated = false; - break; - } - if (is_untranslated) - { - /* Change mp->msgstr_len consecutive empty strings into - nplurals consecutive empty strings. */ - if (nplurals > mp->msgstr_len) - mp->msgstr = untranslated_plural_msgstr; - mp->msgstr_len = nplurals; - } + /* Merge the reference with the definition: take the #. and + #: comments from the reference, take the # comments from + the definition, take the msgstr from the definition. Add + this merged entry to the output message list. */ + message_ty *mp = message_merge (defmsg, refmsg, 0); + + message_list_append (resultmlp, mp); + + /* Remember that this message has been used, when we scan + later to see if anything was omitted. */ + defmsg->used = 1; + stats->merged++; + } + else if (refmsg->msgid[0] != '\0') + { + translation_object *transobj; + char *search_string2; + char *search_string3; + + defmsg = 0; + + if( !refmsg->msgid_TO ) + { + transobj = wordmatch_parse_msgid(refmsg->msgid); + refmsg->msgid_TO = transobj; + } + else + transobj = refmsg->msgid_TO; + + search_string2 = wordmatch_msgid_detailed_matchstring(transobj); + search_string3 = wordmatch_msgid_word_matchstring(transobj); + + + /* If the message was not defined at all, try to find a very + similar message, it could be a typo, or the suggestion may + help. */ + + /* method 1: look for any messages that match this one, + based on the words of this message in sequence, minus + punctuation, case, keywords,and shortcut notation of all the + various software groups. This is a "canonical" representation. + All non-"word" items are represented with "" as a marker. + A single space separates the items. A match using this method + could yield a near-exact translation of the text, given that + the keywords are substituted correctly in the msgstr. + */ + if( use_fuzzy2_matching && search_string2 && strlen(search_string2) ) + { + if (verbosity_level > 0) + { + printf("Looking for Obj+Word Match for %s\n",search_string2); + fflush(stdout); + } + + defmsg = message_list_list_search (definitions2, search_string2); + if( defmsg ) + { + message_ty *mp; + if (verbosity_level > 1) + { + po_gram_error_at_line (&refmsg->pos, _("\ +this message is used but not defined via exact match...")); + po_gram_error_at_line (&defmsg->pos, _("\ +...but this definition is similar via method 1")); + } + + /* Merge the reference with the definition: take the #. and + #: comments from the reference, take the # comments from + the definition, take the msgstr from the definition. Add + this merged entry to the output message list. */ + mp = message_merge (defmsg, refmsg, 1); + + mp->is_fuzzy = true; + + message_list_append (resultmlp, mp); + + /* Remember that this message has been used, when we scan + later to see if anything was omitted. */ + defmsg->used = 1; + stats->fuzzied++; + if (!quiet && verbosity_level <= 1) + /* Always print a dot if we handled a fuzzy match. */ + fputc ('1', stderr); + + } + } + + /* method 2: look for any messages that match this one, + based on the words of this message in sequence, minus + punctuation, case, keywords,and shortcut notation of all the + various software groups. This is a "canonical" representation. + All non-"word" items are simply ignored. A single space separates + the words. + */ + + if( !defmsg && use_fuzzy3_matching && search_string3 && strlen(search_string3)) + { + if (verbosity_level > 0) + { + printf("Looking for Word Match for %s\n",search_string3); + fflush(stdout); + } + + defmsg = message_list_list_search (definitions3, search_string3); + if( defmsg ) + { + message_ty *mp; + if (verbosity_level > 1) + { + po_gram_error_at_line (&refmsg->pos, _("\ +this message is used but not defined via direct match or Method 1...")); + po_gram_error_at_line (&defmsg->pos, _("\ +...but this definition is similar via Method 2")); + } + + /* Merge the reference with the definition: take the #. and + #: comments from the reference, take the # comments from + the definition, take the msgstr from the definition. Add + this merged entry to the output message list. */ + mp = message_merge (defmsg, refmsg, 1); + + mp->is_fuzzy = true; + + message_list_append (resultmlp, mp); + + /* Remember that this message has been used, when we scan + later to see if anything was omitted. */ + defmsg->used = 1; + stats->fuzzied++; + if (!quiet && verbosity_level <= 1) + /* Always print a dot if we handled a fuzzy match. */ + fputc ('2', stderr); + } + } + + /* method 3: by sentence. When building method 2's hash table, + break up any paragraphs into sentences, and enter each sentence + into the table with its corresponding part of the msgstr. + Here we search the definitions3 table for each sentence in the + msgid. */ + + if( !defmsg && use_fuzzy4_matching && wordmatch_count_sentences(transobj) > 1 ) + { + char buffer9[20000]; + char buffer7[20000]; + translation_object *nto = 0; + message_ty *defmsg2; + buffer9[0] = 0; + buffer7[0] = 0; + + while( wordmatch_get_next_sentence(transobj,&nto,buffer9) ) + { + if (verbosity_level > 0) + { + printf("Looking for Sentence Match for %s\n",buffer9); + fflush(stdout); + } + defmsg2 = message_list_list_search( definitions3, buffer9); + + if( defmsg2 ) + { + defmsg = defmsg2; + strcat(buffer7, defmsg2->msgstr); + strcat(buffer7, ". "); + } + } + if( defmsg ) + { + message_ty *mp; + /* hmmm. gotta set up a message_ty to handle the result... */ + message_ty *def2 = message_alloc( xstrdup(refmsg->msgid), + refmsg->msgid_plural, + xstrdup(buffer7), + strlen(buffer7), + &defmsg->pos); + mp = message_merge( def2, refmsg, 1); + mp->is_fuzzy = true; + message_list_append( resultmlp, mp); + /* Remember that this message has been used, when we scan + later to see if anything was omitted. */ + defmsg->used = 1; + stats->fuzzied++; + if (!quiet && verbosity_level <= 1) + /* Always print a dot if we handled a fuzzy match. */ + fputc ('3', stderr); + if (verbosity_level > 1) + { + po_gram_error_at_line (&refmsg->pos, _("\ +this message is used but not defined via direct match or Methods 1&2...")); + po_gram_error_at_line (&defmsg->pos, _("\ +...but this definition is similar via Method 3")); + } + } + } + + + /* method 4: by word. + Here we search the definitions3 table for each single word in the + msgid. If there are several words in the msgid that match, we form + a string of these words separated by a single space. The translator + will have to turn them into a valid translation. + */ + + if( !defmsg && use_fuzzy5_matching && wordmatch_count_words(transobj) > 0 ) + { + char buffer9[20000]; + char buffer7[20000]; + translation_object *nto = 0; + message_ty *defmsg2; + buffer9[0] = 0; + buffer7[0] = 0; + + while( wordmatch_get_next_word(transobj,&nto,buffer9) ) + { + if (verbosity_level > 0) + { + printf("Looking for Single Word Match for %s\n", buffer9); + fflush(stdout); + } + defmsg2 = message_list_list_search( definitions3, buffer9); + + if( defmsg2 ) + { + defmsg = defmsg2; + strcat(buffer7, defmsg2->msgstr); + strcat(buffer7, " "); + } + } + if( defmsg ) + { + message_ty *mp; + /* hmmm. gotta set up a message_ty to handle the result... */ + message_ty *def2 = message_alloc( xstrdup(refmsg->msgid), + refmsg->msgid_plural, + xstrdup(buffer7), + strlen(buffer7), + &defmsg->pos); + mp = message_merge( def2, refmsg, 1); + mp->is_fuzzy = true; + message_list_append( resultmlp, mp); + /* Remember that this message has been used, when we scan + later to see if anything was omitted. */ + defmsg->used = 1; + stats->fuzzied++; + if (!quiet && verbosity_level <= 1) + /* Always print a dot if we handled a fuzzy match. */ + fputc ('4', stderr); + if (verbosity_level > 1) + { + po_gram_error_at_line (&refmsg->pos, _("\ +this message is used but not defined via direct match or Methods 1&2&3...")); + po_gram_error_at_line (&defmsg->pos, _("\ +...but this definition is similar via Method 4")); + } + } + } + + if ( !defmsg ) + { + + if (verbosity_level > 0) + { + printf("Looking for Fuzzy Match (fstrcmp) for %s\n",refmsg->msgid); + fflush(stdout); + } + + if (use_fuzzy_matching + && ((defmsg = + message_list_list_search_fuzzy (definitions, + refmsg->msgid)) != NULL)) + { + message_ty *mp; + + if (verbosity_level > 1) + { + po_gram_error_at_line (&refmsg->pos, _("\ +this message is used but not defined via direct match, or Methods 1 or 2...")); + po_gram_error_at_line (&defmsg->pos, _("\ +...but this definition is similar via fuzzy matching")); + } + + /* Merge the reference with the definition: take the #. and + #: comments from the reference, take the # comments from + the definition, take the msgstr from the definition. Add + this merged entry to the output message list. */ + mp = message_merge (defmsg, refmsg, 0); + + mp->is_fuzzy = true; + + message_list_append (resultmlp, mp); + + /* Remember that this message has been used, when we scan + later to see if anything was omitted. */ + defmsg->used = 1; + stats->fuzzied++; + if (!quiet && verbosity_level <= 1) + /* Always print a dot if we handled a fuzzy match. */ + fputc ('.', stderr); + } + else + { + message_ty *mp; + bool is_untranslated; + const char *p; + const char *pend; + + if (verbosity_level > 1) + po_gram_error_at_line (&refmsg->pos, _("\ +this message is used but not defined in %s"), fn1); + + mp = message_copy (refmsg); + + if (mp->msgid_plural != NULL) + { + /* Test if mp is untranslated. (It most likely is.) */ + is_untranslated = true; + for (p = mp->msgstr, pend = p + mp->msgstr_len; p < pend; p++) + if (*p != '\0') + { + is_untranslated = false; + break; + } + if (is_untranslated) + { + /* Change mp->msgstr_len consecutive empty strings into + nplurals consecutive empty strings. */ + if (nplurals > mp->msgstr_len) + mp->msgstr = untranslated_plural_msgstr; + mp->msgstr_len = nplurals; + } + } + + message_list_append (resultmlp, mp); + stats->missing++; + } + } } - - message_list_append (resultmlp, mp); - stats->missing++; - } - } } - - /* Now postprocess the problematic merges. This is needed because we - want the result to pass the "msgfmt -c -v" check. */ - { - /* message_merge sets mp->used to 1 or 2, depending on the problem. - Compute the bitwise OR of all these. */ - int problematic = 0; - - for (j = 0; j < resultmlp->nitems; j++) - problematic |= resultmlp->item[j]->used; - - if (problematic) - { - unsigned long int nplurals = 0; - - if (problematic & 1) - { - /* Need to know nplurals of the result domain. */ - message_ty *header_entry = message_list_search (resultmlp, ""); - - nplurals = get_plural_count (header_entry - ? header_entry->msgstr - : NULL); - } - - for (j = 0; j < resultmlp->nitems; j++) - { - message_ty *mp = resultmlp->item[j]; - - if ((mp->used & 1) && (nplurals > 0)) - { - /* ref->msgid_plural != NULL but def->msgid_plural == NULL. - Use a copy of def->msgstr for each possible plural form. */ - size_t new_msgstr_len; - char *new_msgstr; - char *p; - unsigned long i; - - if (verbosity_level > 1) - { - po_gram_error_at_line (&mp->pos, _("\ + + /* Now postprocess the problematic merges. This is needed because we + want the result to pass the "msgfmt -c -v" check. */ + { + /* message_merge sets mp->used to 1 or 2, depending on the problem. + Compute the bitwise OR of all these. */ + int problematic = 0; + + for (j = 0; j < resultmlp->nitems; j++) + problematic |= resultmlp->item[j]->used; + + if (problematic) + { + unsigned long int nplurals = 0; + + if (problematic & 1) + { + /* Need to know nplurals of the result domain. */ + message_ty *header_entry = message_list_search (resultmlp, ""); + + nplurals = get_plural_count (header_entry + ? header_entry->msgstr + : NULL); + } + + for (j = 0; j < resultmlp->nitems; j++) + { + message_ty *mp = resultmlp->item[j]; + + if ((mp->used & 1) && (nplurals > 0)) + { + /* ref->msgid_plural != NULL but def->msgid_plural == NULL. + Use a copy of def->msgstr for each possible plural form. */ + size_t new_msgstr_len; + char *new_msgstr; + char *p; + unsigned long i; + + if (verbosity_level > 1) + { + po_gram_error_at_line (&mp->pos, _("\ this message should define plural forms")); - } - - new_msgstr_len = nplurals * mp->msgstr_len; - new_msgstr = (char *) xmalloc (new_msgstr_len); - for (i = 0, p = new_msgstr; i < nplurals; i++) - { - memcpy (p, mp->msgstr, mp->msgstr_len); - p += mp->msgstr_len; - } - mp->msgstr = new_msgstr; - mp->msgstr_len = new_msgstr_len; - mp->is_fuzzy = true; - } - - if ((mp->used & 2) && (mp->msgstr_len > strlen (mp->msgstr) + 1)) - { - /* ref->msgid_plural == NULL but def->msgid_plural != NULL. - Use only the first among the plural forms. */ - - if (verbosity_level > 1) - { - po_gram_error_at_line (&mp->pos, _("\ + } + + new_msgstr_len = nplurals * mp->msgstr_len; + new_msgstr = (char *) xmalloc (new_msgstr_len); + for (i = 0, p = new_msgstr; i < nplurals; i++) + { + memcpy (p, mp->msgstr, mp->msgstr_len); + p += mp->msgstr_len; + } + mp->msgstr = new_msgstr; + mp->msgstr_len = new_msgstr_len; + mp->is_fuzzy = true; + } + + if ((mp->used & 2) && (mp->msgstr_len > strlen (mp->msgstr) + 1)) + { + /* ref->msgid_plural == NULL but def->msgid_plural != NULL. + Use only the first among the plural forms. */ + + if (verbosity_level > 1) + { + po_gram_error_at_line (&mp->pos, _("\ this message should not define plural forms")); - } - - mp->msgstr_len = strlen (mp->msgstr) + 1; - mp->is_fuzzy = true; - } - - /* Postprocessing of this message is done. */ - mp->used = 0; - } - } - } + } + + mp->msgstr_len = strlen (mp->msgstr) + 1; + mp->is_fuzzy = true; + } + + /* Postprocessing of this message is done. */ + mp->used = 0; + } + } + } } static msgdomain_list_ty * @@ -1108,10 +1433,17 @@ msgdomain_list_ty *def; msgdomain_list_ty *ref; size_t j, k; + int dupsin2 = 0; + int dupsin3 = 0; + int entries2 = 0; + int entries3 = 0; + int entries4 = 0; unsigned int processed; struct statistics stats; msgdomain_list_ty *result; message_list_list_ty *definitions; + message_list_list_ty *definitions2; + message_list_list_ty *definitions3; message_list_ty *empty_list; stats.merged = stats.fuzzied = stats.missing = stats.obsolete = 0; @@ -1123,11 +1455,14 @@ whose first element will be definitions for the current domain, and whose other elements come from the compendiums. */ definitions = message_list_list_alloc (); + definitions2 = message_list_list_alloc (); + definitions3 = message_list_list_alloc (); message_list_list_append (definitions, NULL); if (compendiums) message_list_list_append_list (definitions, compendiums); empty_list = message_list_alloc (false); + /* This is the references file, created by groping the sources with the xgettext program. */ ref = read_po_file (fn2); @@ -1175,6 +1510,94 @@ def = iconv_msgdomain_list (def, "UTF-8", fn1); } + /* generate the other two message sets, copies of the original + definitions, and we are on our way */ + + if (verbosity_level > 0) + printf("=========== Generate the other two msgid sets...\n"); + + for(k=0; k< definitions->nitems; k++) + { + message_list_ty *mlpn2 = message_list_alloc (/* definitions->item[1]->use_hashtable*/1); /* Whenever you change the items in a list, you can no longer */ + message_list_ty *mlpn3 = message_list_alloc (/* definitions->item[1]->use_hashtable*/1); /* guarantee that those items are still unique */ + + message_list_ty *mlpd = definitions->item[k]; + + for (j = 0; mlpd && j < mlpd->nitems; j++) + { + char buffer9[20000]; + char buffer9x[20000]; + int msgid_sentences; + translation_object *nto; + translation_object *ntox; + translation_object *msgid_desc = wordmatch_parse_msgid(mlpd->item[j]->msgid); + char *newmsgid2 = wordmatch_msgid_detailed_matchstring(msgid_desc); + char *newmsgid3 = wordmatch_msgid_word_matchstring(msgid_desc); + message_ty *found2 = message_list_search(mlpn2,newmsgid2); + message_ty *found3 = message_list_search(mlpn3,newmsgid3); + message_ty *newmess2; + message_ty *newmess3; + + mlpd->item[j]->msgid_TO = msgid_desc; + + if( !found2 ) + { + newmess2 = message_copy(mlpd->item[j]); + newmess2->msgid_TO = msgid_desc; + newmess2->msgid = newmsgid2; + message_list_append(mlpn2, newmess2); + entries2++; + } + else + dupsin2++; + + if( !found3 ) + { + newmess3 = message_copy(mlpd->item[j]); + newmess3->msgid_TO = msgid_desc; + newmess3->msgid = newmsgid3; + message_list_append(mlpn3, newmess3); + entries3++; + } + else + dupsin3++; + + if( (msgid_sentences=wordmatch_count_sentences(msgid_desc)) > 1 ) + { + translation_object *msgstr_desc = wordmatch_parse_msgid(mlpd->item[j]->msgstr); + int msgstr_sentences = wordmatch_count_sentences(msgstr_desc); + if( msgid_sentences == msgstr_sentences ) + { + nto = 0; + ntox = 0; + while( wordmatch_get_next_sentence(msgid_desc,&nto,buffer9) ) + { + message_ty *found4 = message_list_search(mlpn3,buffer9); + wordmatch_get_next_sentence(msgstr_desc,&ntox,buffer9x); /* parallel */ + + if( !found4 ) + { + newmess3 = message_copy(mlpd->item[j]); + newmess3->msgid_TO = wordmatch_parse_msgid(buffer9); + newmess3->msgid = xstrdup(buffer9); + newmess3->msgstr_TO = wordmatch_parse_msgid(buffer9x); + newmess3->msgstr = xstrdup(buffer9x); + newmess3->is_fuzzy = true; /* a non-multisentence msgid might match; + make sure everybody knows this could be fuzzy */ + message_list_append(mlpn3, newmess3); + entries4++; + } + } + } + } + } + message_list_list_append (definitions2, mlpn2); + message_list_list_append (definitions3, mlpn3); + } + + if (verbosity_level > 0) + printf("=========== Done...Table 2 entries = %d, dups = %d; Table 3 entries = %d, dups = %d, sentences = %d; !!!\n", entries2, dupsin2, entries3, dupsin3, entries4 ); + result = msgdomain_list_alloc (false); processed = 0; @@ -1191,7 +1614,7 @@ if (definitions->item[0] == NULL) definitions->item[0] = empty_list; - match_domain (fn1, fn2, definitions, refmlp, resultmlp, + match_domain (fn1, fn2, definitions, definitions2, definitions3, refmlp, resultmlp, &stats, &processed); } else @@ -1213,7 +1636,7 @@ definitions->item[0] = defmlp; - match_domain (fn1, fn2, definitions, refmlp, resultmlp, + match_domain (fn1, fn2, definitions, definitions2, definitions3, refmlp, resultmlp, &stats, &processed); } } diff -ru /scratch1/gettext-0.14.2/gettext-tools/src/write-po.c gettext-0.14.2/gettext-tools/src/write-po.c --- /scratch1/gettext-0.14.2/gettext-tools/src/write-po.c 2005-01-13 05:07:01.000000000 -0700 +++ gettext-0.14.2/gettext-tools/src/write-po.c 2005-03-23 07:55:06.000000000 -0700 @@ -892,7 +892,16 @@ /* Print translator comment if available. */ message_print_comment (mp, fp); - + + /* I'm sorry, but saving fuzzies as obsolete? Why would we + want to create uniqueness errors? They were fuzzy for a + reason-- they were inexact. And they are now obsolete because + (I presume) either an exact or better fuzzy match was found. + So lose them. + */ + if (mp->is_fuzzy) + return; + /* Print flag information in special comment. */ if (mp->is_fuzzy) { diff -ru /scratch1/gettext-0.14.2/NEWS gettext-0.14.2/NEWS --- /scratch1/gettext-0.14.2/NEWS 2005-02-12 06:31:08.000000000 -0700 +++ gettext-0.14.2/NEWS 2005-03-25 08:56:27.000000000 -0700 @@ -1,3 +1,36 @@ +Version 0.14.2a - March 2005 + +* Added 4 new fuzzy matchers, that get run before the current (fstrcmp) + fuzzy matcher: A. match against a canonicalized sentence, with + words lowercased, popular menu key shortcut notation removed, + popular variables, punctuation, and other notation all replaced + with notation, words separated by a single space. + B. Same as above, only the notation is left out of the matching + string. C. If there is more than one sentence, the paragraph is + broken into sentences, and each matched individually as in B. + D. Each word is looked up individually, and the translation is + added to the msgstr. + If none of the preceding algorithms detect a match, then the fstrcmp + is run. All 4 new algorithms are hash-table based and run fairly + quickly. Their results should be more usable than the average fstrcmp + hit, but YMMV. They can be turned off by using the new options + --no-fuzzy2-matching + --no-fuzzy3-matching + --no-fuzzy4-matching + --no-fuzzy5-matching + which turn off algorithms A, B, C, and D, respectively. + +* Fixed bug whereby fuzzy matching was not done if the "def" (PO) file + has a null msgstr (""). Now, fuzzy matching will be performed as + expected. Because of this bug, fuzzy matches were rarely performed + under common circumstances. + +* Fixed bug whereby a fuzzy match could not be upgraded to an exact + match. Changing the algorithm to allow direct matches after a fuzzy + caused the old fuzzy match to be output as obsolete, causing checking + errors in msgfmt. Obsolete fuzzy matches are no longer output. + + Version 0.14.2 - February 2005 * Improved detection of the locale on MacOS X. @@ -26,7 +59,7 @@ * Security fixes. -Version 0.14 - January 2004 +Version 0.14 - January 2005 * Programming languages support: Only in gettext-0.14.2: NEWS~ diff -ru /scratch1/gettext-0.14.2/README gettext-0.14.2/README --- /scratch1/gettext-0.14.2/README 2002-01-07 10:51:27.000000000 -0700 +++ gettext-0.14.2/README 2005-03-25 09:05:20.000000000 -0700 @@ -122,10 +122,26 @@ locale.alias. In the misc/ subdirectory you find an example for an alias database file. -4. The msgmerge program performs fuzzy search in the message sets. It - might run a long time on slow systems. I saw this problem when running - it on my old i386DX25. The time can really be several minutes, - especially if you have long messages and/or a great number of - them. +4. The msgmerge program performs fuzzy search in the message sets, + that compares the two msgid strings (one from the ref, the other from the + compendia or "def" PO files), and computes a value equal to: + + ((number of chars in common) / (average length of the strings)) + + and, selects the highest value found from all the msgid's in the + compendia and PO files. The corresponding msgstr is then used as + the result. When the maximum match value is below some threshold, then + the matches are rejected, and the fuzzy match fails. Needless to say, + if the compendia or "def" PO are large, and/or the msgid string to be + matched is large. this algorithm will become computationally expensive + to perform. I saw this problem when running it on my old i386DX25. The + time can really be several minutes, especially if you have long messages + and/or a great number of them. If you have a faster implementation of the fstrcmp() function and want to share it with the rest of us, please contact me. + The -N, or --no-fuzzy-matching arguments can be used to + prevent this algorithm from being performed. + +5. Four new hash-table based algorithms have been added to do fuzzy matching, + before the fstrcmp fuzzy matching. It is hoped that they will reduce the + run time by decreasing the number of calls to fstrcmp.