Re: Grammar checking

On Sun, Apr 2, 2023, 11:05 PM Richard Stallman <rms@gnu.org> wrote:

[[[ To any NSA and FBI agents reading my email: please consider ]]]
[[[ whether defending the US Constitution against all enemies, ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

> > If the released (and free) LanguageTool _program_ gives adequate
> > results, we could make Emacs support working with that. But we should
> > take pains _not_ to support the kind of communication that that SaaSS
> > server offers.
> They may not make it easy, see this complaint on their forum:

Would you please spell out what it is
that they "may not make easy"?

> https://forum.languagetool.org/t/about-the-premium-version-of-languagetool/8469

I looked at that page, but lacking the context, I can't understand it
well enough to divine the point that your message hints at.

I may have been mistaken in my first reading. I read the message as saying that any process using the free service would receive an advertisement of how many corrections would be found by the premium service. I am assuming that at the least emacs maintainers would want to filter that out by default. The forum message may only refer to using the web user interface for checking sample text, though.

If the former is true though, it could be difficult to ensure such advertising is always filtered. It really depends on the owners of that service, who can change over time.

> * The process for contributing "rules" to the free version is to go
> through the SaaSS's forum sites.
> https://community.languagetool.org/rule/list?lang=en shows 5919 rules
> for english, presumably in the basic version.

I found a more on-point reference addressing my concern, i.e. how will contributions replicating the rules implemented in the premium version be treated by the project developers:

https://forum.languagetool.org/t/free-lt-premium-for-contributors/8639

Since the exact nature of those premium rules is presumably not disclosed just by virtue of having a premium subscription, I can only guess this reverse engineering would happen by following a process like:

1) Take a large corpus of texts with known grammatical errors, e.g. https://www.cl.cam.ac.uk/research/nl/bea2019st/ or https://ai.googleblog.com/2021/08/the-c4200m-synthetic-dataset-for.html?m=1

2) Record the results produced by the free and premium versions on each test case

3) Formulate rules that specifically fix issues found by the premium version and not the free version.

Perhaps the LanguageTool.org owners would consider this a violation of their service's terms and conditions as a justification for not accepting contributions of source code to the project.

OTOH, if an emacs developer or user simply wants to systematically improve the free version of LanguageTool, the most obvious method for doing so would be

1) Take a large corpus of texts with known grammatical errors, see above

2) Record the results produced by the free rule set

3) Formulate rules that specifically fix issues found, prioritizing issues by some measure of expected frequency in real text

Presumably the additional rules in the premium version have been added precisely according to some measure of their expected frequency, possibly by analysis of real-world text from users over the years the service has been available.

It would be surprising if any attempt to systematically improve my the rules in LanguageTool did not have significant overlap with the rules found in the premium version, if that attempt was successful, just due to the definition of "successful" in statistical terms and the assumption that the premium rule set is likewise "successful".

We could consider forking that code in a limited way: adding new rules.

In general, we should cooperate with upstream developers, but we don't
have to jump through hoops to do so.

I'm not personally very pure in the software I use, so I'm surprised at how much the issues I perceive seem to bother me. I've been an emacs user since the 90s, and it would never have occurred to me that I would ever be concerned about contributing code to improve emacs, whether directly to the emacs projects, or indirectly through one of its dependencies. From what I see now, that will not be the case if grammar checking support is added that depends on languagetool.

I suppose there's another, even more abstract concern with open source software that is developed specifically in conjunction with a SaaSS business, which is: To what extent does data from users of the SaaSS drive development, or even get incorporated in some (aggregated or statistical) form in the source code. For example, what if a grammar checker incorporated a "deep learning" system that had been trained on such data. In most cases, it would be impossible to reconstruct the training data set starting from the data specifying the trained model. But, would it be acceptable for a GNU software project to depend on such software? I don't know the answer, but I think it's a real question when dealing with open source software from projects like LanguageTool. I also don't know or allege that there's anything like that in LanguageTool, but neither can I be certain that there is not. I can't help but think this business model - maintaining an open source version as a loss leader for a proprietary or SaaSS version - is only going to continue growing, and hence the need to address it in the GNU coding manual section 8 or otherwise.

> Looking at the java code makes it appear there are
> many hard-coded rules, but I don't know if that is really the case.
> That is whether the code for the rules are some generic implementation
> of the rules coded in XML, or if the XML rule sets are being
> translated into java code at some point in the build process.

I can only guess at the context this is about, but it sounds like
you're suggesting that it may not be clear what form of the code is
the real source code. Do they not say? Does their source release
include the XML? Does it include Make rules to translate the XML into
Java?

I don't do a lot of Java coding, and it was a cursory examination. I did eventually find the xml rulesets linked to from https://dev.languagetool.org/languages, which is classified as "user documentation". It appears most rules in well-supported languages are in XML, with some coded in Java. Whether the coding in Java is for speed or to overcome limitations of the semantics of rules expressed in XML, I have no idea.

I'm going to leave my concerns at that. I've already spent too much time on this as it is. I just thought the last-minute hair-pulling discussion of tree-sitter grammar files, which frankly seem to have much less ethical baggage, should not be repeated after grammar checking support depending on LanguageTool is already implemented and adopted.

Lynn

From:	Lynn Winebarger
Subject:	Re: Grammar checking
Date:	Thu, 6 Apr 2023 08:29:15 -0400