emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: treesitter local parser: huge slowdown and memory usage in a long fi


From: Yuan Fu
Subject: Re: treesitter local parser: huge slowdown and memory usage in a long file
Date: Fri, 19 Apr 2024 19:18:53 -0700

> 
> 
> > On Feb 18, 2024, at 9:53 PM, Yuan Fu <casouri@gmail.com> wrote:
> > 
> > 
> > 
> >> On Feb 17, 2024, at 7:37 PM, Dmitry Gutov <dmitry@gutov.dev> wrote:
> >> 
> >> On 13/02/2024 10:08, Yuan Fu wrote:
> >> 
> >>>> On 12/02/2024 06:16, Yuan Fu wrote:
> >>>>> Thanks, the culprit is the call to treesit-update-ranges in
> >>>>> treesit--pre-redisplay, where we don’t pass it any specific range, so it
> >>>>> updates the range for the whole buffer. Eli, is there any way to get a
> >>>>> rough estimate the range that redisplay is refreshing? Do you think
> >>>>> something like this would work?
> >>>> 
> >>>> If we don't update the ranges outside of some interval surrounding the 
> >>>> window, what does that mean for correctness?
> >>> If the place of update and the embedded code currently in view belong to 
> >>> the same node in the host language, then when we update ranges for the 
> >>> current window-visible range, the whole node’s range is updated. So at 
> >>> least for this node, the range is correct.
> >>> If the place of update and the embedded code currently in view belong to 
> >>> different nodes in the host language, then when we update ranges for the 
> >>> current window-visible range, only the visible node’s range is updated.
> >> 
> >> Okay. What about positions after the visible part of the buffer? Can their 
> >> ranges be outdated? It's probably okay when the ranges are only used for 
> >> font-lock and syntax-ppss, but I wonder about possible other applications 
> >> (reindenting the whole buffer, for example).
> > 
> > It’s the same as positions before the visible part. For reindenting the 
> > whole 
> > buffer, treesit-indent-region will update the range for the whole buffer at 
> > the very beginning.
> > 
> >> 
> >>>> 
> >>>> Perhaps the mode has a syntax-propertize-function which behaves 
> >>>> differently (as it should) depending on the language at point. Or 
> >>>> different ranges have different syntax tables, something like that.
> >>>> 
> >>>> If the ranges, after some edit (perhaps a programmatic one, performed 
> >>>> far 
> >>>> from the visible area), are kept not update somewhere around the 
> >>>> beginning 
> >>>> of the buffer, do we not risk confusing the syntax-ppss parser, for 
> >>>> example?
> >>> That can happen, yes.
> >>>> 
> >>>> Come to think of it, take treesit-indent: it only updates the ranges for 
> >>>> the current line. But the line's indentation usually depends on the 
> >>>> previous buffer positions, doesn't it?
> >>> The range passed to treesit-update-ranges act as an intercepting range—we 
> >>> capture nodes that intercepts with the range and use them to update 
> >>> ranges. 
> >>> If the line to be indented is in an embedded language block, the whole 
> >>> block will be captured and it’s range will be given to the embedded 
> >>> language parser.
> >>> We haven’t have any problem so far mainly because most embedded code 
> >>> blocks 
> >>> are local, and it’s rare for some edit to take place far from the visible 
> >>> portion which affects ranges and user expects that edit to affect the 
> >>> current visible range.
> >>> I don’t have any great idea for a better way to update ranges right now. 
> >>> Let me think about that. In the meantime, I’ll push a temporary fix so 
> >>> V’s 
> >>> original problem can be solved.
> >> 
> >> I was thinking (since considering the same problem in mmm-mode, actually) 
> >> that it would make sense to either plug into syntax-propertize-function, 
> >> or 
> >> have a parallel data structure similarly tracking the outdated buffer 
> >> regions, which would only update the part of the buffer which had been 
> >> modified since last time.
> >> 
> >> Dealing with the "remainder" of the buffer might be trickier, but maybe 
> >> some 
> >> heuristic which would help detect the "no changes" case could be 
> >> implemented.
> > 
> > Yeah, something similar to syntax-ppss or jit-lock. Or maybe it can be 
> > avoided, since the current on-demand range update has been working fine, 
> > until we added treesit--pre-redisplay for syntax-ppss.
> 
> This is actually a bit involved, because there could be multiple layer’s of 
> parsers: the host language sets range for a local parser, and the local 
> parser 
> can set ranges for a nested-nested parser. Eg, we might have a markdown 
> parser 
> for parsing doc-comments, and inside the markdown there could be code blocks 
> which require another level of nested parser.
> 
> This use-case is a bit advanced but we definitely need to support it in our 
> design. And my brain is twisted by all the dependency and range. If you guys 
> has some ideas they’ll be most welcome :-)
> 

I believe I’ve found a good way to solve this problem. I pushed the changes to 
master. 

Basically I added a function treesit-parser-changed-ranges that can directly 
return the change ranges from last reparse. This means we don’t need to use 
notifiers to get those change ranges anymore. Then in treesit-pre-redisplay, we 
reparse the primary parser and get the changed ranges from it.

Once we have the changed ranges, we update other non-primary parser’s ranges, 
but only within the changed ranges. Originally we were updating those parser’s 
ranges on the whole buffer, which led to the slowdown. Then we had to use some 
workaround to solve this. Now the workaround isn’t needed anymore.

I also remove some notifier functions and moved their work into 
treesit-pre-redisplay.

Yuan


reply via email to

[Prev in Thread] Current Thread [Next in Thread]