Re: cc-mode fontification feels random

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cc-mode fontification feels random

From:	Ergus
Subject:	Re: cc-mode fontification feels random
Date:	Sat, 12 Jun 2021 17:04:02 +0200

On Sat, Jun 12, 2021 at 02:25:45PM +0300, Eli Zaretskii wrote:

Date: Sat, 12 Jun 2021 13:01:03 +0200
From: Ergus <spacibba@aol.com>
Cc: ofv@wanadoo.es, emacs-devel@gnu.org

If I understand something about our cc-mode functionalities (and many of
those functionalities we don't want to loose like indentation and code
navigation). Probably the "right" way to use tree-sitter (maybe Alan
wants give a more precise technical description) is not only fontify but
use the tree information to add contextual information to the text
(something that I think cc-mode does.) And then let font-lock do the
magic.

The tree-sitter tree is basically contextual information, and (for
example) if we have processed the whole buffer and we already have the
tree, then scrolling won't need to parse anything, adding or removing
text is a localized modification, so with the previous tree we can
re-parse only the modified region. The choice may be then if we
propertize the text of the whole buffer or just the visible region OR if
we want to "propertize on demand".

This will save us from the hard parsing in cc-mode to fontify "on the
fly".


I'm not sure I understand what you are suggesting.  Can you describe
your suggestion in terms of 'face' text properties and the 'fontified'
property, and explain how those should fit into the existing redisplay
mechanisms?

cc-mode have something similar to the tree sitter properties. It is the

information we get in c-syntactic-context or c-langelem-sym.

I don't actually know where is this information stored now by cc-mode.

But right now it is set in the text just by regions (visible ones) that
are parsed on demand (that's why they impact commands like
scrolling). So there are two operation, 1) the parsing and then 2) setting
this properties to the text (or where they are stored somehow).

In the other hand when we want to get things like
c-defun-name-and-limits we also search on the fly with functions like
c-declaration-limits-1 or c-go-list-backward, that search on the fly and
try to recognize or find the contextual information.

With tree sitter on the other hand:

suppose we have a buffer like:

int main()
{
        int i = 5;

        return 0;
}

The tree sitter parser returns a tree that may be represented like:

(translation_unit
 (function_definition type:
                      (primitive_type) declarator:
                      (function_declarator declarator: (identifier)
                                           parameters: (parameter_list))
                      body:
                      (compound_statement
                       (declaration type: (primitive_type)
                                    declarator:
                                    (init_declarator
                                     declarator: (identifier)
                                     value: (number_literal)))
                       (return_statement (number_literal)))))

This tree can be traversed, accessed and recalculated very fast; but
after a change, it can be updated even faster and only by sections if we
know the rest haven't change.

When we have a visible region (suppose that we only see the line: int i
= 5; because our screen is very small for this example)

as we know where that line starts in the buffer then we can find the
nearest node that extends in this region using functions like:

ts_node_first_child_for_byte
ts_node_descendant_for_byte_range
ts_node_named_descendant_for_byte_range

the design choice comes here.

1) We can iterate (or traverse) the "usefull" subtree over them to
convert that information in text properties directly (using

ts_tree_cursor_current_field_id).

But If I remember correctly that could have some implications in
redisplay... right?. Even when we modify properties that are not visible
or belong to an outer node.

2) We never convert the tree information into properties (as we know
them in the text now), but just use the ts_tree_cursor_* set of
functions to access the information and tell to the display engine to
use some faces for it.

So in the lisp side instead of accessing stored information in the
properties we just call a wrapper around tree-sitter C functions.

----

The first approach may be probably simpler to implement, but less
optimal because of the translation between C-Lisp types and adding
properties constantly on every update adds extra work on the lisp side.

This may be optimized a bit using for example
ts_tree_get_changed_ranges.

The second approach may require a bit more of work, but will solve the
issue of indentation and code navigation for all the modes with a common
pattern and a single api. While the display engine could access directly
to all the information from C to C.

The key difference may be that (for example) basic commands like: up-list

1) with the first approach will search on the buffer for text properties
changes, syntax-ppss and so on.

2) with the second one will just call ts_node_parent and go to
ts_node_start_byte.

> I don't
>really care if TS actually processes a much larger chunk of text, if
>it does that quickly enough, but processing the resulting faces will
>take time on the Emacs side, and that is better avoided.

But then we won't get all the contextual information we need for
indentation, code navigation or fold the code right?


Why not?

translating also that information may be a lot of work too.

I see two approaches here:

1) add the tree-sitter properties/faces to the buffer text (fully or
partially on the visible regions)

2) use the tree-sitter information directly from the tree and add the
visible properties from there.

This second one will require a more complete api of tree-sitter
functions exposed to elisp, but in my opinion it worth it in accuracy,
speed and simplicity (a single API to rule them all). And to support
many languages we don't actually have like rust or the fancy C++ > 11.


Why can't we have both?  The information you are talking about, which
is needed by Emacs features other than fontification, can be used by
those other Emacs features when needed.  You seem to be saying that
these two alternatives are mutually-exclusive, but you didn't explain
why.

They are not exclusive, but redundant. If we use the current
infrastructure then we will spend a lot of time translating properties
and contextual information. And avoiding to have part of them
outdated. Navigation and indentation will continue to be based on
properties we need to set and update all the time to make the match one
by one.

Basically we will be duplicating the information that is already in the
tree. Creating many list objects, overloading the gc, and so on. So we
potentially will save only the parsing time.

The first one may work with a very primitive api to handle and iterate
the tree-sitter tree. The second one will require to use cursors,
finders and some other features from the tree-sitter API; improving
performance for sure but replacing a lot of the work lisp is doing now.

The second approach will probably make happy the C developers more than
the Lisp ones.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: cc-mode fontification feels random, (continued)

Prev by Date: Re: CSV parsing and other issues (Re: LC_NUMERIC)
Next by Date: Re: cc-mode fontification feels random
Previous by thread: Re: cc-mode fontification feels random
Next by thread: Re: cc-mode fontification feels random
Index(es):
- Date
- Thread