bug#13949: 24.4.1; `fill-paragraph' should not always put the buffer as

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#13949: 24.4.1; `fill-paragraph' should not always put the buffer as

From:	Óscar Fuentes
Subject:	bug#13949: 24.4.1; `fill-paragraph' should not always put the buffer as modified
Date:	Mon, 28 Mar 2016 00:03:05 +0200
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/25.0.92 (gnu/linux)

Dmitry Gutov <dgutov@yandex.ru> writes:

> On 03/28/2016 12:05 AM, Óscar Fuentes wrote:
>
>> I guess that the extra bits of entropy (160 vs 128) was a "fuzzy-warm"
>> factor too on using SHA-1 instead of MD5. Git must avoid collisions
>> among potentially hundreds of millions of objects (repos with that size
>> already exists or will exist on the near future.)
>
> Are there fewer different texts we'd have to be able to discern?

As stated on my previous message, statistically it is entirely different
to avoid collisions among pairs of objects than within arbitrarily large
sets. For this case we are on the pair scenario. IIUC, Lars' idea about
using hashes on buffers to test for modification also is the pair case.

>> Each and every hash
>> must be different from all the others and hence avoid the Birthday
>> Problem. Anyway, 128 bit hashes still would be good enough for those
>> huge repos. fill-paragraph needs to discriminate only between 2 chunks
>> of data.
>
> I think you mean "2 chunks of data that must only be different in
> positioning and presence of newlines". Then yes, the odds of a
> collision must be slim. Still, I haven't seen (or performed) a
> sufficient analysis to evaluate them.

For naturally occurring modifications (opposed to specially chosen
modifications with the purpose of creating collisions), inserting
newlines or any string makes little difference to the hash algorithm.

>>> b) Git has a global object index. It _can_ detect collisions, or at
>>> least that detection can be implemented.
>>
>> And what to do when a collision is detected?
>
> Abort the current operation? Wait 50ms and retry creating the commit?
> Not 100% how the file contents are indexed: e.g. whether mtime factors
> into its hash value, too.

This would not work, for several reasons (colliding commits exists
before they are merged or incorporated into a repo where they met; file
and tree objects, whose content is identified by their SHA-1 hashes, can
not be "retried"; etc.) Having a collision is something that Should Not
Happen on Git, and the designers chose a crypto hash precisely because
those algorithms are the best at avoiding collisions.

>> Back to the topic, your suggetion about comparing the pre- and post-
>> contents of the paragraph (and avoiding huge copies of the pre- contents
>> by restricting the copied area to the paragraph itself) does not work
>> when the file contains just one paragraph. Try visiting a big CSV dump
>> or log and press M-q. You can abort the operation with C-g, but if Emacs
>> starts to swap like crazy or exceeds the process memory limit and it is
>> killed...
>
> You can choose to skip the "did it changed" check if the region to
> check is too long. If the dump was one huge line, we can be confident
> that it will be changed upon filling.

What about a file with lots of lines? If you intentionally press M-q on
such a file and see the modified indicator, you either will assume that
the file changed or use `diff-buffer-with-file' to check for
modifications and possibly be greeted with a very long (possibly longer
than the original file) diff that will render Emacs to its feet.

Using the hash approach will put the "too long" threshold on a higher
level (or eliminate it altogether), does not require extra memory and it
is simpler to implement.

Dmitry, if your proposal about comparing the paragraphs is motivated
only by your fear of hash collisions, you are way out off the mark there
:-)

[Prev in Thread]

Current Thread

[Next in Thread]

bug#13949: 24.4.1; `fill-paragraph' should not always put the buffer as modified, (continued)

Prev by Date: bug#13949: 24.4.1; `fill-paragraph' should not always put the buffer as modified
Next by Date: bug#23128: 25.0.92; pcvs calls format-time-string with timezone 'utc.
Previous by thread: bug#13949: 24.4.1; `fill-paragraph' should not always put the buffer as modified
Next by thread: bug#13949: 24.4.1; `fill-paragraph' should not always put the buffer as modified
Index(es):
- Date
- Thread