[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

UTF-8 corruption bug with diff -y

From: Sjur Nørstebø Moshagen
Subject: UTF-8 corruption bug with diff -y
Date: Thu, 8 Nov 2018 08:47:57 +0000


Using diff on text files with long lines risk corrupting UTF-8 enocded files 
when used with the default column width of 130 columns, if a multibyte char 
happens to be on the border of that limit. The diff command will truncate the 
resulting diff output in the middle of the byte sequence, producing malformed 
UTF-8 text.

To reproduce:

diff -y Input-text-1.txt Input-text-2.txt

The bug can be circumvented by setting the column width to a randomly high 
number, as long as it is higher than the longest diff line produced:

diff -y -W 200 Input-text-1.txt Input-text-2.txt

The files Input-text-1.txt and Input-text-2.txt (UTF-8 encoded) are attached. 
The text (excluding --------) is also reproduced below in case the attachments 
are removed during e-mail transfer.

Sjur Moshagen

        "ja" CC
        "iešguhtet" Pron Indef Acc
        "iešguhtet" Pron Indef Attr
        "iešguhtet" Pron Indef Gen
        "lága" N Sem/Dummytag Ess
        "lága" N Sem/Dummytag Sg Loc South Err/Orth
        "lágan" A Sem/Hum Attr
        "lágan" A Sem/Hum Sg Acc Err/Orth-nom-acc
        "lágan" A Sem/Hum Sg Gen Err/Orth-nom-gen
        "lágan" A Sem/Hum Sg Nom
        "láhka" N Sem/Rule Sg Loc South Err/Orth
        "borramuš" N Sem/Food Pl Nom
        "borramuš" N Sem/Food Sg Acc PxSg2
        "borramuš" N Sem/Food Sg Gen PxSg2
        "borrat" Ex/V TV Der/muš N Pl Nom

        "ja" CC
"<iešguđet lágan>"
        "iešguđetlágan" A Sem/Dummytag Attr Err/SpaceCmpágan
        "iešguđetlágan" A Sem/Dummytag Sg Acc Err/Orthacc Err/SpaceCmpágan
        "iešguđetlágan" A Sem/Dummytag Sg Gen Err/Orthgen Err/SpaceCmpágan
        "iešguđetlágan" A Sem/Dummytag Sg Nom Err/SpaceCmpágan
        "borramuš" N Sem/Food Pl Nom
        "borramuš" N Sem/Food Sg Acc PxSg2
        "borramuš" N Sem/Food Sg Gen PxSg2
        "borrat" Ex/V TV Der/muš N Pl Nom

Attachment: Input-text-1.txt
Description: Input-text-1.txt

Attachment: Input-text-2.txt
Description: Input-text-2.txt

reply via email to

[Prev in Thread] Current Thread [Next in Thread]