[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
UTF-8 corruption bug with diff -y
From: |
Sjur Nørstebø Moshagen |
Subject: |
UTF-8 corruption bug with diff -y |
Date: |
Thu, 8 Nov 2018 08:47:57 +0000 |
Hello
Using diff on text files with long lines risk corrupting UTF-8 enocded files
when used with the default column width of 130 columns, if a multibyte char
happens to be on the border of that limit. The diff command will truncate the
resulting diff output in the middle of the byte sequence, producing malformed
UTF-8 text.
To reproduce:
diff -y Input-text-1.txt Input-text-2.txt
The bug can be circumvented by setting the column width to a randomly high
number, as long as it is higher than the longest diff line produced:
diff -y -W 200 Input-text-1.txt Input-text-2.txt
The files Input-text-1.txt and Input-text-2.txt (UTF-8 encoded) are attached.
The text (excluding --------) is also reproduced below in case the attachments
are removed during e-mail transfer.
Regards,
Sjur Moshagen
Input-text-1.txt:
--------
"<ja>"
"ja" CC
"<iešguđet>"
"iešguhtet" Pron Indef Acc
"iešguhtet" Pron Indef Attr
"iešguhtet" Pron Indef Gen
"<lágan>"
"lága" N Sem/Dummytag Ess
"lága" N Sem/Dummytag Sg Loc South Err/Orth
"lágan" A Sem/Hum Attr
"lágan" A Sem/Hum Sg Acc Err/Orth-nom-acc
"lágan" A Sem/Hum Sg Gen Err/Orth-nom-gen
"lágan" A Sem/Hum Sg Nom
"láhka" N Sem/Rule Sg Loc South Err/Orth
"<borramušat>"
"borramuš" N Sem/Food Pl Nom
"borramuš" N Sem/Food Sg Acc PxSg2
"borramuš" N Sem/Food Sg Gen PxSg2
"borrat" Ex/V TV Der/muš N Pl Nom
--------
Input-text-2.txt
--------
"<ja>"
"ja" CC
"<iešguđet lágan>"
"iešguđetlágan" A Sem/Dummytag Attr Err/SpaceCmpágan
"iešguđetlágan" A Sem/Dummytag Sg Acc Err/Orthacc Err/SpaceCmpágan
"iešguđetlágan" A Sem/Dummytag Sg Gen Err/Orthgen Err/SpaceCmpágan
"iešguđetlágan" A Sem/Dummytag Sg Nom Err/SpaceCmpágan
"<borramušat>"
"borramuš" N Sem/Food Pl Nom
"borramuš" N Sem/Food Sg Acc PxSg2
"borramuš" N Sem/Food Sg Gen PxSg2
"borrat" Ex/V TV Der/muš N Pl Nom
--------
Input-text-1.txt
Description: Input-text-1.txt
Input-text-2.txt
Description: Input-text-2.txt
- UTF-8 corruption bug with diff -y,
Sjur Nørstebø Moshagen <=