[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Discussion about file format for the future

From: Arrigo Marchiori
Subject: Re: Discussion about file format for the future
Date: Fri, 5 Jun 2020 23:44:27 +0200

Dear Patrik, All,

I will try to contribute to this interesting conversation.

On Fri, Jun 05, 2020 at 08:16:30AM -0400, Patrik Dufresne wrote:

> As mentioned by Robert searching for metadata is complex because you need
> to scan multiple file to actually find the right value. instead of having a
> query if we were using a database.
> Obviously performance-wise it's not great either because we need to scan
> multiple file.
> The only thing I hate about that  is lake of visibility as a compromise
> maybe we can find the most common database and add layer on top using
> command line to search in this database? To let users be autonomous.
> SQLite is probably one of those very popular and simple database.

If we were going to substitute a lot of files with a single file (that
is what a SQLite database is in the end, right?) then we may somehow
introduce a "single point of failure" for the whole backup.

If I interpret correctly some experiences I had with apparent
rdiff-backup metadata corruption (I was backing up files with accented
letters, long paths on Windows, or onto faulty external hard drives),
there is the possibility that some missing bits of information are
reconstructed, or single unrecoverable files are substituted with
zero-byte stubs, leaving the rest of the backup safe and recoverable.
I wonder what would happen if the SQLite database got corrupted. Would
data (as files list and/or contents) be still recoverable?

I would also like to add another note to this conversation. Microsoft
Windows systems are subject to a limitation in the maximum length of
file paths. This means that files with "long-ish" paths may not be
accessible, or that their corresponding metadata (?) would not, as
some files inside the rdiff-backup-data directory seem to be named
after backed-up files with some codes appended.

If the rdiff-backup-data directory is ever going to be redesigned,
then please, consider making it filesystem-agnostic. This would not
only solve the above problem, but also allow other possibly useful use
cases such as backing up case sensitive filesystem to case-insensitive
ones or vice-versa... reliably.

I am also replying to Robert's e-mail below.

> On Thu., Jun. 4, 2020, 11:03 p.m. Robert Nichols, <
> rnicholsNOSPAM@comcast.net> wrote:
> > On 6/4/20 11:43 AM, Patrik Dufresne wrote:
> > > But two cent on the subject is, should we really keep this filebase ? For
> > > rdiffweb, scanning the metadata files is a nightmare. When I just need a
> > > subset of the data to be displayed to the user. I always thought a
> > database
> > > could be better fit for the job. Something like a key store or similar.
> >
> > +1 from me
> >
> > The way rdiff-backup stores metadata is its worst feature, in my opinion.
> > Keeping the metadata in various text files makes analysis unnecessarily
> > complex and searches very inefficient. Inode data for hard-linked files
> > is replicated in the mirror_metadata file, except for the checksum, which
> > is stored just on the first entry for that inode, so you have to go
> > hunting for it, and make sure it is always in the right place when
> > that linking changes. That sort of thing just screams to be stored in
> > a database.

I personally never looked into the details of rdiff-backup, but I
often wished I could access all that data... easily.

Maybe this is what you are looking for as well? An alternative way to
access rdiff-backup data and meta-data, other than launching
rdiff-backup itself?

IMHO the best way of addressing this problem would not be to make an
"easy to parse" file format, but rather develop an official, easy to
use API.  If rdiff-backup itself was import-able from Python scripts,
and made its functions directly accessible from Python code, probably
other tools (such as Patrick's rdiff-web, if I understood correctly)
would not care any more about how increments and metadata are stored,
because the API would abstract the details.  This, at least, is how I
would imagine an ideal future development of this software.

The structure of the meta-data itself should rather be based on the
concept of fault tolerance and independence from filesystem, as I
suggested above.

I hope I understood the topic of this thread, and that I could explain
myself clearly enough.

Best regards,


reply via email to

[Prev in Thread] Current Thread [Next in Thread]