rdiff-backup-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: cross-platform backup tool Same files from different source dir caus


From: Mr. Clif
Subject: Re: cross-platform backup tool Same files from different source dir causes spurious diff files
Date: Wed, 9 Feb 2022 12:46:24 -0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.14.0

Howdy,

I've dug into this further and think I now know what's going on. Just FYI this is more than an academic question for me because I have several vms that I would like to take snapshots of and this first one is by far the smallest.

These VMs are LXC containers, and when I started out a long time ago, I would just manually create the filesystems and use some cli tools to install a fresh distro. Eventually the linux kernal started supporting namspaces to improve security and they were adopted by the virtualization ecosystems.

I'm not sure when it happened because I just noticed it, maybe it was when I switched to letting proxmox spin up the new VMs, but now the UIDs and GIDs in the filesystems for the unprivileged containers have all been shifted by adding 100000 to them. This is why rdiff-backup updated all that metadata.

This is not just a mapping in ram, it's actually in the filesystem image on disk. There are several ways of dealing with this, some tools will update the UID/GIDs for you when you reboot the vm. Other tools act like layer in a bind mount to mostly duplicate a filesystem somewhere else, and they rewrite the UID/GIDs on the fly. Some utilities like rdiff-backup and rsync have some ability to rewrite or map the UID/GIDs as they copy. The last two seem most attractive to me.

rsync has --usermap, and --groupmap, and  rdiff-backup has --user-mapping-file, and --group-mapping-file. In the filesystem mount utility area there are, shiftfs, idmapped mounts, and bindfs.

Shiftfs is deprecated in favor of idmapped mounts, though some of my kernels don't have that yet. Bindfs is a FUSE based solution and so might be slower, however it might be the only one that is really workable for me at the moment. This is because it has the --uid-offset, and --gid-offset options. Bye the way, you can put in negative offsets too, good thing. :-)

It would be great if rdiff-backup would allow offsets like this or even better the ability to specify a range like
100000-165535:0-65535
Or you could just have the starting UID after the colon.

In the man page under USERS AND GROUPS, it says:
"If you specify both --preserve-numerical-ids and one of the mapping options, the behavior is undefined."

I think it would be better to allow both with the user-mapping-file overriding the preserve-numerical-ids behavior when necessary. As in my use case I never want user name mapping.

What do you think? I appreciator the discussion, and everyone's help.

    Thanks,
    Clif


On 2/8/22 6:03 PM, Robert Nichols wrote:
On 2/8/22 6:44 PM, Mr. Clif wrote:
ok cool, good info,

I was just digging into it again, and the date I switched to the snapshot was recorded as Feb 1st. Here is a list of the mirror_metadata files leading up to that:

-rw------- 1 root root 2.7M Jan 21 05:25 mirror_metadata.2022-01-21T05:20:05-09:00.snapshot.gz -rw------- 1 root root  632 Jan 23 05:25 mirror_metadata.2022-01-22T05:20:26-09:00.diff.gz -rw------- 1 root root  790 Jan 24 05:26 mirror_metadata.2022-01-23T05:20:04-09:00.diff.gz -rw------- 1 root root  783 Jan 25 05:24 mirror_metadata.2022-01-24T05:20:33-09:00.diff.gz -rw------- 1 root root  778 Jan 26 05:29 mirror_metadata.2022-01-25T05:19:31-09:00.diff.gz -rw------- 1 root root  731 Jan 27 05:25 mirror_metadata.2022-01-26T05:23:21-09:00.diff.gz -rw------- 1 root root  723 Jan 28 05:27 mirror_metadata.2022-01-27T05:20:37-09:00.diff.gz -rw------- 1 root root  786 Jan 29 05:29 mirror_metadata.2022-01-28T05:21:17-09:00.diff.gz -rw------- 1 root root  772 Jan 30 05:26 mirror_metadata.2022-01-29T05:23:55-09:00.diff.gz -rw------- 1 root root 2.7M Jan 30 05:26 mirror_metadata.2022-01-30T05:20:43-09:00.snapshot.gz -rw------- 1 root root  725 Feb  1 05:26 mirror_metadata.2022-01-31T05:21:21-09:00.diff.gz -rw------- 1 root root 2.6M Feb  3 15:33 mirror_metadata.2022-02-01T05:20:43-09:00.diff.gz -rw------- 1 root root  613 Feb  4 05:16 mirror_metadata.2022-02-03T14:20:54-09:00.diff.gz -rw------- 1 root root 1.7K Feb  5 05:17 mirror_metadata.2022-02-04T05:13:29-09:00.diff.gz -rw------- 1 root root  852 Feb  6 05:55 mirror_metadata.2022-02-05T05:14:57-09:00.diff.gz -rw------- 1 root root 1.7K Feb  7 06:36 mirror_metadata.2022-02-06T05:52:59-09:00.diff.gz -rw------- 1 root root  73K Feb  8 05:39 mirror_metadata.2022-02-07T06:33:04-09:00.diff.gz -rw------- 1 root root 2.7M Feb  8 05:39 mirror_metadata.2022-02-08T05:33:08-09:00.snapshot.gz

You will see that the mirror_metadata.2022-02-01T05:20:43-09:00.diff.gz with the modified date of Feb 3rd is about the same size as the previous snapshot file a couple of days before.

If you grep for the lines that match "^File" then I presume you get a good count of the number of files that changed, or at least recorded for some reason. Here are those stats:

find increments -name "*2022-02-01*" -exec ls -lh {} \; | wc
   85287  767583 11064660
gzip -dc mirror_metadata.2022-01-30T05:20:43-09:00.snapshot.gz | egrep "^File " | wc
   89287  178574 4535737
gzip -dc mirror_metadata.2022-02-01T05:20:43-09:00.diff.gz | egrep "^File " | wc
   85288  170576 4374253

Notice how the number of files with that date in the name, (the first wc output) is almost the same as the number of files listed in the diff.gz file on the last wc call for the diff.gz file.

I also compared some of the entries in the snapshot file to the diff.gz file, and never found any differences. Of course I only checked a dozen or two.

I believe you are comparing the wrong files. Welcome to the confusing world of reverse diffs. Everything works backward. That 2.6MB mirror_metadata.2022-02-01T05:20:43-09:00.diff.gz has the differences that would be applied to a 2022-02-03T14:20:54-09:00 snapshot (i.e., the next _newer_ state) to construct a 2022-02-01 snapshot. The huge perceived change occurred between the 2022-02-01 backup and the 2022-02-03 backup.

I would first look at some of the entries in that mirror_metadata.2022-02-03T14:20:54-09:00.diff.gz file and see if some of the same filenames appear in the huge 2022-02-01 diff. Hopefully you can spot what metadata changed. If you can't find any matching names in the 2022-02-03 diff, try the 2022-02-04 diff. As a last resort, I can send you a rather large** awk script that you can use to work back from the nearest future snapshot (currently 2022-02-08) to reconstruct a 2022-02-03 snapshot. Then you should certainly be able to see what the differences that 2022-02-01 diff is applying.

** A bit over 3KB, somewhat more than I care to spew out to a mailing list.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]