Patches to findutils 4.9.0-1's updatedb to do locking, allow filenames w

bug-findutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Patches to findutils 4.9.0-1's updatedb to do locking, allow filenames w

From:	Dan Harkless
Subject:	Patches to findutils 4.9.0-1's updatedb to do locking, allow filenames with spaces & progress monitoring, exclude /dev on Cygwin, etc.
Date:	Thu, 24 Feb 2022 08:32:55 -0800
User-agent:	Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.6.0

Howdy. I posted last August to the Cygwin list about some problems Iwas having with the updatedb script, including that it was taking morethan 24 hours to complete, and then colliding with the next cron run,and that I had no way to monitor progress (short of using SysinternalsProcess Explorer's very awkward GUI for this).

I'm finally getting around to sending in a patch (to bug-findutils andthe Cygwin list, to which I'm currently subscribed) to address theseissues, along with some others, a few of which represent small changesin behavior:

1. Changed the direct 'find' -> 'sort' pipelining of the file list toinstead go to a temporary file first (and then a second one, whensorting). This allows monitoring of progress with 'tail -f'.

2. By default, these .txt and .txt.sort files get deleted on exit (orfatal signal), but I've added a --keeptext option which can be set to'sort' or 'both' to preserve them, either for debugging purposes, or forrunning special-purpose text-munging/matching scripts on.

3. The script now does locking, outputting its PID to (by default)/var/locatedb.running_updatedb_pid (I was first just going to just callit locatedb.lock, but I decided to be more user-friendly). If that fileexists when another instance starts to run, the script will abort withan error.

4. Improved signal-handling, catching more fatal signals. I alsochanged SIGHUP to not be treated as fatal, in line with most programs.

5. /dev was not being excluded by default on Cygwin, since there wecan't recognize it by filesystem type. This was causing sometime-consuming looping problems for me, with falsely self-nested paths.

6. I tried a whole bunch of different quoting variations, but I couldn'tget --prunepaths to work with paths containing spaces (which are ofcourse very common on Windows). The PRUNEREGEX would always end upincorrectly splitting on the spaces. I tried to improve the regexp totreat '\ ' differently than ' ', but I couldn't get it to work (I'm moreexpert with Perl regexps than POSIX ones). In the end, I used anot-TOO-ugly kludge: in the first sed -e expression, I change '\ ' to'///', and then in the last one I change '///' back to ' ' (thebackslash isn't needed in the regexp). Of course '///' should neverappear in a path; I didn't use simply '//' because it's a relativelycommon artifact of path concatenation.

7. make_tempdir() wasn't being used in the current version of findutils,so I removed it.

8. Previously, the only protection the script did of the prior versionof the database was to write to locatedb.n, and then overwrite locatedbwith it at the end. This was a problem for me when trying to debug theupdatedb issues I was having, since, for instance, if you killed the'find', it'd still overwrite the old DB with a partial file list. Thescript now saves the previous version of the DB as locatedb.prev. Thisis also quite handy when a file you expect to be able to find goesmissing, in case it disappeared since the last updatedb run. (I didn'taddress the underlying problem of the script not aborting if 'find' did;the bizarre 4-way 'find' construction with file redirection from themiddle of the if {} is unlike anything I've seen in a shell script before.)

9. I also made some minor changes, including not outputting the fullscript path on errors, standardizing error formatting, putting quotesaround arguments in errors, and various cleanups and comment improvements.

10. The most significant of the minor changes I made was standardizingthe indentation, which was all over the place, making it difficult tounderstand some of the code while debugging. Because of that, inaddition to the attached 'diff -u' patch, here's the output of 'diff-uw' to make it easier to review my changes. My Linux systems use adifferent version of locate, thus I didn't test there, but I think Iremained platform-agnostic (and also didn't write any code that wouldn'twork under Bourne/POSIX shell). BTW, I assign my copyright to GNU (Ifilled out the official form some years ago when I was co-maintainer ofWget), and if you feel my self-credit in the author list comment isunwarranted, of course feel free to get rid of it.


Here's that 'diff -uw' output (again, 'diff -u' patch attached):
--- updatedb.orig    2022-02-05 09:37:55.000000000 -0800
+++ updatedb    2022-02-24 03:27:10.749175300 -0800
@@ -15,13 +15,20 @@
 # You should have received a copy of the GNU General Public License
 # along with this program.  If not, see <https://www.gnu.org/licenses/>.

-# csh original by James Woods; sh conversion by David MacKenzie.
+# csh original by James Woods; sh conversion by David MacKenzie;
+# cleanup and enhancements by Dan Harkless.

 #exec 2> /tmp/updatedb-trace.txt
 #set -x

+ourname=`basename $0` # don't verbosely report path to the script inerrors

+
+stderr() {
+    echo "$ourname: $*" >&2
+}
+
 version='
-updatedb (GNU findutils) 4.9.0
+updatedb (GNU findutils) 4.9.0+patches
 Copyright (C) 1994-2022 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later<https://gnu.org/licenses/gpl.html>

 This is free software: you are free to change and redistribute it.
@@ -47,11 +54,12 @@
 # (correctly) points to https://www.gnu.org/software/findutils/ instead
 # of the bug reporting page.
 usage="\
-Usage: $0 [--findoptions='-option1 -option2...']
+Usage: $ourname [--findoptions='-option1 -option2...']
        [--localpaths='dir1 dir2...'] [--netpaths='dir1 dir2...']
        [--prunepaths='dir1 dir2...'] [--prunefs='fs1 fs2...']
        [--output=dbfile] [--netuser=user] [--localuser=user]
-       [--dbformat] [--version] [--help]
+       [--dbformat=(LOCATE02|slocate)] [--keeptxt=(sort|both)]
+       [--version] [--help]

Please see also the documentation athttps://www.gnu.org/software/findutils/.

 Report (and track progress on fixing) bugs in the updatedb
@@ -61,8 +69,7 @@
 "
 changeto=/

-for arg
-do
+for arg; do
   # If we are unable to fork, the back-tick operator will
   # fail (and the shell will emit an error message).  When
   # this happens, we exit with error value 71 (EX_OSERR).
@@ -80,10 +87,11 @@
     --localuser) LOCALUSER="$val" ;;
     --changecwd)  changeto="$val" ;;
     --dbformat)   dbformat="$val" ;;
-    --version) fail=0; echo "$version" || fail=1; exit $fail ;;
-    --help)    fail=0; echo "$usage"   || fail=1; exit $fail ;;
-    *) echo "updatedb: invalid option $opt
-Try '$0 --help' for more information." >&2
+        --keeptxt)     keeptxt="$val" ;;
+        --version) fail=0; echo "$version" >&2 || fail=1; exit $fail ;;
+        --help)    fail=0; echo "$usage"   >&2 || fail=1; exit $fail ;;
+        *) stderr 'Invalid option "'$opt'".'
+           echo "          Try '$ourname --help' for more information." >&2
        exit 1 ;;
   esac
 done
@@ -100,13 +108,87 @@
         ;;
     *)
         # The "old" database format is no longer supported.

- echo "Unsupported locate database format ${dbformat}: Supportedformats are:" >&2

-        echo "LOCATE02, slocate" >&2
+        stderr 'Unsupported locate database format "'$dbformat'".'
+        echo '          Supported formats are "LOCATE02" or "slocate".' >&2
         exit 1
 esac

+# The database file to build (overridable via commandline orenvironment var.).

+: ${LOCATE_DB=/var/locatedb}
+LOCATE_DB_DIR=`dirname $LOCATE_DB`
+

+# Prevent overlapping with ourselves. Large filesystem collections caneasily+# take over 24 hours to complete, even on pretty speedy systems / harddrives.+# Ideally this would go in /var/run on systems that have that, but thisis OK.

+lockfile=$LOCATE_DB.running_updatedb_pid
+
+if [ -e $lockfile ]; then
+    stderr "Aborting since prior run's lockfile still exists:"
+    ls -lF $lockfile >&2
+    exit 1
+fi
+
+keeptxt=neither
+reported_lockfile_failure=0
+
+cleanup_on_exit_or_signal() {
+    rm -f $LOCATE_DB.n
+
+    if [ $reported_lockfile_failure -ne 1 ]; then
+        # We didn't already have a failure trying to initially create the

+ # lockfile, so we can assume the temporary .txt files are ours(not

+        # saved on a previous run with --keeptxt), and it's safe to
+        # (optionally) delete them.
+        if [ x"$keeptxt" = x"sort" ]; then
+            rm -f $LOCATE_DB.txt
+        elif [ x"$keeptxt" != x"both" ]; then
+            # TBD: Report undefined values of --keeptxt?
+            rm -f $LOCATE_DB.txt $LOCATE_DB.txt.sort
+        fi
+    fi
+
+    if ! rm -f $lockfile; then
+        report_lockfile_failure "remove"
+    fi
+}
+
+report_lockfile_failure() {
+    if [ $reported_lockfile_failure -ne 1 ]; then
+        echo -n "$ourname: Failed to $1 lockfile $lockfile" >&2
+
+        if [ -e $lockfile ]; then
+            echo ":" >&2
+            ls -lF $lockfile >&2
+        else
+            echo " in dir:" >&2
+            ls -dlF $LOCATE_DB_DIR >&2
+        fi
+
+        reported_lockfile_failure=1
+    fi
+}
+

+# Now that we've checked for a previous lockfile above, it's safe toinstall+# cleanup signal handler. We'll try to catch all potentially fatalsignals,

+# along with exit.  From CentOS 7's /usr/include/asm/signal.h:
+#[shell exit=0] SIGHUP=1        SIGINT=2        SIGQUIT=3 SIGILL=4
+#SIGTRAP=5      SIGABRT=6       SIGIOT=6        SIGBUS=7 SIGFPE=8
+#SIGKILL=9      SIGUSR1=10      SIGSEGV=11      SIGUSR2=12 SIGPIPE=13
+#SIGALRM=14     SIGTERM=15      SIGSTKFLT=16    SIGCHLD=17 SIGCONT=18
+#SIGSTOP=19     SIGTSTP=20      SIGTTIN=21      SIGTTOU=22 SIGURG=23
+#SIGXCPU=24     SIGXFSZ=25      SIGVTALRM=26    SIGPROF=27 SIGWINCH=28
+#SIGIO=29       SIGPOLL=SIGIO   SIGLOST=29      SIGPWR=30 SIGSYS=31
+trap cleanup_on_exit_or_signal 0 2 3 4 6 7 8 9 11 15 16 30 31
+

+# Now that we've installed the signal handler, it's safe to createlockfile.

+if ! echo $$ > $lockfile; then
+    report_lockfile_failure "write to"
+    exit 1
+fi

-if true

+# Don't use NUL as a path separator, now that we write to a temporarytext file+# (may want to make that controllable with a commandline option in thefuture).

+if false
 then
     sort="/usr/bin/sort -z"
     print_option="-print0"
@@ -123,7 +205,7 @@
     id | cut -d'(' -f 1 | cut -d'=' -f2
 }

-# figure out if su supports the -s option
+# Figure out if su supports the -s option.
 select_shell() {
     if su "$1" -s $SHELL -c false < /dev/null  ; then
     # No.
@@ -140,8 +222,7 @@
     fi
 }

-
-# You can set these in the environment, or use command-line options,
+# You can set these in the environment, or use command-line options
 # to override their defaults:

 # Any global options for find?
@@ -156,10 +237,13 @@
 # Network (NFS, AFS, RFS, etc.) directories to put in the database.
 : ${NETPATHS=}

-# Directories to not put in the database, which would otherwise be.

+# Default list of directories (overridable with options) to be omittedfrom the+# database. Note that /dev and /proc need to be specified"redundantly" here,

+# since on Cygwin, they can't be detected based on filesystem type.
 : ${PRUNEPATHS="
 /afs
 /amd
+/dev
 /proc
 /sfs
 /tmp
@@ -167,24 +251,26 @@
 /var/tmp
 "}

-# Trailing slashes result in regex items that are never matched, which
-# is not what the user will expect.   Therefore we now reject such
-# constructs.
+# Trailing slashes result in regex items that are never matched, which is
+# not what the user will expect.  Therefore we now reject such constructs.
+# TBD: Just remove any trailing slashes instead?
 for p in $PRUNEPATHS; do
     case "$p" in

- /*/) echo "$0: $p: pruned paths should not contain trailingslashes" >&2+ /*/) stderr "Prune path '$p' has a trailing slash, which isn'tallowed."

            exit 1
     esac
 done

-# The same, in the form of a regex that find can use.

+# Convert $PRUNEPATHS to a regex that find can use. Note that to allowpaths+# containing spaces, the first -e changes '\ ' to '///' ('//' isn'tused since

+# it's a semi-common artifact of path concatenation), and then the last -e
+# changes '///' back to ' ' (it doesn't need backslashing in the regex).
 test -z "$PRUNEREGEX" &&

- PRUNEREGEX=`echo $PRUNEPATHS|sed -e 's,^,\\$^,' -e 's,,$\\$\\\|\\$^,g' -e 's,$,$\\$,'`+ PRUNEREGEX=`echo $PRUNEPATHS | sed -e 's,\\\ ,///,g' -e 's,^,\\$^,'-e 's, ,$\\$\\\|\\$^,g' -e 's,$,$\\$,' -e 's,///, ,g'`


-# The database file to build.
-: ${LOCATE_DB=/var/locatedb}
-
-# Directory to hold intermediate files.

+# Directory for sort (& possibly other executables) to holdintermediate files.+# The script's own temporary files go in the same directory as thedatabase,

+# since they aren't always temporary (--keeptxt or left-behind lockfile).
 if test -z "$TMPDIR"; then
   if test -d /var/tmp; then
     : ${TMPDIR=/var/tmp}
@@ -217,42 +303,19 @@
 : ${find:=${BINDIR}/find}
 : ${frcode:=${LIBEXECDIR}/frcode}

-make_tempdir () {
-    # This implementation is adapted from the GNU Autoconf manual.
-    {
-        tmp=`
-    (umask 077 && mktemp -d "$TMPDIR/updatedbXXXXXX") 2>/dev/null
-    ` &&
-        test -n "$tmp" && test -d "$tmp"
-    } || {
-    # This method is less secure than mktemp -d, but it's a fallback.
-    #
-    # We use $$ as well as $RANDOM since $RANDOM may not be available.
-    # We also add a time-dependent suffix.  This is actually somewhat
-    # predictable, but then so is $$.  POSIX does not require date to
-    # support +%N.
-    ts=`date +%N%S || date +%S 2>/dev/null`
-        tmp="$TMPDIR"/updatedb"$$"-"${RANDOM:-}${ts}"
-        (umask 077 && mkdir "$tmp")
-    }
-    echo "$tmp"
-}
-
 checkbinary () {
     if test -x "$1" ; then
     : ok
     else
-      eval echo "updatedb needs to be able to execute $1, but cannot." >&2
+        stderr "We need to be able to execute $1, but cannot."
       exit 1
     fi
 }

-for binary in $find $frcode
-do
+for binary in $find $frcode; do
   checkbinary $binary
 done

-
 : ${PRUNEFS="
 9P
 NFS
@@ -283,10 +346,20 @@
 fi

 # Make and code the file list.
-# Sort case insensitively for users' convenience.

-rm -f $LOCATE_DB.n
-trap 'rm -f $LOCATE_DB.n; exit' HUP TERM
+if ! echo test > $LOCATE_DB.n; then
+    stderr "Failed to write to temporary database file $LOCATE_DB.n."
+    exit 1
+fi
+

+# We now write to a temporary text file instead of going direct over apipe, as+# the latter makes it very difficult to monitor progress and to debugfailures.

+if ! echo test > $LOCATE_DB.txt; then
+    stderr "Failed to write to text list of files $LOCATE_DB.txt."
+    exit 1
+fi
+
+failed_to_generate_locate_db=0

 if {
 cd "$changeto"
@@ -314,29 +387,43 @@
     exit $?
   else
     # : A4

- $find $NETPATHS $FINDOPTIONS $ -type d -regex "$PRUNEREGEX" -prune$ -o $print_option ||

+            $find $NETPATHS $FINDOPTIONS \( -type d -regex "$PRUNEREGEX" \
+              -prune \) -o $print_option ||
     exit $?
   fi
 fi
-} | $sort | $frcode $frcode_options > $LOCATE_DB.n
+} > $LOCATE_DB.txt
 then
-    : OK so far
-    true

+ # OK, find completed. Going through all the files is verytime-consuming

+    # on some systems, so (try to) save a copy of the previous DB in case
+    # something goes wrong at this point.
+    cp -fp $LOCATE_DB $LOCATE_DB.prev
+

+ # Now sort results, case-insensitively for user convenience, thengenerate

+    # the new DB.
+    if ! $sort -f $LOCATE_DB.txt > $LOCATE_DB.txt.sort; then
+        failed_return_value=$?
+        failed_to_generate_locate_db=1

+ elif ! $frcode $frcode_options < $LOCATE_DB.txt.sort >$LOCATE_DB.n; then

+        failed_return_value=$?
+        failed_to_generate_locate_db=1
+    fi
 else
-    rv=$?
-    echo "Failed to generate $LOCATE_DB.n" >&2
+    failed_to_generate_locate_db=1
+fi
+
+if [ $failed_to_generate_locate_db -eq 1 ]; then
+    stderr "Failed to generate new database temp file $LOCATE_DB.n."
     rm -f $LOCATE_DB.n
-    exit $rv
+    exit $failed_return_value
 fi

-# To avoid breaking locate while this script is running, put the
+# To avoid breaking locate while this script is running, we put the
 # results in a temp file, then rename it atomically.
 if test -s $LOCATE_DB.n; then
-  chmod 644 ${LOCATE_DB}.n
-  mv ${LOCATE_DB}.n $LOCATE_DB
+    chmod 644 $LOCATE_DB.n
+    mv -f $LOCATE_DB.n $LOCATE_DB
 else
-  echo "updatedb: new database would be empty" >&2
+    stderr "New database would be empty, so not creating it."
   rm -f $LOCATE_DB.n
 fi
-
-exit 0

Dan Harkless
http://harkless.org/dan/

updatedb.patch
Description: Text document

[Prev in Thread]

Current Thread

[Next in Thread]

Patches to findutils 4.9.0-1's updatedb to do locking, allow filenames with spaces & progress monitoring, exclude /dev on Cygwin, etc., Dan Harkless <=
- Re: Patches to findutils 4.9.0-1's updatedb to do locking, allow filenames with spaces & progress monitoring, exclude /dev on Cygwin, etc., Andreas Metzler, 2022/02/25
- Re: Patches to findutils 4.9.0-1's updatedb to do locking, allow filenames with spaces & progress monitoring, exclude /dev on Cygwin, etc., Bernhard Voelker, 2022/02/27
  - Re: Patches to findutils 4.9.0-1's updatedb to do locking, allow filenames with spaces & progress monitoring, exclude /dev on Cygwin, etc., Dan Harkless, 2022/02/27

Prev by Date: [bug #62043] The xargs -r, --no-run-if-empty Option Is Ignored When a Delimiter Is Passed
Next by Date: Re: Patches to findutils 4.9.0-1's updatedb to do locking, allow filenames with spaces & progress monitoring, exclude /dev on Cygwin, etc.
Previous by thread: [bug #59988] Documentation for the find command omits -printf %{A, T, C}z
Next by thread: Re: Patches to findutils 4.9.0-1's updatedb to do locking, allow filenames with spaces & progress monitoring, exclude /dev on Cygwin, etc.
Index(es):
- Date
- Thread