emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Rationale for split-string?


From: Stephen J. Turnbull
Subject: Re: Rationale for split-string?
Date: Tue, 20 May 2003 10:55:20 +0900
User-agent: Gnus/5.1001 (Gnus v5.10.1) XEmacs/21.5 (carrot, linux)

>>>>> "sjt" == Stephen J Turnbull <address@hidden> writes:

    sjt> OK.  That is satisfactory for XEmacs, and we'll implement
    sjt> that.

    sjt> Unless you say you prefer to do it yourself, I will also
    sjt> submit a patch against GNU Emacs CVS head, and audit the Lisp
    sjt> code in CVS head to make sure there are no surprises from
    sjt> callers with non-default SEPARATORS.

Enclosed are patches for lisp/subr.el and lispref/strings.texi to
implement the API for split-string discussed earlier.

Also enclosed is the result of an audit of uses of split-string in
Emacs CVS (as of about three weeks ago).  I didn't notice any cases
where the changed specification made existing code out-and-out
incorrect, so there are no further patches suggested.  However, I
think a lot of the uses with an explicit SEPARATORS are semantically
dubious without using the OMIT-NULLS flag (and most were semantically
dubious before the change to split-string, because it's at least
theoretically possible for a null string to arise in the interior of
the list).  Most other uses of split-string are dubious in that either
they depend heavily on undocumented implementation details of other
utilities (eg, that the fields in /etc/mtab are separated by exactly
one space) or are not very robust to bogus input.  People who
understand the modules in question might want to take a closer look.

A few I couldn't tell at all without doing a much deeper analysis of
the code than I have time for right now:

./lisp/calendar/todo-mode.el:869:  needs checking
./lisp/eshell/em-pred.el:601:  needs checking
./lisp/mh-e/mh-utils.el:1606:  needs checking
./lisp/textmodes/reftex.el:934:  needs checking
./lisp/textmodes/reftex.el:2161:  needs checking

If you set default-directory to the root of the Emacs hierarchy, the
following function is useful to jump to the reference.  nb. a few of
the references have changed since I started the audit.

(defun sjt/parse-grep-n2 ()
  "Parse `grep -n -#' output for filename and line number."
  (interactive)
  (beginning-of-line)
  (when (re-search-forward "^\\(\\S-+\\):\\([0-9]+\\):")
    (cons (match-string 1) (string-to-number (match-string 2)))))

(defun sjt/parse-grep-n-and-go ()
  "Jump to place specified by `grep -n' output."
  (interactive)
  (let* ((pair (sjt/parse-grep-n2))
         (file (car pair))
         (line (cdr pair)))
    (find-file file)
    (goto-line line)))


lisp/ChangeLog 2003-05-16 Stephen J. Turnbull <address@hidden>

        * subr.el (split-string): Implement specification that splitting
        on explicit separators retains null fields.  Add new argument
        OMIT-NULLS.  Special-case (split-string "a string").

lispref/ChangeLog
2003-05-16  Stephen J. Turnbull  <address@hidden>

        * strings.texi (Creating Strings): Update split-string
        specification and examples.

Index: lisp/subr.el
===================================================================
RCS file: /cvsroot/emacs/emacs/lisp/subr.el,v
retrieving revision 1.350
diff -u -r1.350 subr.el
--- lisp/subr.el        24 Apr 2003 23:14:12 -0000      1.350
+++ lisp/subr.el        16 May 2003 10:03:58 -0000
@@ -1792,19 +1792,45 @@
        (buffer-substring-no-properties (match-beginning num)
                                        (match-end num)))))
 
-(defun split-string (string &optional separators)
-  "Splits STRING into substrings where there are matches for SEPARATORS.
-Each match for SEPARATORS is a splitting point.
-The substrings between the splitting points are made into a list
+(defconst split-string-default-separators "[ \f\t\n\r\v]+"
+  "The default value of separators for `split-string'.
+
+A regexp matching strings of whitespace.  May be locale-dependent
+\(as yet unimplemented).  Should not match non-breaking spaces.
+
+Warning: binding this to a different value and using it as default is
+likely to have undesired semantics.")
+
+;; The specification says that if both SEPARATORS and OMIT-NULLS are
+;; defaulted, OMIT-NULLS should be treated as t.  Simplifying the logical
+;; expression leads to the equivalent implementation that if SEPARATORS
+;; is defaulted, OMIT-NULLS is treated as t.
+(defun split-string (string &optional separators omit-nulls)
+  "Splits STRING into substrings bounded by matches for SEPARATORS.
+
+The beginning and end of STRING, and each match for SEPARATORS, are
+splitting points.  The substrings matching SEPARATORS are removed, and
+the substrings between the splitting points are collected as a list,
 which is returned.
-If SEPARATORS is absent, it defaults to \"[ \\f\\t\\n\\r\\v]+\".
 
-If there is match for SEPARATORS at the beginning of STRING, we do not
-include a null substring for that.  Likewise, if there is a match
-at the end of STRING, we don't include a null substring for that.
+If SEPARATORS is non-nil, it should be a regular expression matching text
+which separates, but is not part of, the substrings.  If nil it defaults to
+`split-string-default-separators', normally \"[ \\f\\t\\n\\r\\v]+\", and
+OMIT-NULLS is forced to t.
+
+If OMIT-NULLs is t, zero-length substrings are omitted from the list \(so
+that for the default value of SEPARATORS leading and trailing whitespace
+are effectively trimmed).  If nil, all zero-length substrings are retained,
+which correctly parses CSV format, for example.
+
+Note that the effect of `(split-string STRING)' is the same as
+`(split-string STRING split-string-default-separators t)').  In the rare
+case that you wish to retain zero-length substrings when splitting on
+whitespace, use `(split-string STRING split-string-default-separators)'.
 
 Modifies the match data; use `save-match-data' if necessary."
-  (let ((rexp (or separators "[ \f\t\n\r\v]+"))
+  (let ((keep-nulls (not (if separators omit-nulls t)))
+       (rexp (or separators split-string-default-separators))
        (start 0)
        notfirst
        (list nil))
@@ -1813,16 +1839,14 @@
                                       (= start (match-beginning 0))
                                       (< start (length string)))
                                  (1+ start) start))
-               (< (match-beginning 0) (length string)))
+               (< start (length string)))
       (setq notfirst t)
-      (or (eq (match-beginning 0) 0)
-         (and (eq (match-beginning 0) (match-end 0))
-              (eq (match-beginning 0) start))
+      (if (or keep-nulls (< start (match-beginning 0)))
          (setq list
                (cons (substring string start (match-beginning 0))
                      list)))
       (setq start (match-end 0)))
-    (or (eq start (length string))
+    (if (or keep-nulls (< start (length string)))
        (setq list
              (cons (substring string start)
                    list)))


Index: lispref/strings.texi
===================================================================
RCS file: /cvsroot/emacs/emacs/lispref/strings.texi,v
retrieving revision 1.23
diff -u -r1.23 strings.texi
--- lispref/strings.texi        4 Feb 2003 14:47:54 -0000       1.23
+++ lispref/strings.texi        16 May 2003 10:03:59 -0000
@@ -259,30 +259,46 @@
 Lists}.
 @end defun
 
address@hidden split-string string separators
address@hidden split-string string separators omit-nulls
 This function splits @var{string} into substrings at matches for the regular
 expression @var{separators}.  Each match for @var{separators} defines a
 splitting point; the substrings between the splitting points are made
-into a list, which is the value returned by @code{split-string}.
+into a list, which is the value returned by @code{split-string}.  If
address@hidden is @code{t}, null strings will be removed from the
+result list.  Otherwise, null strings are left in the result.
 If @var{separators} is @code{nil} (or omitted),
-the default is @code{"[ \f\t\n\r\v]+"}.
+the default is the value of @code{split-string-default-separators}.
 
-For example,
address@hidden split-string-default-separators
+The default value of @var{separators} for @code{split-string}, initially
address@hidden"[ \f\t\n\r\v]+"}.
+
+As a special case, when @var{separators} is @code{nil} (or omitted),
+null strings are always omitted from the result.  Thus:
 
 @example
-(split-string "Soup is good food" "o")
address@hidden ("S" "up is g" "" "d f" "" "d")
-(split-string "Soup is good food" "o+")
address@hidden ("S" "up is g" "d f" "d")
+(split-string "  two words ")
address@hidden ("two" "words")
address@hidden example
+
+The result is not @samp{("" "two" "words" "")}, which would rarely be
+useful.  If you need such a result, use an explict value for
address@hidden:
+
address@hidden
+(split-string "  two words " split-string-default-separators)
address@hidden ("" "two" "words" "")
 @end example
 
-When there is a match adjacent to the beginning or end of the string,
-this does not cause a null string to appear at the beginning or end
-of the list:
+More examples:
 
 @example
-(split-string "out to moo" "o+")
address@hidden ("ut t" " m")
+(split-string "Soup is good food" "o")
address@hidden ("S" "up is g" "" "d f" "" "d")
+(split-string "Soup is good food" "o" t)
address@hidden ("S" "up is g" "d f" "d")
+(split-string "Soup is good food" "o+")
address@hidden ("S" "up is g" "d f" "d")
 @end example
 
 Empty matches do count, when not adjacent to another match:

bash-2.05b$ find . -name '*.el' | xargs fgrep -2 -n split-string /dev/null
./lisp/apropos.el:267:  want OMIT-NULLS t
./lisp/calendar/todo-mode.el:869:  needs checking
./lisp/cvs-status.el:286:  new semantics preferred; no error checking
./lisp/diff-mode.el:1047:  OK, double default
./lisp/ediff-diff.el:1143:  OK
./lisp/emacs-lisp/authors.el:460:  double default, OK
./lisp/emacs-lisp/crm.el:419:  new semantics preferred; no error checking
./lisp/emacs-lisp/crm.el:605:  new semantics preferred; no error checking
./lisp/emacs-lisp/lisp-mnt.el:412:  want OMIT-NULLS t
./lisp/emacs-lisp/unsafep.el:111:  mentioned in comment, not used
./lisp/eshell/em-cmpl.el:403:  new semantics preferred; no error checking
./lisp/eshell/em-ls.el:257:  OK, double default
./lisp/eshell/em-pred.el:601:  needs checking
./lisp/eshell/esh-util.el:228:  want OMIT-NULLS t
./lisp/eshell/esh-util.el:449:  new semantics preferred; no error checking
./lisp/eshell/esh-var.el:568:  new semantics preferred; no error checking
./lisp/files.el:4254:  double default, OK
./lisp/filesets.el:1202:  new semantics preferred; no error checking
./lisp/gdb-ui.el:1001:  new semantics preferred; no error checking
./lisp/gnus/gnus-art.el:4645:  new semantics preferred; no error checking
./lisp/gnus/gnus-group.el:3798:  OK
./lisp/gnus/gnus.el:2679:  OK
./lisp/gnus/gnus.el:2681:  OK
./lisp/gnus/mailcap.el:367:  OK, could use OMIT-NULLS t instead
./lisp/gnus/mailcap.el:502:  want OMIT-NULLS t
./lisp/gnus/mailcap.el:648:  new semantics preferred; no error checking 
(splitting MIME content type)
./lisp/gnus/mailcap.el:702:  new semantics preferred; no error checking 
(splitting MIME content type)
./lisp/gnus/mailcap.el:870:  OK, could use OMIT-NULLS t instead
./lisp/gnus/mailcap.el:940:  new semantics preferred; no error checking 
(splitting MIME content type)
./lisp/gnus/message.el:4701:  want OMIT-NULLS t
./lisp/gnus/mm-decode.el:55:  new semantics preferred; no error checking 
(splitting MIME content type)
./lisp/gnus/mm-decode.el:57:  new semantics preferred; no error checking 
(splitting MIME content type)
./lisp/gnus/mm-decode.el:264:  new semantics preferred; no error checking 
(splitting MIME content type)
./lisp/gnus/mm-decode.el:363:  OK, double default
./lisp/gnus/mml.el:307:  new semantics preferred; no error checking (splitting 
MIME content type)
./lisp/gnus/mml.el:337:  ditto
./lisp/gnus/nnslashdot.el:364:  OK, double default
./lisp/gnus/nnslashdot.el:488:  OK, could use OMIT-NULLS t instead
./lisp/gnus/nnultimate.el:176:  OK, could use OMIT-NULLS t instead
./lisp/gnus/pop3.el:249:  want OMIT-NULLS t
./lisp/gnus/pop3.el:346:  want OMIT-NULLS t
./lisp/gnus/pop3.el:347:  want OMIT-NULLS t
./lisp/gnus/pop3.el:409:  want OMIT-NULLS t
./lisp/gnus/rfc2231.el:131:  new semantics preferred; no error checking 
(splitting encoded word into locale info)
./lisp/gud.el:1817:  OK
./lisp/gud.el:1847:  OK
./lisp/gud.el:2288:  OK, double default
./lisp/gud.el:2813:  OK
./lisp/hexl.el:635:  double default, OK
./lisp/hexl.el:652:  double default, OK
./lisp/ido.el:2502:  want OMIT-NULLS t
./lisp/ido.el:2868:  want OMIT-NULLS t
./lisp/info.el:387:  want OMIT-NULLS t
./lisp/info.el:390:  want OMIT-NULLS t
./lisp/mail/rfc2368.el:137:  OK
./lisp/mail/rfc2368.el:144:  new semantics preferred; no error checking
./lisp/mail/smtpmail.el:602:  want OMIT-NULLS t
./lisp/mh-e/mh-alias.el:156:  want OMIT-NULLS t
./lisp/mh-e/mh-alias.el:289:  OK
./lisp/mh-e/mh-alias.el:469:  OK
./lisp/mh-e/mh-comp.el:374:  OK, double default
./lisp/mh-e/mh-e.el:2164:  OK, double default
./lisp/mh-e/mh-index.el:475:  OK, double default
./lisp/mh-e/mh-seq.el:966:  OK, double default
./lisp/mh-e/mh-utils.el:1606:  needs checking
./lisp/net/eudc-export.el:126:  OK
./lisp/net/eudc.el:161:  Emacs 21 compatible
./lisp/net/eudc.el:419:  want OMIT-NULLS t
./lisp/net/eudc.el:442:  check this
./lisp/net/eudc.el:833:  want OMIT-NULLS t
./lisp/net/eudcb-ldap.el:90:  OK
./lisp/net/ldap.el:415:  new semantics preferred; no error checking
./lisp/net/ldap.el:420:  OK
./lisp/net/tramp.el:5658:  check this
./lisp/net/tramp.el:6257:  tramp-split-string is not quite emacs compatible
./lisp/pcmpl-cvs.el:175:  new semantics preferred; no error checking
./lisp/pcmpl-gnu.el:127:  OK, double default
./lisp/pcmpl-linux.el:46:  double default, OK
./lisp/pcmpl-linux.el:88:  want OMIT-NULLS t
./lisp/pcmpl-linux.el:101:  want OMIT-NULLS t
./lisp/pcmpl-rpm.el:39:  OK, double default
./lisp/pcmpl-rpm.el:46:  OK, double default
./lisp/pcmpl-unix.el:89:  new semantics preferred; no error checking
./lisp/pcvs-util.el:227:  want OMIT-NULLS t
./lisp/pcvs-util.el:228:  want OMIT-NULLS t
./lisp/progmodes/ada-prj.el:590:  want OMIT-NULLS t
./lisp/progmodes/ada-xref.el:207:  new semantics preferred; no error checking
./lisp/progmodes/fortran.el:267:  want OMIT-NULLS t
./lisp/progmodes/idlw-shell.el:1734:  could use new split-string with 
OMIT-NULLS t
./lisp/progmodes/idlwave.el:3702:  prior XEmacs-compatible, could use new 
split-string
./lisp/progmodes/inf-lisp.el:285:  double default, OK
./lisp/progmodes/vhdl-mode.el:13030:  new semantics preferred; no error checking
./lisp/progmodes/vhdl-mode.el:13171:  new semantics preferred; no error checking
./lisp/progmodes/vhdl-mode.el:13698:  new semantics preferred; no error checking
./lisp/progmodes/vhdl-mode.el:13701:  new semantics preferred; no error checking
./lisp/textmodes/bibtex.el:2665:  new semantics preferred; no error checking
./lisp/textmodes/reftex-cite.el:192:  Gone?
./lisp/textmodes/reftex-cite.el:373:  new semantics preferred; no error checking
./lisp/textmodes/reftex-cite.el:383:  new semantics preferred; no error checking
./lisp/textmodes/reftex-cite.el:445:  OK
./lisp/textmodes/reftex-cite.el:863:  new semantics preferred; no error checking
./lisp/textmodes/reftex-cite.el:961:  new semantics preferred; no error checking
./lisp/textmodes/reftex-index.el:1552:  new semantics preferred; no error 
checking
./lisp/textmodes/reftex-index.el:1685:  want OMIT-NULLS t
./lisp/textmodes/reftex-index.el:1734:  OK, double default
./lisp/textmodes/reftex-index.el:1748:  OK, double default
./lisp/textmodes/reftex-index.el:1755:  OK, double default
./lisp/textmodes/reftex-index.el:1762:  new semantics preferred; no error 
checking
./lisp/textmodes/reftex-index.el:1818:  new semantics preferred; no error 
checking
./lisp/textmodes/reftex-parse.el:343:  new semantics preferred; no error 
checking
./lisp/textmodes/reftex-parse.el:482:  OK, mapconcat used
./lisp/textmodes/reftex-parse.el:990:  new semantics preferred; no error 
checking
./lisp/textmodes/reftex.el:934:  needs checking
./lisp/textmodes/reftex.el:1455:  OK, double default
./lisp/textmodes/reftex.el:1488:  OK, double default
./lisp/textmodes/reftex.el:1556:  OK, could use OMIT-NULLS t instead
./lisp/textmodes/reftex.el:2161:  needs checking (uses explicit re or explicit 
ws)
./lisp/vc-cvs.el:789:    new semantics preferred; requires rewrite to use
./lisp/xml.el:432:  OK
./lisp/xml.el:436:  OK



-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]