viewmail-info
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[VM] [COMMIT XEMACS] Pass character count from coding systems to buffer


From: Aidan Kehoe
Subject: [VM] [COMMIT XEMACS] Pass character count from coding systems to buffer insertion code.
Date: Thu, 16 Jan 2014 17:10:35 +0000

APPROVE COMMIT

SUPERSEDES address@hidden

This doesn’t include character_tell() implementations for lots of the coding
systems where said implementation is easy, e.g. in iso-8859-1. Nor does it
include it for chain coding systems, e.g. those for non-Unix line endings. I
haven’t profiled it to the level I did for the first patch, should do that
soon.

# HG changeset patch
# User Aidan Kehoe <address@hidden>
# Date 1389889672 0
# Node ID 65d65b52d608ca1f17365a96fc3cf710a3af625c
# Parent  4004c3266c09888a9935242a462beb3fb28e02a3
Pass character count from coding systems to buffer insertion code.

src/ChangeLog addition:

2014-01-16  Aidan Kehoe  <address@hidden>

        Pass character count information from the no-conversion and
        unicode coding systems to the buffer insertion code, making
        #'find-file on large buffers a little snappier (if
        ERROR_CHECK_TEXT is not defined).

        * file-coding.c:
        * file-coding.c (coding_character_tell): New.
        * file-coding.c (conversion_coding_stream_description): New.
        * file-coding.c (no_conversion_convert):
        Update characters_seen when decoding.
        * file-coding.c (no_conversion_character_tell): New.
        * file-coding.c (lstream_type_create_file_coding): Create the
        no_conversion type with data.
        * file-coding.c (coding_system_type_create):
        Make the character_tell method available here.
        * file-coding.h:
        * file-coding.h (struct coding_system_methods):
        Add a new character_tell() method, passing charcount information
        from the coding systems to the buffer code, avoiding duplicate
        bytecount-to-charcount work especially with large buffers.

        * fileio.c (Finsert_file_contents_internal):
        Update this to pass charcount information to
        buffer_insert_string_1(), if that is available from the lstream code.

        * insdel.c:
        * insdel.c (buffer_insert_string_1):
        Add a new CCLEN argument, giving the character count of the string
        to insert. It can be -1 to indicate that te function should work
        it out itself using bytecount_to_charcount(), as it used to.
        * insdel.c (buffer_insert_raw_string_1):
        * insdel.c (buffer_insert_lisp_string_1):
        * insdel.c (buffer_insert_ascstring_1):
        * insdel.c (buffer_insert_emacs_char_1):
        * insdel.c (buffer_insert_from_buffer_1):
        * insdel.c (buffer_replace_char):
        Update these functions to use the new calling convention.
        * insdel.h:
        * insdel.h (buffer_insert_string):
        Update this header to reflect the new buffer_insert_string_1()
        argument.

        * lstream.c (Lstream_character_tell): New.
        Return the number of characters *read* and seen by the consumer so
        far, taking into account the unget buffer, and buffered reading.

        * lstream.c (Lstream_unread):
        Update unget_character_count here as appropriate.
        * lstream.c (Lstream_rewind):
        Reset unget_character_count here too.

        * lstream.h:
        * lstream.h (struct lstream):
        Provide the character_tell method, add a new field,
        unget_character_count, giving the number of characters ever passed
        to Lstream_unread().
        Declare Lstream_character_tell().
        Make Lstream_ungetc(), which happens to be unused, an inline
        function rather than a macro, in the course of updating it to
        modify unget_character_count.

        * print.c (output_string):
        Use the new argument to buffer_insert_string_1().
        * tests.c:
        * tests.c (Ftest_character_tell):
        New test function.
        * tests.c (syms_of_tests):
        Make it available.
        * unicode.c:
        * unicode.c (struct unicode_coding_stream):
        * unicode.c (unicode_character_tell):
        New method.
        * unicode.c (unicode_convert):
        Update the character counter as appropriate.
        * unicode.c (coding_system_type_create_unicode):
        Make the character_tell method available.

diff -r 4004c3266c09 -r 65d65b52d608 src/ChangeLog
--- a/src/ChangeLog     Sun Dec 22 10:36:33 2013 +0000
+++ b/src/ChangeLog     Thu Jan 16 16:27:52 2014 +0000
@@ -1,3 +1,82 @@
+2014-01-16  Aidan Kehoe  <address@hidden>
+
+       Pass character count information from the no-conversion and
+       unicode coding systems to the buffer insertion code, making
+       #'find-file on large buffers a little snappier (if
+       ERROR_CHECK_TEXT is not defined).
+       
+       * file-coding.c:
+       * file-coding.c (coding_character_tell): New.
+       * file-coding.c (conversion_coding_stream_description): New.
+       * file-coding.c (no_conversion_convert):
+       Update characters_seen when decoding.
+       * file-coding.c (no_conversion_character_tell): New.
+       * file-coding.c (lstream_type_create_file_coding): Create the
+       no_conversion type with data.
+       * file-coding.c (coding_system_type_create):
+       Make the character_tell method available here.
+       * file-coding.h:
+       * file-coding.h (struct coding_system_methods):
+       Add a new character_tell() method, passing charcount information
+       from the coding systems to the buffer code, avoiding duplicate
+       bytecount-to-charcount work especially with large buffers.
+
+       * fileio.c (Finsert_file_contents_internal):
+       Update this to pass charcount information to
+       buffer_insert_string_1(), if that is available from the lstream code.
+       
+       * insdel.c:
+       * insdel.c (buffer_insert_string_1):
+       Add a new CCLEN argument, giving the character count of the string
+       to insert. It can be -1 to indicate that te function should work
+       it out itself using bytecount_to_charcount(), as it used to.
+       * insdel.c (buffer_insert_raw_string_1):
+       * insdel.c (buffer_insert_lisp_string_1):
+       * insdel.c (buffer_insert_ascstring_1):
+       * insdel.c (buffer_insert_emacs_char_1):
+       * insdel.c (buffer_insert_from_buffer_1):
+       * insdel.c (buffer_replace_char):
+       Update these functions to use the new calling convention.
+       * insdel.h:
+       * insdel.h (buffer_insert_string):
+       Update this header to reflect the new buffer_insert_string_1()
+       argument.
+
+       * lstream.c (Lstream_character_tell): New.
+       Return the number of characters *read* and seen by the consumer so
+       far, taking into account the unget buffer, and buffered reading.
+
+       * lstream.c (Lstream_unread):
+       Update unget_character_count here as appropriate.
+       * lstream.c (Lstream_rewind):
+       Reset unget_character_count here too.
+
+       * lstream.h:
+       * lstream.h (struct lstream):
+       Provide the character_tell method, add a new field,
+       unget_character_count, giving the number of characters ever passed
+       to Lstream_unread().
+       Declare Lstream_character_tell().
+       Make Lstream_ungetc(), which happens to be unused, an inline
+       function rather than a macro, in the course of updating it to
+       modify unget_character_count.
+
+       * print.c (output_string):
+       Use the new argument to buffer_insert_string_1().
+       * tests.c:
+       * tests.c (Ftest_character_tell):
+       New test function.
+       * tests.c (syms_of_tests):
+       Make it available.
+       * unicode.c:
+       * unicode.c (struct unicode_coding_stream):
+       * unicode.c (unicode_character_tell):
+       New method.
+       * unicode.c (unicode_convert):
+       Update the character counter as appropriate.
+       * unicode.c (coding_system_type_create_unicode):
+       Make the character_tell method available.
+
 2013-12-19  Aidan Kehoe  <address@hidden>
 
        * text.c:
diff -r 4004c3266c09 -r 65d65b52d608 src/file-coding.c
--- a/src/file-coding.c Sun Dec 22 10:36:33 2013 +0000
+++ b/src/file-coding.c Thu Jan 16 16:27:52 2014 +0000
@@ -1990,6 +1990,14 @@
   return Lstream_seekable_p (str->other_end);
 }
 
+static Charcount
+coding_character_tell (Lstream *stream)
+{
+  struct coding_stream *str = CODING_STREAM_DATA (stream);
+
+  return XCODESYSMETH_OR_GIVEN (str->codesys, character_tell, (str), -1);
+}
+
 static int
 coding_flusher (Lstream *stream)
 {
@@ -2823,7 +2831,32 @@
 
    #### Shouldn't we _call_ it that, then?  And while we're at it,
    separate it into "to_internal" and "to_external"? */
-DEFINE_CODING_SYSTEM_TYPE (no_conversion);
+
+
+struct no_conversion_coding_system
+{
+};
+
+struct no_conversion_coding_stream
+{
+  /* Number of characters seen when decoding. */
+  Charcount characters_seen;
+};
+
+static const struct memory_description 
no_conversion_coding_system_description[] = {
+  { XD_END }
+};
+
+static const struct memory_description 
no_conversion_coding_stream_description_1 [] = {
+  { XD_INT, offsetof (struct no_conversion_coding_stream, characters_seen) },
+  { XD_END }
+};
+
+const struct sized_memory_description no_conversion_coding_stream_description 
= {
+  sizeof (struct no_conversion_coding_stream), 
no_conversion_coding_stream_description_1
+};
+
+DEFINE_CODING_SYSTEM_TYPE_WITH_DATA (no_conversion);
 
 /* This is used when reading in "binary" files -- i.e. files that may
    contain all 256 possible byte values and that are not to be
@@ -2846,6 +2879,9 @@
          DECODE_ADD_BINARY_CHAR (c, dst);
        }
 
+      CODING_STREAM_TYPE_DATA (str, no_conversion)->characters_seen
+        += orign;
+
       if (str->eof)
        DECODE_OUTPUT_PARTIAL_CHAR (ch, dst);
     }
@@ -2904,6 +2940,12 @@
   return orign;
 }
 
+static Charcount
+no_conversion_character_tell (struct coding_stream *str)
+{
+  return CODING_STREAM_TYPE_DATA (str, no_conversion)->characters_seen;
+}
+
 DEFINE_DETECTOR (no_conversion);
 DEFINE_DETECTOR_CATEGORY (no_conversion, no_conversion);
 
@@ -4656,6 +4698,7 @@
   LSTREAM_HAS_METHOD (coding, writer);
   LSTREAM_HAS_METHOD (coding, rewinder);
   LSTREAM_HAS_METHOD (coding, seekable_p);
+  LSTREAM_HAS_METHOD (coding, character_tell);
   LSTREAM_HAS_METHOD (coding, marker);
   LSTREAM_HAS_METHOD (coding, flusher);
   LSTREAM_HAS_METHOD (coding, closer);
@@ -4697,9 +4740,10 @@
   dump_add_opaque_int (&coding_detector_count);
   dump_add_opaque_int (&coding_detector_category_count);
 
-  INITIALIZE_CODING_SYSTEM_TYPE (no_conversion,
-                                "no-conversion-coding-system-p");
+  INITIALIZE_CODING_SYSTEM_TYPE_WITH_DATA (no_conversion,
+                                           "no-conversion-coding-system-p");
   CODING_SYSTEM_HAS_METHOD (no_conversion, convert);
+  CODING_SYSTEM_HAS_METHOD (no_conversion, character_tell);
 
   INITIALIZE_DETECTOR (no_conversion);
   DETECTOR_HAS_METHOD (no_conversion, detect);
diff -r 4004c3266c09 -r 65d65b52d608 src/file-coding.h
--- a/src/file-coding.h Sun Dec 22 10:36:33 2013 +0000
+++ b/src/file-coding.h Thu Jan 16 16:27:52 2014 +0000
@@ -353,6 +353,9 @@
      a result of the stream being rewound.  Optional. */
   void (*rewind_coding_stream_method) (struct coding_stream *str);
 
+  /* Return the number of characters *decoded*. Optional. */
+  Charcount (*character_tell_method) (struct coding_stream *str);
+
   /* Finalize coding stream method: Clean up the type-specific data
      attached to the coding stream (i.e. in struct TYPE_coding_stream).
      Happens when the Lstream is deleted using Lstream_delete() or is
diff -r 4004c3266c09 -r 65d65b52d608 src/fileio.c
--- a/src/fileio.c      Sun Dec 22 10:36:33 2013 +0000
+++ b/src/fileio.c      Thu Jan 16 16:27:52 2014 +0000
@@ -3180,6 +3180,7 @@
     struct gcpro ngcpro1;
     Lisp_Object stream = make_filedesc_input_stream (fd, 0, total,
                                                     LSTR_ALLOW_QUIT);
+    Charcount last_tell = -1;
 
     NGCPRO1 (stream);
     Lstream_set_buffering (XLSTREAM (stream), LSTREAM_BLOCKN_BUFFERED, 65536);
@@ -3187,6 +3188,7 @@
       (XLSTREAM (stream), get_coding_system_for_text_file (codesys, 1),
        CODING_DECODE, 0);
     Lstream_set_buffering (XLSTREAM (stream), LSTREAM_BLOCKN_BUFFERED, 65536);
+    last_tell = Lstream_character_tell (XLSTREAM (stream));
 
     record_unwind_protect (delete_stream_unwind, stream);
 
@@ -3196,7 +3198,7 @@
     while (1)
       {
        Bytecount this_len;
-       Charcount cc_inserted;
+       Charcount cc_inserted, this_tell = last_tell;
 
        QUIT;
        this_len = Lstream_read (XLSTREAM (stream), read_buf,
@@ -3209,12 +3211,17 @@
            break;
          }
 
-       cc_inserted = buffer_insert_raw_string_1 (buf, cur_point, read_buf,
-                                                 this_len,
-                                                 !NILP (visit)
-                                                 ? INSDEL_NO_LOCKING : 0);
+       cc_inserted
+          = buffer_insert_string_1 (buf, cur_point, read_buf, Qnil,
+                                    0, this_len, last_tell >= 0
+                                    ? (this_tell
+                                       = Lstream_character_tell (XLSTREAM
+                                                                 (stream)))
+                                    - last_tell : -1,
+                                    !NILP (visit) ? INSDEL_NO_LOCKING : 0);
        inserted  += cc_inserted;
        cur_point += cc_inserted;
+        last_tell = this_tell;
       }
     if (!NILP (used_codesys))
       {
diff -r 4004c3266c09 -r 65d65b52d608 src/insdel.c
--- a/src/insdel.c      Sun Dec 22 10:36:33 2013 +0000
+++ b/src/insdel.c      Thu Jan 16 16:27:52 2014 +0000
@@ -1039,14 +1039,15 @@
 #endif
 }
 
-/* Insert a string into BUF at Charbpos POS.  The string data comes
-   from one of two sources: constant, non-relocatable data (specified
-   in NONRELOC), or a Lisp string object (specified in RELOC), which
-   is relocatable and may have extent data that needs to be copied
-   into the buffer.  OFFSET and LENGTH specify the substring of the
-   data that is actually to be inserted.  As a special case, if POS
-   is -1, insert the string at point and move point to the end of the
-   string.
+/* Insert a string into BUF at Charbpos POS.  The string data comes from one
+   of two sources: constant, non-relocatable data (specified in NONRELOC),
+   or a Lisp string object (specified in RELOC), which is relocatable and
+   may have extent data that needs to be copied into the buffer.  OFFSET and
+   LENGTH specify the substring of the data that is actually to be inserted.
+   As a special case, if POS is -1, insert the string at point and move
+   point to the end of the string.  CCLEN is the character count of the data
+   to be inserted, and can be -1 to indicate that buffer_insert_string_1 ()
+   should work this out itself with bytecount_to_charcount().
 
    Normally, markers at the insertion point end up before the
    inserted string.  If INSDEL_BEFORE_MARKERS is set in flags, however,
@@ -1061,13 +1062,12 @@
 buffer_insert_string_1 (struct buffer *buf, Charbpos pos,
                        const Ibyte *nonreloc, Lisp_Object reloc,
                        Bytecount offset, Bytecount length,
-                       int flags)
+                        Charcount cclen, int flags)
 {
   /* This function can GC */
   struct gcpro gcpro1;
   Bytebpos bytepos;
   Bytecount length_in_buffer;
-  Charcount cclen;
   int move_point = 0;
   struct buffer *mbuf;
   Lisp_Object bufcons;
@@ -1118,14 +1118,27 @@
 
   bytepos = charbpos_to_bytebpos (buf, pos);
 
-  /* string may have been relocated up to this point */
-  if (STRINGP (reloc))
+  if (cclen < 0)
     {
-      cclen = string_offset_byte_to_char_len (reloc, offset, length);
-      nonreloc = XSTRING_DATA (reloc);
+      /* string may have been relocated up to this point */
+      if (STRINGP (reloc))
+        {
+          cclen = string_offset_byte_to_char_len (reloc, offset, length);
+          nonreloc = XSTRING_DATA (reloc);
+        }
+      else
+        cclen = bytecount_to_charcount (nonreloc + offset, length);
     }
   else
-    cclen = bytecount_to_charcount (nonreloc + offset, length);
+    {
+      text_checking_assert (cclen > 0 && cclen
+                            == (STRINGP (reloc) ?
+                                string_offset_byte_to_char_len (reloc, offset,
+                                                                length)
+                                : bytecount_to_charcount (nonreloc + offset,
+                                                          length)));
+    }
+
   /* &&#### Here we check if the text can't fit into the format of the buffer,
      and if so convert it to another format (either default or 32-bit-fixed,
      according to some flag; if no flag, use default). */
@@ -1286,7 +1299,7 @@
 {
   /* This function can GC */
   return buffer_insert_string_1 (buf, pos, nonreloc, Qnil, 0, length,
-                                flags);
+                                -1, flags);
 }
 
 Charcount
@@ -1295,8 +1308,7 @@
 {
   /* This function can GC */
   return buffer_insert_string_1 (buf, pos, 0, str, 0,
-                                XSTRING_LENGTH (str),
-                                flags);
+                                XSTRING_LENGTH (str), -1, flags);
 }
 
 /* Insert the null-terminated string S (in external format). */
@@ -1309,7 +1321,7 @@
   const CIbyte *translated = GETTEXT (s);
   ASSERT_ASCTEXT_ASCII (s);
   return buffer_insert_string_1 (buf, pos, (const Ibyte *) translated, Qnil,
-                                0, strlen (translated), flags);
+                                0, strlen (translated), -1, flags);
 }
 
 Charcount
@@ -1319,7 +1331,7 @@
   /* This function can GC */
   Ibyte str[MAX_ICHAR_LEN];
   Bytecount len = set_itext_ichar (str, ch);
-  return buffer_insert_string_1 (buf, pos, str, Qnil, 0, len, flags);
+  return buffer_insert_string_1 (buf, pos, str, Qnil, 0, len, -1, flags);
 }
 
 Charcount
@@ -1339,7 +1351,7 @@
   /* This function can GC */
   Lisp_Object str = make_string_from_buffer (buf2, pos2, length);
   return buffer_insert_string_1 (buf, pos, 0, str, 0,
-                                XSTRING_LENGTH (str), flags);
+                                XSTRING_LENGTH (str), -1, flags);
 }
 
 
@@ -1674,7 +1686,7 @@
        * backward so that it now equals the insertion point.
        */
       buffer_insert_string_1 (buf, (movepoint ? -1 : pos),
-                             newstr, Qnil, 0, newlen, 0);
+                             newstr, Qnil, 0, newlen, -1, 0);
     }
 }
 
diff -r 4004c3266c09 -r 65d65b52d608 src/insdel.h
--- a/src/insdel.h      Sun Dec 22 10:36:33 2013 +0000
+++ b/src/insdel.h      Thu Jan 16 16:27:52 2014 +0000
@@ -38,7 +38,7 @@
 Charcount buffer_insert_string_1 (struct buffer *buf, Charbpos pos,
                                  const Ibyte *nonreloc, Lisp_Object reloc,
                                  Bytecount offset, Bytecount length,
-                                 int flags);
+                                 Charcount clen, int flags);
 Charcount buffer_insert_raw_string_1 (struct buffer *buf, Charbpos pos,
                                      const Ibyte *nonreloc,
                                      Bytecount length, int flags);
@@ -58,7 +58,7 @@
    All of these can GC. */
 
 #define buffer_insert_string(buf, nonreloc, reloc, offset, length) \
-  buffer_insert_string_1 (buf, -1, nonreloc, reloc, offset, length, 0)
+  buffer_insert_string_1 (buf, -1, nonreloc, reloc, offset, length, -1, 0)
 #define buffer_insert_raw_string(buf, string, length) \
   buffer_insert_raw_string_1 (buf, -1, string, length, 0)
 #define buffer_insert_ascstring(buf, s) \
diff -r 4004c3266c09 -r 65d65b52d608 src/lstream.c
--- a/src/lstream.c     Sun Dec 22 10:36:33 2013 +0000
+++ b/src/lstream.c     Thu Jan 16 16:27:52 2014 +0000
@@ -735,6 +735,134 @@
   return Lstream_read_1 (lstr, data, size, 0);
 }
 
+Charcount
+Lstream_character_tell (Lstream *lstr)
+{
+  Charcount ctell = lstr->imp->character_tell ?
+    lstr->imp->character_tell (lstr) : -1;
+
+  if (ctell >= 0)
+    {
+      /* Our implementation's character tell code doesn't know about the
+         unget buffer, update its figure to reflect it. */
+      ctell += lstr->unget_character_count;
+
+      if (lstr->unget_buffer_ind > 0)
+        {
+          /* The character count should not include those characters
+             currently *in* the unget buffer, subtract that count.  */
+          Ibyte *ungot, *ungot_ptr;
+          Bytecount ii = lstr->unget_buffer_ind, impartial, sevenflen;
+
+          ungot_ptr = ungot
+            = alloca_ibytes (lstr->unget_buffer_ind) + MAX_ICHAR_LEN;
+
+          /* Make sure the string starts with a valid ibyteptr, otherwise
+             validate_ibyte_string_backward could run off the beginning. */
+          sevenflen = set_itext_ichar (ungot, (Ichar) 0x7f);
+          ungot_ptr += sevenflen;
+
+          /* Internal format data, but in reverse order. There's not
+             actually a need to alloca here, we could work out the character
+             count directly from the reversed bytes, but the alloca approach
+             is more robust to changes in our internal format, and the unget
+             buffer is not going to blow the stack. */
+          while (ii > 0)
+            {
+              *ungot_ptr++ = lstr->unget_buffer[--ii];
+            }
+
+          impartial
+            = validate_ibyte_string_backward (ungot, ungot_ptr - ungot);
+
+          /* Move past the character we added. */
+          impartial -= sevenflen;
+          INC_IBYTEPTR (ungot);
+
+          if (impartial > 0 && !valid_ibyteptr_p (ungot))
+            {
+              Ibyte *newstart = ungot, *limit = ungot + impartial;
+              /* Our consumer has the start of a partial character, we
+                 have the rest. */
+
+              while (!valid_ibyteptr_p (newstart) && newstart < limit)
+                {
+                  newstart++, impartial--;
+                }
+                  
+              /* Remove this character from the count, since the
+                 end-consumer hasn't seen the full character. */
+              ctell--;
+              ungot = newstart;
+            }
+          else if (valid_ibyteptr_p (ungot)
+                   && rep_bytes_by_first_byte (*ungot) > impartial)
+            {
+              /* Rest of a partial character has yet to be read, its first
+                 octet has probably been unread by Lstream_read_1(). We
+                 included it in the accounting in Lstream_unread(), adjust
+                 the figure here appropriately. */
+              ctell--;
+            }
+
+          /* bytecount_to_charcount will throw an assertion failure if we're
+             not at the start of a character. */
+          text_checking_assert (impartial == 0 || valid_ibyteptr_p (ungot));
+
+          /* The character length of this text is included in
+             unget_character_count; if the bytes are still in the unget
+             buffer, then our consumers haven't seen them, and so the
+             character tell figure shouldn't reflect them. Subtract it from
+             the total.  */
+          ctell -= bytecount_to_charcount (ungot, impartial);
+        }
+
+      if (lstr->in_buffer_ind < lstr->in_buffer_current)
+        {
+          Ibyte *inbuf = lstr->in_buffer + lstr->in_buffer_ind;
+          Bytecount partial = lstr->in_buffer_current - lstr->in_buffer_ind,
+            impartial;
+
+          if (!valid_ibyteptr_p (inbuf))
+            {
+              Ibyte *newstart = inbuf;
+              Ibyte *limit = lstr->in_buffer + lstr->in_buffer_current;
+              /* Our consumer has the start of a partial character, we
+                 have the rest. */
+
+              while (newstart < limit && !valid_ibyteptr_p (newstart))
+                {
+                  newstart++;
+                }
+                  
+              /* Remove this character from the count, since the
+                 end-consumer hasn't seen the full character. */
+              ctell--;
+              inbuf = newstart;
+              partial = limit - newstart;
+            }
+
+          if (valid_ibyteptr_p (inbuf)) 
+            {
+              /* There's at least one valid starting char in the string,
+                 validate_ibyte_string_backward won't run off the
+                 begining. */
+              impartial = 
+                validate_ibyte_string_backward (inbuf, partial);
+            }
+          else
+            {
+              impartial = 0;
+            }
+
+          ctell -= bytecount_to_charcount (inbuf, impartial);
+        }
+
+      text_checking_assert (ctell >= 0);
+    }
+
+  return ctell;
+}
 
 /* Push back SIZE bytes of DATA onto the input queue.  The next call
    to Lstream_read() with the same size will read the same bytes back.
@@ -755,7 +883,12 @@
   /* Bytes have to go on in reverse order -- they are reversed
      again when read back. */
   while (size--)
-    lstr->unget_buffer[lstr->unget_buffer_ind++] = p[size];
+    {
+      lstr->unget_buffer[lstr->unget_buffer_ind++] = p[size];
+      /* If we see a valid first byte, that is the last octet in a
+         character, so increase the count of ungot characters. */
+      lstr->unget_character_count += valid_ibyteptr_p (p + size);
+    }
 }
 
 /* Rewind the stream to the beginning. */
@@ -768,6 +901,7 @@
   if (Lstream_flush (lstr) < 0)
     return -1;
   lstr->byte_count = 0;
+  lstr->unget_character_count = 0;
   return (lstr->imp->rewinder) (lstr);
 }
 
diff -r 4004c3266c09 -r 65d65b52d608 src/lstream.h
--- a/src/lstream.h     Sun Dec 22 10:36:33 2013 +0000
+++ b/src/lstream.h     Thu Jan 16 16:27:52 2014 +0000
@@ -181,6 +181,10 @@
      method.  If this method is not present, the result is determined
      by whether a rewind method is present. */
   int (*seekable_p) (Lstream *stream);
+
+  /* Return the number of complete characters read so far. Respects
+     buffering and unget. Returns -1 if unknown or not implemented. */
+  Charcount (*character_tell) (Lstream *stream);
   /* Perform any additional operations necessary to flush the
      data in this stream. */
   int (*flusher) (Lstream *stream);
@@ -250,8 +254,9 @@
      similarly has to push the data on backwards. */
   unsigned char *unget_buffer; /* holds characters pushed back onto input */
   Bytecount unget_buffer_size; /* allocated size of buffer */
-  Bytecount unget_buffer_ind; /* pointer to next buffer spot
-                                         to write a character */
+  Bytecount unget_buffer_ind; /* Next buffer spot to write a character */
+
+  Charcount unget_character_count; /* Count of complete characters ever ungot. 
*/
 
   Bytecount byte_count;
   int flags;
@@ -297,8 +302,8 @@
 int Lstream_fputc (Lstream *lstr, int c);
 int Lstream_fgetc (Lstream *lstr);
 void Lstream_fungetc (Lstream *lstr, int c);
-Bytecount Lstream_read (Lstream *lstr, void *data,
-                                Bytecount size);
+Bytecount Lstream_read (Lstream *lstr, void *data, Bytecount size);
+Charcount Lstream_character_tell (Lstream *);
 int Lstream_write (Lstream *lstr, const void *data,
                   Bytecount size);
 int Lstream_was_blocked_p (Lstream *lstr);
@@ -353,19 +358,28 @@
    reverse order they were pushed back -- most recent first. (This is
    necessary for consistency -- if there are a number of bytes that
    have been unread and I read and unread a byte, it needs to be the
-   first to be read again.) This is a macro and so it is very
-   efficient.  The C argument is only evaluated once but the STREAM
-   argument is evaluated more than once.
- */
+   first to be read again.) */
 
-#define Lstream_ungetc(stream, c)                                      \
-/* Add to the end if it won't overflow buffer; otherwise call the      \
-   function equivalent */                                              \
-  ((stream)->unget_buffer_ind >= (stream)->unget_buffer_size ?         \
-   Lstream_fungetc (stream, c) :                                       \
-   (void) ((stream)->byte_count--,                                     \
-   ((stream)->unget_buffer[(stream)->unget_buffer_ind++] =             \
-    (unsigned char) (c))))
+DECLARE_INLINE_HEADER (
+void
+Lstream_ungetc (Lstream *lstr, int c)
+)
+{
+  /* Add to the end if it won't overflow buffer; otherwise call the
+     function equivalent */
+  if (lstr->unget_buffer_ind >= lstr->unget_buffer_size)
+    {
+      Lstream_fungetc (lstr, c);
+    }
+  else
+    {
+      lstr->byte_count--;
+      lstr->unget_buffer[lstr->unget_buffer_ind] = (unsigned char) (c);
+      lstr->unget_character_count
+        += valid_ibyteptr_p (lstr->unget_buffer + lstr->unget_buffer_ind);
+      lstr->unget_buffer_ind++;
+    }
+}
 
 #define Lstream_data(stream) ((void *) ((stream)->data))
 #define Lstream_byte_count(stream) ((stream)->byte_count)
diff -r 4004c3266c09 -r 65d65b52d608 src/print.c
--- a/src/print.c       Sun Dec 22 10:36:33 2013 +0000
+++ b/src/print.c       Thu Jan 16 16:27:52 2014 +0000
@@ -514,7 +514,7 @@
 
       buffer_insert_string_1 (XMARKER (function)->buffer,
                              spoint, nonreloc, reloc, offset, len,
-                             0);
+                             -1, 0);
       Fset_marker (function, make_fixnum (spoint + cclen),
                   Fmarker_buffer (function));
     }
diff -r 4004c3266c09 -r 65d65b52d608 src/tests.c
--- a/src/tests.c       Sun Dec 22 10:36:33 2013 +0000
+++ b/src/tests.c       Thu Jan 16 16:27:52 2014 +0000
@@ -558,6 +558,186 @@
   return conversion_result;
 }
 
+DEFUN ("test-character-tell", Ftest_character_tell, 0, 0, "", /*
+Return list of results of tests of the stream character offset code.
+For use by the automated test suite.  See tests/automated/c-tests.
+
+Each element is a list (DESCRIPTION, STATUS, REASON).
+DESCRIPTION is a string describing the test.
+STATUS is a symbol, either t (pass) or nil (fail).
+REASON is nil or a string describing the failure (not required).
+*/
+       ())
+{
+  Extbyte ext_unix[]= "\n\nfoo\nbar\n\nf\372b\343\340\nfoo\nbar\n";
+  /* Previous string in UTF-8. */
+  Extbyte ext_utf_8_unix[]
+    = "\n\nfoo\nbar\n\nf\303\272b\303\243\303\240\nfoo\nbar\n";
+  Charcount ext_utf_8_unix_char_len = 25;
+  Ibyte shortbuf[13], longbuf[512];
+  Lisp_Object stream =
+    make_fixed_buffer_input_stream (ext_unix, sizeof (ext_unix) - 1);
+  Lisp_Object result = Qnil, string = Qnil;
+  Charcount count;
+  Bytecount bytecount;
+  struct gcpro gcpro1, gcpro2, gcpro3;
+
+#define CHARACTER_TELL_ASSERT(assertion, description, failing_case) \
+  do                                                                \
+    {                                                               \
+    if (assertion)                                                  \
+      result = Fcons (list3 (build_cistring (description),          \
+                             Qt, Qnil), result);                    \
+    else                                                            \
+      result = Fcons (list3 (build_cistring (description),          \
+                             Qnil, build_ascstring (failing_case)), \
+                      result);                                      \
+    }                                                               \
+  while (0)
+
+  GCPRO3 (stream, result, string);
+
+  Lstream_set_buffering (XLSTREAM (stream), LSTREAM_BLOCKN_BUFFERED, 65536);
+  stream = make_coding_input_stream
+    (XLSTREAM (stream), Ffind_coding_system (intern ("no-conversion-unix")),
+     CODING_DECODE, 0);
+  Lstream_set_buffering (XLSTREAM (stream), LSTREAM_BLOCKN_BUFFERED, 65536);
+
+  bytecount = Lstream_read (XLSTREAM (stream), longbuf, sizeof (longbuf));
+
+  CHARACTER_TELL_ASSERT (Lstream_character_tell (XLSTREAM (stream))
+                         == sizeof (ext_unix) -1,
+                         "basic character tell, no-conversion-unix",
+                         "basic character tell failed");
+
+  string = build_extstring (ext_unix,
+                            Ffind_coding_system (intern
+                                                 ("no-conversion-unix")));
+
+  CHARACTER_TELL_ASSERT (Lstream_character_tell (XLSTREAM (stream))
+                         == string_char_length (string),
+                         "repeat basic character tell, no-conversion-unix",
+                         "repeat basic character tell failed with string");
+
+  count = Lstream_character_tell (XLSTREAM (stream));
+
+  Lstream_unread (XLSTREAM (stream), "r\n", 2);
+
+  /* This should give the same result as before the unread. */
+  CHARACTER_TELL_ASSERT (Lstream_character_tell (XLSTREAM (stream))
+                         == count, "checking post-unread character tell",
+                         "post-unread character tell failed");
+  bytecount += Lstream_read (XLSTREAM (stream), longbuf + bytecount,
+                             sizeof (longbuf) - bytecount);
+
+  CHARACTER_TELL_ASSERT (Lstream_character_tell (XLSTREAM (stream))
+                         == count + 2,
+                         "checking post-unread+read character tell",
+                         "post-unread+read character tell failed");
+
+  /* This seems to be buggy for my purposes. */
+  /* Lstream_rewind (XLSTREAM (stream)); */
+  Lstream_close (XLSTREAM (stream));
+  Lstream_delete (XLSTREAM (stream));
+
+  stream = make_fixed_buffer_input_stream (ext_unix, sizeof (ext_unix) - 1);
+  Lstream_set_buffering (XLSTREAM (stream), LSTREAM_BLOCKN_BUFFERED, 65536);
+  Lstream_unset_character_mode (XLSTREAM (stream));
+  stream = make_coding_input_stream
+    (XLSTREAM (stream), Ffind_coding_system (intern ("no-conversion-unix")),
+     CODING_DECODE, 0);
+  Lstream_set_buffering (XLSTREAM (stream), LSTREAM_BLOCKN_BUFFERED, 65536);
+  Lstream_unset_character_mode (XLSTREAM (stream));
+
+  bytecount = Lstream_read (XLSTREAM (stream), shortbuf, sizeof (shortbuf));
+
+  CHARACTER_TELL_ASSERT (Lstream_character_tell (XLSTREAM (stream))
+                         /* This should be equal to sizeof (shortbuf) on
+                            non-mule. */
+                         == sizeof (shortbuf) - !(byte_ascii_p (0xff)),
+                         "character tell with short read, no-conversion-unix",
+                         "short read character tell failed");
+
+  Lstream_close (XLSTREAM (stream));
+  Lstream_delete (XLSTREAM (stream));
+
+  stream
+    = make_fixed_buffer_input_stream (ext_utf_8_unix,
+                                      sizeof (ext_utf_8_unix) - 1);
+  Lstream_set_buffering (XLSTREAM (stream), LSTREAM_BLOCKN_BUFFERED, 65536);
+  stream = make_coding_input_stream
+    (XLSTREAM (stream), Ffind_coding_system (intern ("utf-8-unix")),
+     CODING_DECODE, 0);
+  Lstream_set_buffering (XLSTREAM (stream), LSTREAM_BLOCKN_BUFFERED, 65536);
+
+  bytecount = Lstream_read (XLSTREAM (stream), longbuf, sizeof (longbuf));
+
+  CHARACTER_TELL_ASSERT (Lstream_character_tell (XLSTREAM (stream))
+                         == ext_utf_8_unix_char_len,
+                         "utf-8 character tell, utf-8-unix",
+                         "utf-8 character tell failed");
+
+  string = build_extstring (ext_utf_8_unix,
+                            Ffind_coding_system (intern
+                                                 ("utf-8-unix")));
+
+  CHARACTER_TELL_ASSERT (Lstream_character_tell (XLSTREAM (stream))
+                         == string_char_length (string),
+                         "repeat utf-8 character tell, utf-8-unix",
+                         "repeat utf-8 character tell failed with string");
+
+  count = Lstream_character_tell (XLSTREAM (stream));
+
+  Lstream_unread (XLSTREAM (stream), "r\n", 2);
+
+  /* This should give the same result as before the unread. */
+  CHARACTER_TELL_ASSERT (Lstream_character_tell (XLSTREAM (stream))
+                         == count, "checking post-unread utf-8 tell",
+                         "post-unread utf-8 tell failed");
+  bytecount += Lstream_read (XLSTREAM (stream), longbuf + bytecount,
+                             sizeof (longbuf) - bytecount);
+
+  CHARACTER_TELL_ASSERT (Lstream_character_tell (XLSTREAM (stream))
+                         == count + 2,
+                         "checking post-unread+read utf-8 tell",
+                         "post-unread+read utf-8 tell failed");
+
+  /* This seems to be buggy for my purposes. */
+  /* Lstream_rewind (XLSTREAM (stream)); */
+  Lstream_close (XLSTREAM (stream));
+  Lstream_delete (XLSTREAM (stream));
+
+  stream = make_fixed_buffer_input_stream (ext_utf_8_unix, sizeof 
(ext_utf_8_unix) - 1);
+  Lstream_set_buffering (XLSTREAM (stream), LSTREAM_BLOCKN_BUFFERED, 65536);
+  Lstream_set_character_mode (XLSTREAM (stream));
+
+  stream = make_coding_input_stream
+    (XLSTREAM (stream), Ffind_coding_system (intern ("utf-8-unix")),
+     CODING_DECODE, 0);
+  Lstream_set_buffering (XLSTREAM (stream), LSTREAM_BLOCKN_BUFFERED, 65536);
+  Lstream_set_character_mode (XLSTREAM (stream));
+
+  bytecount = Lstream_read (XLSTREAM (stream), shortbuf, sizeof (shortbuf));
+
+  CHARACTER_TELL_ASSERT
+    (bytecount == (sizeof (shortbuf) - 1),
+     "utf-8 Lstream_read, character mode, checking partial char not read",
+     "partial char appars to have been read when it shouldn't");
+
+  CHARACTER_TELL_ASSERT
+    (Lstream_character_tell (XLSTREAM (stream))
+     /* This is shorter, because it's in the middle of a character. */
+     == sizeof (shortbuf) - 1,
+     "utf-8 tell with short read, character mode, utf-8-unix",
+     "utf-8 read character tell, character mode failed");
+
+  Lstream_close (XLSTREAM (stream));
+  Lstream_delete (XLSTREAM (stream));
+
+  UNGCPRO;
+  return result;
+}
+
 
 /* Hash Table testing */
 
@@ -724,6 +904,7 @@
   Vtest_function_list = Qnil;
 
   TESTS_DEFSUBR (Ftest_data_format_conversion);
+  TESTS_DEFSUBR (Ftest_character_tell);
   TESTS_DEFSUBR (Ftest_hash_tables);
   TESTS_DEFSUBR (Ftest_store_void_in_lisp);
   /* Add other test functions here with TESTS_DEFSUBR */
diff -r 4004c3266c09 -r 65d65b52d608 src/unicode.c
--- a/src/unicode.c     Sun Dec 22 10:36:33 2013 +0000
+++ b/src/unicode.c     Thu Jan 16 16:27:52 2014 +0000
@@ -1707,6 +1707,7 @@
   unsigned char counter;
   unsigned char indicated_length;
   int seen_char;
+  Charcount characters_seen;
   /* encode */
   Lisp_Object current_charset;
   int current_char_boundary;
@@ -1988,6 +1989,17 @@
                          write_error_characters_as_such);
 }
 
+static Charcount
+unicode_character_tell (struct coding_stream *str)
+{
+  if (CODING_STREAM_TYPE_DATA (str, unicode)->counter == 0)
+    {
+      return CODING_STREAM_TYPE_DATA (str, unicode)->characters_seen;
+    }
+
+  return -1;
+}
+
 static Bytecount
 unicode_convert (struct coding_stream *str, const UExtbyte *src,
                 unsigned_char_dynarr *dst, Bytecount n)
@@ -2006,6 +2018,7 @@
       unsigned char counter = data->counter;
       unsigned char indicated_length
         = data->indicated_length;
+      Charcount characters_seen = data->characters_seen;
 
       while (n--)
        {
@@ -2020,12 +2033,15 @@
                     {
                       /* ASCII. */
                       decode_unicode_char (c, dst, data, ignore_bom);
+                      characters_seen++;
                     }
                   else if (0 == (c & 0x40))
                     {
                       /* Highest bit set, second highest not--there's
                          something wrong. */
                       DECODE_ERROR_OCTET (c, dst, data, ignore_bom);
+                      /* This is a character in the buffer. */
+                      characters_seen++;
                     }
                   else if (0 == (c & 0x20))
                     {
@@ -2050,7 +2066,7 @@
                       /* We don't supports lengths longer than 4 in
                          external-format data. */
                       DECODE_ERROR_OCTET (c, dst, data, ignore_bom);
-
+                      characters_seen++;
                     }
                 }
               else
@@ -2061,15 +2077,20 @@
                       indicate_invalid_utf_8(indicated_length, 
                                              counter, 
                                              ch, dst, data, ignore_bom);
+                      /* These are characters our receiver will see, not
+                         actual characters we've seen in the input. */
+                      characters_seen += (indicated_length - counter);
                       if (c & 0x80)
                         {
                           DECODE_ERROR_OCTET (c, dst, data, ignore_bom);
+                          characters_seen++;
                         }
                       else
                         {
                           /* The character just read is ASCII. Treat it as
                              such.  */
                           decode_unicode_char (c, dst, data, ignore_bom);
+                          characters_seen++;
                         }
                       ch = 0;
                       counter = 0;
@@ -2092,10 +2113,12 @@
                                                      counter, 
                                                      ch, dst, data,
                                                      ignore_bom);
+                              characters_seen += (indicated_length - counter);
                             }
                           else
                             {
                               decode_unicode_char (ch, dst, data, ignore_bom);
+                              characters_seen++;
                             }
                           ch = 0;
                         }
@@ -2242,6 +2265,7 @@
               indicate_invalid_utf_8(indicated_length, 
                                      counter, ch, dst, data, 
                                      ignore_bom);
+              characters_seen += (indicated_length - counter);
               break;
 
             case UNICODE_UTF_16:
@@ -2295,6 +2319,7 @@
 
       data->counter = counter;
       data->indicated_length = indicated_length;
+      data->characters_seen = characters_seen;
     }
   else
     {
@@ -3177,6 +3202,8 @@
   CODING_SYSTEM_HAS_METHOD (unicode, putprop);
   CODING_SYSTEM_HAS_METHOD (unicode, getprop);
 
+  CODING_SYSTEM_HAS_METHOD (unicode, character_tell);
+
   INITIALIZE_DETECTOR (utf_8);
   DETECTOR_HAS_METHOD (utf_8, detect);
   INITIALIZE_DETECTOR_CATEGORY (utf_8, utf_8);


-- 
‘Liston operated so fast that he once accidentally amputated an assistant’s
fingers along with a patient’s leg, […] The patient and the assistant both
died of sepsis, and a spectator reportedly died of shock, resulting in the
only known procedure with a 300% mortality.’ (Atul Gawande, NEJM, 2012)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]