[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: how to calculate the size of string in bytes?

From: tomas
Subject: Re: how to calculate the size of string in bytes?
Date: Tue, 18 Aug 2015 13:47:03 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

Hash: SHA1

On Tue, Aug 18, 2015 at 03:43:44AM -0700, Sam Halliday wrote:
> On Tuesday, 18 August 2015 11:14:04 UTC+1, address@hidden  wrote:
> > On Tue, Aug 18, 2015 at 02:11:54AM -0700, Sam Halliday wrote:
> > > We used to have a 6 character hex number at the start of each message 
> > > that counted the number of multibyte characters, but we'd like to change 
> > > it to be the number of bytes in the message.
> > > 
> > > We're sending the string to `process-send-string' and `read'ing from the 
> > > associated network buffer. But when calculating the outgoing length of 
> > > the string that we want to send, we use `length' --- but we need this to 
> > > be `length-in-bytes' not the number of multibyte chars. Is there a built 
> > > in function to do this or am I going to have to iterate the string and 
> > > count the byte size of each character?
> > > 
> > > A quick test shows that
> > > 
> > >   (length (encode-coding-string "EURO" 'raw-text))
> > > 
> > > seems to give the correct result (1 for ASCII, 2 for Pound Sterling, 3 
> > > for Euro), but I am not 100% sure if this is correct.
> > 
> > Raw is, afaik, Emacs's internal coding system. You don't want traces of it
> > in the network :-)
> We're not sending the message using raw, we're using UTF-8. But I need to 
> calculate the length of the UTF-8 string IN BYTES as part of the payload 
> (each messages begins with a 6 character hex encoding of the proceeding 
> string's raw length).

Yes, I get that. The way I understand encode-coding-string is that you give
it the target encoding:

  (length (encode-coding-string foo 'raw-text))

would mean "transform this string to whatever Emacs uses as internal
encoding and measure its length in bytes", whereas what you want is,
AFAIU "transform this string to UTF-8 and measure its length in bytes",
which would read as:

  (length (encode-coding-string foo 'utf-8))

> I'm using "raw" to calculate an approximation of the UTF-8 string's byte 
> length, but I am aware that it might not actually be true in the general case 
> :-/

Use utf-8 then?

> I don't think what you've suggested would actually change the semantics, but 
> it would allow us to use a different encoding on the wire than the encoding 
> of the string. We don't really need to worry about that at this stage, 
> because all our users are using UTF-8. We'll keep it in mind though.

But, but... isn't that a bug lurking? And it would be so easy to fix...
(that is unrelated to the above issue -- that I think you want utf-8
instead of raw)

- -- tomás
Version: GnuPG v1.4.12 (GNU/Linux)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]