[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-mailutils] iterating through large mailbox: memory consumption

From: Alain Magloire
Subject: Re: [bug-mailutils] iterating through large mailbox: memory consumption
Date: Fri, 4 Mar 2005 15:22:31 -0500 (EST)

> I am writing a program that reads all messages in a mailbox in
> sequence, say to convert the mailbox into a structured representation,
> or to compile on-the-fly statistics regarding headers, or similar
> things.
> One of the mailboxes contains spam sent to a domain.  The sample I
> have is a Unix mbox file with 45,000 messages.  Iterating through the
> messages with
>   mailbox_messages_count(mboxObj, &numMessages)
>   for (m = 1; m <= numMessages; ++m) {
>     mailbox_get_message(mboxObj, m, &msgObj);
>       /* ... some code ... */
>     message_destroy(&msgObj, message_get_owner(msgObj));
>   }
> consumes close to 300 MB of RAM on my machine.
> Is there a way to go through the messages one-by-one without using
> memory proportional to the total file size?  Maybe I'm doing something
> wrong.
> I looked at the supplied frm.c, and also ran frm on my big mailbox,
> but found that it, also, consumes lots of memory.

45 000 ...
We did expect a certain hit.  The problem comes from when your mailbox
is scan() a certain number of information are saved:
(1) each message is allocated a structure that will hold the offsets of the 
(2) and a certain number of headers are save/allocated for fast processing
    so when you ask for the Subject/From/Date etc ... the lib will not go back 
to the disk .. to slow.

For (1), we need this.
However for (2) we could have reduce the number of headers cached and doing
so reducing the memory usage.
I've been playing with the Mailbox stuff and the new implementation will let you
choose what to cache but its not ready.

Meanwhile, you could edit mailbox/mboxscan.c  and get rid of some headers
#define FAST_H_BCC(mum,save_field,buf,n)
#define FAST_H_CC(mum,save_field,buf,n)
#define FAST_H_CONTENT_LANGUAGE(mum,save_field,buf,n)
#define FAST_H_CONTENT_TRANSFER_ENCODING(mum,save_field,buf,n)
#define FAST_H_CONTENT_TYPE(mum,save_field,buf,n)
#define FAST_H_DATE(mum,save_field,buf,n)
#define FAST_H_FROM(mum,save_field,buf,n)
#define FAST_H_IN_REPLY_TO(mum,save_field,buf,n)
#define FAST_H_MESSAGE_ID(mum,save_field,buf,n)
#define FAST_H_REFERENCE(mum,save_field,buf,n)
#define FAST_H_REPLY_TO(mum,save_field,buf,n)
#define FAST_H_SENDER(mum,save_field,buf,n)
#define FAST_H_SUBJECT(mum,save_field,buf,n)
#define FAST_H_TO(mum,save_field,buf,n)
#define FAST_H_X_UIDL(mum,save_field,buf,n)

Now multpliy this 45000 ... lots of memory. 

> Thanks,
> Robby Villegas
> P.S.  Actually, the count alone, mailbox_messages_count(mboxObj,
> &numMessages), consumes this much memory.  Seeking to a message at the
> end with, say mailbox_get_message(mboxObj, 45000 &msgObj), without
> computing the count first (since I know that 45000 is valid here),
> also consumes the memory.

Skeeping the message is the same, before skeeping we need to scan.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]