[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Pan-devel] ancient DB schema

From: Calin A. Culianu
Subject: Re: [Pan-devel] ancient DB schema
Date: Tue, 8 Jun 2004 09:43:08 -0400 (EDT)

This looks reasonable.  Although I would suggest for the purposes of 
performance, further normalizing the Articles tables so that it doesn't 
contain the actual subject text.

For the binaries newsgroups, I have been able to really increase 
performance and minimize disk space usage by doing the following:

If it's a multipart binary (as determined using the heuristics already in 
Pan, namely it ends in [xx/yy] or (xx/yy) and it is over 400 lines), then 
we can assume that all the subjects are the same, but they differ only in 
the xx/yy part.  So why not truncate that part, then put all the subjects 
in a separate table, and save only the 'subject id' and part and parts in 
the Articles table?

In fact, in a typical 1 million+ header group, there are usually only like
1000-2000 unique subjects.  So you save a LOT of space by doing this. This
saves a lot of disk space, and makes queries and sorting of the articles
table much faster since less overall disk space needs to be scanned per

Anyway, the stuff I am working on now as far as DB changes aren't as 
comprehensive as what you propose here.  As an initial first-pass, I am 
_only_ changing the bits of pan that deal with article headers, and 
putting only that stuff in the DB, as that's where we have really big 
problems with memory consumption and that's where we benefit most from 
using a DB.  This is the lazy man's approach.. I don't want to change pan 
too much.. I only want to tweak it to scale better..

I leave it to you guys to decide how to totally metamorphosize Pan into 
using a full-fledged DB backend and creating 'virtual groups' or whatever 
it was you were discussing..


On Fri, 4 Jun 2004, K. Haley wrote:

> I'm attaching an old DB schema I came up with.  It is based on the one 
> posted by Charles a long time ago.  There are still some unanswered 
> questions as to where some of the info should go.  The biggest one is 
> whether or not articles in folders should be in their own table.   FYI I 
> chose to use integer primary keys for space and speed savings.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]