[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [help-GIFT] Re: Clarification on inverted file

From: Wolfgang Mueller
Subject: Re: [help-GIFT] Re: Clarification on inverted file
Date: Mon, 20 Aug 2001 11:28:32 +0200

On Friday 17 August 2001 08:56, David Squire wrote:
> suryani lim wrote:
> > I am trying out GIFT at the moment, to see if I can use it and to test
> > your published algorithms on my database. I had some problems indexing
> > my images when I first used it. It's looking good now but I am unsure
> > which combination of algorithms are implemented in Classical IDF. You
> > have published several weight functions and pruning algorithms.
> >
> > For pruning, it seems that only sorted weight and lossy pruning is
> > implemented, with the pruning features set at 70%? Is that right?

OK. The 70% is a parameter which can be changed during runtime. It means the 
program will read the feature list for the 70% query features with the 
highest term frequency. Note that this term frequency is the effective term 
frequency calculated from the feedback given.

The pruning is to be found in

in the method
CQInvertedFile::init() the parameters coming from the config file and/or the 
interface will be analyzed.

in the method
...::keepScorePruning() they will be used (all these conditions in the 
for-loop). Henning implemented also pruning based on a maximum time, and 
cutting off the scoreboard at a certain point within query processing (this 
limits the number of hash accesses). In the current version it is untested if 
my linking it to the rest of the GIFT is correct. The feature-list based 
pruning is used a lot, though so you can rely on that.

Why only lossy pruning?
With our current collection sizes we can assume, that the seek time (seeking 
the location of the feature list on the hard disk) is more important than the 
actual reading time. So the pruning has to be based on entire feature lists. 

The other point is, that in the current feature set of Viper, for the 
overwhelming majority of the features the term frequency is either one or 
zero. Implementing more sophisticated methods that read only parts of feature 
lists thus would not make sense.

The weighting is to be found in
The function of interest is
So we use a Rocchio-inspired weighting formula for obtaining the 
tfrequencies. What we don't do is cutting off the negative ones, i.e. there 
*will* be features with negative term frequency. Please note that which 
weighting function is actually used is also decided during runtime. The 
weighting functions are encapsulated in classes that inherit from 
CWeightingFunction ("Strategy" pattern).

> Few people say that they are using inverted files, but quite a few
> techniques use the same idea, which is to index images by features, rather
> than vice versa. Smith and Chang's colour sets is a good example (see
> There are other examples in the literature; sometimes the term "hashing" is
> used.

If I may add to that, Smeulders et al. report in their recent overview 
article some work done in 1996 that was about indexing images in inverted 
files using salient points. Chad Carson told me, that he stored the blobs of 
BlobWorld in an inverted file. MARS is strongly inspired by text retrieval, 
but modifies the retrieval scheme, basing the weighting not on the document 
frequency but on the standard deviation of the term frequency.

After what I have seen, David's approach that was implemented in Viper is 
maybe the most radical approach to adapting image retrieval methods to text 
retrieval indexing, because it does not try to achieve semantically 
meaningful representation (e.g. segments) before indexing the image in an 
inverted file. So much so, that this inverted files are not even mentioned as 
a possible general purpose indexing structure in the said Smeulders et al. 
article. Please correct me, David, if I missed something.

More questions welcome...

Wolfgang Müller, 
assistant-doctorant ==  PhD student (2001), teaching assistant
Personal page: 
Maintainer, GNU Image Finding Tool (

reply via email to

[Prev in Thread] Current Thread [Next in Thread]