lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev proposal for LYK_SCRIPT and patch


From: Klaus Weide
Subject: Re: lynx-dev proposal for LYK_SCRIPT and patch
Date: Mon, 5 Jul 1999 11:12:19 -0500 (CDT)

On Sun, 4 Jul 1999, Scott Bigham wrote:

> On Sat, 3 Jul 1999, Klaus Weide wrote:
> 
> > 1. This is a bad hack based on a bad hack.
> 
> This is only tangential to the original subject (and I'm only tangential
> to SOURCE_CACHE nowadays...), but I think this merits a reply.
> 
> >    SOURCE_CACHE isn't a real cache in the HTTP sense [...]
> 
> It's not meant to be.  It piggybacks on the HText cache, and inherits

Yes, I know it doesn't intend to be that.  It might still one day grow
into something resembling it.  At least I hate to see stuff added that
would make that possibility more remote.

Don't get me wrong, I actually find SOURCE_CACHE useful (sometimes) and
use it (sometimes).

I should not have called it a "bad hack".  Please accept my apologies
for that.  It's not bad for what it does.

I would also like to apologize to Eduardo, if he found my message too
unkind.  I appreciate his contribution, and would like to see more
people sending their private modifications to lynx-dev.  I do think
however that the code had a lot of problems, but that can be resolved
or taken in a different direction.

> its expiration and variant logic.  Having duplicates of this logic for
> two separate caches seems wasteful, especially since there's little if
> any reason for them to differ substantially.

It seems to me that in fact a lot of the *code implementing the logic*
had to be duplicated - the only resource saved by not thinking about
a modified more appropriate logic may be developer brain cells. :)

If I understand correctly, the reason of existence for SOURCE_CACHE was
to provide a backup copy for those situation where Lynx had to reload
the document because some setting changed, not for any other cases of
reloading.  (Right?  At least that seems to be what it's doing.)

If that is the case, then it *is* wasteful to always make a copy for all
http: documents.  I need the cache copy only for a few of them.

I am not really complaining about this - I don't know how lynx could
predict which copies would be needed and which would not either. :)
And I can turn SOURCE_CACHE off.

Btw. SOURCE_CACHE does allready differ form the HText cache logic, at
least in two respects: (1) the source cache copy is kept when the HText
rendered copy is discarded because of changed settings (of course!),
and (2) it is selective as to MIME type and protocol.  It's not obvious
why these differences should be the only ones (or stay as they are).

> >    [...] and it is incredibly wasteful: it makes a cache copy for each
> >    and every URL
> 
> Only for documents currently in the HText cache; the cached source is
> discarded at the same time that the cached rendering is discarded.  Or
> am I misunderstanding your objection?

I was not completely sure whether it always keeps the cache copy only as
long as the HText doc is kept.  But yes, that seems to be the case
(except of course at the point where HText is reloaded because a setting
changed.)

But SOURCE_CACHE does make a copy for each and every (http[*]) URL
that is parsed as HTML.  It just doesn't keep it forever.

Regarding my expression 'incredibly wasteful', well that's a relative
evaluation.  It depends on what you compare with (e.g. with the intended
purpose: how many of the cached copies are actually ever used?, or with
possible alternatives: would installing a caching proxy serve me better?
are different implementations with less resource consumption possible?),
on the actual usage pattern, and on the resource situation.  Everybody
will have to come to their own conclusions (and I come to different
conclusions in different situations :)), to decide whether to use it.

My main point was that there are good reasons not to use it some of
the time, and that the availability of a cache file may not be a constant,
so it would be good not to depend on it for a logically quite unrelated
task/feature.

[*] Actually, what's tested is the protocol name for "physical" access,
which is different from the user-visible URL e.g. when using proxies.
So HTML documents with ftp: URLs accessed through a proxy would be
source-cached, but are are not when no proxy is used.  That seems a bit
arbitrary.

> >    SOURCE_CACHE also messes around with the way Lynx usually does things
> >    in weird ways, of which I am still suspicious.
> 
> For instance?

What made me say that are comments in mainloop():
                /*
                 * Urk.  I have no idea how to recover from a failure here.
                 * At a guess, I'll try reloading.  -dsb
                 */
        /*
         *  Trying to accomodate HTreparse_document() logic
         *  with mainloop events.  Working out of force_load cycle
         *  set all the necessary flags here, from case NORMAL
         *  (see also LYK_SOURCE, some stuff implemented directly there).
         */
                        /*
                         * These normally get cleaned up after getfile() 
returns;
                         * since we're not calling getfile(), we have to clean 
them
                         * up ourselves.  -dsb
                         */

Also, calling the loading function out of the lower half of mainloop,
rather than at the top through getfile(), is a significant change.  Are you
sure that, when control returns to the top of the loop after one of the
HTreparse_document() calls, all the myriad of variables are handled right?
Or, more important for me, can I be sure?  I tried to follow the logic,
but perhaps not long enough, and could not easily see that.  It seems
it took quite a number of attempts to get it as right as it is now,
it is quite likely there a still bugs lurking (witness most recent patch
by Leonid for chartrans).

Also, if I would previously added some new variable in mainloop() for
some purpose, do I have to duplicate the handling now somewhere in
GridText?  (Maybe the answer would be easy to find, and I haven't
really checked in detail. I just feel things have gotten a bit more
complicated, more things to check.  No, not that mainloop() was clean
and easy before...)


> As for the proposal under discussion:  I agree that this is probably the
> wrong way to produce this functionality.

I wouldn't really go that far.  At least, the mess-with-the-cache-file
approach is something I would like to be able to play around with, as
long as it can be done with less changes to Lynx / with a cleaner
interface.

I don't want to say what should be 'the' way.  Does there have to be
only one?

> Your gettidy-esque lynxcgi:/
> mechanism looks interesting, though I can't test it with my current
> build.  Do relative links get handled correctly?  

Yes, that's the beauty of it.  The current document will appear to
have the URL <xhttp://example.com/some/page.html> (the way I have 
set it up.  Variations are possible).  Relative links will point
to <xhttp://example.com/...>, absolute links to the serve to
<http://example.com/...>, both work (with the obvious difference
that in the first form access would automatically go through the same
script).

> How does it affect the
> cache?

The xhttp: version appears to Lynx as a different document, so will
be cached separately.  Or not if you wish, control via all sorts of HTTP
headers is possible (remember, it's lynx*CGI*!), including no-cache.

One could develop a system where parameters that control what the script
does are passed as part of the URL.  Like
<xhttp://example.com/some/page.html?myscriptaction=remove-DIV> and so
on.  In that case different invocations are of course cached as different
documents.

Source caching would not happen, unless the relevant lynx code is changed
(the protocol name that is tested would say "lynxcgi" instead of "http").

The script has to get the document from *somewhere*.  But it can do
nearly whatever it wants to achieve that based on the URL: Load it from
the net each time (using lynx, or wget etc.).  Better through a proxy 
cache if you have one. Make a local copy the first time a URL is requested
with wget.

Or, what about this idea: modify Lynx to put the requested real (non-x)
document's SOURCE_CACHE file, if one is available, into an environment
variable.  (There's already an existing option to control which
environment variables are passed to lynxcgi.)  The script has then
complete liberty to use that file as input if available, and get the
document from the net otherwise.  A disadvantage of this is that lynx
has to know about the naming convention (i.e. prefixing with 'x',
possibly some parameters).

A disadvantage with all this lynxcgi stuff is that it doesn't work
on VMS, DOS, Windows (except I assume with cygwin), and other non-
Unix-like systems.  (Unless someone implements lynxcgi for those.)

> If we really want this done in the code, then just off the top of my
> head, I'd say the most straightforward way to do it would be as an
> HTStream interposed between the source and rendering, downstream of the
> source cache if present.  In a sense, it would be akin to (reloading
> and) reparsing the document with inline image links enabled, or with
> comment parsing set to minimal.  The filtered source wouldn't actually
> live in the cache or on disk anywhere, but would simply be passed to the
> rendering mechanism; the cached source (if present) would be unchanged,
> and the cached HText would be replaced as though by a reparse.  

Very similar to the lynxcgi idea above.  Except that several processes
are involved, loading is not done by the lynx process.

A HTStream that runs arbitrary commands as a filter is certainly possible,
at least for Unix.  Shells do it all the time, after all.

For other systems, portability would be a problem.

There could be a poor man's implementation for systems without fork etc.
(or something equivalent), along the lines of how HTCompressed works
in lieu of real streaming decompression.

> At that
> level, the filtering mechanism shouldn't even care if source caching is
> enabled; if it's disabled, then filtering would mean reloading the
> document from the `net, but so would changing the comment parsing.
> 
>                                               -sbigham
> 
> 


reply via email to

[Prev in Thread] Current Thread [Next in Thread]