lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RFC: change coming to lynx-dev archives


From: Al Gilman
Subject: RFC: change coming to lynx-dev archives
Date: Mon, 11 Feb 2002 11:01:13 -0500

Our archives have been on the web now for several years thanks to the kind 
hospitality of of the Flora community.

The size of the archive and the risk that system changes might break MHonArc 
and there would be nobody to fix it argue for a change in how we host the 
archives.

On the other hand, almost half the web traffic on the Flora site is accesses of 
the lynx-dev archives, and I don't know how much of that is to old stuff.  

Further down, I quote a longer discussion of the issues and plans from Russell 
McOrmond.  For the moment, I want to focus on a few questions that are open or 
could be opened.

** Plan A:  stay at Flora and rehost to ForumDB.

A currently running instance of the proposed ForumDB flavor of archive, running 
with our messages, is visible at


http://www.flora.org/lynx-dev/forum/

The look and feel can be adjusted on a per-list basis if we feed PHP patches to 
Russell for integration.  Contributions in this form are very welcome.  Some 
form of support for threaded indexing or navigation is a known wish-list item.  
Code contributions sought.

* Question:  The current archive format hangs all posters' email addresses out 
in an easily harvested format.  The thinking so far has been that we should 
change our way of doing business to make it harder to harvest list-members' 
addresses from the web access to the archives.

Which of the following do you think we should adopt as list 
policy/infrastructure:

- obscure email addresses mildly in the HTML so that the HTML source text is 
not usable without some transformation as an email address.  Things like 
representing <address@hidden> as &lt;John.Doe  -(at&#41;&#x2D; 
your.home.domain>.  This will allow people transcribing edresses manually to 
get them right and address any individual who has posted.  It may prevent 
collection in today's price-performance climate for email address harvesters, 
but it does not make it really hard.

- remove individual email addresses from the public web pages entirely.  If 
someone wishes to communicate with an individual individually, they can 
subscribe and post requesting that the party get in touch with them.  And 
subscribers can address subscribers.

Note that we are contemplating removing the public automatic availability of 
the mailbox format files.  That is just too great a harvesting invitation.  
These would be archived somehow and available by explicit human request for 
human release.  How that works is TBD.

If there is a readily implemented alternative that you think is much better 
than either of these, please explain it and how it is readily achievable.  Even 
the second one above takes finding the right code and adding it into the 
existing FormDB PHP viewer.

** Plan B:

If someone out there thinks that sustaining the MHonArc form, or some other 
form of the archives is critical, can you volunteer to host and maintain the 
installation, or do you think you can recruit someone who will?  In other 
words, alternatives to Plan A as outlined above should come with resources.

Al

-- quoting from Russell:


  Here is what was discussed previously [on listelves], just to bring us to the 
same
page.

  I have hosted multiple archives over the years and have ended up with

problems, and I came up with solutions:

  - MHonArc is configuratble, but once a message is 'stored' a
    configuration file change doesn't change how old messages are viewed.
    As times change, so sometimes the look needs to change as well (IE:
    ongoing changes of "address@hidden" to "russell AT flora DOT org",
    or other formats as the worms figure that pattern out.

  - Storing in HTML files speeds up Webserver access, but requires more
    disk space for the "look and feel" and the fact the origional RFC-822
    message needs to be stored anyway.

  - HHonArc requires external searchability, when some fields (Subject,
    date, etc) may be more commonly required.



  What I came up with was a system which stored messages in an SQL
database, and then a simple PHP viewer of the SQL data.  Information is
at:  http://www.flora.org/flora/forumdb/
  There is also a SourceForge project at:
    http://sourceforge.net/projects/forumdb/


  The importer is written in PERL and is very simple.  It separates the
email into header and body, stores both parts into a table.  It then
extracts specific headers and stores them into a separate table for quick
searching on those fields.  The SQL database interface used is MySQL.

  The viewer is written in simple PHP, compatable with PHP3 and PHP4.
There is a small number of functions which are expected to be inserted
into a PHP file with all the look-and-feel.


Conversion process
------------------


  Lynx-dev has a few directories of files:

  /html/  - this is the HMonArc archives starting October 1996 
  /mailbox/ - this is the mailbox format files.
  /lynx-dev/  - Archives prior to 1996.

In terms of size (in Bytes).

214952960       html
1024            index.html
47548416        lynx-dev
95623168        mailbox


  The files in /mailbox/ can be used to prime the new database.  We can,
in date order, import all of those messages.  Message #1 would be the
first message in the file
  http://www.flora.org/lynx-dev/mailbox/LynxDev_10.16-11.14.gz


  The files in /html/ can be converted to a script.  This script would
take the URL, look up the message-ID in a database table we create, and
then redirect to the right Message#.  We would need to write a script that
would scan the files in /html/ and create the URL-->MessageID table by
extracting the MessageID's out of each MHonArc file.

  MessageID's are actually in a comment in each MHonArc file.

<!--X-Message-Id: address@hidden -->

One way to extract these is:

find html/ -name "*.html" -print | xargs -n 10 grep "<\!--X-Message-Id: "

and then pipe the results into something that can generate a DBM file to
look up the URL to get the MessageID.


EG:

html/month081997/msg00595.html:<!--X-Message-Id: address@hidden -->






  The look-and-feel for the new archive would be determined by the PHP
code.  I suspect there are some people here that have some experience with
PHP who could update the software to give an appropriate look-and-feel to
this data.  I use this software myself all over FLORA, and one example to
look at is: http://www.flora.org/dmca/



  This leaves one remaining question:  What to do about the files at

http://www.flora.org/lynx-dev/lynx-dev/ ?  My recommendation is to just
note they are already indexed at Archive.org and simply delete them from
our website.

  http://web.archive.org/web/*/http://www.flora.org/lynx-dev/lynx-dev/*



The final outcome of all of this would be:



http://www.flora.org/lynx-dev/forum/  - New ForumDB view of messages.
             Simply a script with the data in an SQL database.


http://www.flora.org/lynx-dev/mailbox/ - The ongoing growing Mailbox
             files.  These can be kept in the short-term for backup, but
             aren't needed long-term as the SQL has this data.  We also
             want to remove these from view so that the Worms/etc can't
             collect data from these files.


http://www.flora.org/lynx-dev/html/ - Script which redirects to the right
             Message# in /forum/ based on the URL it is given.


http://www.flora.org/lynx-dev/lynx-dev/  - file not found, which
             simply shows people the information/links at
             http://www.flora.org/lynx-dev/


---
Russell McOrmond, Internet Consultant: <http://www.flora.ca/>
See http://weblog.flora.org/ for announcements, activities, and opinions
"If we don't believe in freedom of expression for people we despise,
  we don't believe in it at all." -- Noam Chomsky



; To UNSUBSCRIBE: Send "unsubscribe lynx-dev" to address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]