[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RFC: change coming to lynx-dev archives
From: |
Al Gilman |
Subject: |
RFC: change coming to lynx-dev archives |
Date: |
Mon, 11 Feb 2002 11:01:13 -0500 |
Our archives have been on the web now for several years thanks to the kind
hospitality of of the Flora community.
The size of the archive and the risk that system changes might break MHonArc
and there would be nobody to fix it argue for a change in how we host the
archives.
On the other hand, almost half the web traffic on the Flora site is accesses of
the lynx-dev archives, and I don't know how much of that is to old stuff.
Further down, I quote a longer discussion of the issues and plans from Russell
McOrmond. For the moment, I want to focus on a few questions that are open or
could be opened.
** Plan A: stay at Flora and rehost to ForumDB.
A currently running instance of the proposed ForumDB flavor of archive, running
with our messages, is visible at
http://www.flora.org/lynx-dev/forum/
The look and feel can be adjusted on a per-list basis if we feed PHP patches to
Russell for integration. Contributions in this form are very welcome. Some
form of support for threaded indexing or navigation is a known wish-list item.
Code contributions sought.
* Question: The current archive format hangs all posters' email addresses out
in an easily harvested format. The thinking so far has been that we should
change our way of doing business to make it harder to harvest list-members'
addresses from the web access to the archives.
Which of the following do you think we should adopt as list
policy/infrastructure:
- obscure email addresses mildly in the HTML so that the HTML source text is
not usable without some transformation as an email address. Things like
representing <address@hidden> as <John.Doe -(at)-
your.home.domain>. This will allow people transcribing edresses manually to
get them right and address any individual who has posted. It may prevent
collection in today's price-performance climate for email address harvesters,
but it does not make it really hard.
- remove individual email addresses from the public web pages entirely. If
someone wishes to communicate with an individual individually, they can
subscribe and post requesting that the party get in touch with them. And
subscribers can address subscribers.
Note that we are contemplating removing the public automatic availability of
the mailbox format files. That is just too great a harvesting invitation.
These would be archived somehow and available by explicit human request for
human release. How that works is TBD.
If there is a readily implemented alternative that you think is much better
than either of these, please explain it and how it is readily achievable. Even
the second one above takes finding the right code and adding it into the
existing FormDB PHP viewer.
** Plan B:
If someone out there thinks that sustaining the MHonArc form, or some other
form of the archives is critical, can you volunteer to host and maintain the
installation, or do you think you can recruit someone who will? In other
words, alternatives to Plan A as outlined above should come with resources.
Al
-- quoting from Russell:
Here is what was discussed previously [on listelves], just to bring us to the
same
page.
I have hosted multiple archives over the years and have ended up with
problems, and I came up with solutions:
- MHonArc is configuratble, but once a message is 'stored' a
configuration file change doesn't change how old messages are viewed.
As times change, so sometimes the look needs to change as well (IE:
ongoing changes of "address@hidden" to "russell AT flora DOT org",
or other formats as the worms figure that pattern out.
- Storing in HTML files speeds up Webserver access, but requires more
disk space for the "look and feel" and the fact the origional RFC-822
message needs to be stored anyway.
- HHonArc requires external searchability, when some fields (Subject,
date, etc) may be more commonly required.
What I came up with was a system which stored messages in an SQL
database, and then a simple PHP viewer of the SQL data. Information is
at: http://www.flora.org/flora/forumdb/
There is also a SourceForge project at:
http://sourceforge.net/projects/forumdb/
The importer is written in PERL and is very simple. It separates the
email into header and body, stores both parts into a table. It then
extracts specific headers and stores them into a separate table for quick
searching on those fields. The SQL database interface used is MySQL.
The viewer is written in simple PHP, compatable with PHP3 and PHP4.
There is a small number of functions which are expected to be inserted
into a PHP file with all the look-and-feel.
Conversion process
------------------
Lynx-dev has a few directories of files:
/html/ - this is the HMonArc archives starting October 1996
/mailbox/ - this is the mailbox format files.
/lynx-dev/ - Archives prior to 1996.
In terms of size (in Bytes).
214952960 html
1024 index.html
47548416 lynx-dev
95623168 mailbox
The files in /mailbox/ can be used to prime the new database. We can,
in date order, import all of those messages. Message #1 would be the
first message in the file
http://www.flora.org/lynx-dev/mailbox/LynxDev_10.16-11.14.gz
The files in /html/ can be converted to a script. This script would
take the URL, look up the message-ID in a database table we create, and
then redirect to the right Message#. We would need to write a script that
would scan the files in /html/ and create the URL-->MessageID table by
extracting the MessageID's out of each MHonArc file.
MessageID's are actually in a comment in each MHonArc file.
<!--X-Message-Id: address@hidden -->
One way to extract these is:
find html/ -name "*.html" -print | xargs -n 10 grep "<\!--X-Message-Id: "
and then pipe the results into something that can generate a DBM file to
look up the URL to get the MessageID.
EG:
html/month081997/msg00595.html:<!--X-Message-Id: address@hidden -->
The look-and-feel for the new archive would be determined by the PHP
code. I suspect there are some people here that have some experience with
PHP who could update the software to give an appropriate look-and-feel to
this data. I use this software myself all over FLORA, and one example to
look at is: http://www.flora.org/dmca/
This leaves one remaining question: What to do about the files at
http://www.flora.org/lynx-dev/lynx-dev/ ? My recommendation is to just
note they are already indexed at Archive.org and simply delete them from
our website.
http://web.archive.org/web/*/http://www.flora.org/lynx-dev/lynx-dev/*
The final outcome of all of this would be:
http://www.flora.org/lynx-dev/forum/ - New ForumDB view of messages.
Simply a script with the data in an SQL database.
http://www.flora.org/lynx-dev/mailbox/ - The ongoing growing Mailbox
files. These can be kept in the short-term for backup, but
aren't needed long-term as the SQL has this data. We also
want to remove these from view so that the Worms/etc can't
collect data from these files.
http://www.flora.org/lynx-dev/html/ - Script which redirects to the right
Message# in /forum/ based on the URL it is given.
http://www.flora.org/lynx-dev/lynx-dev/ - file not found, which
simply shows people the information/links at
http://www.flora.org/lynx-dev/
---
Russell McOrmond, Internet Consultant: <http://www.flora.ca/>
See http://weblog.flora.org/ for announcements, activities, and opinions
"If we don't believe in freedom of expression for people we despise,
we don't believe in it at all." -- Noam Chomsky
; To UNSUBSCRIBE: Send "unsubscribe lynx-dev" to address@hidden
- RFC: change coming to lynx-dev archives,
Al Gilman <=