[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Classpathx-xml] Announce: MKSearch beta 1, done with GCJ

From: Phil Shaw
Subject: [Classpathx-xml] Announce: MKSearch beta 1, done with GCJ
Date: Thu, 3 Nov 2005 12:30:44 -0000

I have mostly been lurking on these lists over the past year and I 
have learned a lot from the posts, so thanks very much to the free 
Java contributors for your part in the MKSearch project.

A formal announcement follows, but I thought members of these 
lists (Mark Wielaard especially) may also be interested in some 
screen shots of our beta search engine running on Fedora Core 4 
with Tomcat 5 in Firefox.

Best regards,

Phil Shaw

MKSearch beta 1 release announcement

MKDoc Ltd. would like to announce the first beta release of 
MKSearch, under the GNU General Public Licence. Source and 
pre-compiled binary downloads are available from the project Web 

MKSearch is a metadata search engine that indexes structured 
metadata in Web documents, not free text in the document body. 
The data acquisition system:

* Conforms to the Dublin Core metadata in HTML 
recommendations [1]

* Supports other application profiles, such as the UK e-Government 
Metadata Standard [2]

* Indexes native RDF formats, including RSS 1.0

The MKSearch system has five major components:

1. A Web crawler based on JSpider [3]

    * Multi-threaded processing
    * Per-site throttle, user agent, depth and linking rules
    * Respects the robots.txt exclusion policy
    * Extensible plug-in based content handling

2. An HTML document validator and formatter based on JTidy [4]

    * Cleans-up and corrects HTML syntax errors
    * Converts HTML to XHTML

3. A set of custom indexers based on the Simple API for XML (SAX)

    * Extracts metadata from HTML meta and link elements
    * Converts metadata to RDF triple statements
    * Configurable application profiles

4. An RDF storage and query system based on Sesame [5]

    * XML/RDF file-based storage
    * Database storage using PostgreSQL or MySQL
    * Sophisticated Sesame RDF Query Language (SeRQL) queries
    * Scope for more semantically rich queries with inferencing

5. A public query interface, provided through a standard servlet 

    * Simple, expandable query builder form
    * Configurable application profile-based presentation
    * Wildcard query handling
    * Phrase searches
    * Paged HTML results
    * Standing RSS results

The two main elements of the MKSearch system can be used 
independently. The data acquisition system can be used to gather 
large quantities of metadata from the Web and store it as RDF. The 
query system can be used to provide a typical search engine-style 
interface to existing RDF content.

The MKSearch beta 1 distribution includes sample configurations 
that crawl a Web site and create:

* A mirror of the site on the local file system in valid XHTML
* An RDF N-Triple record for each page on the local file system
* UK e-Government metadata in a Sesame file-based repository 

This distribution also includes a demonstration of the MKSearch 
query interface, in the form of a Web Application Archive (WAR) 
that can be deployed directly to an existing servlet container. The 
sample search content is from an index of the MKSearch project 
Web site on 2 November 2005. See the site documentation below:

System requirements and licence

MKSearch is written in the Java programming language and is 
designed to run on any platform that supports a Java environment 
equivalent to the Sun Java 2 language specification.

The system has specifically been designed, developed and tested 
to run on GNU/Linux systems using the GNU Compiler for Java 
(GCJ) [6] and Apache Tomcat 5 servlet container, as available on 
Fedora Core 4 [7].  This provision means that MKSearch can be 
built and run on software systems that are entirely open source 
and free from proprietary licencing.

The system has been tested extensively using the Sun Java SDK 
1.5 on Microsoft Windows 2000. JUnit test suites for the 
MKSearch code base cover 99% of all code branches.

If you have any comments or questions about the MKSearch 
system, please join us on the project mailing list.









MKSearch (beta)

Free, open source metadata search engine with RDF storage and query.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]