[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] Parsing HTML, best practice with Chicken

From: Ivan Raikov
Subject: Re: [Chicken-users] Parsing HTML, best practice with Chicken
Date: Mon, 29 Dec 2014 10:47:50 -0800

Hello Piotr,

   The neuromorpho egg is a scraper-like utility to fetch information from a public database with neuronal reconstructions.
You can look at the code for examples of page scraping with sxpath. In particular, take a look at the procedures
table->alist, extract-metadata, extract-pages-from-search-results. Obviously these are specific to the particular page
structure served by NeuroMorpho, but this might help.


On Sun, Dec 28, 2014 at 6:28 PM, mfv <address@hidden> wrote:

I am currently playing around the Chicken and the web. More precisely, I
want to make some web link collection and see how well it goes for me when
scraping web sites for links and content.

Which eggs would you recommend for that? What should I avoid doing?

So far, I have been getting the site with http-client, the raw html to sxml
with html-parser, and trying to process the resulting list with
matchable/srfi-13. I am not sure how much good it will do to use regex on those
lists. Are there any packages like Python's Beautifulsoup in the Chicken

So far, I have some troubles when trying to parse the resulting sxml, both with
matchable and string-contains.



ps: ze code so far:

;; version 0.0.3

; high level HTTP client, HTML/SXML parsing library and regular _expression_
; library
(use http-client html-parser matchable srfi-13)

; grab a website
(define lnk
; "")
(define raw (with-input-from-request lnk #f read-string))

;; convert site crawl data from html to sxml
(define sxml (html->sxml raw))

;; saving function
;; * display form is more suitable, for it evaluates all those \n and other
;; * specials characters;; * might be good to remove these things from regex
;; * processing, too.
(define (savedata somedata filename)
  (call-with-output-file filename
    (lambda (p)
      (let f ((ls somedata))
        (unless (null? ls)
          (display (car ls) p)   ; changed: display->write
          (newline p)
          (f (cdr ls)))))))

; check how much the output is parsable..
(savedata sxml "output.txt")

;; non-TCO
(define (flatten x)
    (cond ((null? x) '())
          ((not (pair? x)) (list x))
          (else (append (flatten (car x))
                        (flatten (cdr x))))))

(define sxmlflat (flatten sxml))

;; ***************
;; Multi-check procedure is needed to check whether STRING element has:
;;  journal-id: "10.1002"
;;  link string: "issuetoc"
;; function:
;;   takes list of strings and checks wheather the element has them.
;;   AND operator.
;; ***************

;; --- member? returns #t if elemnt x is in list lst.
;; --- ref:
;; ---
;; --- use: (member? "a" (list "a" 1)) --> #t
(define (member? x lst)
  (fold (lambda (e r)
          (or r (equal? e x)))
        #f lst))

;; --- string-contains/m returns #t if all strings of list lsstr are in
;; --- string str.
;; --- case insensitive string matching.
;; --- does not check if lsstr is empty. This would return #t.
;; --- use: (string-contains/m "Somestring" '("10.1002" "issuetoc")
(define (string-contains/m str lsstr)
  (if (string? str)
      (if (not (member? #f (map (lambda (x) (string-contains-ci str x))
lsstr))) #t)))

(filter (lambda (x) (string-contains/m x '("10.1002" "http://" "toc")))

;; Something is wrong with those bloody strings!

Chicken-users mailing list

reply via email to

[Prev in Thread] Current Thread [Next in Thread]