|Subject:||Re: [Chicken-users] Parsing HTML, best practice with Chicken|
|Date:||Mon, 29 Dec 2014 10:47:50 -0800|
I am currently playing around the Chicken and the web. More precisely, I
want to make some web link collection and see how well it goes for me when
scraping web sites for links and content.
Which eggs would you recommend for that? What should I avoid doing?
So far, I have been getting the site with http-client, the raw html to sxml
with html-parser, and trying to process the resulting list with
matchable/srfi-13. I am not sure how much good it will do to use regex on those
lists. Are there any packages like Python's Beautifulsoup in the Chicken
So far, I have some troubles when trying to parse the resulting sxml, both with
matchable and string-contains.
ps: ze code so far:
;; version 0.0.3
; high level HTTP client, HTML/SXML parsing library and regular _expression_
(use http-client html-parser matchable srfi-13)
; grab a website
(define raw (with-input-from-request lnk #f read-string))
;; convert site crawl data from html to sxml
(define sxml (html->sxml raw))
;; saving function
;; * display form is more suitable, for it evaluates all those \n and other
;; * specials characters;; * might be good to remove these things from regex
;; * processing, too.
(define (savedata somedata filename)
(let f ((ls somedata))
(unless (null? ls)
(display (car ls) p) ; changed: display->write
(f (cdr ls)))))))
; check how much the output is parsable..
(savedata sxml "output.txt")
(define (flatten x)
(cond ((null? x) '())
((not (pair? x)) (list x))
(else (append (flatten (car x))
(flatten (cdr x))))))
(define sxmlflat (flatten sxml))
;; Multi-check procedure is needed to check whether STRING element has:
;; journal-id: "10.1002"
;; link string: "issuetoc"
;; takes list of strings and checks wheather the element has them.
;; AND operator.
;; --- member? returns #t if elemnt x is in list lst.
;; --- ref:
;; --- http://stackoverflow.com/questions/14668616/scheme-fold-map-and-filter-functions
;; --- use: (member? "a" (list "a" 1)) --> #t
(define (member? x lst)
(fold (lambda (e r)
(or r (equal? e x)))
;; --- string-contains/m returns #t if all strings of list lsstr are in
;; --- string str.
;; --- case insensitive string matching.
;; --- does not check if lsstr is empty. This would return #t.
;; --- use: (string-contains/m "Somestring" '("10.1002" "issuetoc")
(define (string-contains/m str lsstr)
(if (string? str)
(if (not (member? #f (map (lambda (x) (string-contains-ci str x))
(filter (lambda (x) (string-contains/m x '("10.1002" "http://" "toc")))
;; Something is wrong with those bloody strings!
Chicken-users mailing list
|[Prev in Thread]||Current Thread||[Next in Thread]|