RacketScrapingHTMLAutomation
Project 3: Build a Web Scraper/Bot | Racket Projects
2.55 min read
Md Nasim Sheikh
Python is famous for scraping, but Racket is faster and has excellent HTML parsing libraries.
Advertisement
Tools
We need:
net/urlto fetch pages.html-parsingto parse S-expressions (HTML is trees!).
#lang racket
(require net/url)
(require html-parsing)
(require sxml/sxpath)
Fetching the Page
(define (fetch-page url)
(call/input-url (string->url url)
get-pure-port
html->xexp))
This returns the HTML as an X-expression (a list of symbols and strings).
Extracting Data (XPath)
We use sxpath to query the tree. It's like CSS selectors but for lists.
(define (get-headlines url)
(define content (fetch-page url))
(define extract-h2 (sxpath "//h2/text()"))
(extract-h2 content))
Running the Bot
(define news (get-headlines "https://news.ycombinator.com"))
(with-output-to-file "headlines.txt"
(lambda ()
(for ([item news])
(displayln item)))
#:exists 'replace)
Advertisement
Summary
Because HTML is nested (like Lisp code), Racket is naturally good at parsing it. You can build powerful extractors without regular expressions.
Quick Quiz
What format does Racket convert HTML into for easy parsing?
Written by
Md Nasim Sheikh
Software Developer at softexForge
Verified Author150+ Projects
Published: