RacketScrapingHTMLAutomation

Project 3: Build a Web Scraper/Bot | Racket Projects

2.74 min read
Md Nasim SheikhMd Nasim Sheikh
Share:

Python is famous for scraping, but Racket is faster and has excellent HTML parsing libraries.

Advertisement

Tools

We need:

  • net/url to fetch pages.
  • html-parsing to parse S-expressions (HTML is trees!).
#lang racket
(require net/url)
(require html-parsing)
(require sxml/sxpath)

Fetching the Page

(define (fetch-page url)
  (call/input-url (string->url url)
                  get-pure-port
                  html->xexp))

This returns the HTML as an X-expression (a list of symbols and strings).

Extracting Data (XPath)

We use sxpath to query the tree. It's like CSS selectors but for lists.

(define (get-headlines url)
  (define content (fetch-page url))
  (define extract-h2 (sxpath "//h2/text()"))
  
  (extract-h2 content))

Running the Bot

(define news (get-headlines "https://news.ycombinator.com"))

(with-output-to-file "headlines.txt"
  (lambda ()
    (for ([item news])
      (displayln item)))
  #:exists 'replace)

Advertisement

Summary

Because HTML is nested (like Lisp code), Racket is naturally good at parsing it. You can build powerful extractors without regular expressions.

Quick Quiz

What format does Racket convert HTML into for easy parsing?

Test Your Skills
Beginner Level

HTML5 & Semantic Web

Test your knowledge and reinforce what you've learned in this article.

20/100
Questions
20 min
Duration
Start Quiz
Md Nasim Sheikh
Written by

Md Nasim Sheikh

Software Developer at softexForge

Verified Author150+ Projects
Published:

You May Also Like