Project 3: Build a Web Scraper/Bot | Racket Projects

Python is famous for scraping, but Racket is faster and has excellent HTML parsing libraries.

Tools

We need:

net/url to fetch pages.
html-parsing to parse S-expressions (HTML is trees!).

#lang racket
(require net/url)
(require html-parsing)
(require sxml/sxpath)

Fetching the Page

(define (fetch-page url)
  (call/input-url (string->url url)
                  get-pure-port
                  html->xexp))

This returns the HTML as an X-expression (a list of symbols and strings).

Extracting Data (XPath)

We use sxpath to query the tree. It's like CSS selectors but for lists.

(define (get-headlines url)
  (define content (fetch-page url))
  (define extract-h2 (sxpath "//h2/text()"))
  
  (extract-h2 content))

Running the Bot

(define news (get-headlines "https://news.ycombinator.com"))

(with-output-to-file "headlines.txt"
  (lambda ()
    (for ([item news])
      (displayln item)))
  #:exists 'replace)

Summary

Because HTML is nested (like Lisp code), Racket is naturally good at parsing it. You can build powerful extractors without regular expressions.

Quick Quiz

What format does Racket convert HTML into for easy parsing?

Test Your Skills

Beginner Level

HTML5 & Semantic Web

Test your knowledge and reinforce what you've learned in this article.

20/100

Questions

20 min

Duration

Start Quiz

Project 3: Build a Web Scraper/Bot | Racket Projects

Tools

Fetching the Page

Extracting Data (XPath)

Running the Bot

Summary

What format does Racket convert HTML into for easy parsing?

HTML5 & Semantic Web

Md Nasim Sheikh

You May Also Like

Tools

Fetching the Page

Extracting Data (XPath)

Running the Bot

Summary

What format does Racket convert HTML into for easy parsing?

Related Quiz - HTML5 & Semantic Web

HTML5 & Semantic Web

Md Nasim Sheikh

You May Also Like