Beautiful Soup Alternatives for Go

Continuing the topic of extracting data from html

Page content
  • For a direct Beautiful Soup analogue in Go, use soup.
  • For CSS selector support, consider goquery.
  • For XPath queries, use htmlquery.
  • For another Beautiful Soup-inspired option, look at Node.

If you’re looking for a Beautiful Soup equivalent in Go, several libraries offer similar HTML parsing and scraping functionality:

gopher is cooking soup

soup

  • soup is a Go library explicitly designed as an analogue to Python’s Beautiful Soup. Its API is intentionally similar, featuring functions like Find, FindAll, and HTMLParse, making it easy for developers familiar with Beautiful Soup to transition to Go.
  • It allows you to fetch web pages, parse HTML, and traverse the DOM to extract data, much like Beautiful Soup.
  • Example usage:
    resp, err := soup.Get("https://xkcd.com")
    if err != nil {
        os.Exit(1)
    }
    doc := soup.HTMLParse(resp)
    links := doc.Find("div", "id", "comicLinks").FindAll("a")
    for _, link := range links {
        fmt.Println(link.Text(), "| Link :", link.Attrs()["href"])
    }
    
  • Note: soup does not support CSS selectors or XPath; it relies on tag and attribute-based searching.

goquery

  • goquery is another popular Go library for HTML parsing, offering a jQuery-like syntax for DOM traversal and manipulation.
  • It supports CSS selectors, making it more flexible for complex queries compared to soup.
  • Example usage:
    doc, err := goquery.NewDocumentFromReader(resp.Body)
    doc.Find("div#comicLinks a").Each(func(i int, s *goquery.Selection) {
        fmt.Println(s.Text(), "| Link :", s.AttrOr("href", ""))
    })
    

htmlquery Go Library

htmlquery is a Go library designed for parsing and extracting data from HTML documents using XPath expressions. It provides a straightforward API for traversing and querying the HTML tree structure, making it especially useful for web scraping and data extraction tasks.

Key Features

  • Allows querying HTML documents with XPath 1.0/2.0 expressions.
  • Supports loading HTML from strings, files, or URLs.
  • Offers functions to find single or multiple nodes, extract attributes, and evaluate XPath expressions.
  • Includes query caching (LRU-based) to improve performance by avoiding repeated compilation of XPath expressions.
  • Built on top of Go’s standard HTML parsing libraries and is compatible with other Go libraries like goquery.

Basic Usage Examples

Load HTML from a string:

doc, err := htmlquery.Parse(strings.NewReader("..."))

Load HTML from a URL:

doc, err := htmlquery.LoadURL("http://example.com/")

Find all `` elements:

list := htmlquery.Find(doc, "//a")

Find all `` elements with an href attribute:

list := htmlquery.Find(doc, "//a[@href]")

Extract the text of the first `` element:

h1 := htmlquery.FindOne(doc, "//h1")
fmt.Println(htmlquery.InnerText(h1)) // Outputs the text inside 

Extract all values of the href attribute from `` elements:

list := htmlquery.Find(doc, "//a/@href")
for _, n := range list {
    fmt.Println(htmlquery.SelectAttr(n, "href"))
}

Typical Use Cases

  • Web scraping where XPath provides more precise or complex querying than CSS selectors.
  • Extracting structured data from HTML documents.
  • Navigating and manipulating HTML trees programmatically.

Installation

go get github.com/antchfx/htmlquery

Node

  • Node is a Go package inspired by Beautiful Soup, providing APIs for extracting data from HTML and XML documents.

Colly

Colly - A web scraping framework for Go, which uses goquery internally for HTML parsing.

https://github.com/gocolly/colly

To install - add colly to your go.mod file:

module github.com/x/y

go 1.14

require (
        github.com/gocolly/colly/v2 latest
)

Example:

func main() {
	c := colly.NewCollector()

	// Find and visit all links
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		e.Request.Visit(e.Attr("href"))
	})

	c.OnRequest(func(r *colly.Request) {
		fmt.Println("Visiting", r.URL)
	})

	c.Visit("http://go-colly.org/")
}

Comparison Table

Library API Style Selector Support Inspiration Notes
soup Beautiful Soup-like Tag & attribute only Beautiful Soup Simple, no CSS/XPath
goquery jQuery-like CSS selectors jQuery Flexible, popular
htmlquery XPath XPath lxml/XPath Advanced queries
Node Beautiful Soup-like Tag & attribute Beautiful Soup Similar to soup

Summary

  • For a direct Beautiful Soup analogue in Go, use soup.
  • For CSS selector support, consider goquery.
  • For XPath queries, use htmlquery.
  • For another Beautiful Soup-inspired option, look at Node.

All these libraries leverage Go’s standard HTML parser, which is robust and HTML5-compliant, so the main difference is in API style and selector capabilities.