Beautiful Soup Alternatives for Go
Continuing the topic of extracting data from html
- For a direct Beautiful Soup analogue in Go, use soup.
- For CSS selector support, consider goquery.
- For XPath queries, use htmlquery.
- For another Beautiful Soup-inspired option, look at Node.
If you’re looking for a Beautiful Soup equivalent in Go, several libraries offer similar HTML parsing and scraping functionality:
soup
- soup is a Go library explicitly designed as an analogue to Python’s Beautiful Soup. Its API is intentionally similar, featuring functions like
Find
,FindAll
, andHTMLParse
, making it easy for developers familiar with Beautiful Soup to transition to Go. - It allows you to fetch web pages, parse HTML, and traverse the DOM to extract data, much like Beautiful Soup.
- Example usage:
resp, err := soup.Get("https://xkcd.com") if err != nil { os.Exit(1) } doc := soup.HTMLParse(resp) links := doc.Find("div", "id", "comicLinks").FindAll("a") for _, link := range links { fmt.Println(link.Text(), "| Link :", link.Attrs()["href"]) }
- Note: soup does not support CSS selectors or XPath; it relies on tag and attribute-based searching.
goquery
- goquery is another popular Go library for HTML parsing, offering a jQuery-like syntax for DOM traversal and manipulation.
- It supports CSS selectors, making it more flexible for complex queries compared to soup.
- Example usage:
doc, err := goquery.NewDocumentFromReader(resp.Body) doc.Find("div#comicLinks a").Each(func(i int, s *goquery.Selection) { fmt.Println(s.Text(), "| Link :", s.AttrOr("href", "")) })
htmlquery Go Library
htmlquery is a Go library designed for parsing and extracting data from HTML documents using XPath expressions. It provides a straightforward API for traversing and querying the HTML tree structure, making it especially useful for web scraping and data extraction tasks.
Key Features
- Allows querying HTML documents with XPath 1.0/2.0 expressions.
- Supports loading HTML from strings, files, or URLs.
- Offers functions to find single or multiple nodes, extract attributes, and evaluate XPath expressions.
- Includes query caching (LRU-based) to improve performance by avoiding repeated compilation of XPath expressions.
- Built on top of Go’s standard HTML parsing libraries and is compatible with other Go libraries like goquery.
Basic Usage Examples
Load HTML from a string:
doc, err := htmlquery.Parse(strings.NewReader("..."))
Load HTML from a URL:
doc, err := htmlquery.LoadURL("http://example.com/")
Find all `` elements:
list := htmlquery.Find(doc, "//a")
Find all `` elements with an href
attribute:
list := htmlquery.Find(doc, "//a[@href]")
Extract the text of the first `` element:
h1 := htmlquery.FindOne(doc, "//h1")
fmt.Println(htmlquery.InnerText(h1)) // Outputs the text inside
Extract all values of the href
attribute from `` elements:
list := htmlquery.Find(doc, "//a/@href")
for _, n := range list {
fmt.Println(htmlquery.SelectAttr(n, "href"))
}
Typical Use Cases
- Web scraping where XPath provides more precise or complex querying than CSS selectors.
- Extracting structured data from HTML documents.
- Navigating and manipulating HTML trees programmatically.
Installation
go get github.com/antchfx/htmlquery
Node
- Node is a Go package inspired by Beautiful Soup, providing APIs for extracting data from HTML and XML documents.
Colly
Colly - A web scraping framework for Go, which uses goquery internally for HTML parsing.
https://github.com/gocolly/colly
To install - add colly to your go.mod file:
module github.com/x/y
go 1.14
require (
github.com/gocolly/colly/v2 latest
)
Example:
func main() {
c := colly.NewCollector()
// Find and visit all links
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.Visit("http://go-colly.org/")
}
Comparison Table
Library | API Style | Selector Support | Inspiration | Notes |
---|---|---|---|---|
soup | Beautiful Soup-like | Tag & attribute only | Beautiful Soup | Simple, no CSS/XPath |
goquery | jQuery-like | CSS selectors | jQuery | Flexible, popular |
htmlquery | XPath | XPath | lxml/XPath | Advanced queries |
Node | Beautiful Soup-like | Tag & attribute | Beautiful Soup | Similar to soup |
Summary
- For a direct Beautiful Soup analogue in Go, use soup.
- For CSS selector support, consider goquery.
- For XPath queries, use htmlquery.
- For another Beautiful Soup-inspired option, look at Node.
All these libraries leverage Go’s standard HTML parser, which is robust and HTML5-compliant, so the main difference is in API style and selector capabilities.