Thursday, June 4, 2015

Day 14 - Comparison of html parsers for a webcrawler, today scrape

Scrape (github.com/yhat/scrape) is a lightweight layer above godoc.org/golang.org/x/net/html. It makes it very easy to traverse the []*html.Node tree and provides some convenience functions such as Attr and Text.

After grokking the concept its super easy to use and get fast results.

There are really only two things to understand:

1) The traversing functions Find and FindAll:

// starts to search the node tree beginning at node article (*html.Node)
titlenode, ok = scrape.Find(article,
  func(n *html.Node) bool {
    if n.DataAtom == atom.Td && scrape.Attr(n, "class") == "title" {
       return true
    }
    return false
  })
  if !ok {
    ... do some error
     }


2) The matcher function which controls which nodes will be returned:
A matcher function is an argument to Find or Findall, gets a Node as an input param and returns ether false or true. If true the node n is included in the result of FindAll. In the case of Find a true result from the matcher causes Find to stop and return the Node n.


    // define a matcher 
    matcher := func(n *html.Node) bool {
        if n.DataAtom == atom.Tr && n.Parent != nil && n.Parent.DataAtom == atom.Tbody {
            matched := scrape.Attr(n, "class") == "athing"
            return matched
        }
        return false
    }



The full example code for this blog entry can be found at github.com/kimxilxyong/intogooglego/hackernewsCrawlerScrape


1 comment:

  1. hello
    i am new to golang, i want to scrap result data from result websites. How i do it in golang?

    ReplyDelete