Friday, June 5, 2015

Day 15 - Comparison of html parsers for a webcrawler, today go-html-transform

http://bitbucket.org/zaphar/go-html-transform looks very promising and powerfull, as its not only a scraping lib but is able to manipulate the DOM.

I was really excited to get my hands onto it: It should be providing a simple traversing walker (like scrape). So i wrote a main loop to get articles from hackernews ("http://news.ycombinator.com/")


    tree, err := h5.New(body)
    if err != nil {
        err = errors.New("Failed to html.Parse: " + err.Error())
        return
    }

    matched := []*html.Node{}
    // Get all articles
    tree.Walk(func(n *html.Node) {
        // check the node
        if n.DataAtom == atom.Tr && n.Parent != nil &&  
           n.Parent.DataAtom == atom.Tbody {

            for _, a := range n.Attr {
                if a.Key == "class" {
                    if a.Val == "athing" {
                        matched = append(matched, n)
                    }
                }
            }

        }
    })
 
 
Ok,  thats a little bit more code then using scrape but im sure it will pay out sometime later.
And I found out that it supports CSS3 selectors - Great! - so whats that?

Its like a query language, you can request a class with ".", an id with "#", and so on (.w3schools.com/cssref/css_selectors)

Great, that is just what i want - the lack of documentation doesnt scare me off.

What then throw me off was not only the lack of examples, but the fact that the projects migration from code.google.com to bitbucket left some dangling imports like:


// The package follows the CSS3 Spec at: http://www.w3.org/TR/css3-selectors/
package selector

import (
    "go.marzhillstudios.com/pkg/go-html-transform/h5"


go.marzhillstudios.com seems to be dead, no response, tcp timeout. So i do not want to base a project on a dead lib, even if it looks to be the best!.
Its just too much effort to get it running and hoping it will not be disbanded. You are much better of using scrape for a simple to use interface, or goquery for a fully fledged jquery compatible selector lib.




No comments:

Post a Comment