Thursday, June 18, 2015

Day 17 - Using reflection to write into an interface slice

How to use reflection to write into a slice of unknown items inside an unknown struct:

While working on Gorp with Indexes i had to solve the problem of how to write into a slice inside a struct passed as an interface{}. A new struct with the filled slice should be returned. Both the struct and the slice are not known at compile time, i only have the fieldname of the slice passed to me at runtime. After some serious headscratching and reading through the reflection code i luckily found a way to do it.


The basic method is:
  1. Convert the incoming interface i to a reflect.Type t
  2. Dereference until we have a t which is a Struct
  3. Create a new reflect.Value v from the Type t
  4. From v we now can ask for the FieldName of the slice, getting back another reflect.Value s
  5. As we now have the slice s as a reflect.Value, we need to get the slice elements type
  6. From the slice elements type we create a new instance of it (reflect.New)
  7. Write to fields in this newItem using FieldByName (Hardcoded in the example)
  8. Append the newItem to the slice (the set append was the hard part to find out for me)
  9. Return the reflect.Value v as an interface
  10. ???
  11. Profit - Heureka!!

Output:
Input Type main.Post:
Slice Type []*main.Comment:
Slice Elem Type main.Comment:
Comment 0, Body XYZ 0, PostId 0
Comment 1, Body XYZ 1, PostId 2
Comment 2, Body XYZ 2, PostId 4
Comment 3, Body XYZ 3, PostId 6
Comment 4, Body XYZ 4, PostId 8
Erfolg: Prozess beendet mit Rückgabewert 0.



// This is a demo to show how to convert from a normal struct
// to a reflection type and back to a struct without knowing
// the original one. Input is passed as an Interface and the  
// output will be an interface, too.
// Bonus points for writing into an embedded slice
// (= the embedded Comment struct slice in Post)

package main

import (
    "errors"
    "fmt"
    "os"
    "reflect"
)

type Post struct {
    Id       uint64
    Title    string
    Comments []*Comment
}

// holds a single comment bound to a post
type Comment struct {
    Id     uint64
    PostId uint64
    Body   string
}

func CreateAndFillSlice(i interface{}, sliceName string) (interface{}, error) {

    // Convert the interface i to a reflect.Type t 
    t := reflect.TypeOf(i)
    // Check if the input is a pointer and dereference it if yes
    if t.Kind() == reflect.Ptr {
        t = t.Elem()
    }

    // Check if the input is a struct
    if t.Kind() != reflect.Struct {
        return nil, errors.New("Input param is not a struct")
    }
    fmt.Printf("Input Type %v:\n", t)

    // Create a new Value from the input type
    // this will be returned to the caller
    v := reflect.New(t).Elem()

    // Get the field named "sliceName" from the input struct, which should be a slice
    s := v.FieldByName(sliceName)
    if s.Kind() == reflect.Slice {

        st := s.Type()
        fmt.Printf("Slice Type %s:\n", st)

        // Get the type of a single slice element
        sliceType := st.Elem()
        // Pointer?
        if sliceType.Kind() == reflect.Ptr {
            // Then dereference it
            sliceType = sliceType.Elem()
        }
        fmt.Printf("Slice Elem Type %v:\n", sliceType)

        for i := 0; i < 5; i++ {
            // Create a new slice element
            newitem := reflect.New(sliceType)
            // Set some field in it
            newitem.Elem().FieldByName("Body").SetString(fmt.Sprintf("XYZ %d", i))
            newitem.Elem().FieldByName("PostId").SetUint(uint64(i * 2))

            // This is the important part here - append and set
            // Append the newitem to the slice in "v" which will be the output
            s.Set(reflect.Append(s, newitem))
        }
    } else {
        return nil, fmt.Errorf("Field %s is not a slice\n", sliceName)
    }

    // IMPORTANT
    // Cast back to the empty interface type
    // So the cast back to Post outside will work
    return v.Interface(), nil
}

func main() {
    var err error
    p := Post{Id: 1, Title: "Title 1"}

    result, err := CreateAndFillSlice(p, "Comments")
    if err != nil {
        fmt.Println(err.Error())
        os.Exit(1)
    }
    // Cast the returned interface to a Post
    post := result.(Post)
    for i, c := range post.Comments {
        fmt.Printf("Comment %d, Body %s, PostId %d\n", i, c.Body, c.PostId)
    }
}




Saturday, June 6, 2015

Day 16 - Comparison of html parsers for a webcrawler, today GoQuery

github.com/goquery is the king of go html parsing. After trying the two other
relevant libs i can conclude:

  • GoQuery: All you want, you can use it as a Ferrari or as a heavy load truck
  • Scrape: Small, light and neat: Its your bicycle, it always works
  • go-html-transform: For me it feels like an energy plant with all buttons in "русский этикетки"


This example to scrape posts from hackernews has been stripped down to
fit onto one page. All error handling and debug printouts have been
removed, so only the pure GoQuery logic remains.
The full source is available at github.com/hackernewsCrawlerGoQuery.


// Parse for posts in html from hackernews, input html is an io.Reader 
// and returns recognized posts in a psout slice of posts.
// Errors which affect only a single post are stored in their post.Err
func ParseHtmlHackerNews(body io.Reader, ps []*post.Post)  
                        (psout []*post.Post, err error) { 
    // Create a qoquery document to parse from an io.Reader 
    doc, err := goquery.NewDocumentFromReader(body)
    // Find hackernews posts = elements with class "athing"
    thing := doc.Find(".athing")
    for iThing := range thing.Nodes {
        // Create a new post struct - if the crawling fails 
        // the post will have its Err set, but will be added  
        // to the outgoing (psout) slice nevertheless
        post := post.NewPost()
        ps = append(ps, &post)
        // use singlearticle as a selection of one single post
        singlearticle := thing.Eq(iThing)
        // Get the next element containing additional info for this post
        scorenode := singlearticle.Next()
        // Get the post title
        htmlpost := singlearticle.Find(".title a").First()
        post.Title = htmlpost.Text()
        // Get the post url
        post.Url, exists = htmlpost.Attr("href")
        // Get the post score
        scoretag := scorenode.Find(".subtext .score").First()
        post.SetScore(strings.Split(scoretag.Text(), " ")[0])
        // Get the post id
        postid, exists := scoretag.Attr("id")
        post.PostId = strings.Split(postid, "_")[1]
        // Get the username and postdate
        hrefs := scorenode.Find(".subtext a")
        for i := range hrefs.Nodes {
            href := hrefs.Eq(i)
            t, _ := href.Html()
            s, exists := href.Attr("href")
            if exists {
                if strings.HasPrefix(s, "user?id") {
                    post.User = t
                    continue
                }
                if strings.HasPrefix(s, "item?id") {
                    if strings.Contains(t, "ago") {
                        var postDate time.Time
                        postDate, err = GetDateFromCreatedAgo(t)
                        post.PostDate = postDate
                        post.Err = nil
                    }
                }
            }
        }
    }
    return ps, err
}

Friday, June 5, 2015

Day 15 - Comparison of html parsers for a webcrawler, today go-html-transform

http://bitbucket.org/zaphar/go-html-transform looks very promising and powerfull, as its not only a scraping lib but is able to manipulate the DOM.

I was really excited to get my hands onto it: It should be providing a simple traversing walker (like scrape). So i wrote a main loop to get articles from hackernews ("http://news.ycombinator.com/")


    tree, err := h5.New(body)
    if err != nil {
        err = errors.New("Failed to html.Parse: " + err.Error())
        return
    }

    matched := []*html.Node{}
    // Get all articles
    tree.Walk(func(n *html.Node) {
        // check the node
        if n.DataAtom == atom.Tr && n.Parent != nil &&  
           n.Parent.DataAtom == atom.Tbody {

            for _, a := range n.Attr {
                if a.Key == "class" {
                    if a.Val == "athing" {
                        matched = append(matched, n)
                    }
                }
            }

        }
    })
 
 
Ok,  thats a little bit more code then using scrape but im sure it will pay out sometime later.
And I found out that it supports CSS3 selectors - Great! - so whats that?

Its like a query language, you can request a class with ".", an id with "#", and so on (.w3schools.com/cssref/css_selectors)

Great, that is just what i want - the lack of documentation doesnt scare me off.

What then throw me off was not only the lack of examples, but the fact that the projects migration from code.google.com to bitbucket left some dangling imports like:


// The package follows the CSS3 Spec at: http://www.w3.org/TR/css3-selectors/
package selector

import (
    "go.marzhillstudios.com/pkg/go-html-transform/h5"


go.marzhillstudios.com seems to be dead, no response, tcp timeout. So i do not want to base a project on a dead lib, even if it looks to be the best!.
Its just too much effort to get it running and hoping it will not be disbanded. You are much better of using scrape for a simple to use interface, or goquery for a fully fledged jquery compatible selector lib.