[Golang] Extract Title, Image, and URL via goquery


Extract title, image, and URL in buy123 product webpage via goquery, and then output the info to reStructuredText image.

First we examine the source code of buy123 product webpage, we found that the product info is embedded into json string in script tag, type of which is application/ld+json.

So we extract the json string via goquery Find method. Then convert it to a struct type. Finally we use go text/template package to output the info in reStructuredText image format.

package main

import (
      "bytes"
      "encoding/json"
      "github.com/PuerkitoBio/goquery"
      "text/template"
)

const rstTmpl = `.. image:: {{.Image}}
  :alt: {{.Name}}
  :target: {{.Url}}
  :align: center`

type buy123ProductInfo struct {
      Name        string
      Description string
      Image       string
      Url         string
}

func parseBuy123(url string) string {
      doc, err := goquery.NewDocument(url)
      if err != nil {
              panic(err)
      }

      jsonBlob := doc.Find("script[type=\"application/ld+json\"]").Text()

      i := buy123ProductInfo{}
      err = json.Unmarshal([]byte(jsonBlob), &i)
      if err != nil {
              panic(err)
      }

      tmpl, err := template.New("buy123").Parse(rstTmpl)
      if err != nil {
              panic(err)
      }
      var rst bytes.Buffer
      err = tmpl.Execute(&rst, i)
      if err != nil {
              panic(err)
      }

      return rst.String()
}

Output:

.. image:: //s3-buy123.cdn.hinet.net/images/item/GLFA9T7.png
  :alt: 6LED多功能太陽能露營燈
  :target: https://direct.buy123.com.tw/site/item/64493/6LED%E5%A4%9A%E5%8A%9F%E8%83%BD%E5%A4%AA%E9%99%BD%E8%83%BD%E9%9C%B2%E7%87%9F%E7%87%88
  :align: center

Tested on: Ubuntu Linux 15.10, Go 1.6.


References:

[1]go template output to string