[Golang] Extract Title, Image, and URL via goquery
Extract title, image, and URL in buy123 product webpage via goquery, and then output the info to reStructuredText image.
First we examine the source code of buy123 product webpage, we found that the product info is embedded into json string in script tag, type of which is application/ld+json.
So we extract the json string via goquery Find method. Then convert it to a struct type. Finally we use go text/template package to output the info in reStructuredText image format.
package main
import (
"bytes"
"encoding/json"
"github.com/PuerkitoBio/goquery"
"text/template"
)
const rstTmpl = `.. image:: {{.Image}}
:alt: {{.Name}}
:target: {{.Url}}
:align: center`
type buy123ProductInfo struct {
Name string
Description string
Image string
Url string
}
func parseBuy123(url string) string {
doc, err := goquery.NewDocument(url)
if err != nil {
panic(err)
}
jsonBlob := doc.Find("script[type=\"application/ld+json\"]").Text()
i := buy123ProductInfo{}
err = json.Unmarshal([]byte(jsonBlob), &i)
if err != nil {
panic(err)
}
tmpl, err := template.New("buy123").Parse(rstTmpl)
if err != nil {
panic(err)
}
var rst bytes.Buffer
err = tmpl.Execute(&rst, i)
if err != nil {
panic(err)
}
return rst.String()
}
Output:
.. image:: //s3-buy123.cdn.hinet.net/images/item/GLFA9T7.png
:alt: 6LED多功能太陽能露營燈
:target: https://direct.buy123.com.tw/site/item/64493/6LED%E5%A4%9A%E5%8A%9F%E8%83%BD%E5%A4%AA%E9%99%BD%E8%83%BD%E9%9C%B2%E7%87%9F%E7%87%88
:align: center
Tested on: Ubuntu Linux 15.10, Go 1.6.
References:
[1] | go template output to string |