goquery Handle Non-UTF8 HTML Web Page


goquery handles only UTF-8 encoded web pages. The wiki of goquery [1] provides a method to handle non-utf8 html pages if the character encoding (charset) of the pages is known. The trick is to use iconv to convert the encoding to utf8 first. I re-write the code in the wiki and make it more modular.

Install goquery and Go iconv binding:

$ go get -u github.com/PuerkitoBio/goquery
$ go get -u github.com/djimenez/iconv-go

Source code

import (
      "net/http"

      "github.com/PuerkitoBio/goquery"
      iconv "github.com/djimenez/iconv-go"
)

func NewDocumentFromNonUtf8Url(url, charset string) (doc *goquery.Document, err error) {
      resp, err := http.Get(url)
      if err != nil {
              return
      }
      defer resp.Body.Close()

      utfBody, err := iconv.NewReader(resp.Body, charset, "utf-8")
      if err != nil {
              return
      }

      doc, err = goquery.NewDocumentFromReader(utfBody)
      return
}

Example: Read Big5 webpage

func main() {
      doc, err := NewDocumentFromNonUtf8Url("http://shenfang.com.tw/product.htm", "big5")
      if err != nil {
              panic(err)
      }

      // do something with the doc
}

Tested on: Ubuntu Linux 18.04, Go 1.10.1.


References:

[1]Tips and tricks · PuerkitoBio/goquery Wiki · GitHub