2018-03-17

Golang: HTML Template案例(xml新闻解析+并发改进)

HTML 中使用 golang 模板编程。

主要写两个案例，都是 HTML 参数解析展示类的模板编程。由于这部分内容比较简单，直接上代码了。

simple.go 解析 JSON 数据，展示到自定义的网页。

package main

import (
  "fmt"
  "html/template"
  "net/http"
)

func main() {
  http.HandleFunc("/", indexHandler)
  http.HandleFunc("/new/", newsAggHandler)
  http.ListenAndServe(":8080", nil)
}

func indexHandler(w http.ResponseWriter, r *http.Request) {
  fmt.Fprintf(w, "<h1> Hi, there.</h1") //without template
}

type NewsPara struct {
  Title string
  News  string
}

func newsAggHandler(w http.ResponseWriter, r *http.Request) {
  p := NewsPara{Title: "Wow", News: "news"}
  t, _ := template.ParseFiles("simple.html") // {{ .Title}} , {{.News}}
  //t.Execute(w, p)
  fmt.Println(t.Execute(w, p)) //在控制台打印消息 (解析错误会有报错)
}

网页的模板代码非常简单: simple.html

1	{{ .Title}} , {{.News}}

然后控制台运行 localhost:8080/new/ 看到参数已经解析到页面上了。(如果HTML 模板参数名字写错了，则报错)

下面来个稍微复杂一点的，复杂的模板使用。(从 WT xml 获取消息，然后显示到自己的网页)

模板如下: (网页上显示整个 map 的内容，所以需要循环)

<h1> {{ .Title }}</h1>

<table>
    <thead>
        <th>Title</th>
        <th>KeyWords</th>
    </thead>
    <tbody>
        {{ range $key, $value := .News }}
          <tr>
            <td><a href="{{$value.Location}}" target="_blank">{{ $key }}</a></td>
            <td>{{ $value.Keyword }}</td>
          </tr>
        {{ end }}

    </tbody>
</table>

消息结构定义如下: (只是截取了标签内部分子标签)

type Sitemapindex struct {
  Locations []string `xml: "sitemap>loc"`
}

type News struct {
  Titles []string `xml:"url>news>title"`
  Keywords []string `xml:"url>news>keywords"`
  Locations []string `xml:"url>loc"`
}

type NewsMap struct {
  Keyword string
  Location string
}

type NewsPara struct {
  Title string
  News  string
}

然后摘取然后显示的代码如下, 其 xml 结构类似如下:

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
  <loc>
  http://www.washingtonpost.com/news-politics-sitemap.xml
  </loc>
</sitemap>
<sitemap>
  <loc>
  http://www.washingtonpost.com/news-blogs-politics-sitemap.xml
  </loc>
</sitemap>
</sitemapindex>

具体的每一条新闻类似:

<url>
  <loc>
    https://www.washingtonpost.com/politics/courts_law/is-california-protecting-women-or-forcing-clinics-to-promote-abortion-supreme-court-to-decide/2018/03/16/05ab6db4-2627-11e8-874b-d517e912f125_story.html
  </loc>
  <changefreq>hourly</changefreq>
  <n:news>
    <n:publication>
      <n:name>Washington Post</n:name>
      <n:language>en</n:language>
    </n:publication>
    <n:publication_date>2018-03-17T00:30:17Z</n:publication_date>
    <n:title>
      Is California protecting women or forcing clinics to promote abortion? Supreme Court to decide.
    </n:title>
    <n:keywords>
      courts,informed choices,xavier becerra,supreme court,fake clinics,abortion
    </n:keywords>
  </n:news>
</url>

如果你无法访问该网页，可以把代码改成请求本地的 XML。

func newsHandler(w http.ResponseWriter, r *http.Request) {

  var s Sitemapindex
  var n News

  //获取 xml 信息
  resp, err := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml")
  if err != nil {
    log.Fatal(err)
  }

  bytes, err:= ioutil.ReadAll(resp.Body)
  if err != nil {
    log.Fatal(err)
  }
  resp.Body.Close()

  xml.Unmarshal(bytes, &s) //拿到具体的新闻地址 Sitemapindex.locations
  if len(s.Locations)==0 {
    log.Fatal("get the news xml locations error")
  }

  //调试用
/* 	for url := range s.Locations {
    log.Println(url)
    fmt.Println("s", url)
  } */

  //解析 xml
  newsMap := make(map[string]NewsMap) //存储单个每一条新闻
  for _, Location := range s.Locations {
    resp, _ := http.Get(Location)
    bytes, _ := ioutil.ReadAll(resp.Body)
    resp.Body.Close()
    xml.Unmarshal(bytes, &n) //拿到具体新闻 News

    //存储到 NewsMap中
    for idx := range n.Keywords {
      newsMap[n.Titles[idx]] = NewsMap{n.Keywords[idx], n.Locations[idx]}
    }

    //调试用
/*  for kk, vv := range newsMap {
      log.Printf("title: %s\n", kk)
      log.Printf("News: keywords {%s} ", vv.Keyword)
    } */

  }

  p := NewsPara{Title: "Wow", News: newsMap}
  t, _ := template.ParseFiles("complex.html")
  //t.Execute(w, p)
  fmt.Println(t.Execute(w, p)) //在控制台打印消息 (解析错误会有报错)
}

为了让表格好看，可以加上一些格式代码:

<head>
 <script type="text/javascript" charset="utf8" src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>
 <script type="text/javascript" charset="utf8" src="//cdn.datatables.net/1.10.16/js/jquery.dataTables.min.js"></script>
 <link rel="stylesheet" type="text/css" href="//cdn.datatables.net/1.10.16/css/jquery.dataTables.min.css">
</head>

<script>
  $(document).ready( function () {
    $('#myTable').DataTable();
} );
</script>

<table id="mytable" class="display"> ... </table>

具体效果如下:

另外，可以看到网页的展示效果，但是就是速度上很慢，因为我们是顺序下载然后展示的，CPU 很大一部分时间在 idle 等待 IO 完毕(或者对方 server reponse 写回)，其实可以并行执行，同时多个 routine 去取，然后先返回的显写入 map.

做法有很多：比如开 N 个 goroutines，把获取 xml news content 的工作交出去(哪几个 routines 抢到工作我不关心)，然后在 main routine 里面不断的读取 chan (这个 chan 可以开 x buffer，避免阻塞)，然后写入 map，即写入newsMap; 并且只能主 routine 写。

具体的模型可以参考这篇我写的这篇文章的线程池部分: 《Golang: Here We Go(5.孰能生巧)》

读取 xml news 可以有多个，但是写保证只有一个人，即 main 在写(避免混乱)

main 和其他 goroutines 之间可以用 sync.WaitGroup 的 Done() 和 Add()，也可以不用，读写同一个阻塞型 chan 自动保证了协作&同步。

我给出另外一种方案，使用 WaitGroup 的方案:(队列 + WaitGroup保证主routine等待其他routine完成具体的工作)


var wg sync.WaitGroup

func newsHandler(w http.ResponseWriter, r *http.Request) {

  var s Sitemapindex

  //获取 xml 信息
  resp, err := http.Get("https://www.washingtonpost.com/news-sitemap-index.xml")
  if err != nil {
    log.Fatal(err)
  }

  bytes, err:= ioutil.ReadAll(resp.Body)
  if err != nil {
    log.Fatal(err)
  }
  resp.Body.Close()

  xml.Unmarshal(bytes, &s) //拿到具体的新闻地址 Sitemapindex.locations
  if len(s.Locations)==0 {
    log.Fatal("get the news xml locations error")
  }

  queue := make(chan News, 50) //缓冲队列有50，然后开启100个 goroutines

  //解析 xml 存储新闻到 map 里面 ---- 把这里分割取出，并发
  newsMap := make(map[string]NewsMap) //存储单个每一条新闻
  log.Println(len(s.Locations)) //22
  for _, Location := range s.Locations {
    wg.Add(1)
    go worker(queue, Location) //往 News 队列里面写拿到的 xml 具体信息
  }

  wg.Wait()
  close(queue)

  // 从 chan 中拿到 News 写入 map
  for elem := range queue {
    /*if len(elem.Keywords)==0 {
      log.Fatal("get the news xml content error") //有的链接拿到的就是空
    }*/
    for idx := range elem.Keywords {
      newsMap[elem.Titles[idx]] = NewsMap{elem.Keywords[idx], elem.Locations[idx]}
    }
  }

  p := NewsPara{Title: "Wow", News: newsMap}
  t, _ := template.ParseFiles("complex.html")
  //t.Execute(w, p)
  fmt.Println(t.Execute(w, p)) //在控制台打印消息 (解析错误会有报错)
}

func worker(newsChan chan News, Location string) {
  defer wg.Done()
  var n News
  resp, _ := http.Get(Location) //1
  bytes, _ := ioutil.ReadAll(resp.Body) //2
  resp.Body.Close()
  xml.Unmarshal(bytes, &n) //拿到具体新闻 News

  newsChan <- n
}

速度提升了快了很多:

但是还能不能更快一点？应该可以，开更多的线程，比如：

queue := make(chan News, 50)
urlChan := make(chan string, 15)
for i :=0; i<50; i++ {
  wg.Add(1)
  go worker(queue, urlChan)
}

//可能写的有点儿慢了, 所以把写也拿出去
//开的 routine 有点儿多，所以不用担心，一个劲儿派发 url 给它们抢
for _, Location := range s.Locations {
  //urlChan<- Location
  go dispathURL(urlChan, Location)
}

但是这里同步控制就比较烦了，控制不当可能有的 routine 一直等待写 urlChan。(因为还没有close)

并发部分，要么非常懂再去写；懂一点点&知道语法，很容易写出带有Bug的并发程序。

Merlin 2018.3 Golang HTML 模板(根据本案例扩展，其实可以做一个 RSS 订阅器)

本文标题:Golang: HTML Template案例(xml新闻解析+并发改进)

文章作者:Merlin

发布时间:2018-03-17, 07:14:23

最后更新:2018-04-12, 14:55:17

原始链接:http://www.merlinblog.site/posts/690caed4/

许可协议: "署名-非商用-相同方式共享 4.0" 转载请保留原文链接及作者。