This is a creation in Article, where the information may have evolved or changed.
1. Preface
From the previous article from zero to one: using Golang to write robots, we can already write a small robot that belongs to us.
And this article will explain their robot Samaritan to find the realization of the film skills, it is a kind of a.
This technology is only for Exchange study, please respect film and television copyright.
2. Clear requirements and pre-preparation
When we want to download a movie:
- Enter movie name
- Find related pages
- Find a download resource hyperlink
- Copy link address for final download
and give it to the robot:
- Identify the user's input
- Find resource Links and format
- Results after formatting the output
is the 1th and 3rd steps familiar? In fact, this is a dialogue process that was implemented in the previous article, except that we are no longer making robots "free to play" but rather telling the robot what to reply to.
So what we need to do is just teach the robot how to search for information from the Web and what information we need. The best way to do this is to "precept" and let the robot learn and imitate all the actions we have done throughout the process.
3. Get and Parse resources
Here is the movie "Star Wars 7" For example, the resource site Select Dragon Tribe, the goal is to find the available download link.
The following actions, we find the final link in the browser operation record
3.1 Search for "Star Wars 7"
Search page Display
And for the robot, that is http://www.lbldy.com/search/星球大战7 the request, get the page back:
movie:= "星球大战7" resp, _ := http.Get("http://www.lbldy.com/search/" + movie) defer resp.Body.Close() body, _ := ioutil.ReadAll(resp.Body)
The error handling is ignored here, body the value is the source of the page we just saw in the browser, the browser review elements can also be seen:
3.2 found the first result link
Right-click to copy the link address as:http://www.lbldy.com/movie/64115.html
The only variable is 64115 this number, which is the number that appears in the source code of the Web page
<div> class="postlist" id="post-64115"
Bold guess, only need to extract the id="post-64115" numbers in, it is relatively simple to use the regular:
re, _ := regexp.Compile("<div class=\"postlist\" id=\"post-(.*?)\">") firstId := re.FindSubmatch(body) //find first match case
3.3 Go to resources download page
At this point the browser section displays the contents as:
Download page
Review elements:
Can see the download address has been seen, the next thing to do is to let the robot to extract all the relevant links.
In the previous step we have found the movie ID so that the robot also accesses this page:
resp, _ = http.Get("http://www.lbldy.com/movie/" + id + ".html") defer resp.Body.Close() doc, err := goquery.NewDocumentFromReader(io.Reader(resp.Body)) if err != nil { return }
Although you can still use the regular to search for download links, the Goquery library is now available to handle more complex HTML pages.
doc.Find("p").Each(func(i int, selection *goquery.Selection) { name := selection.Find("a").Text() link, _ := selection.Find("a").Attr("href") if strings.HasPrefix(link, "ed2k") || strings.HasPrefix(link, "magnet") || strings.HasPrefix(link, "thunder") { m := Media{ Name: name, Link: link, } ms = append(ms, m) } })
Goquery through the parsing of HTML tags, we found all the download results list.
3.4 Copy Download link
The robot will find the results by channel returning to the user:
if len(ms) == 0 { results <- fmt.Sprintf("No results for *%s* from LBL", movie) return } else { ret := "Results from LBL:\n\n" for i, m := range ms { ret += fmt.Sprintf("*%s*\n```%s```\n\n", m.Name, m.Link) //when results are too large, we split it. if i%4 == 0 && i < len(ms)-1 && i > 0 { results <- ret ret = fmt.Sprintf("*LBL Part %d*\n\n", i/4+1) } } results <- ret }
We can get a response from the robot at this point:
LBL Partial results
4. Get from more resource sites
Often we search for the same resource through multiple resource sites, and when searching for movies, Samaritan is also available from the subtitle group, in addition to the Dragon tribe.
The resource search process for subtitle groups is similar to the Dragon tribe, except that it involves logging in, so you need to have the bot log in first and carry a cookie to access it before getting the resources:
//zmz.tv needs to login before downloadingvar zmzClient http.Clientfunc loginZMZ() { gCookieJar, _ := cookiejar.New(nil) zmzURL := "http://www.zimuzu.tv/User/Login/ajaxLogin" zmzClient = http.Client{ Jar: gCookieJar, } zmzClient.PostForm(zmzURL, url.Values{"account": {"username"}, "password": {"password"}, "remember": {"0"}})}
By cookiejar logging in, you zmzClient can bring a user cookie on subsequent visits to access the page you want to log in to.
The same movie that gets the resources from the subtitle group:
ZMZ Partial results
5. Return results faster
When we have a, B, C. Several resource sites, the code you write is likely to be
func DownloadMovie(){ retA := getResourceFromA() retB := getResourceFromB() retC := getResourceFromC() ... return retA + retB + retC}
Ideally, we want to get the resources in parallel and return them to the user as soon as we have the results.
With the CSP concurrency model of Golang, it goroutine is not difficult to write concurrent versions:
func DownloadMovie(results chan<- string){ var wg sync.WaitGroup wg.Add(3) go func() { defer wg.Done() results <- getResourceFromA() }() go func() { defer wg.Done() results <- getResourceFromB() }() go func() { defer wg.Done() results <- getResourceFromC() }() wg.Wait() close(results)}
And the caller simply keeps channel getting from:
func(){ results:= make(chan string) go DownloadMovie(results) for { msg, ok := <-results //retrive result from channel if !ok { return } reply(msg) }}
This allows the user to receive a response the first time. This is the subtlety of the goroutine channel match.
6. Summary
In the previous article, we built a small robot that can talk, and this article explains one of the common skills of robots: crawling Resources (crawlers).
Through the existing knowledge reserves, and then through the analysis, clear our goal. Just accept user input--Find resources--output to users.
Then, taking the movie resources as an example, let the robot simulate the user's operation step by step, and finally find the resource link.
But we are not satisfied with this, put forward two optimization points, functional requirements, we from more sites to obtain resources; On non-functional requirements, we make the result return faster through the concurrency characteristics of Golang.
Source Reference
Have fun!