Step-by-step teaching your robot to find resource links

Last Update:2017-02-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a creation in Article, where the information may have evolved or changed.

1. Preface

From the previous article from zero to one: using Golang to write robots, we can already write a small robot that belongs to us.

And this article will explain their robot Samaritan to find the realization of the film skills, it is a kind of a.

This technology is only for Exchange study, please respect film and television copyright.

2. Clear requirements and pre-preparation

When we want to download a movie:

Enter movie name
Find related pages
Find a download resource hyperlink
Copy link address for final download

and give it to the robot:

Identify the user's input
Find resource Links and format
Results after formatting the output

is the 1th and 3rd steps familiar? In fact, this is a dialogue process that was implemented in the previous article, except that we are no longer making robots "free to play" but rather telling the robot what to reply to.

So what we need to do is just teach the robot how to search for information from the Web and what information we need. The best way to do this is to "precept" and let the robot learn and imitate all the actions we have done throughout the process.

3. Get and Parse resources

Here is the movie "Star Wars 7" For example, the resource site Select Dragon Tribe, the goal is to find the available download link.

The following actions, we find the final link in the browser operation record

3.1 Search for "Star Wars 7"

Search page Display

And for the robot, that is http://www.lbldy.com/search/星球大战7 the request, get the page back:

    movie:= "星球大战7"    resp, _ := http.Get("http://www.lbldy.com/search/" + movie)    defer resp.Body.Close()    body, _ := ioutil.ReadAll(resp.Body)

The error handling is ignored here, body the value is the source of the page we just saw in the browser, the browser review elements can also be seen:

3.2 found the first result link

Right-click to copy the link address as:http://www.lbldy.com/movie/64115.html
The only variable is 64115 this number, which is the number that appears in the source code of the Web page

    <div> class="postlist" id="post-64115"

Bold guess, only need to extract the id="post-64115" numbers in, it is relatively simple to use the regular:

    re, _ := regexp.Compile("<div class=\"postlist\" id=\"post-(.*?)\">")    firstId := re.FindSubmatch(body) //find first match case

3.3 Go to resources download page

At this point the browser section displays the contents as:

Download page

Review elements:

Can see the download address has been seen, the next thing to do is to let the robot to extract all the relevant links.
In the previous step we have found the movie ID so that the robot also accesses this page:

    resp, _ = http.Get("http://www.lbldy.com/movie/" + id + ".html")    defer resp.Body.Close()    doc, err := goquery.NewDocumentFromReader(io.Reader(resp.Body))    if err != nil {        return    }

Although you can still use the regular to search for download links, the Goquery library is now available to handle more complex HTML pages.

    doc.Find("p").Each(func(i int, selection *goquery.Selection) {        name := selection.Find("a").Text()        link, _ := selection.Find("a").Attr("href")        if strings.HasPrefix(link, "ed2k") || strings.HasPrefix(link, "magnet") || strings.HasPrefix(link, "thunder") {            m := Media{                Name: name,                Link: link,            }            ms = append(ms, m)        }    })

Goquery through the parsing of HTML tags, we found all the download results list.

3.4 Copy Download link

The robot will find the results by channel returning to the user:

    if len(ms) == 0 {        results <- fmt.Sprintf("No results for *%s* from LBL", movie)        return    } else {        ret := "Results from LBL:\n\n"        for i, m := range ms {            ret += fmt.Sprintf("*%s*\n```%s```\n\n", m.Name, m.Link)            //when results are too large, we split it.            if i%4 == 0 && i < len(ms)-1 && i > 0 {                results <- ret                ret = fmt.Sprintf("*LBL Part %d*\n\n", i/4+1)            }        }        results <- ret    }

We can get a response from the robot at this point:

LBL Partial results

4. Get from more resource sites

Often we search for the same resource through multiple resource sites, and when searching for movies, Samaritan is also available from the subtitle group, in addition to the Dragon tribe.

The resource search process for subtitle groups is similar to the Dragon tribe, except that it involves logging in, so you need to have the bot log in first and carry a cookie to access it before getting the resources:

//zmz.tv needs to login before downloadingvar zmzClient http.Clientfunc loginZMZ() {    gCookieJar, _ := cookiejar.New(nil)    zmzURL := "http://www.zimuzu.tv/User/Login/ajaxLogin"    zmzClient = http.Client{        Jar: gCookieJar,    }    zmzClient.PostForm(zmzURL, url.Values{"account": {"username"}, "password": {"password"}, "remember": {"0"}})}

By cookiejar logging in, you zmzClient can bring a user cookie on subsequent visits to access the page you want to log in to.
The same movie that gets the resources from the subtitle group:

ZMZ Partial results

5. Return results faster

When we have a, B, C. Several resource sites, the code you write is likely to be

func DownloadMovie(){      retA := getResourceFromA()      retB := getResourceFromB()      retC := getResourceFromC()      ...      return retA + retB + retC}

Ideally, we want to get the resources in parallel and return them to the user as soon as we have the results.
With the CSP concurrency model of Golang, it goroutine is not difficult to write concurrent versions:

func DownloadMovie(results chan<- string){        var wg sync.WaitGroup        wg.Add(3)        go func() {            defer wg.Done()            results <- getResourceFromA()        }()        go func() {            defer wg.Done()            results <- getResourceFromB()        }()        go func() {            defer wg.Done()            results <- getResourceFromC()        }()        wg.Wait()        close(results)}

And the caller simply keeps channel getting from:

func(){        results:= make(chan string)        go DownloadMovie(results)        for {            msg, ok := <-results //retrive result from channel            if !ok {                return            }            reply(msg)        }}

This allows the user to receive a response the first time. This is the subtlety of the goroutine channel match.

6. Summary

In the previous article, we built a small robot that can talk, and this article explains one of the common skills of robots: crawling Resources (crawlers).
Through the existing knowledge reserves, and then through the analysis, clear our goal. Just accept user input--Find resources--output to users.
Then, taking the movie resources as an example, let the robot simulate the user's operation step by step, and finally find the resource link.
But we are not satisfied with this, put forward two optimization points, functional requirements, we from more sites to obtain resources; On non-functional requirements, we make the result return faster through the concurrency characteristics of Golang.

Source Reference
Have fun!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More