Step-by-step teaching your robot to find resource links

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

1. Preface

From the previous article from zero to one: using Golang to write robots, we can already write a small robot that belongs to us.

And this article will explain their robot Samaritan to find the realization of the film skills, it is a kind of a.

This technology is only for Exchange study, please respect film and television copyright.

2. Clear requirements and pre-preparation

When we want to download a movie:

    1. Enter movie name
    2. Find related pages
    3. Find a download resource hyperlink
    4. Copy link address for final download

and give it to the robot:

    1. Identify the user's input
    2. Find resource Links and format
    3. Results after formatting the output

is the 1th and 3rd steps familiar? In fact, this is a dialogue process that was implemented in the previous article, except that we are no longer making robots "free to play" but rather telling the robot what to reply to.

So what we need to do is just teach the robot how to search for information from the Web and what information we need. The best way to do this is to "precept" and let the robot learn and imitate all the actions we have done throughout the process.

3. Get and Parse resources

Here is the movie "Star Wars 7" For example, the resource site Select Dragon Tribe, the goal is to find the available download link.

The following actions, we find the final link in the browser operation record

3.1 Search for "Star Wars 7"


Search page Display

And for the robot, that is http://www.lbldy.com/search/星球大战7 the request, get the page back:

    movie:= "星球大战7"    resp, _ := http.Get("http://www.lbldy.com/search/" + movie)    defer resp.Body.Close()    body, _ := ioutil.ReadAll(resp.Body)

The error handling is ignored here, body the value is the source of the page we just saw in the browser, the browser review elements can also be seen:


3.2 found the first result link

Right-click to copy the link address as:http://www.lbldy.com/movie/64115.html
The only variable is 64115 this number, which is the number that appears in the source code of the Web page

    <div> class="postlist" id="post-64115"

Bold guess, only need to extract the id="post-64115" numbers in, it is relatively simple to use the regular:

    re, _ := regexp.Compile("<div class=\"postlist\" id=\"post-(.*?)\">")    firstId := re.FindSubmatch(body) //find first match case

3.3 Go to resources download page

At this point the browser section displays the contents as:


Download page

Review elements:


Can see the download address has been seen, the next thing to do is to let the robot to extract all the relevant links.
In the previous step we have found the movie ID so that the robot also accesses this page:

    resp, _ = http.Get("http://www.lbldy.com/movie/" + id + ".html")    defer resp.Body.Close()    doc, err := goquery.NewDocumentFromReader(io.Reader(resp.Body))    if err != nil {        return    }

Although you can still use the regular to search for download links, the Goquery library is now available to handle more complex HTML pages.

    doc.Find("p").Each(func(i int, selection *goquery.Selection) {        name := selection.Find("a").Text()        link, _ := selection.Find("a").Attr("href")        if strings.HasPrefix(link, "ed2k") || strings.HasPrefix(link, "magnet") || strings.HasPrefix(link, "thunder") {            m := Media{                Name: name,                Link: link,            }            ms = append(ms, m)        }    })

Goquery through the parsing of HTML tags, we found all the download results list.

3.4 Copy Download link

The robot will find the results by channel returning to the user:

    if len(ms) == 0 {        results <- fmt.Sprintf("No results for *%s* from LBL", movie)        return    } else {        ret := "Results from LBL:\n\n"        for i, m := range ms {            ret += fmt.Sprintf("*%s*\n```%s```\n\n", m.Name, m.Link)            //when results are too large, we split it.            if i%4 == 0 && i < len(ms)-1 && i > 0 {                results <- ret                ret = fmt.Sprintf("*LBL Part %d*\n\n", i/4+1)            }        }        results <- ret    }

We can get a response from the robot at this point:


LBL Partial results

4. Get from more resource sites

Often we search for the same resource through multiple resource sites, and when searching for movies, Samaritan is also available from the subtitle group, in addition to the Dragon tribe.

The resource search process for subtitle groups is similar to the Dragon tribe, except that it involves logging in, so you need to have the bot log in first and carry a cookie to access it before getting the resources:

//zmz.tv needs to login before downloadingvar zmzClient http.Clientfunc loginZMZ() {    gCookieJar, _ := cookiejar.New(nil)    zmzURL := "http://www.zimuzu.tv/User/Login/ajaxLogin"    zmzClient = http.Client{        Jar: gCookieJar,    }    zmzClient.PostForm(zmzURL, url.Values{"account": {"username"}, "password": {"password"}, "remember": {"0"}})}

By cookiejar logging in, you zmzClient can bring a user cookie on subsequent visits to access the page you want to log in to.
The same movie that gets the resources from the subtitle group:


ZMZ Partial results

5. Return results faster

When we have a, B, C. Several resource sites, the code you write is likely to be

func DownloadMovie(){      retA := getResourceFromA()      retB := getResourceFromB()      retC := getResourceFromC()      ...      return retA + retB + retC}

Ideally, we want to get the resources in parallel and return them to the user as soon as we have the results.
With the CSP concurrency model of Golang, it goroutine is not difficult to write concurrent versions:

func DownloadMovie(results chan<- string){        var wg sync.WaitGroup        wg.Add(3)        go func() {            defer wg.Done()            results <- getResourceFromA()        }()        go func() {            defer wg.Done()            results <- getResourceFromB()        }()        go func() {            defer wg.Done()            results <- getResourceFromC()        }()        wg.Wait()        close(results)}

And the caller simply keeps channel getting from:

func(){        results:= make(chan string)        go DownloadMovie(results)        for {            msg, ok := <-results //retrive result from channel            if !ok {                return            }            reply(msg)        }}

This allows the user to receive a response the first time. This is the subtlety of the goroutine channel match.

6. Summary

In the previous article, we built a small robot that can talk, and this article explains one of the common skills of robots: crawling Resources (crawlers).
Through the existing knowledge reserves, and then through the analysis, clear our goal. Just accept user input--Find resources--output to users.
Then, taking the movie resources as an example, let the robot simulate the user's operation step by step, and finally find the resource link.
But we are not satisfied with this, put forward two optimization points, functional requirements, we from more sites to obtain resources; On non-functional requirements, we make the result return faster through the concurrency characteristics of Golang.

Source Reference
Have fun!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.