Golang Native crawler Simple crawler implementation of non-reliance on third-party package library easy to understand technology principles (I.)

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

Exploration technology on the road should be built on their own wheels, even if there are more choices on the market, their own hands-on attempt is necessary, the first attempt will inevitably be a lot of problems, but you do not think to solve him is a very fulfilling thing, so as to bring you greater progress and deeper understanding.

If you have not written and interested in the implementation of this simple crawler.

In fact, the use of Golang to achieve the crawler is very simple is a matter, but also the situation, we share this article on the simplest way to implement the crawler, the official library used as follows:

import (      "fmt"      "io"      "io/ioutil"      "net/http"      "os"      "regexp"      "strconv"      "strings"      "time"  )  

If you can just go through these libraries and think about what to do, you'll be great.

In order to let the program can continue to run, we first need to have a source page, and then constantly crawl record new links, record means there are many, such as the existence of databases, through the Redis cache, the existence of text files, the simplest should be the existence of the database, this look at your technology bias. I'm going to store the crawled links in a text file.

First of all, to understand their own crawling goals, I am ready to crawl all Golang related questions or articles, and then over and over again many sites do not feel suitable to do the source URL, and then brainwave, Baidu a bit

Then use this as the source URL: Baidu-Golang implementation

With the source URL, the following things can be done as long as smooth. First we need a regular expression in order to crawl the link.

var (      regHref       = `((ht|f)tps?)://[w]{0,3}.baidu.com/link\?[a-zA-z=0-9-\s]*`  )  

Because this regular expression may be reused later, it can be stored in a global variable.

If a crawler does not limit the number of seconds to crawl, then your network will certainly be unbearable, if the computer is not configured, the computer will be hung off, so we need to write a timer, Golang has provided a timer package

func Timer() {      t := time.NewTimer(time.Second * 1)      <-t.C      fmt.Print("\n\n\n执行爬抓\n\n")      Timer()  }  

Why write in a timer function? Of course is used to invoke the/manual funny

Because we have two cases, the first crawl or not the first crawl is doing different things. How do you judge that? Because our link is stored in the TXT file, so we just need to check the TXT file is not empty, if it is empty think he is the first time to execute the program, first access to the source URL, or follow the link in the file to access.

The code is as follows:

func main() {      if checkFile("./data/", "url.txt").Size() == 0 {          fistStart()          main()      } else {          Timer()      }  }  

So let's take a look at the Firststart () function and explain the code later:

Func fiststart ()  {      var num int       url :=  "HTTP://WWW.BAIDU.COM/S?IE=UTF-8&AMP;F=8&AMP;RSV_BP=1&AMP;TN=39042058_20_OEM_DG &wd=golang%e5%ae%9e%e7%8e%b0&oq=golang%2520%25e5%2588%25a0%25e9%2599%25a4%25e6%2595%25b0%25e7%25bb% 2584&rsv_pq=d9be28ec0002df1b&rsv_t=8017gwpslphdmkilzq1stc04evpuaelep90nim%2bk5prh5r9o57nhmo8gaxm1ttsoo %2fvtjj%2b98%2fsc&rqlang=cn&rsv_enter=1&inputt=3474&rsv_sug3=16&rsv_sug1=11&rsv_sug7= 100&rsv_sug2=0&rsv_sug4=4230 "      resp, _ := http. Get (URL)       defer resp. Body.close ()       body, _ := ioutil. ReadAll (resp. Body)       reg := regexp. Mustcompile (' (ht|f) TPS?):/ /[w]{0,3}.baidu.com/link\? [a-za-z=0-9-\s]* ')       f, _ := os. OPenfile ("./data/url.txt",  os. O_create|os. O_append|os. o_rdwr, 0666)       defer f.close ()        For _, d := range reg. Findallstring (String (body),  -1)  {          ff,  _ := os. OpenFile ("./data/url.txt",  os. o_rdwr, 0666)           file, _ := ioutil . ReadAll (FF)           dd := strings. Split (d,  "")           dddd :=  ""            for _, ddd := range dd {               if ddd ==  "?"  {                  ddd =  ' \? '                }              dddd  += ddd          }           if checkregexp (String (file),  dddd, 0). (string)  ==  ""  {               io. WriteString (f, d+ "\ n")                fmt. Print ("\ n Collect Address:"  + d +  "\ n")                num++          }           // fmt. Print (string (file))           ff. Close () &Nbsp;     }      fmt. Print ("\ nthe first collection network address:"  + strconv. Itoa (Len (Reg. Findallstring (String (body),  -1)))  +  "\ n")       fmt. Print ("\ nthe network address number:"  + strconv. Itoa (num))       fmt. Print ("\ n \ nthe First Storage success!) \ n ")   } 

I'm sorry, there is no comment habit

The simple thing is to start a GET request, and then you get to the byte[] type of data, and after converting it to a string type, it's the Code of the Web page.

Break it down (understand the principle of skipping this paragraph):

url := "http://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&tn=39042058_20_oem_dg&wd=golang%E5%AE%9E%E7%8E%B0&oq=golang%2520%25E5%2588%25A0%25E9%2599%25A4%25E6%2595%25B0%25E7%25BB%2584&rsv_pq=d9be28ec0002df1b&rsv_t=8017GWpSLPhDmKilZQ1StC04EVpUAeLEP90NIm%2Bk5pRh5R9o57NHMO8Gaxm1TtSOo%2FvtJj%2B98%2Fsc&rqlang=cn&rsv_enter=1&inputT=3474&rsv_sug3=16&rsv_sug1=11&rsv_sug7=100&rsv_sug2=0&rsv_sug4=4230"  resp, _ := http.Get(url)  defer resp.Body.Close()  body, _ := ioutil.ReadAll(resp.Body)  reg := regexp.MustCompile(`((ht|f)tps?)://[w]{0,3}.baidu.com/link\?[a-zA-z=0-9-\s]*`)  f, _ := os.OpenFile("./data/url.txt", os.O_CREATE|os.O_APPEND|os.O_RDWR, 0666)  defer f.Close()  

This paragraph is mainly to initiate a get network request, and then the request to the byte data into the STIRNG type of data, skip the regular get matching link to get a link array (not too much to repeat, if you do not understand the HTTP request can be found in another Baidu)

For _, d := range reg. Findallstring (String (body),  -1)  {      ff, _ := os. OpenFile ("./data/url.txt",  os. o_rdwr, 0666)       file, _ := ioutil. ReadAll (FF)       dd := strings. Split (d,  "")       dddd :=  ""        for _, ddd := range dd {           if ddd ==  "?"  {              ddd =  ' \? '           }           dddd += ddd      }      if  checkregexp (String (file),  dddd, 0). (string)  ==  "" {          io. WriteString (f, d+ "\ n")           fmt. Print ("\ n Collect Address:"  + d +  "\ n")           num ++      }      // fmt. Print (string (file))       ff. Close ()   }  

Through the loop array, first of all the special symbols in the link to do outstanding processing, and then through the CHECKREGEXP function to check the weight, is to prevent multiple duplicate link records resulting in wasting resources, and finally deposited in TXT file.

Checkregexp function:

func checkRegexp(cont string, reg string, style int) (result interface{}) {      check := regexp.MustCompile(reg)      switch style {      case 0:          result = check.FindString(cont)      case 1:          result = check.FindAllString(cont, -1)      default:          result = check.FindAll([]byte(cont), -1)      }      return  }  

Here, the first execution of the program has been completed and the link to the crawl has been successfully recorded. The program executes as follows:

The next article goes on how to filter the useless crawl content through the links of these records, if in the above code, there are doubts or found that the big bug can add me QQ625366394 together to solve

First article address: https://blog.csdn.net/superwebmaster/article/details/80319502

Attached code example download address: https://download.csdn.net/download/superwebmaster/10415730

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.