Reptile combat based on Golang

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

Reptile combat based on Golang

Objective

Crawler was originally the strength of Python, earlier research scrapy, also wrote some simple crawler small program, but then suddenly interested in Golang, decided to write practiced hand crawler practice. Because I Golang Meng new, there are mistakes, welcome correction.

General idea

    • Due to the fact that there are more dynamic pages now, consider Webdriver drive Chrome and other page rendering to complete and crawl the data. (just started with PHANTOMJS, later this goods not maintenance, and efficiency is not high)
    • General crawlers run on Linux systems, so consider Chrome's headless mode.
    • The data is fetched to the CSV file and then sent out by mail.

The shortcomings

    • Because of the need to render, so the speed will be reduced a lot, even if the picture is not rendered, the speed is not ideal.
    • Because of the beginning of learning, so multi-threaded and so did not add in, afraid of memory will collapse.
    • No data is written to the database, and it is not the final solution after all.

Required libraries

    • Github.com/tebeka/selenium
      • Golang version of the selenium, can achieve most of the functions.
    • Gopkg.in/gomail.v2
      • Send mail to the library, a long time not updated, but enough.

Download Dependent packages

    • This intends to use DEP management relies on, the result of this cargo pit is quite many, did not study understand not astray, temporarily give up.
    • Download dependent packages via go get
go get github.com/tebeka/seleniumgo get gopkg.in/gomail.v2

Code implementation

    • Launch Chromedriver to drive the Chrome browser and drive the browser through the code.
Startchrome launches Google Chrome headless mode func Startchrome () {opts: = []selenium]. Serviceoption{}caps: = Selenium. capabilities{"Browsername": "Chrome",}//disable loading of pictures, speed up rendering imagcaps: = map[string]interface{}{ "Profile.managed_default_content_settings.images": 2,}chromecaps: = Chrome. Capabilities{prefs:imagcaps,path: "", Args: []string{"--headless",//Set Chrome headless Mode "--no-sandbox", "--user-agent= mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) applewebkit/604.4.7 (khtml, like Gecko) version/11.0.2 safari/604.4.7 ",//Analog user-agent, anti-crawl},}caps . Addchrome (chromecaps)//Start Chromedriver, port number can be customized service, err = Selenium. Newchromedriverservice ("/opt/google/chrome/chromedriver", 9515, opts ...) if err! = Nil {log. Printf ("Error starting the Chromedriver server:%v", err)}//to the Chrome browser webdriver, err = Selenium. Newremote (Caps, FMT. Sprintf ("Http://localhost:%d/wd/hub", 9515)) if err! = Nil {panic (err)}//This is the pit left by the target site, not add this in the Linux system will display the mobile Web page, each site strategy is not the same, Need to be treated differently. Webdriver.addcookie (&AMp;selenium. Cookie{name: "Defaultjumpdomain", Value: "www",})//Navigate to the target site err = webdriver.get (urlbeijing) if err! = Nil {Panic (FMT . Sprintf ("Failed to load page:%s\n", err))}log. Println (Webdriver.title ())}

With the code above, you can launch chrome from the code and jump to the target site for next data acquisition.

    • Initialize CSV, data storage location
// SetupWriter 初始化CSVfunc SetupWriter() {dateTime = time.Now().Format("2006-01-02 15:04:05") // 格式字符串是固定的,据说是go语言诞生时间,谷歌的恶趣味...os.Mkdir("data", os.ModePerm)csvFile, err := os.Create(fmt.Sprintf("data/%s.csv", dateTime))if err != nil {panic(err)}csvFile.WriteString("\xEF\xBB\xBF")writer = csv.NewWriter(csvFile)writer.Write([]string{"车型", "行驶里程", "首次上牌", "价格", "所在地", "门店"})}

Data fetching

This part is the core business, each site crawl way is not the same, but the idea is consistent, through the xpath,css selector, className, tagname, etc. to get the content of the element, selenium API can achieve most of the operation function, Through the selenium source can be seen, the core API includes Webdriver and webelement, below write my grasp of the use of used cars in Beijing, the use of data, other sites can refer to the process of change.

    • Open the home Car website via Safari and get the Beijing Used Car Homepage connection
const urlBeijing = "https://www.che168.com/beijing/list/#pvareaid=104646"
    • Right click on "Check Element" on the page to enter the developer mode, you can see all the data in this area
<ul class="fn-clear certification-list" id="viewlist_ul">

The mouse points to the right-hand side 拷贝 , XPath then-to get the XPath property where the element was changed

//*[@id="viewlist_ul"]

Then through the code

listContainer, err := webDriver.FindElement(selenium.ByXPATH, "//*[@id=\"viewlist_ul\"]")

Can get the Webelement object of the modified paragraph HTML, it is not difficult to see that this is the parent container of all data, in order to get the specific data need to locate each subset of elements, through the development mode can see

The developer tool allows you to get class Carinfo because there are multiple

lists, err := listContainer.FindElements(selenium.ByClassName, "carinfo")

You can get a set of all the subset of elements, and to get the element data in each subset, we need to iterate over the set.

for i := 0; i < len(lists); i++ {var urlElem selenium.WebElementif pageIndex == 1 {urlElem, err = webDriver.FindElement(selenium.ByXPATH, fmt.Sprintf("//*[@id='viewlist_ul']/li[%d]/a", i+13))} else {urlElem, err = webDriver.FindElement(selenium.ByXPATH, fmt.Sprintf("//*[@id='viewlist_ul']/li[%d]/a", i+1))}if err != nil {break}// 因为有些数据在次级页面,需要跳转url, err := urlElem.GetAttribute("href") if err != nil {break}    webDriver.Get(url)title, _ := webDriver.Title()log.Printf("当前页面标题:%s\n", title)        // 获取车辆型号modelElem, err := webDriver.FindElement(selenium.ByXPATH, "/html/body/div[5]/div[2]/div[1]/h2")var model stringif err != nil {log.Println(err)model = "暂无"} else {model, _ = modelElem.Text()}log.Printf("model=[%s]\n", model)    ...        // 数据写入CSV    writer.Write([]string{model, miles, date, price, position, store})writer.Flush()webDriver.Back() // 回退到上级页面重复步骤抓取}

All source code below, beginner, light spray ~ ~

Startcrawler begins crawling data func startcrawler () {log. Println ("Start crawling at", time. Now (). Format ("2006-01-02 15:04:05")) PageIndex: = 0for {Listcontainer, err: = Webdriver.findelement (Selenium. Byxpath, "//*[@id =\" viewlist_ul\ "]") if err! = Nil {panic (err)}lists, err: = Listcontainer.findelements (Selenium. Byclassname, "Carinfo") if err! = Nil {panic (err)}log. PRINTLN ("Data Volume:", Len (lists)) Pageindex++log. Printf ("Fetching page%d data ... \ n", PageIndex) for I: = 0; I < Len (lists); i++ {var urlelem selenium. Webelementif PageIndex = = 1 {urlelem, err = webdriver.findelement (selenium. Byxpath, FMT. Sprintf ("//*[@id = ' Viewlist_ul ']/li[%d]/a", i+13))} else {Urlelem, err = webdriver.findelement (selenium. Byxpath, FMT. Sprintf ("//*[@id = ' Viewlist_ul ']/li[%d]/a", i+1))}if err! = Nil {Break}url, err: = Urlelem.getattribute ("href") if err! = N Il {break}webdriver.get (URL) title, _: = Webdriver.title () log. Printf ("current page title:%s\n", title) Modelelem, Err: = Webdriver.findelement (Selenium. Byxpath, "/html/body/div[5]/div[2]/div[1]/h2") vAR model stringif Err! = Nil {log. PRINTLN (ERR) model = "No"} else {model, _ = Modelelem.text ()}log. Printf ("model=[%s]\n", model) Priceelem, err: = Webdriver.findelement (Selenium. Byxpath, "/html/body/div[5]/div[2]/div[2]/div/ins") var price stringif Err! = Nil {log. PRINTLN (err) Price = "No"} else {price, _ = Priceelem.text () Price = FMT. Sprintf ("%s million", price)}log. Printf ("price=[%s]\n", price) Mileselem, err: = Webdriver.findelement (Selenium. Byxpath, "/html/body/div[5]/div[2]/div[4]/ul/li[1]/span") var miles stringif Err! = Nil {log. PRINTLN (Err) Mileselem, err: = Webdriver.findelement (Selenium. Byxpath, "/html/body/div[5]/div[2]/div[3]/ul/li[1]/span") if err! = Nil {log. Println (err) miles = "No"} else {miles, _ = Mileselem.text ()}} else {miles, _ = Mileselem.text ()}log. Printf ("miles=[%s]\n", miles) Timeelem, err: = Webdriver.findelement (Selenium. Byxpath, "/html/body/div[5]/div[2]/div[4]/ul/li[2]/span") var date stringif Err! = Nil {log. PRINTLN (Err) Timeelem, err: = Webdriver.findelement (Selenium. ByxpatH, "/html/body/div[5]/div[2]/div[3]/ul/li[2]/span") if err! = Nil {log. PRINTLN (err) date = "No"} else {date, _ = Timeelem.text ()}} else {date, _ = Timeelem.text ()}log. Printf ("time=[%s]\n", date) Positionelem, err: = Webdriver.findelement (Selenium. Byxpath, "/html/body/div[5]/div[2]/div[4]/ul/li[4]/span") var position stringif Err! = Nil {log. PRINTLN (Err) Positionelem, err: = Webdriver.findelement (Selenium. Byxpath, "/html/body/div[5]/div[2]/div[3]/ul/li[4]/span") if err! = Nil {log. Println (err) position = "No"} else {position, _ = Positionelem.text ()}} else {position, _ = Positionelem.text ()}log. Printf ("position=[%s]\n", position) Storeelem, err: = Webdriver.findelement (Selenium. Byxpath, "/html/body/div[5]/div[2]/div[1]/div/div/div") var store stringif Err! = Nil {log. PRINTLN (err) store = "No"} else {store, _ = Storeelem.text () store = strings. Replace (Store, "merchant |", "",-1) if strings. Contains (store, "Gold store") {store = strings. Replace (store, "Gold store", "",-1)}}log. Printf ("store=[%s]\n", store) writer. WritE ([]string{model, Miles, date, price, position, store}) writer. Flush () Webdriver.back ()}log. Printf ("page%d data has been crawled, start next page ... \ n", PageIndex) Nextbutton, err: = Webdriver.findelement (Selenium. Byclassname, "Page-item-next") if err! = Nil {log. PRINTLN ("All data crawl complete! ") Break}nextbutton.click ()}log. Println ("Crawling finished at", time. Now (). Format ("2006-01-02 15:04:05")) Sendresult (DateTime)}
    • Send mail

All the code below, relatively simple, do not repeat

func sendResult(fileName string) {email := gomail.NewMessage()email.SetAddressHeader("From", "re**ng@163.com", "张**")email.SetHeader("To", email.FormatAddress("li**yang@163.com", "李**"))email.SetHeader("Cc", email.FormatAddress("zhang**tao@163.net", "张**"))email.SetHeader("Subject", "二手车之家-北京-二手车信息")email.SetBody("text/plain;charset=UTF-8", "本周抓取到的二手车信息数据,请注意查收!\n")email.Attach(fmt.Sprintf("data/%s.csv", fileName))dialer := &gomail.Dialer{Host:     "smtp.163.com",Port:     25,Username: ${your_email},    // 替换自己的邮箱地址Password: ${smtp_password}, // 自定义smtp服务器密码SSL:      false,}if err := dialer.DialAndSend(email); err != nil {log.Println("邮件发送失败!err: ", err)return}log.Println("邮件发送成功!")}
    • Last-remembered recycling resources
defer service.Stop()    // 停止chromedriverdefer webDriver.Quit()  // 关闭浏览器defer csvFile.Close()   // 关闭文件流

Summarize

    • Beginner Golang, purely to take the reptile project practiced hand, the code is relatively rough without any engineering can say, hope will not fraught.
    • As the Golang crawler basically no other projects to learn from, so there are some of their own research, but also hope to help others.
    • Finally, Amway a great God wrote the crawler framework Pholcus, powerful, is currently a more perfect framework.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.