Scrapy Crawl Pull Network job information

Source: Internet
Author: User

Many sites have used a technology called AJAX (asynchronous loading), usually we will find this page, open, first to show you the above part of things, and then the rest of the load slowly, that is, local loading. So you can see a lot of Web pages, the Web site in the browser has not changed, but the data can still be updated. This has a certain impact on the proper crawling of data, and we have to parse out the correct destination address to successfully crawl the information.

Today to crawl is this kind of website, the target URL is: https://www.lagou.com/zhaopin/

one, the target address

Through the introduction of the previous document, with the above target address, we can easily build a crawler framework.

My spider file code:

#-*-Coding:utf-8-*-
import Scrapy


class Positionspider (scrapy. Spider):
    name = "position"
    # allowed_domains = ["lagou.com/zhaopin/"]
    start_urls = [' http://lagou.com/ zhaopin//']

    def parse (self, Response):
        file =  open ("lagou.html", "W")
        File.write (response.body)
        file.close ()
        print Response.body

Then open the lagou.html file, found that the page is a bit low ah, it does not matter, you can see some information is good.

The position information here is the same as the one shown in the above picture, so we simply crawl it. Yes, in fact, the home page can be crawled in front of the way, but this is not the data we want to crawl. We want to capture the position information under certain conditions.

Here we first open the developer tools.

When we choose the condition, we can not catch the information with the above address, and the address of the address bar has also changed: Https://www.lagou.com/jobs/list_?px=new&city=%E6%9D%AD%E5%B7%9E &district=%e8%a5%bf%e6%b9%96%e5%8c%ba#filterbox, however, there is no change in the choice of other conditions.

So it's easy to think of this as a network request event sent through JavaScript Ajax technology.

Under the network panel we try to filter the request by entering JSON in the filter.

We found 2 resources feel particularly like, which has a name directly to the position, we click on the right button, open the New tab page to see.

We click Open link in new tab.

We are consistent with the content here and the content on the page. Now we can conclude that what we need is this Web site:
Http://www.lagou.com/jobs/positionAjax.json. Then you can add the following parameters:

Gj= Fresh Graduates &xl= College &jd= growth &hy= Mobile internet &px=new&city= Shanghai

By modifying these parameters, we can get different job information.

Note: The structure here is also relatively simple, sometimes, some URLs are more complex than this, and often there are some id= that you don't know what you mean, this time, the possible value of this ID may be in another file, you may have to find it again, or it may be somewhere in the source code of the Web page.

There is also a situation, may appear time= What, this is the timestamp, at this time, need to use the temporal function construction. In short, the specific situation to specific analysis.

Import Time
time.time ()
second, the preparation of reptiles 1. Climb the first page

Let's look at the returned JSON data structure:

We write the code that parses the JSON data against the hierarchical relationships here.

First, the JSON module is introduced:

Import JSON

Spider file Code:

#-*-Coding:utf-8-*-
# coding=utf-8
import JSON

import Scrapy


class Positionspider (scrapy. Spider):
    name = "position"
    # allowed_domains = ["lagou.com/zhaopin/"]
    start_urls = [' https:// www.lagou.com/jobs/positionAjax.json?px=new&city=%E6%9D%AD%E5%B7%9E&district=%E8%A5%BF%E6%B9%96%E5%8C% Ba&needaddtionalresult=false ']

    def parse (self, Response):
        # print response.body
        jdict = json.loads ( Response.body)
        jcontent = jdict["Content"]
        Jposresult = jcontent["Positionresult"]
        Jresult = jposresult["Result" to each in
        Jresult:
            print each[' city '
            print each[' Companyfullname '
            ] Print each[' companysize '] print
            each[' positionname ']
            print each[' secondtype ']
            print each[' salary '
            print '

Run the next look at the effect:

2. Crawl More pages

We can crawl the first page of data, and then look at the details of the request:

The information provided by the browser's tools shows that this is a form-by-form submit parameter POST request. Here we will simulate this request method.

Rewrite the spider start_requests method and use Formrequest to set the POST request, and we can modify the Xrang scope to download the data for the specified range of pages. The code is as follows:

#-*-coding:utf-8-*-Import JSON Import Scrapy class Positionspider (scrapy. Spider): name = "position" # allowed_domains = ["lagou.com/zhaopin/"] start_urls = [' Https://www.lago u.com/jobs/positionajax.json?px=new&city=%e6%9d%ad%e5%b7%9e&district=%e8%a5%bf%e6%b9%96%e5%8c%ba& Needaddtionalresult=false '] City = U ' Hangzhou ' district = U ' West L. District ' url = ' HTTPS://WWW.LAGOU.COM/JOBS/POSITIONAJAX.J Son ' Def start_requests (self): for num in xrange (1, 5): Form_data = {' pn ': str (num), ' City ': Self . City, ' District ': self.district} headers = {' Host ': ' www.jycinema.com ', ' Con Tent-type ': ' Application/x-www-form-urlencoded;charset=utf-8 ', ' user-agent ': ' mozilla/5.0 (Windows NT 10.0 ; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 '} yield scrapy.f Ormrequest (Self.url, Formdata=form_data, callback=self.parse) # requests = [] # for Num in xrange (1, 5): # Requests.append (scrapy. Formrequest (Self.url, method= ' post ', formdata={' pn ': str (num), ' City ': self.city, ' District ': Self.district}, Callback =self.parse)) # return requests Def parse (self, Response): # print Response.body jdict = JSON . loads (response.body) jcontent = jdict["Content"] Jposresult = jcontent["Positionresult"] Jresult = jposresult["Result" to each in Jresult:print each[' city '] print each[' Companyfullnam
            E '] Print each[' companysize '] print each[' positionname '] print each[' Secondtype ' Print each[' salary '] print '

Run the program we can successfully crawl 1-4 pages of all job information.

This does not provide a screenshot of the data, because the data here is constantly changing. If you test it yourself, it's definitely not the same as my data. 3, automatic page

#-*-Coding:utf-8-*-# coding=utf-8 Import JSON Import Scrapy class Positionspider (scrapy. Spider): name = "position" # allowed_domains = ["lagou.com/zhaopin/"] start_urls = [' Https://www.lago  U.com/jobs/positionajax.json '] Totalpagecount = 0 Curpage = 1 city = U ' Hangzhou ' district = U ' West L. District ' url =
        ' Https://www.lagou.com/jobs/positionAjax.json ' # set download delay # Download_delay = Ten def start_requests (self):
        # for Num in xrange (1, 3): # Form_data = {' pn ': str (num), ' City ': self.city, ' District ': Self.district} # headers = {# ' Host ': ' www.jycinema.com ', # ' content-type ': ' Application/x-ww W-form-urlencoded;charset=utf-8 ', # ' user-agent ': ' mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/52.0.2743.116 safari/537.36 ' #} # yield SCRA Py. Formrequest (Self.url, Formdata=form_data, Callback=self.parse) # requests = [] # for Num in xrange (1, 5): # Requests.append (scrapy. Formrequest (Self.url, method= ' post ', formdata={' pn ': str (num), ' City ': self.city, ' District ': Self.district}, Callback =self.parse)) # return requests return [Scrapy.
                                   Formrequest (self.url,formdata={' pn ': str (self.curpage), ' City ': self.city, ' District ': Self.district}, Callback=self.parse)] def parse (self, Response): # print Response.body # Print response . Body.decode (' utf-8 ') print str (self.curpage) + "page" jdict = Json.loads (response.body) jcontent  = jdict[' content '] Jposresult = jcontent["Positionresult"] pageSize = jcontent["PageSize"] Jresult
        = jposresult["Result"] Self.totalpagecount = jposresult[' totalcount ']/pageSize + 1; For each in Jresult:print each[' city '] print each[' companyfullname '] print each[' Comp
          Anysize ']  Print each[' positionname '] print each[' secondtype '] print each[' salary '] print ' If self.curpage <= self.totalPageCount:self.curpage = 1 yield scrapy.http.FormRequest (s
                                          Elf.url, formdata={' pn ': str (self.curpage), ' City ': self.city, ' District ': Self.district}, Callback=self.parse)

Finally, if you want to save the data, refer to the previous article.

There is also a strategy for the Anti crawler, such as using the user agent pool, where you can view the user agent used by the request.

When the shell is loaded, you will get a local response variable with response data and a request variable. Input Response.body will output response package body, output request.headers can see the request header.

Anti-reptile strategy: Setting Download_delay prohibit cookies Use the user agent pool to use IP pool distributed crawling

This project source has been uploaded GitHub, click here to view.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.