Using Scrapy to crawl the information data of the enterprise

Source: Internet
Author: User
Tags xpath
Using Scrapy to crawl the information data requirement analysis of the enterprises to crawl the address url:http://www.jobui.com/cmp to crawl is the information that corresponds to each company details page first need to get a list of all companies, the program automatically pages, Get the link address for the next page, get the URL of each company's details page to initiate a request to the URL of the details page, and get the data code you want to crawl in the details page

First use the command line to create a crawler project

Scrapy Startproject Zhiyouji

Write the items of the project, define the fields to be saved, the following figure to find a more complete information page of the company, to save the information is the red box circled information. Then we edit the items.py file.

Import Scrapy

class Zhiyoujiitem (scrapy. Item):
  # company Name
  = Scrapy. Field ()
  # browse Views
  = Scrapy. Field ()
  # company Nature
  type = Scrapy. Field ()
  # company size size
  = scrapy. Field ()
  # industry
  industry = scrapy. Field ()
  # company abbreviation
  abbreviation = scrapy. Field ()
  # Company Info
  info = scrapy. Field ()
  # Praise degree
  praise = scrapy. Field ()
  # Salary Range
  Salary_range = scrapy. Field ()
  # Company Products
  = scrapy. Field ()
  # financing situation
  financing_situation = scrapy. Field ()
  # company rank Rank
  = scrapy. Field ()
  # Company Address
  = Scrapy. Field ()
  # company website
  website = scrapy. Field ()
  # Company Contact
  = Scrapy. Field ()
  # QQ
  QQ = Scrapy. Field ()

After writing the items file, we can create the crawler file, where I am using Crawlspider, before using the command line to create a crawler file, we need to first CD to the Zhiyouji folder, and then use the command:

Scrapy genspider-t crawl Zhiyouji ' jobui.com '

After creating the crawler file, we use Pycharm to open the project.

So we are ready for the early, and then how to write the crawler, how to obtain data.

Open the zhiyouji.py file under the spider path, where we analyze a wave first.

First we want to make sure that our starting URL is start_url, modify the file Start_url

Start_urls = [' http://www.jobui.com/cmp ']

First of all, the site's data is pagination, we want to get to the next page of the URL.

Through analysis, we found that the next page of the URL address of the law is/cmp?n= page #listinter, then we can use the regular will be the next page of the link to extract.

# Gets the URL to the next page Rule

(linkextractor (allow=r '/cmp\?n=\d+\ #listInter '), follow=true),

Get to the next page of the URL address, we are next to get the Details page URL address, through the view found that the link to the details of the law is/company/number/, you can use the regular to match the URL of the details page

# Get Details page URL rule

(linkextractor (allow=r '/company/\d+/$ '), callback= ' Parse_item ', follow=false),

When we get the Details page link, we have a callback function, Parse_item, and then we're going to extract the data we want from the function. The code is actually very visible, using XPath to extract the data that you want to get. XPath unfamiliar partners can go to the Internet to find a tutorial to learn, put the following code:

  def parse_item (self, Response): # Instantiate Item Object item = Zhiyoujiitem () # Use XPath to extract data # company Name I tem[' name ' = Response.xpath ('//*[@id = ' companyH1 ']/a/text () '). Extract_first () # Browse amount item[' views '] = response.x Path ('//div[@class = grade Cfix Sbox ']/div[1]/text () '). Extract_first (). Split (U ' person ') [0].strip ()] "" "Some details page of the company There is no picture, so the structure of the page is somewhat different "" # Company Nature try:item[' type '] = Response.xpath ('//div[@class = "Cfix Fs16 "]/dl/dd[1]/text ()"). Extract_first (). Split ('/') [0] except:item[' type ' = Response.xpath ('//*[@id = cmp -intro "]/div/div/dl/dd[1]/text ()"). Extract_first (). Split ('/') [0] # Company size try:item[' size '] = respons E.xpath ('//div[@class = "Cfix fs16"]/dl/dd[1]/text ()). Extract_first (). Split ('/') [1] except:item[' size '] = Response.xpath ('//*[@id = "Cmp-intro"]/div/div/dl/dd[1]/text ()). Extract_first (). Split ('/') [1] # industry item[' Indus Try '] = Response.xpath ('dd[@class = "Comind"]/a[1]/text ()). Extract_first () # company abbreviation item[' abbreviation '] = Response.xpath ('//dl[@class = "J-edit hasvist Dlli mb10"]/dd[3]/text ()). Extract_first () # Company Information item[' info ' = '. Join (Response.xpath ('//*[@ Id= "Textshowmore"]/text ()). Extract ()) # Praise item[' Praise ' = Response.xpath ('//div[@class = ' Swf-conta ']/div/h3 /text () '). Extract_first () # Salary Range item[' salary_range ' = Response.xpath ('//div[@class = ' Swf-contb ']/div/h3/text (

      ). Extract_first () # Company product item[' products '] = Response.xpath ('//div[@class = ' mb5 ']/a/text () '). Extract ()  # Financing situation data_list = [] node_list = Response.xpath ('//div[5]/ul/li ') for node in node_list:temp
          = {} # Financing date temp[' date '] = Node.xpath ('./span[1]/text () '). Extract_first () # Financing status temp[' status ' = Node.xpath ('./h3/text () '). Extract_first () # Financing amount temp[' sum ' = Node.xpath ('./span[2]/ Text () "). Extract_first ()
          # investor temp[' investors ' = Node.xpath ('./span[3]/text () '). Extract_first () Data_list.append ( Temp) item[' financing_situation ' = data_list # company Rank data_list = [] node_list = Response.xpath (' div[@class = "Fs18 honor-box"]/div) for node in node_list:temp = {} key = Node.xpath ('./a/tex T () '). Extract_first () Temp[key] = Int (Node.xpath ('./span[2]/text () '). Extract_first ()) data_list.append (temp) item[' rank '] = data_list # company address item[' addresses ' = Response.xpath ('//dl[@class = "Dlli fs16"]/dd[1 ]/text () "). Extract_first () # Company web site item[' website ' = Response.xpath ('//dl[@class = ' Dlli fs16 ']/dd[2]/a/text () '). Extract_first () # contact item[' contacts ' = Response.xpath ('//div[@class = ' j-shower1 dn ']/dd/text () '). Extract_firs T () # QQ number item[' qq '] = Response.xpath ('//dd[@class = ' Cfix ']/span/text () '). Extract_first () # for K,v in   Item.items (): #  Print K,v # print ' **************************************** ' yield item 

The above code to note the place: first of all the details of the page company profile where there are pictures, and some companies do not have pictures, which caused us to write XPath may encounter no picture of the page will not match the data, all the corresponding processing.

"" "
   Some company details page has no picture
   so the structure of the page is somewhat different" "
        # Company Nature
Try:
  item[' type ' = Response.xpath ('//div[@class = "Cfix fs16"]/dl/dd[1]/text ()). Extract_first (). Split ('/') [0]
except:
  item[' type '] = Response.xpath ('//*[ @id = "Cmp-intro"]/div/div/dl/dd[1]/text ()). Extract_first (). Split ('/') [0]


# company Size

try:
   item[' size '] = Response.xpath ('//div[@class = "Cfix fs16"]/dl/dd[1]/text () "). Extract_first (). Split ('/') [1]
except:
  item[' size '] = Response.xpath ('//*[@id = ' Cmp-intro ']/div/div/dl/dd[1]/text () '). Extract_first (). Split ('/') [1]

The rest of the place is normal use of XPath to extract the data you want.

The data is extracted and the data is saved as JSON data. Open the pipe file pipelines.py

Import JSON

class Zhiyoujipipeline (object):

  def open_spider (self, spider):
      self.file = open (' Zhiyouji.json ', ' W ', encoding= ' Utf-8 ')


  def process_item (self, item, spider):

      data = Json.dumps (Dict (item), Ensure_ascii=false, indent=2)

      self.file.write (data) return

      item

  def close_spider (self, spider):
      Self.file.close ()

Using pipelines to save data, we'll open the pipe in the settings file:

Item_pipelines = {
 ' ZhiYouJi.pipelines.ZhiyoujiPipeline ': +,
}

Then use the command to run the crawler

Scrapy Crawl Zhiyouji

Because of too much data, it was forced to stop on the project. The following JSON file is a part of the saved data,

The project was transformed into a Scrapy-radis distributed crawler. The next post on the project.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.