Using Scrapy to crawl the information data requirement analysis of the enterprises to crawl the address url:http://www.jobui.com/cmp to crawl is the information that corresponds to each company details page first need to get a list of all companies, the program automatically pages, Get the link address for the next page, get the URL of each company's details page to initiate a request to the URL of the details page, and get the data code you want to crawl in the details page
First use the command line to create a crawler project
Scrapy Startproject Zhiyouji
Write the items of the project, define the fields to be saved, the following figure to find a more complete information page of the company, to save the information is the red box circled information. Then we edit the items.py file.
Import Scrapy
class Zhiyoujiitem (scrapy. Item):
# company Name
= Scrapy. Field ()
# browse Views
= Scrapy. Field ()
# company Nature
type = Scrapy. Field ()
# company size size
= scrapy. Field ()
# industry
industry = scrapy. Field ()
# company abbreviation
abbreviation = scrapy. Field ()
# Company Info
info = scrapy. Field ()
# Praise degree
praise = scrapy. Field ()
# Salary Range
Salary_range = scrapy. Field ()
# Company Products
= scrapy. Field ()
# financing situation
financing_situation = scrapy. Field ()
# company rank Rank
= scrapy. Field ()
# Company Address
= Scrapy. Field ()
# company website
website = scrapy. Field ()
# Company Contact
= Scrapy. Field ()
# QQ
QQ = Scrapy. Field ()
After writing the items file, we can create the crawler file, where I am using Crawlspider, before using the command line to create a crawler file, we need to first CD to the Zhiyouji folder, and then use the command:
Scrapy genspider-t crawl Zhiyouji ' jobui.com '
After creating the crawler file, we use Pycharm to open the project.
So we are ready for the early, and then how to write the crawler, how to obtain data.
Open the zhiyouji.py file under the spider path, where we analyze a wave first.
First we want to make sure that our starting URL is start_url, modify the file Start_url
Start_urls = [' http://www.jobui.com/cmp ']
First of all, the site's data is pagination, we want to get to the next page of the URL.
Through analysis, we found that the next page of the URL address of the law is/cmp?n= page #listinter, then we can use the regular will be the next page of the link to extract.
# Gets the URL to the next page Rule
(linkextractor (allow=r '/cmp\?n=\d+\ #listInter '), follow=true),
Get to the next page of the URL address, we are next to get the Details page URL address, through the view found that the link to the details of the law is/company/number/, you can use the regular to match the URL of the details page
# Get Details page URL rule
(linkextractor (allow=r '/company/\d+/$ '), callback= ' Parse_item ', follow=false),
When we get the Details page link, we have a callback function, Parse_item, and then we're going to extract the data we want from the function. The code is actually very visible, using XPath to extract the data that you want to get. XPath unfamiliar partners can go to the Internet to find a tutorial to learn, put the following code:
def parse_item (self, Response): # Instantiate Item Object item = Zhiyoujiitem () # Use XPath to extract data # company Name I tem[' name ' = Response.xpath ('//*[@id = ' companyH1 ']/a/text () '). Extract_first () # Browse amount item[' views '] = response.x Path ('//div[@class = grade Cfix Sbox ']/div[1]/text () '). Extract_first (). Split (U ' person ') [0].strip ()] "" "Some details page of the company There is no picture, so the structure of the page is somewhat different "" # Company Nature try:item[' type '] = Response.xpath ('//div[@class = "Cfix Fs16 "]/dl/dd[1]/text ()"). Extract_first (). Split ('/') [0] except:item[' type ' = Response.xpath ('//*[@id = cmp -intro "]/div/div/dl/dd[1]/text ()"). Extract_first (). Split ('/') [0] # Company size try:item[' size '] = respons E.xpath ('//div[@class = "Cfix fs16"]/dl/dd[1]/text ()). Extract_first (). Split ('/') [1] except:item[' size '] = Response.xpath ('//*[@id = "Cmp-intro"]/div/div/dl/dd[1]/text ()). Extract_first (). Split ('/') [1] # industry item[' Indus Try '] = Response.xpath ('dd[@class = "Comind"]/a[1]/text ()). Extract_first () # company abbreviation item[' abbreviation '] = Response.xpath ('//dl[@class = "J-edit hasvist Dlli mb10"]/dd[3]/text ()). Extract_first () # Company Information item[' info ' = '. Join (Response.xpath ('//*[@ Id= "Textshowmore"]/text ()). Extract ()) # Praise item[' Praise ' = Response.xpath ('//div[@class = ' Swf-conta ']/div/h3 /text () '). Extract_first () # Salary Range item[' salary_range ' = Response.xpath ('//div[@class = ' Swf-contb ']/div/h3/text (
). Extract_first () # Company product item[' products '] = Response.xpath ('//div[@class = ' mb5 ']/a/text () '). Extract () # Financing situation data_list = [] node_list = Response.xpath ('//div[5]/ul/li ') for node in node_list:temp
= {} # Financing date temp[' date '] = Node.xpath ('./span[1]/text () '). Extract_first () # Financing status temp[' status ' = Node.xpath ('./h3/text () '). Extract_first () # Financing amount temp[' sum ' = Node.xpath ('./span[2]/ Text () "). Extract_first ()
# investor temp[' investors ' = Node.xpath ('./span[3]/text () '). Extract_first () Data_list.append ( Temp) item[' financing_situation ' = data_list # company Rank data_list = [] node_list = Response.xpath (' div[@class = "Fs18 honor-box"]/div) for node in node_list:temp = {} key = Node.xpath ('./a/tex T () '). Extract_first () Temp[key] = Int (Node.xpath ('./span[2]/text () '). Extract_first ()) data_list.append (temp) item[' rank '] = data_list # company address item[' addresses ' = Response.xpath ('//dl[@class = "Dlli fs16"]/dd[1 ]/text () "). Extract_first () # Company web site item[' website ' = Response.xpath ('//dl[@class = ' Dlli fs16 ']/dd[2]/a/text () '). Extract_first () # contact item[' contacts ' = Response.xpath ('//div[@class = ' j-shower1 dn ']/dd/text () '). Extract_firs T () # QQ number item[' qq '] = Response.xpath ('//dd[@class = ' Cfix ']/span/text () '). Extract_first () # for K,v in Item.items (): # Print K,v # print ' **************************************** ' yield item
The above code to note the place: first of all the details of the page company profile where there are pictures, and some companies do not have pictures, which caused us to write XPath may encounter no picture of the page will not match the data, all the corresponding processing.
"" "
Some company details page has no picture
so the structure of the page is somewhat different" "
# Company Nature
Try:
item[' type ' = Response.xpath ('//div[@class = "Cfix fs16"]/dl/dd[1]/text ()). Extract_first (). Split ('/') [0]
except:
item[' type '] = Response.xpath ('//*[ @id = "Cmp-intro"]/div/div/dl/dd[1]/text ()). Extract_first (). Split ('/') [0]
# company Size
try:
item[' size '] = Response.xpath ('//div[@class = "Cfix fs16"]/dl/dd[1]/text () "). Extract_first (). Split ('/') [1]
except:
item[' size '] = Response.xpath ('//*[@id = ' Cmp-intro ']/div/div/dl/dd[1]/text () '). Extract_first (). Split ('/') [1]
The rest of the place is normal use of XPath to extract the data you want.
The data is extracted and the data is saved as JSON data. Open the pipe file pipelines.py
Import JSON
class Zhiyoujipipeline (object):
def open_spider (self, spider):
self.file = open (' Zhiyouji.json ', ' W ', encoding= ' Utf-8 ')
def process_item (self, item, spider):
data = Json.dumps (Dict (item), Ensure_ascii=false, indent=2)
self.file.write (data) return
item
def close_spider (self, spider):
Self.file.close ()
Using pipelines to save data, we'll open the pipe in the settings file:
Item_pipelines = {
' ZhiYouJi.pipelines.ZhiyoujiPipeline ': +,
}
Then use the command to run the crawler
Scrapy Crawl Zhiyouji
Because of too much data, it was forced to stop on the project. The following JSON file is a part of the saved data,
The project was transformed into a Scrapy-radis distributed crawler. The next post on the project.