This article mainly introduces, after crawling the HTML data, the HTML body content is stored in JSON or CSV format.
1 JSON format storage
After selecting the site to crawl, we use the previously learned content, such as: Beautiful Soup, XPath, and other ways to get the content we want.
1.1 Getting data
First use Urllib to access the page Https://www.lagou.com/zhaopin/Python/?labelWords=label
Get the HTML content with the following code:
from urllib import requesttry: url = ‘https://www.lagou.com/zhaopin/Python/?labelWords=label‘ header = {‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0‘} req = request.Request(url, headers=header) response = request.urlopen(req).read().decode(‘utf-8‘)except request.URLError as e: if hasattr(e, ‘reason‘): print(e.reason) elif hasattr(e, ‘code‘): print(e.code)
Using the code above to get the HTML content, the next step is to parse the HTML to extract the content we need.
Open the hook page, use CTRL + F12 to open the Firefox tool, you can see what we want to get, jobs, jobs, salaries, published companies and other information in a div, such as:
The next step is to get the div content using the previously described beautiful soup, as well as get the content we need, through the tools we can see the tags we need, see the following:
# 生成soup实例soup = BeautifulSoup(response, ‘lxml‘)# 获取class=‘list_item_top’的div标签divlist = soup.find_all(‘div‘, class_=‘list_item_top‘)# 定义空列表content = []# 通过循环,获取需要的内容for list in divlist: # 职位名称 job_name = list.find(‘h3‘).string # 职位详细页面 link = list.find(‘a‘, class_="position_link").get(‘href‘) # 招聘的公司 company = list.find(‘div‘, class_=‘company_name‘).find(‘a‘).string # 薪水 salary = list.find(‘span‘, class_=‘money‘).string print(job_name, company, salary, link) content.append({‘job‘: job_name, ‘company‘: company, ‘salary‘: salary, ‘link‘: link})
All through the beautiful soup method to obtain the content, if you do not understand, you can turn over the previous tool. The output reads as follows:
Python 开发工程师 还呗-智能信贷领先者 10k-15k https://www.lagou.com/jobs/2538412.htmlPython开发工程师 天玑科技 10K-20K https://www.lagou.com/jobs/3608088.htmlPython 兜乐科技 6k-12k https://www.lagou.com/jobs/4015725.htmlPython 妙计旅行 8k-16k https://www.lagou.com/jobs/3828627.htmlPython工程师 洋钱罐 25k-35k https://www.lagou.com/jobs/3852092.htmlPython软件开发工程师 深信服科技集团 15k-20k https://www.lagou.com/jobs/4009780.htmlPython开发 问卷网@爱调研 15k-25k https://www.lagou.com/jobs/3899604.htmlPython Veeva 25k-35k https://www.lagou.com/jobs/3554732.htmlpython工程师 多麦 10k-20k https://www.lagou.com/jobs/3917781.htmlpython工程师 北蚁 8k-12k https://www.lagou.com/jobs/3082699.htmlpython研发工程师 数美 15k-30k https://www.lagou.com/jobs/3684787.htmlpython开发工程师 紫川软件 12k-19k https://www.lagou.com/jobs/3911802.htmlpython开发工程师 老虎证券 20k-40k https://www.lagou.com/jobs/3447959.htmlPython开发 印孚瑟斯 10k-20k https://www.lagou.com/jobs/3762196.htmlPython工程师 江苏亿科达 10k-20k https://www.lagou.com/jobs/3796922.html
Well, when the data is there, it's stored.
1.2 Data storage (JSON)
Python encodes and decodes data through a JSON module. The encoding process is the conversion of the Python object to the JSON object via the dumps and dump of the JSON module, which is the conversion of the JSON object to the Python object through the loads and load of the JSON module.
Coding
Dump serializes the Python object into a JSON-formatted stream, which is stored in a file, and the conversion type changes as follows:
Json.dump (obj, FP, *, Skipkeys=false, ensure_ascii=true, Check_circular=true, Allow_nan=true, Cls=none, Indent=None, Separators=none, Default=none, Sort_keys=false, **kw)
Dumps to serialize obj to JSON-formatted STR
Json.dumps (obj, *, Skipkeys=false, ensure_ascii=true, Check_circular=true, Allow_nan=true, Cls=none, Indent=None, Separators=none, Default=none, Sort_keys=false, **kw)
Decoding
Load deserializes a Python object, which can be read from a file
Json.load (FP, *, Cls=none, Object_hook=none, Parse_float=none, Parse_int=none, Parse_constant=none, Object_pairs_hook =none, **kw)
Loads to deserialize a Python object
Json.loads (S, *, Encoding=none, Cls=none, Object_hook=none, Parse_float=none, Parse_int=none, Parse_constant=None, Object_pairs_hook=none, **kw)
Knowing the JSON operation, you can now store the previously acquired hook data as JSON, as shown in the following code:
with open(‘lagou.json‘, ‘w‘) as fp: # indent表示缩进,如果输入这个参数,json的数据会按照找个缩进存储 # 如果不设置,则按最紧凑方式存储 json.dump(content, fp=fp, indent=4)
Well, it's here to store it in JSON format. The complete code is as follows:
#-*-Coding:utf-8-*-import jsonfrom BS4 import beautifulsoupfrom urllib Import requesttry:url = ' Https://www.lagou. Com/zhaopin/python/?labelwords=label ' header = {' user-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) gecko/20100101 firefox/57.0 '} req = Request. Request (URL, headers=header) response = Request.urlopen (req). Read (). Decode (' Utf-8 ') except request. Urlerror as E:if hasattr (E, ' reason '): print (E.reason) elif hasattr (E, ' code '): Print (e.code) # Generate Soup Instance soup = BeautifulSoup (response, ' lxml ') # gets the div tag class= ' list_item_top ' divlist = Soup.find_all (' div ', class_= ' list_ Item_top ') # defines the empty list content = []# through the loop, gets the content needed for the list in Divlist: # job name job_name = List.find (' h3 '). String # Job Detail Page link = list.find (' A ', class_= "Position_link"). Get (' href ') # Recruiting Company companies = List.find (' div ', class_= ' Company_nam E '). Find (' a '). String # Salary salary = List.find (' span ', class_= ' money '). String print (Job_name, company, salary, link ) ConteNt.append ({' Job ': job_name, ' Company ': Company, ' salary ': salary, ' link ': ' link} ') with open (' Lagou.json ', ' W ') as FP: # I Ndent represents indentation, if you enter this parameter, the JSON data will be stored as an indent # if not set, the most compact way to store json.dump (content, FP=FP, indent=4)
2 CSV format storage
The so-called CSV (Comma separated Values) format is the most commonly used import and export format for spreadsheets and databases.
Python's CSV module implementation class reads and writes tabular data in CSV format. It allows programmers to say, "Write this data in Excel format," or "read data from an Excel generated file" without knowing the details of the CSV format that Excel uses. Programmers can also describe the CSV format that other applications understand, or define their own proprietary CSV format.
Write data to a CSV file
# -*- coding: utf-8 -*-import csv# 定义第一行header = [‘id‘, ‘name‘]# 2条数据d1 = [1, "xiaoming"]d2 = [2, "lucy"]# 打开csv文件,newline作用是去掉空行,不加结果之间会有一个空行with open(‘test.csv‘, ‘w‘, newline=‘‘) as f: # 建立写入对象 writer = csv.writer(f) # 写入数据 writer.writerow(header) writer.writerow(d1) writer.writerow(d2)
The resulting CSV file contents are as follows:
id,name1,xiaoming2,lucy
Write a dictionary to a CSV file
import csvwith open(‘names.csv‘, ‘w‘, newline=‘‘) as csvfile: # 定义名称,也就是header fieldnames = [‘first_name‘, ‘last_name‘] # 直接将fieldnames写入,写入字典使用DictWriter方法 writer = csv.DictWriter(csvfile, fieldnames=fieldnames) # 调用writeheader方法加入header writer.writeheader() # 写入字典数据 writer.writerow({‘first_name‘: ‘Baked‘, ‘last_name‘: ‘Beans‘}) writer.writerow({‘first_name‘: ‘Lovely‘, ‘last_name‘: ‘Spam‘}) writer.writerow({‘first_name‘: ‘Wonderful‘, ‘last_name‘: ‘Spam‘})
Get the contents of the CSV file as follows:
first_name,last_nameBaked,BeansLovely,SpamWonderful,Spam
Read CSV file
# -*- coding: utf-8 -*-import csvwith open(‘xingming.csv‘, ‘r‘) as f: # 创建reader对象 reader = csv.reader(f) # reader是可迭代对象,可以通过for循环获取内容 for row in reader: print(row)
The results are as follows:
[‘id‘, ‘name‘][‘1‘, ‘xiaoming‘][‘2‘, ‘lucy‘]
Read in a CSV file as a dictionary
import csvwith open(‘names.csv‘, ‘r‘) as f: # 定义字典阅读对象 reader = csv.DictReader(f) # 打印第一行名称 print(reader.fieldnames) # 循环打印字典内容 for row in reader: print(row[‘first_name‘], row[‘last_name‘])
Output Result:
[‘first_name‘, ‘last_name‘]Baked BeansLovely SpamWonderful Spam
So crawl the hook net data, if the code stored in the CSV file is as follows:
#-*-Coding:utf-8-*-import csvfrom BS4 import beautifulsoupfrom urllib Import requesttry:url = ' https://www.lagou.c Om/zhaopin/python/?labelwords=label ' header = {' user-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) gecko/20100101 firefox/57.0 '} req = Request. Request (URL, headers=header) response = Request.urlopen (req). Read (). Decode (' Utf-8 ') except request. Urlerror as E:if hasattr (E, ' reason '): print (E.reason) elif hasattr (E, ' code '): Print (e.code) # Generate Soup Instance soup = BeautifulSoup (response, ' lxml ') # gets the div tag class= ' list_item_top ' divlist = Soup.find_all (' div ', class_= ' list_ Item_top ') # defines the empty list content = []# through the loop, gets the content needed for the list in Divlist: # job name job_name = List.find (' h3 '). String # Job Detail Page link = list.find (' A ', class_= "Position_link"). Get (' href ') # Recruiting Company companies = List.find (' div ', class_= ' Company_nam E '). Find (' a '). String # Salary salary = List.find (' span ', class_= ' money '). String # Print (Job_name, company, salary, Li NK) contEnt.append ({' Job ': job_name, ' Company ': Company, ' salary ': salary, ' link ': ' link} ') with open (' lagou.csv ', ' a ', newline= ') As f: # defines header fieldnames = [' job ', ' company ', ' salary ', ' Link '] # written by the Dictwriter method to the dictionary writer = csv. Dictwriter (f, fieldnames=fieldnames) # Write Header Writer.writeheader () # Loop to get content, write to CSV file for row in Conten T:writer.writerow (Row)
The data obtained are as follows:
Hey, yo! Suddenly found today's example is still saved as CSV format appropriate, find work new skills, you get it!
DevOps Python crawler Intermediate (v) data storage (no database version)