Crawler Introduction
Crawler: You can think of the Internet as a large network, reptiles are like the spider in the net, if you want to get the resources in this net, you can crawl it down.
In short, it is the automated program that requests the site and extracts the data.
The basic flow of reptiles:
- Initiating a request: sends a request to the target site via the HTTP library, sending a request that can contain additional information such as headers, waiting for the server to respond.
- Get response content: If the server responds properly, it will get a response,response content that is the content of the page to get, the type may be Html,json string, binary data and other types.
- Parsing content: The resulting content may be HTML, you can use regular expressions, Web page parsing library to parse. Could be JSON, can be converted directly to JSON object parsing, possibly binary data, can be saved or further processing.
- Save data: A variety of saved data can be saved as text, stored in a database, or saved as a file in a specific format.
Request and Response process:
(1) The browser sends a message to the server where the URL is located, and this process is called the HTTP Request
(2) After receiving the message from the browser, the server can send the message content according to the browser, do the corresponding processing, and then send the message back to the browser, this process is called the HTTP Response
(3) When the browser receives the response information from the server, it will process and display the information.
Request requests:
- Request mode: Mainly get, post two types, in addition to head, PUT, DELETE, options and so on.
- Request Url:url full name is the same resource locator, such as a Web page document, a picture, a video, etc. can be uniquely identified by the URL.
- Request Header: Contains header information when requested, such as User-agent, Host, cookies, etc.
- Request Body: Additional data that is carried when requested, such as form data when a form is submitted.
Response response:
- Response Status: There are multiple response states, such as 200 for success, 301 jump, 404 Page Not found, 502 server error, etc.
- Response headers: such as content type, content length, server information, settings cookie, and so on.
- Response body: The most important part contains the content of the requested resource, such as Web page HTML, picture, binary data, etc.
Simple example:
Import requestsresponse= requests.get (' http://www.baidu.com ') print (response.text) # Get response body print (response.headers) # Get the corresponding head print (response.status_code) # Status code
What kind of data can I catch?
- Web page text, such as HTML documents, JSON-formatted text, and so on.
- Picture, get get is binary file, save in picture format.
- Video, same as binary file, save as video format.
- Other, as long as the request can be obtained.
Data processing:
- Direct processing
- JSON parsing
- Regular expressions
- BeautifulSoup
- Pyquery
- Xpath
How to save data
- Text, plain text, JSON, XML, etc.
- relational databases such as MySQL, Oracle, etc.
- Non-relational database, MongoDB, Redis and other key-value form storage
- binary files, slices, videos, audio, etc. are saved directly in the specified format.
Small example:
Crawl the href and picture of the A tag on the https://www.autohome.com.cn/news/page and save the picture locally
Import requestsfrom bs4 Import beautifulsoupresponse = Requests.get ( url= ' https://www.autohome.com.cn/news/') response.encoding = response.apparent_encoding # fix garbled soup = beautifulsoup (response.text,features= ' Html.parser ') target = Soup.find (id= ' auto-channel-lazyload-article ') li_list = Target.find_all (' li ') for I in Li_list: # Every i is a soup object , you can use find to continue to find a = I.find (' a ') # If A is not found, call A.attrs will error, all need to determine if a: a_href = a.attrs.get (' href ') a_href = ' http: ' + a_href print (a_href) txt = a.find (' h3 '). Text print (TXT) img = a.find (' img '). Attrs.get (' src ') img = ' http: ' + img print (img) img_response = Requests.get (url=img) import UUID filename = str (UUID.UUID4 ()) + '. jpg ' with open (filename, ' WB ') as F: F.write (img_response.content)
Simple summary:
"' Response = request.get (' url ') response.textresopnse.contentresponse.encodingresponse.encoding = Response.apparent_encodingresponse.status_code "" Soup = BeautifulSoup (response.text,features= ' html.parser ') v1 = Soup.find (' div ') # found the first qualifying soup.find (id= ' I1 ') soup.find (' div ', id= ' i1 ') v2 = soup.find_all (' div ') obj = V1obj = v2[0] # from List by index to each object Obj.textobj.attrs # property ""
Requests Module Introduction
1. Method Relationship of Call
"' Requests.get () requests.post () Requests.put () Requests.delete () ... These methods are essentially called the Requests.request () method, for example: def get (URL, Params=none, **kwargs): R "" "sends a GET request. :p Aram Url:url for the New:class: ' Request ' object. :p Aram params: (optional) Dictionary or bytes to being sent in the query St Ring for The:class: ' request '. :p Aram \*\*kwargs:optional arguments, "request" takes. : return:: Class: ' Response <Response> ' object : Rtype:requests. Response "" " kwargs.setdefault (' allow_redirects ', True) return request (' Get ', URL, params=params, * * Kwargs) ""
2. Common parameters:
' Requests.request ()-method: How to Submit-URL: Submit Address-Params: parameters passed on the URL, GET, such as requests.request (method= ' get ', Url= ' http://www.baidu.com ', params={' username ': ' user ', ' Password ': ' pwd '}) # Http://www.baidu.com?username=us Er&password=pwd-data: Data passed in the request body requests.request (method= ' POST ', url= ' http://www.baidu.com ', data={' username ': ' user ', ' Password ': ' pwd '}-JSON: Data passed in the request body requests.request (method= ' POST ', url= ' htt P://www.baidu.com ', json={' username ': ' user ', ' Password ': ' pwd '} ' # json= ' {' username ': ' user ', ' Password ': ' pwd '} ' Overall send-headers: request Header Requests.request (method= ' POST ', url= ' http://www.baidu.com ', json={' username ': ' User ', ' Password ': ' pwd '}, headers={' referer ': ' https://dig.chouti.com/', ' user-agent ': ' Mozill a/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.132 safari/537.36 '})-cookies:cookies, usually on head ERS sent over. ‘‘‘
The Crawler Foundation for Python development