Basic steps
The purpose of this class is to crawl the information on the site using the Requests Module +beautifulsoup module.
First crawl a site in two main steps
1, the first step we need to understand the server and local exchange mechanism, choose the right way we can get the correct information.
2, we need to understand some of the real Web pages to obtain information on some methods, there are some ideas
Server and local switching mechanism
Let us first explain the server and the local exchange mechanism, first we understand a common sense of this operation, when we normally browse the Web page, we actually use the browser to click on each page, is to the site's server to launch a request, we call it, And this server received the request, will give us a reply, we call it reaponse, it is this behavior a request a response, requests and responses, in fact, this way is the HTTP protocol, that is, our client and the server a way to do the conversation.
Request
When requesting a server, the behavior of a request actually contains different methods, in the age of HTTP 1.0, there are actually only three methods of get, POST, head.
Then after HTTP 1.1, which is now not called Universal protocol, add put, OPTIONS, CONNECT, TRSCE, delete total plus now there are 8 methods.
We do not have a detailed explanation here, you may be confused when you hear this, why a request there are so many ways, actually here we need to understand not so much, we just need to know when we go to the server to request the two most common methods of get and post.
In a nutshell, 99% of the pages can be accessed using both methods, in fact, they have a lot of different functions, but there are various roles, such as we go to click on a page, click on a button when it is actually a get, and then like we do tweet, This behavior is a post behavior, but we do not start with specific details.
We use crawler crawling page is actually imitation of these two ways, and for us to use the crawler, know that a get method is actually more than 90% of the site we are able to crawl, so that the next request this part we will focus on the expansion, get the use of methods, Using GET requests for request, the simplest to contain this information, using its method in the leftmost get, request URL Point and page_one.html, we know this is a Web page, using what kind of protocol is HTTP 1.1, and where is his main URL, Quite with the suffix behind the host form the link you want to request this page, this is a request, the simplest to send the information included, of course, using the request we can also go to the server to send more information, such as who you are, where you are, what state you are in, And you even use what browser, all this information can be sent through request, you send different information to the browser browser will give you feedback different results, for example, in our daily life, in the use of mobile browser and the use of PC-side browser to see the structure of the page is not the same content, This is because when we use different browsers to make requests to the Web page, the server identifies what the client we are requesting is, and then decides what form of page we want to send, the Web page, which is in a simple request, plus different information to present different results.
Response
Response, is the website responds to us the information, but simply speaking, we crawl in the previous study is a local webpage, crawls is a local webpage, these pages if in the network environment, actually is in we to the website request, Then the server will send us the page we requested in the form of response, we will get some basic information such as the status code, tell me if it is 200, then it means that my request succeeds, if my request failed should return to me 403 or 404. If the request succeeds, the following should be the elements on the page to load continuously, this is the basis of our section, once again request the site we return to the page to parse, analyze it, to crawl the data we want.
Just now the simple description of the request and the principle of response, then we open the browser to see how some of this abstract concept is in the real network environment, we first open a Web page, right-click Check (or press F12) into the developer mode, We can see that the source code of the webpage has been loaded, when we go to monitor this behavior, click the network in the options bar, then click Refresh page, the page is reloaded.
At this point we can see that the loaded information in the Web page has been shown here.
At this time we clicked on the first page of the way, in Headers, request and response information are all recorded in this monitor, we click on request Headers, Here we can see the way Web requests are made and some other specific information has been shown here. We can see the cookie in this page.
The user-agent in request is the name of the agent we use, and the address host.
Next we look at the information in the response in the options bar, actually response the main information is our page itself, this way I can see response loaded Web page source and we used to check open the source of the Web page is consistent.
To crawl information in a Web page
This is the request and response the interactive behavior in the browser demonstration, then we are to use a kind of interaction with the server to crawl the data we want to the information, before we write the query code, we first look at the page we want to crawl, Filter the elements we need. The elements we need are picture information, as well as picture names, number of hotels.
1. Request access to Web content from the server using requests
2. Using BeautifulSoup to parse the page
3. Describe where you want to crawl an element
First of all we need the title, we learn from the previous lesson method to get a title of CSS Selector, remove the special path (learning HTML should understand), find all the title, we locate an element to find its uniqueness characteristics.
When looking for a picture, we can specify the height and width of the image to find the picture, in order to find the picture we need.
Number of hotels
4. Organize and filter the required information
But after we output found that we get the image address is wrong, this is because the site has taken the anti-crawler means, using JS code control the picture, we will explain how to get the correct picture method.
The next step is code presentation.
fromBs4ImportBeautifulSoupImportRequestsurl='Https://www.tripadvisor.cn/TravelersChoice-Landmarks'Wb_data=requests.get (URL) soup= BeautifulSoup (Wb_data.text,'lxml') Titles= Soup.select ('div.winnerlayer > Div.winnername > Div.mainName.extra > A') IMGs= Soup.select ("img[width= ' 472px ']") Hotelsnum= Soup.select ('div.lodging > Div.lodgbdy > Ul > li > A') forTitle,img,hotelnuminchZip (titles,imgs,hotelsnum): Data= { 'title': Title.get_text (),'img': Img.get ('src'), 'Hotelnum': Hotelnum.get_text (),}Print(data)
Today's crawler is about here, the point is that we first understand the server and the local exchange mechanism, and then the Web page information crawl exercise.
The third lesson of crawlers: The analysis of Web pages in the Internet