Scrapy crawling Dynamic Web pages

Source: Internet
Author: User

Dynamic Web pages refer to several possibilities:

1) requires user interaction, such as common login operations;

2) The Web page is dynamically generated through js/ajax, such as an HTML with <div id= "test" ></div>, and JS to generate <div id= "test" ><span>aaa</span ></div>;

3) Click enter keyword to query, and the browser URL address does not change

This article does not use any external tools, the instance operation how to observe the network communication method to parse the Dynamic Web page.

Environment: WIN10, python2.7,scrapy 1.4.0,chrome browser, Firefox browser

1. See if it is a Dynamic web page

In the Washington Post, for example, search for keyword French, search results are as follows:

The URL of the page is https://www.washingtonpost.com/newssearch/?datefilter=All%20Since%202005&query=French&sort= Relevance&utm_term=.3570cb8c6dcf

F12 Open the console and find the section label "Main-content" for the search list data you want to get under element

Then go to the network's doc tag, reload the current page, click on the first file under Name, and look for the corresponding section id "main-content" element on the right, and find no data:

Indicates that the content is dynamically loaded

2, get the JS click Action issued by the request URL

In the JS tag to find the real data page, click on the file under the name of the preview on the right there is no data, the page that holds the data is the page we really need to crawl:

A little trick: generally dynamically loaded data are stored in JSON format, the filter is filled with JSON filter, you can find the desired file more quickly, but not all the site is applicable, or need to manually find the required files in JS or XHR.

Copy the link to the current file and get a very long URL:

Https://sitesearchapp.washingtonpost.com/sitesearch-api/v2/search.json?count=20&datefilter=displaydatetime :%5b*+to+now%2fday%2b1day%5d&facets.fields=%7b!ex%3dinclude%7dcontenttype,%7b!ex%3dinclude%7dname& highlight.fields=headline,body&highlight.on=true&highlight.snippets=1&query=french&sort=& Callback=angular.callbacks._0

Open this address in the browser, found that this is a JSON file, but the URL is too lengthy, we can be appropriate to delete some parameters as necessary, these parameters can be obtained under the headers:

Choose to retain count,datefilter,query three parameters, note that the page after the deletion parameter must be guaranteed the same as the JSON data obtained from the original URL, the streamlined URL is:

Https://sitesearchapp.washingtonpost.com/sitesearch-api/v2/search.json?count=20&datefilter=displaydatetime : [*+to+now%2fday%2b1day]&query=french

Open the URL in Firefox (choose Firefox to open because the JSON data display is friendly) and get the following page:

3. Extracting JSON data

Based on the structure of the JSON file above, we can use the Json.loads function to further extract the desired data:

4. Paging mechanism

Take action on the Web page to observe the parameters of the URL change law:

First page:

Second page:

The third page:

found that each page of the URL changes are caused by the more startat this parameter, parsing the next page only needs to increase the value of Startat 20 each time, add the parameter after the URL, such as the second page:

Https://sitesearchapp.washingtonpost.com/sitesearch-api/v2/search.json?count=20&datefilter=displaydatetime : [*+to+now%2fday%2b1day]&query=french&startat=20

Finally, according to the total number of news to calculate the last page of the Startat value can be

5. Parsing Dynamic Web pages that submit parameters in form form

Some sites enter the keyword after the query, and the browser URL address is not changed, it is possible to submit the request parameter as a form.

In the case of Apple Daily, the keyword searches for "France", but there are no similar "q=" parameters in the URL.

Load the URL again and find it positioned on a separate search page:

Analyze the results of the search and find that the parameter was submitted as a form:

Populate the Formrequest according to the parameters of input:

The returned response can parse the webpage normally with XPath:

Scrapy crawling Dynamic Web pages

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.