Scrapy Crawl 2 (get post URL)

Last Update:2016-04-29 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. To crawl the data of the investment method of Globebill, crawl the content as follows:

2. Check the URL to discover:

When you click on the next page, the links in the Address bar do not change. You can tell that the data of this page is uploaded by post.

Say the difference between get and post:

Get explicit arguments, and post is implicit.

The URL of get will have a limit, and post does not.

Get no post security.

However, the small one also saw a piece of content. Click to open link

3.F12 Finding data

Simply look at one page of content, wait for a response, and then look in the network, and there is no post data. It should be F12 open, click on the next page, see the changes, you can find the post to the data.

4. Grab with fiddler tool

With this tool, you can crawl only three parameters of the post, like this:

pageno=1&pagesize=10&loanid=a53a4759bf89454dbc5756ca0e12f482

The three parameters obtained are: pageno,pageSize,loanId

No complete links, no stitching. With the third step, you can find the full URL. such as this:

If you look down, you'll find the three parameters you've passed.

The link to the homepage can also be found in this way.

5. Start parsing source code for crawling.

The XPath used to get the elements of the page, the idea is to get the title of the first page, and then go to the next pages to get four small titles and the following content. However, using XPath does not get the results you want, all of the titles, all the subheadings, all the content. It may not be possible to use XPath .... /(ㄒoㄒ)/~~

Problems encountered

(1) To enter the second page, because the different pages of the second page investment records are not the same, for the loop when you do not know how much to write (is not very food!!!) ) Just like this:

For URL in URLs:            rea=re.compile ('/loan/show-loan-detial-loanid-')            url=rea.sub (', url)            # print URL            For PageIndex in range (1,10):                link= "http://www.rqbao.com/loan/ajaxInvestCommonList?pageNo=" +str (pageIndex) + " &pagesize=10&loanid= "+url                # Print link                yield Request (link,callback=self.parsetable)

The scope of the PageNo is unclear.

(2) When using XPath to get all the content directly, this writes:

Items=selector1.xpath ('//tr[@class = ' investrecording ']/td '). Extract ()

Can obtain investment time, investment amount and investment way, but less investment users, and then carefully observe the source, investment users, although also in the <td> tag, but the inside and embedded a <font> tag, and finally replaced it with the regular.

(3) A problem occurred during storage

has been an error that the file does not exist, testing is there. Then I modified the way the path was written,

#一开始是这样写的D: \python test\rqbao\rqb.txtwith open (' D:\\python test\\rqbao\\rqb.txt ', ' a ') as F:

Then will be error, said that is the problem of coding, just write the time encode, you can write normally.

6. The code is as follows:

Scrapy Crawl 2 (get post URL)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More