Scrapy Crawl 2 (get post URL)

Source: Internet
Author: User
Tags xpath

1. To crawl the data of the investment method of Globebill, crawl the content as follows:

2. Check the URL to discover:





When you click on the next page, the links in the Address bar do not change. You can tell that the data of this page is uploaded by post.

Say the difference between get and post:

Get explicit arguments, and post is implicit.

The URL of get will have a limit, and post does not.

Get no post security.

However, the small one also saw a piece of content. Click to open link

3.F12 Finding data

Simply look at one page of content, wait for a response, and then look in the network, and there is no post data. It should be F12 open, click on the next page, see the changes, you can find the post to the data.

4. Grab with fiddler tool

With this tool, you can crawl only three parameters of the post, like this:

pageno=1&pagesize=10&loanid=a53a4759bf89454dbc5756ca0e12f482

The three parameters obtained are: pageno,pageSize,loanId

No complete links, no stitching. With the third step, you can find the full URL. such as this:


If you look down, you'll find the three parameters you've passed.


The link to the homepage can also be found in this way.



5. Start parsing source code for crawling.

The XPath used to get the elements of the page, the idea is to get the title of the first page, and then go to the next pages to get four small titles and the following content. However, using XPath does not get the results you want, all of the titles, all the subheadings, all the content. It may not be possible to use XPath .... /(ㄒoㄒ)/~~

Problems encountered

(1) To enter the second page, because the different pages of the second page investment records are not the same, for the loop when you do not know how much to write (is not very food!!!) ) Just like this:

For URL in URLs:            rea=re.compile ('/loan/show-loan-detial-loanid-')            url=rea.sub (', url)            # print URL            For PageIndex in range (1,10):                link= "http://www.rqbao.com/loan/ajaxInvestCommonList?pageNo=" +str (pageIndex) + " &pagesize=10&loanid= "+url                # Print link                yield Request (link,callback=self.parsetable)
The scope of the PageNo is unclear.

(2) When using XPath to get all the content directly, this writes:

Items=selector1.xpath ('//tr[@class = ' investrecording ']/td '). Extract ()
Can obtain investment time, investment amount and investment way, but less investment users, and then carefully observe the source, investment users, although also in the <td> tag, but the inside and embedded a <font> tag, and finally replaced it with the regular.

(3) A problem occurred during storage

has been an error that the file does not exist, testing is there. Then I modified the way the path was written,

#一开始是这样写的D: \python test\rqbao\rqb.txtwith open (' D:\\python test\\rqbao\\rqb.txt ', ' a ') as F:
Then will be error, said that is the problem of coding, just write the time encode, you can write normally.

6. The code is as follows:







Scrapy Crawl 2 (get post URL)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.