Python:10 minutes to get rid of code-writing crawlers

Last Update:2017-05-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Code to knock Yourself

Using the Chrome plug-in web Scraper can easily crawl the Web page data, do not write code, mouse operation, where to crawl, not to consider the crawler's Landing, verification code, asynchronous loading and other complex problems.

Web Scraper Plugin

Introduction to Web Scraper official website:

Web Scraper Extension (free!)
Using our extension can create a plan (sitemap) How a Web site should is traversed and what should is extracted. Using these sitemaps the Web Scraper would navigate the site accordingly and extract all data. Scraped data later can be exported as CSV.

Let's take a look at the data I crawled with the Web scaper:

1. Wheel Brother fans

Wheel Brother has more than 540,000 fans, I only grabbed the first 20 pages 400 records

Set data field Web Scraper crawl process and key points:

Three steps to complete the crawl operation after installing the Web scraper plugin
1. Create new Sitemap (Creating a Crawl project)
2, choose to crawl the content of the Web page, point to Point ~ Point, Operation
3. Open crawl, download CSV data

The most critical of these is the second step, two points:

Select Block Element First, each piece of data we take on the page, are duplicates, check multiple
Re-fetch the required data fields in the data block (columns in Excel)

The main point of crawling large amounts of data is mastering pagination control.
Pagination is divided into 3 situations:

URL parameter paging (more structured)
A page parameter with paging in the URL, such as:
```
https://www.zhihu.com/people/excited-vczh/followers?page=2
```
When you create a sitemap directly, you can bring up the paging parameter in the Start URL, written like this:
```
https://www.zhihu.com/people/excited-vczh/followers?page=[1-27388]
```
Scroll load, click "Load more" to load the page data
Click the Pagination Number tab (including "Next" tab)
Note that the 第2-3 species here can be categorized as a way of loading asynchronously, most of which can be transformed into a 1th way to handle it.
This way paging is not very well controlled. The use of Link or Element Click is generally used for paging operations.

Diagram Web Scraper operation steps:
First step: Create a Sitemap
Step Two: Select the block data element
Step three: Select the Captured field text
Fourth step: Crawl Web Scaper Usage experience:

1) In addition to the regular paging method, the other paging method is not good control, different sites by the page label, the operation is not the same.

2) because the direct crawl of the page display values, crawl data is not well-structured, you need EXCEL function processing.
For example, Pinterest 7th popular in the article published time, the format has several kinds.

3) A little bit of web code based on the very fast, code is the King ah.
In particular, a bit of Python-based, in the selection of page data is easy to operate, understand, found in the operation of the problem.

4) compared to eight claw fish, locomotive and other data collectors, web scraper do not need to download software, free, no registration, but also a little bit of code operation. Of course, web scraper also have a paid cloud crawler.

Web Scraper can also import sitemaps, the following code to import, you can crawl to the wheel of the first 20 pages of fans:

{"StartURL":"Https://www.zhihu.com/people/excited-vczh/followers?page=[1-20]","Selectors": [{"Parentselectors": ["_root"],"Type":"Selectorelement","Multiple":True"id":"Items","Selector":"Div. List-item ","Delay":""},{"Parentselectors": ["Items"],"Type":"Selectortext","Multiple":False"id":"Name","Selector":"Div. Useritem-title A.userlink-link ","Regex":"","Delay":""},{"Parentselectors": ["Items"],"Type":"Selectortext","Multiple":False"id":"Desc","Selector":"Div. RichText ","Regex":"","Delay":""},{"Parentselectors": ["Items"],"Type":"Selectortext","Multiple":False"id":"Answers","Selector":"Span. Contentitem-statusitem:nth-of-type (1) ","Regex":"","Delay":""},{"Parentselectors": ["Items"],"Type":"Selectortext","Multiple":False"id":"Articles"," selector":"span." Contentitem-statusitem:nth-of-type (2) "," regex ":""," delay ":""},{" parentselectors ": [" Items "]," type ":" Selectortext "," multiple ":false," id ":" fans "," selector ":" Span. Contentitem-statusitem:nth-of-type (3) "," regex ":""," delay ":""}]," _id ":" Zh_vczh "}

You are welcome to join the Learning Exchange Group if you encounter any problems or want to acquire learning resources in the learning process.
626062078, we learn python! together.

Python:10 minutes to get rid of code-writing crawlers

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python:10 minutes to get rid of code-writing crawlers

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python:10 minutes to get rid of code-writing crawlers

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support