Python:10 minutes to get rid of code-writing crawlers

Source: Internet
Author: User

Code to knock Yourself

Using the Chrome plug-in web Scraper can easily crawl the Web page data, do not write code, mouse operation, where to crawl, not to consider the crawler's Landing, verification code, asynchronous loading and other complex problems.


Web Scraper Plugin

Introduction to Web Scraper official website:

Web Scraper Extension (free!)
Using our extension can create a plan (sitemap) How a Web site should is traversed and what should is extracted. Using these sitemaps the Web Scraper would navigate the site accordingly and extract all data. Scraped data later can be exported as CSV.

Let's take a look at the data I crawled with the Web scaper:

1. Wheel Brother fans

Wheel Brother has more than 540,000 fans, I only grabbed the first 20 pages 400 records


Set data field Web Scraper crawl process and key points:

Three steps to complete the crawl operation after installing the Web scraper plugin
1. Create new Sitemap (Creating a Crawl project)
2, choose to crawl the content of the Web page, point to Point ~ Point, Operation
3. Open crawl, download CSV data

The most critical of these is the second step, two points:

    1. Select Block Element First, each piece of data we take on the page, are duplicates, check multiple
    2. Re-fetch the required data fields in the data block (columns in Excel)

The main point of crawling large amounts of data is mastering pagination control.
Pagination is divided into 3 situations:

    1. URL parameter paging (more structured)
      A page parameter with paging in the URL, such as:

      https://www.zhihu.com/people/excited-vczh/followers?page=2

      When you create a sitemap directly, you can bring up the paging parameter in the Start URL, written like this:

      https://www.zhihu.com/people/excited-vczh/followers?page=[1-27388]
    2. Scroll load, click "Load more" to load the page data

    3. Click the Pagination Number tab (including "Next" tab)
      Note that the 第2-3 species here can be categorized as a way of loading asynchronously, most of which can be transformed into a 1th way to handle it.
      This way paging is not very well controlled. The use of Link or Element Click is generally used for paging operations.

Diagram Web Scraper operation steps:
First step: Create a Sitemap
Step Two: Select the block data element
Step three: Select the Captured field text
Fourth step: Crawl Web Scaper Usage experience:

1) In addition to the regular paging method, the other paging method is not good control, different sites by the page label, the operation is not the same.

2) because the direct crawl of the page display values, crawl data is not well-structured, you need EXCEL function processing.
For example, Pinterest 7th popular in the article published time, the format has several kinds.

3) A little bit of web code based on the very fast, code is the King ah.
In particular, a bit of Python-based, in the selection of page data is easy to operate, understand, found in the operation of the problem.

4) compared to eight claw fish, locomotive and other data collectors, web scraper do not need to download software, free, no registration, but also a little bit of code operation. Of course, web scraper also have a paid cloud crawler.

Web Scraper can also import sitemaps, the following code to import, you can crawl to the wheel of the first 20 pages of fans:

{"StartURL":"Https://www.zhihu.com/people/excited-vczh/followers?page=[1-20]","Selectors": [{"Parentselectors": ["_root"],"Type":"Selectorelement","Multiple":True"id":"Items","Selector":"Div. List-item ","Delay":""},{"Parentselectors": ["Items"],"Type":"Selectortext","Multiple":False"id":"Name","Selector":"Div. Useritem-title A.userlink-link ","Regex":"","Delay":""},{"Parentselectors": ["Items"],"Type":"Selectortext","Multiple":False"id":"Desc","Selector":"Div. RichText ","Regex":"","Delay":""},{"Parentselectors": ["Items"],"Type":"Selectortext","Multiple":False"id":"Answers","Selector":"Span. Contentitem-statusitem:nth-of-type (1) ","Regex":"","Delay":""},{"Parentselectors": ["Items"],"Type":"Selectortext","Multiple":False"id":"Articles"," selector":"span." Contentitem-statusitem:nth-of-type (2) "," regex ":""," delay ":""},{" parentselectors ": [" Items "]," type ":" Selectortext "," multiple ":false," id ":" fans "," selector ":" Span. Contentitem-statusitem:nth-of-type (3) "," regex ":""," delay ":""}]," _id ":" Zh_vczh "}  

 

You are welcome to join the Learning Exchange Group if you encounter any problems or want to acquire learning resources in the learning process.
626062078, we learn python! together.

Python:10 minutes to get rid of code-writing crawlers

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.