Code to knock Yourself
Using the Chrome plug-in web Scraper can easily crawl the Web page data, do not write code, mouse operation, where to crawl, not to consider the crawler's Landing, verification code, asynchronous loading and other complex problems.
Web Scraper Plugin
Introduction to Web Scraper official website:
Web Scraper Extension (free!)
Using our extension can create a plan (sitemap) How a Web site should is traversed and what should is extracted. Using these sitemaps the Web Scraper would navigate the site accordingly and extract all data. Scraped data later can be exported as CSV.
Let's take a look at the data I crawled with the Web scaper:
1. Wheel Brother fans
Wheel Brother has more than 540,000 fans, I only grabbed the first 20 pages 400 records
Set data field Web Scraper crawl process and key points:
Three steps to complete the crawl operation after installing the Web scraper plugin
1. Create new Sitemap (Creating a Crawl project)
2, choose to crawl the content of the Web page, point to Point ~ Point, Operation
3. Open crawl, download CSV data
The most critical of these is the second step, two points:
- Select Block Element First, each piece of data we take on the page, are duplicates, check multiple
- Re-fetch the required data fields in the data block (columns in Excel)
The main point of crawling large amounts of data is mastering pagination control.
Pagination is divided into 3 situations:
URL parameter paging (more structured)
A page parameter with paging in the URL, such as:
https://www.zhihu.com/people/excited-vczh/followers?page=2
When you create a sitemap directly, you can bring up the paging parameter in the Start URL, written like this:
https://www.zhihu.com/people/excited-vczh/followers?page=[1-27388]
Scroll load, click "Load more" to load the page data
Click the Pagination Number tab (including "Next" tab)
Note that the 第2-3 species here can be categorized as a way of loading asynchronously, most of which can be transformed into a 1th way to handle it.
This way paging is not very well controlled. The use of Link or Element Click is generally used for paging operations.
Diagram Web Scraper operation steps:
First step: Create a Sitemap
Step Two: Select the block data element
Step three: Select the Captured field text
Fourth step: Crawl Web Scaper Usage experience:
1) In addition to the regular paging method, the other paging method is not good control, different sites by the page label, the operation is not the same.
2) because the direct crawl of the page display values, crawl data is not well-structured, you need EXCEL function processing.
For example, Pinterest 7th popular in the article published time, the format has several kinds.
3) A little bit of web code based on the very fast, code is the King ah.
In particular, a bit of Python-based, in the selection of page data is easy to operate, understand, found in the operation of the problem.
4) compared to eight claw fish, locomotive and other data collectors, web scraper do not need to download software, free, no registration, but also a little bit of code operation. Of course, web scraper also have a paid cloud crawler.
Web Scraper can also import sitemaps, the following code to import, you can crawl to the wheel of the first 20 pages of fans:
{"StartURL":"Https://www.zhihu.com/people/excited-vczh/followers?page=[1-20]","Selectors": [{"Parentselectors": ["_root"],"Type":"Selectorelement","Multiple":True"id":"Items","Selector":"Div. List-item ","Delay":""},{"Parentselectors": ["Items"],"Type":"Selectortext","Multiple":False"id":"Name","Selector":"Div. Useritem-title A.userlink-link ","Regex":"","Delay":""},{"Parentselectors": ["Items"],"Type":"Selectortext","Multiple":False"id":"Desc","Selector":"Div. RichText ","Regex":"","Delay":""},{"Parentselectors": ["Items"],"Type":"Selectortext","Multiple":False"id":"Answers","Selector":"Span. Contentitem-statusitem:nth-of-type (1) ","Regex":"","Delay":""},{"Parentselectors": ["Items"],"Type":"Selectortext","Multiple":False"id":"Articles"," selector":"span." Contentitem-statusitem:nth-of-type (2) "," regex ":""," delay ":""},{" parentselectors ": [" Items "]," type ":" Selectortext "," multiple ":false," id ":" fans "," selector ":" Span. Contentitem-statusitem:nth-of-type (3) "," regex ":""," delay ":""}]," _id ":" Zh_vczh "}
You are welcome to join the Learning Exchange Group if you encounter any problems or want to acquire learning resources in the learning process.
626062078, we learn python! together.
Python:10 minutes to get rid of code-writing crawlers