This article tags: webscraper Chrome plugin web page data crawling
Using the Chrome plug-in Web Scraper can easily crawl the Web page data, do not write code, mouse operation, where to crawl, not to consider the Crawler's landing, verification code, asynchronous loading and other complex problems.
Web Scraper Plugin
Introduction to Web Scraper official website:
Web Scraper Extension (free!)
Using our extension can create a plan (sitemap) how a Web site should is traversed and what should is Extracted. Using these sitemaps the Web Scraper would navigate the site accordingly and extract all Data. Scraped data later can be exported as CSV.
Let's take a look at the data I crawled with the web Scaper:
1. Wheel Brother fans
Wheel Brother has more than 540,000 fans, I only grabbed the first 20 pages 400 records
Setting data fields
2. Pinterest 7th Popular Data
Run crawlers to get data
Exporting data
Web Scraper crawl process and key points:
Three steps to complete the crawl operation after installing the Web scraper plugin
1. Create new Sitemap (creating a crawl Project)
2, choose to crawl the content of the Web page, point to point ~ point, operation
3. Open crawl, Download CSV data
The most critical of these is the second step, two points:
- Select Block Element first, each piece of data we take on the page, are duplicates, check multiple
- Re-fetch the required data fields in the data block (columns in Excel)
The main point of crawling large amounts of data is mastering pagination Control.
Pagination is divided into 3 situations:
URL parameter Paging (more structured)
A page parameter with paging in the URL, such as:
https://www.zhihu.com/people/excited-vczh/followers?page=2
When you create a sitemap directly, you can bring up the paging parameter in the Start url, written like This:
https://www.zhihu.com/people/excited-vczh/followers?page=[1-27388]
Scroll load, click "load more" to load the page data
Click the Pagination Number tab (including "next" Tab)
Note that the 第2-3 species here can be categorized as a way of loading asynchronously, most of which can be transformed into a 1th way to handle it.
This way paging is not very well Controlled. The use of Link or Element Click is generally used for paging operations.
Diagram Web Scraper operation steps:
First Step: Create a sitemap
Step Two: Select the block data element
Step Three: Select the captured field text
Fourth Step: crawl
Web Scaper Usage experience:
1) In addition to the regular paging method, the other paging method is not good control, different sites by the page label, the operation is not the Same.
2) because the direct crawl of the page display values, crawl data is not well-structured, you need EXCEL function processing.
For example, Pinterest 7th popular in the article published time, the format has several kinds.
3) a little bit of web code based on the very fast, code is the king Ah.
In particular, a bit of python-based, in the selection of page data is easy to operate, understand, found in the operation of the Problem.
4) compared to eight claw fish, locomotive and other data collectors, Web scraper do not need to download software, free, no registration, but also a little bit of code Operation. of course, web scraper also have a paid cloud crawler.
Web Scraper can also import sitemaps, the following code to import, you can crawl to the wheel of the first 20 pages of fans:
{"starturl": "https://www.zhihu.com/people/excited-vczh/followers?page=[1-20]", "selectors": [{"parentSelectors": ["_root"], "type": "selectorelement", "multiple": true, "id": "items", "selector": "div." List-item "," delay ":" "},{" parentselectors ": [" items "]," type ":" selectortext "," multiple ": false," ID ":" name "," Selector ":" Div. Useritem-title a.userlink-link "," regex ":" "," delay ":" ""},{"parentselectors": ["items"], "type": "selectortext", " Multiple ": false," ID ":" desc "," selector ":" Div. " RichText "," regex ":" "," delay ":"},{"parentselectors": ["items"], "type": "selectortext", "multiple": false, "id": " Answers "," Selector ":" Span. Contentitem-statusitem:nth-of-type (1) "," "regex": "", "delay": ""},{"parentselectors": ["items"], "type": " Selectortext "," multiple ": false," ID ":" articles "," selector ":" Span. " Contentitem-statusitem:nth-of-type (2) "," "regex": "", "delay": ""},{"parentselectors": ["items"], "type": " Selectortext "," multiple ": false," ID ":" fans "," selector ":" Span. " Contentitem-statusitem:nth-of-type (3) "," regex ":" "," delay ":" "}]," _id ":" zh_vczh "}
PS, Web Scraper Data Tutorial
Video tutorials in the official website
Http://webscraper.io/tutorials
A detailed step was written in the answer to @ Chen Dahin, and a video tutorial was Recorded.
This question source 0 how to learn the crawler technology? @ Chen Dahin in the article in the Excel crawler, Web scraper, code Crawler to do a comparative analysis.
written at the End: for Freedom look outside the world, and it this line, not to go to Google data, finally, Amway some speed agent.
Accelerator recommendations |
Free Solutions |
Payment Plan |
Official website |
A Red apricot accelerator |
Free program is not available, stable high-speed |
Enter 80 percent coupon code wh80, annual pay only 80 yuan/year |
Official website Direct HTTP://WHOSMALL.COM/GO/YZHX |
Azumino Accelerator |
Best use of foreign trade VPN |
Minimum ¥30/month |
Official website Direct Http://whosmall.com/go/ay |
Loco Accelerator |
Free 2 hours per day |
Minimum ¥15/month |
Official website Direct Http://whosmall.com/go/loco |
This article tags: webscraper Chrome plugin web page data crawling
Turn from SUN's BLOG-focus on Internet knowledge, share the spirit of the internet!
Original Address : " crawl of Web data with Chrome plugin Web Scraper for 10 minutes "
Related reading : How does MacOS use the Package Manager homebrew-cask to install software? "
Related reading : How can I use Launchbar to download all the files on a webpage on Mac? "
Related reading : How does MacOS use Launchbar to upload files to Google drive? "
Related reading : " best Mac App Quick start and switch tool: Manico 2.0"
Related reading : Why do I choose Window Tidy as the MacOS split-screen tool? "
Related reading : Chrome extension stylish: "skin-changing" with one click to not like a website
Related reading : "integrating QQ music, netease cloud music and shrimp music resources" with chrome extension listen 1 "
Related reading : "8" new tab "chrome extensions: teach you to play the New tab page with a sneak "
Related reading : "7 practical Chrome Extensions Recommended: help you improve your chrome experience "
Related reading : " no extension is not Chrome: 15 premium chrome extensions recommended for everyone "
Related reading : thebest experience for Web browsing with 12 no less chrome extensions
Related reading : "5 Chrome extensions that bring happiness "
related reading: useful for Programmers: 2017 latest in Google's Hosts file download and summary of the various hosts encountered the problem of the solution and configuration of the detailed
Related blog:SUN's blog -focus on Internet knowledge, share the spirit of the internet! Go and see:www.whosmall.com
Original Address: http://whosmall.com/?post=473
Easily crawl web data with Chrome plugin Web Scraper for 10 minutes