Is it easy to crawl? Until I crawl 100 billion pages I understand! Crawler is really not easy!

Last Update:2018-07-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Now crawler technology seems to be a very easy thing to do, but this view is very confusing. There are a lot of open source libraries/frameworks, visual crawlers, and data extraction tools, and fetching data from websites seems like a breeze. However, when you grab something on the web on a scale, things quickly become tricky.

Dozens of sets of PDFs can be obtained from the private messages 007!

Why is scale crawling important?

Unlike standard web crawl applications, the sheer scale of crawling e-commerce product data has a unique challenge that makes web crawling much more difficult.

In essence, these challenges can be attributed to two things: speed and data quality.

Challenge #1--sloppy and always on the Change website format

This is obvious but perhaps not the sexiest challenge, but the sloppy and constantly changing website format is by far the biggest challenge you'll face when it comes to scale data extraction. This is not necessarily because of the complexity of the task, but because of the time and resources you have to devote.

There is no easy solution.

Unfortunately, there is no silver bullet that can solve these problems completely. Most of the time this is just a matter of scaling up and devoting more resources to your project. To take another example, the project has 18 full-time crawler engineers and 3 dedicated QA engineers to ensure that customers always get reliable data flow.

Challenge 2: A Scalable architecture

The second challenge you will face is building a crawler infrastructure that can grow with daily requests and does not degrade performance.

A simple web crawler that is a serial crawl is overwhelmed by the scale of product data extraction. Typically a serial web crawler loops through requests, and each request takes 2-3 seconds to complete.

Challenge 3: Maintaining throughput performance

The goal of Formula One is to remove all unnecessary loads from the vehicle, and to squeeze the last horsepower of the engine in the name of speed, in this sense the size of the crawl can be compared with the first equation. The same is true for scale web crawling.

In the analysis of large amounts of data, under the existing hardware resources, you will always find a way to minimize the demand cycle to maximize the performance of the crawler. It's all about wanting you to save a few microseconds of time for each request.

For this reason your team needs to have a deep understanding of the Web crawl framework, agent management, and the hardware used so that they can be tuned to optimize performance. You also need to focus on:

Outside agent

Unfortunately, relying on proxy services is not enough to ensure that you can circumvent the anti-robot strategy of large e-commerce sites. More and more websites are using sophisticated anti-robot strategies to monitor your reptile behavior and detect whether it is a live visitor.

These fan bot strategies not only make it more and more difficult to crawl e-commerce sites, but overcoming them can also be a serious drag on crawler performance if done wrong.

A large part of these robotic countermeasures use JavaScript to determine whether a request is from a crawler or a human (JavaScript engine check, font enumeration, WebGL and canvas, and so on).

Site changes-The structural changes that occur on the target site are the main causes of the crawler's failure. This is monitored by our dedicated monitoring system. The tool checks the target site frequently to ensure that no changes have occurred since the last crawl. If the change is found, it will also give notice.

We will discuss the details of automatic quality assurance in a later article.

Summarize

As you can see, scale captures product data with a number of unique challenges. Hopefully this article will make you more aware of the challenges and inspire you to solve them.

However, this is only the first part of this series, so if you are interested, you can register our email list, once the next article is published we will inform you the first time.

Incoming group: 125240963 to get dozens of sets of PDFs Oh!

Is it easy to crawl? Until I crawl 100 billion pages I understand! Crawler is really not easy!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Is it easy to crawl? Until I crawl 100 billion pages I understand! Crawler is really not easy!

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Is it easy to crawl? Until I crawl 100 billion pages I understand! Crawler is really not easy!

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support