Is it easy to crawl? Until I crawl 100 billion pages I understand! Crawler is really not easy!

Source: Internet
Author: User

Now crawler technology seems to be a very easy thing to do, but this view is very confusing. There are a lot of open source libraries/frameworks, visual crawlers, and data extraction tools, and fetching data from websites seems like a breeze. However, when you grab something on the web on a scale, things quickly become tricky.

Dozens of sets of PDFs can be obtained from the private messages 007!

Why is scale crawling important?

Unlike standard web crawl applications, the sheer scale of crawling e-commerce product data has a unique challenge that makes web crawling much more difficult.

In essence, these challenges can be attributed to two things: speed and data quality.

Challenge #1--sloppy and always on the Change website format

This is obvious but perhaps not the sexiest challenge, but the sloppy and constantly changing website format is by far the biggest challenge you'll face when it comes to scale data extraction. This is not necessarily because of the complexity of the task, but because of the time and resources you have to devote.

There is no easy solution.

Unfortunately, there is no silver bullet that can solve these problems completely. Most of the time this is just a matter of scaling up and devoting more resources to your project. To take another example, the project has 18 full-time crawler engineers and 3 dedicated QA engineers to ensure that customers always get reliable data flow.

Challenge 2: A Scalable architecture

The second challenge you will face is building a crawler infrastructure that can grow with daily requests and does not degrade performance.

A simple web crawler that is a serial crawl is overwhelmed by the scale of product data extraction. Typically a serial web crawler loops through requests, and each request takes 2-3 seconds to complete.

Challenge 3: Maintaining throughput performance

The goal of Formula One is to remove all unnecessary loads from the vehicle, and to squeeze the last horsepower of the engine in the name of speed, in this sense the size of the crawl can be compared with the first equation. The same is true for scale web crawling.

In the analysis of large amounts of data, under the existing hardware resources, you will always find a way to minimize the demand cycle to maximize the performance of the crawler. It's all about wanting you to save a few microseconds of time for each request.

For this reason your team needs to have a deep understanding of the Web crawl framework, agent management, and the hardware used so that they can be tuned to optimize performance. You also need to focus on:

Outside agent

Unfortunately, relying on proxy services is not enough to ensure that you can circumvent the anti-robot strategy of large e-commerce sites. More and more websites are using sophisticated anti-robot strategies to monitor your reptile behavior and detect whether it is a live visitor.

These fan bot strategies not only make it more and more difficult to crawl e-commerce sites, but overcoming them can also be a serious drag on crawler performance if done wrong.

A large part of these robotic countermeasures use JavaScript to determine whether a request is from a crawler or a human (JavaScript engine check, font enumeration, WebGL and canvas, and so on).

    • Site changes-The structural changes that occur on the target site are the main causes of the crawler's failure. This is monitored by our dedicated monitoring system. The tool checks the target site frequently to ensure that no changes have occurred since the last crawl. If the change is found, it will also give notice.

We will discuss the details of automatic quality assurance in a later article.

Summarize

As you can see, scale captures product data with a number of unique challenges. Hopefully this article will make you more aware of the challenges and inspire you to solve them.

However, this is only the first part of this series, so if you are interested, you can register our email list, once the next article is published we will inform you the first time.

Incoming group: 125240963 to get dozens of sets of PDFs Oh!

Is it easy to crawl? Until I crawl 100 billion pages I understand! Crawler is really not easy!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.