What are the advantages and disadvantages of Web Crawler writing in various languages?

Source: Internet
Author: User
Now it seems that many people are using python, but they also see PHP, JAVA, C ++, and so on. I have saved my skills in the above languages. What language should I use to develop crawlers? It seems that many people are using python
But I also saw PHP, JAVA, C ++ and so on.
I have saved my skills in the above languages.
What language should I use to develop crawlers? Reply content: Thank you!
I have written crawler and body extraction programs in PHP and Python.
First, let's talk about the advantages of PHP:
1. The language is relatively simple. PHP is a very casual language. It is easy to write so that you focus on what you want to do, rather than various syntax rules.
2. The various functional modules are complete. There are two parts:
1. webpage download: curl and other extension libraries;
2. document parsing: dom, xpath, tidy, and various transcoding tools may have different problems with the subject. My crawler needs to extract the body, so it needs complicated text processing, therefore, various convenient text processing tools are my great love .;
In short, it is easy to get started.

Disadvantages:
1. Poor concurrent processing capability: Because PHP had no threads or process functions at the time, to achieve concurrency, you need to borrow a multi-path model. PHP uses the select model. It is troublesome to implement it. It may be because of the level issues that often occur in my program, leading to missing catch.

Let's talk about Python:
Advantages:
1. Various crawler frameworks for convenient and efficient download of webpages;
2. multithreading, mature and stable process models, crawler is a typical multi-task processing scenario, there will be a long latency in the request page, in general, more is waiting. Multithreading or processes will optimize program efficiency and improve the download and analysis capabilities of the entire system.
3. GAE support: GAE was just available when I was writing crawlers. It only supports Python. crawlers created using GAE are almost free of charge. At most, I have nearly a thousand application instances working.

Disadvantages:
1. poor adaptability to non-standard HTML: for example, if a page contains both the GB18030 Character Set and the UTF-8 Character Set, Python is not as simple as PHP, you need to make a lot of judgments on your own. Of course, this is the trouble to extract the body.

Java and C ++ were also tested at that time, so they gave up because it was relatively difficult to use the scripting language.

In short, if you develop a small-scale crawler script language, it is an advantageous language in all aspects. If you want to develop a complex crawler system, Java may be an added option. C ++ is more suitable for writing a module. For a crawler system, download and internal parsing are only two basic functions. A really good system also includes comprehensive task scheduling, monitoring, storage, page data storage and update logic, and deduplication. Crawling is a bandwidth-consuming application. A good design will save a lot of bandwidth and server resources, and there is a big gap between good and bad. Update: 2016-02-12
Curl http://www.topit.me/| grep-P "http: [^>] *? (Jpg | gif) "-o | xargs wget
This is a regular search url that starts with http and ends with jpg or gif. The example of wget download is as follows:

Original answer:
I wrote a program for image crawling. I used shell to add it to crontab.
Probably curl | grep | wget
Is it simpler? Opposed to @ kenth, he has seen too few crawlers.

First, it depends on the purpose

If it is a site with a single purpose, write it in the language you are used to, and it takes enough time to rebuild it twice to learn other languages.
If there are around 100 websites, it is more important to build a framework and manage your crawlers than how to write them.
OK. Both of the above are "manually" templates (of course, we will have some small plug-ins and other auxiliary tools ). The advantage of manual template writing is that when there are not many sites-fast and flexible. In this scenario and purpose, you can select the language you are used to. The language of the database with the maximum page resolution and HTTP request support is the best. For example, python and java.
Note that the only reason for this choice is that the startup cost is higher than the write cost.

When you are faced with the magnitude of 1000 sites, you may need to write a template generator, such as Kimono: Turn websites into structured APIs from your browser in seconds
If you are faced with more than one million sites but are of the same type, you may need to perform automatic template mining.
At this stage, Algorithms are more important, so the convenience of code writing determines your choice.Of course, when the algorithm is stable, it becomes the problem below.

When you are faced with billions of web pages every day, a full computing task is taken for one week. Each page needs to extract the title, main graph, release time, webpage blocks, and page value. It is impossible for someone to write a "script" to them and configure the template. A large number of achievements, word segmentation, machine learning, scoring, follow link quality prediction, and screening. A large amount of computing is used.
At this stage, The computing speed is very important.Unless you can convince the boss to add thousands of machines to you. Compared with this requirement, you can rewrite all basic components. If you select a language, the execution speed is high.

It should be noted that flexibility, or accuracy of extraction, is gradually decreasing from top to bottom. PM does not require you to accurately extract each field from tens of billions of websites.


Finally, let's talk about the crawling problem. Scheduling and capturing are necessary for every crawler, but there is nothing to say. Different Levels of crawlers naturally have different practices. However, such a system generally has a clear purpose and has less package dependencies, you do not need to modify it constantly. In architecture, it can often be an independent component, which can be used in different languages as well as downstream components ..
For data center bandwidth, downlink traffic is basically useless. As long as you and your website are willing to, crawling speed is not a bottleneck. On the contrary, it is more important to calculate the peer pressure, duplicate filtering, and high-quality links. This in turn brings computing pressure. Crawlers are not looking at the speed of code execution on a single machine, but at the development efficiency and convenience of tools. The simpler the language, the better. As @ kenth said. Development efficiency is very important. Because the specific code of the crawler must be modified according to the website, the flexible Script Language Python is especially suitable for this task.
At the same time, Python also has powerful crawler libraries such as Scrapy. I have written it in Java, and the language is heavy. any modifications to the data model created will cause a lot of changes in the Code, so it is a little troublesome.
However, some of the underlying tools of a project are crawling web pages and encapsulating a business layer. To this extent, it is quite comfortable to use Java. It depends on your specific business. Previously, for a crawler task, the crawled objects were all dynamically rendered (a blog is so crazy), so we had to find a non-interface browser. After selecting PhantomJS, so JavaScript is used. Nobody said about nodejs? It is easy and convenient to use with cheerio. Generally, it is easy to use. Naturally asynchronous and quickly crawled. The python. crawler module is the richest. Is it implemented in the R language?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.