Which of PHP, Python, and Node. js is suitable for crawling?

Source: Internet
Author: User
1. page parsing capability 2. database operation capability (mysql) 3. crawling efficiency 4. description of the required class library or framework in the code volume recommendation language. thank you. For example: python + MySQLdb + urllib2 + reps: in fact, I don't like python very much (it may be the reason on the windows platform that requires a variety of character encoding, and multithreading seems quite bad .) 1. page parsing capability
2. database operation capability (mysql)
3. crawling efficiency
4. amount of code
Description of the required class library or framework in the recommendation language. thank you.
For example, python + MySQLdb + urllib2 + re
Ps: Actually, I don't like python very much (it may be the reason for the windows platform. it requires various character encoding and multithreading seems to be quite bad .) Reply content: it mainly depends on what the "crawler" You define is.

1. perform some simple page parsing for targeted crawling of several pages, Crawling efficiency is not the core requirement,So there is little difference in the language used.
Of course, if the page structure is complex and the regular expression is very complex, especially after using the class libraries/crawler libraries that support xpath, it will be found that this method, although the entry threshold is low, however, scalability and maintainability are all odd. In this case, we recommend that you use some existing crawler libraries, such as xpath and multi-thread support.

2. for targeted crawling, the main goal is to parse the content dynamically generated by js.
At this time, the page content is dynamically generated by js/ajax, and the parsing method is useless with the common request page->, it is necessary to use a js engine similar to firefox and chrome to dynamically parse the js code on the page.
In this case, we recommend that you consider casperJS + phantomjs or slimerJS + phantomjs. of course, you can also consider selenium.

3. if crawlers involve large-scale website crawling, the efficiency, scalability, and maintainability must be considered.
Large-scale crawling involves many problems: Multi-Thread concurrency, I/O mechanism, distributed crawling, message communication, weighting mechanism, and task scheduling, at this time, the selection of language and framework used is of great significance.
PHP has poor support for multithreading and asynchronization, and is not recommended.
NodeJS: crawls some vertical websites. However, due to the weak support for distributed crawling and messaging, NodeJS determines based on its own situation.
Python: it is strongly recommended that you have better support for the above issues. In particular, the Scrapy framework is worth the first choice. Many advantages: Support for xpath; good performance based on twisted; good debugging tools;
In this case, if you still need to parse the js dynamic content, casperjs is not suitable. it is only based on self-built js engines such as chrome V8.
As for C and C ++, although the performance is good, it is not recommended, especially considering the cost and many other factors. for most companies, it is recommended to do so based on some open-source frameworks rather than inventing the wheel on their own, it is easy to make a simple crawler, but it is difficult to make a complete crawler.

A website http://lewuxian.com that aggregates contents like a public account I set up It is based on Scrapy and also involves message queues. Refer:


For details, refer to the architecture of a task scheduling and distribution service. Let me talk a little about my usage experience. PHP won't. I have used Python and Node. js.

Simple targeted crawling:
Python + urlib2 + RegExp + bs4
Or
Node. js + co, any dom framework or html parser + Request + RegExp is also very handy.
For me, the above two options are almost equivalent, but I am mainly familiar with JS, and now I will select more Node platforms.

On-site crawling:
Python + Scrapy
If the DIY spider in the above two solutions is a Xiaomi rifle, Scrapy is simply a heavy industry Cannon. it is not easy to use. Custom crawling rules, http error handling, XPath, RPC, pipeline mechanism and so on. In addition, because Scrapy is implemented based on Twisted, both of them have a very good efficiency. The only drawback is that the installation is troublesome and there are many dependencies, I am still a relatively new osx, and there is no way to directly install scrapy pip

In addition, if xpath is introduced to spider and then the xpath plug-in is installed on chrome, the parsing path is clear and the development efficiency is extremely high. PHP and js do not do this by nature; python has a relatively well-developed framework, but I have never used it very clearly; but nodejs can be used to talk about it, because I know it. The data is captured by node.

It is estimated that there are quite a few people who develop and deploy them on linux servers like me. Nodejs has an outstanding advantage at this moment: easy deployment and almost no cross-platform access. In contrast, python ...... It's easy to get people out of the ground.
The parsing page uses cheerio, which is fully compatible with jQuery syntax. if you are familiar with the front-end, you can use cheerio easily and no longer have to worry about annoying regular expressions;
You can use the mysql module to operate the database. all functions are available;
Is crawling efficiency actually not really a stress test, because I am familiar with it, and a little more thread bottleneck will go to the bandwidth. In addition, it is not really multi-thread, but asynchronous. when the bandwidth is full (about several hundred threads, about 10 MB/s), the CPU is only about 50%, this is only the CPU of the host with the lowest linode configuration. Besides, I usually limit the thread and capture interval, which does not consume much performance;
Finally, code. The biggest headache for asynchronous programming is to fall into the callback hell. if you write a multi-line queue based on your actual situation, it will not be too much trouble than synchronous programming. Let's make a solution at 1.1:
1. page parsing capability
Basically, a third-party package in a specific language is used to parse webpages. If you want to implement an HTML parser from scratch, both the difficulty and the time are very difficult. For complex Web pages or requests generated based on a large number of Javascript operations, the browser environment can be scheduled. In this case, Python is absolutely competent.

2. database operation capability (mysql)
Python provides official and third-party connection libraries for database operations. In addition, the data captured by crawlers is stored in NoSQL databases, which is more suitable.

3. crawling efficiency
It is true that the computing speed of the script language is not high, but the speed of the anti-crawler mechanism and network I/O speed of the specific website is negligible for the speed of these languages, it lies in the developer level. If you make good use of the waiting time for sending network requests to process other things (multithreading, multi-process or coroutine), then the efficiency of each language is not a problem.

4. amount of code
In this regard, Python is advantageous. it is well known that Python code is concise. as long as the developer level is in place, Python code can be as concise and easy to understand as pseudo code, and the amount of code is low.

Description of the required class library or framework in the recommendation language. thank you.
For example, python + MySQLdb + urllib2 + re
Python: requests + MongoDB + BeautifulSoup

Ps: Actually, I don't like python very much (it may be the reason for the windows platform. it requires various character encoding and multithreading seems to be quite bad .)
Due to the existence of GIL, Python multithreading does not take advantage of multiple cores, so you can use multi-process solutions. However, crawlers spend more time waiting for network I/O. Therefore, you can directly use the coroutine to increase the crawling speed.


In addition, I recently summarized some Python crawler programming experiences in my column. if you are interested, please read and correct me.
Column address: Workshop.

Node is used to write data to the database. asynchronous mode does not need to wait for the completion of synchronous IO, nor does it need to involve multi-thread locks. Currently, Node5.x supports ES6. you can use promise to solve multiple nested callback functions.

For data capturing and analysis using php, forget it. I have used PHP Node. js Python to write the capture script. let's talk about it briefly.

First, PHP. Advantages: Web crawling and parsing of html frameworks can be used directly by various tools, which is easy to understand. Disadvantages: first, the speed and efficiency were very problematic. when I downloaded a movie poster, the crontab was executed on a regular basis without Optimization. as there were too many php processes running, the memory went off. Then, the syntax is also very slow. there are too many keywords and symbols, which are not concise enough. This gives people a feeling that they have not been carefully designed, and it is very troublesome to write.

Node. js. The advantage is efficiency, efficiency, and efficiency. because the network is asynchronous, it is basically as powerful as the concurrency of hundreds of processes, and the memory and CPU usage are very small, if you do not perform complex operations on the captured data, the bottleneck of the system is basically the bandwidth and I/O speed of the database such as MySQL. Of course, the opposite of the advantage is also a disadvantage. asynchronous network means that you need callback. in this case, if your business needs to be linear, for example, you must wait until the previous page is captured and get the data, in order to capture the next page, or even multi-layer dependencies, there will be a terrible multi-layer callback! At this time, the code structure and logic will be messy. Of course, you can use process control tools such as Step to solve these problems.

Finally, let's talk about Python. If you have no extreme requirements on efficiency, we recommend using Python! First, the Python syntax is very concise, and the same statement can be knocked many times on the keyboard. Then, Python is very suitable for data processing, such as packaging function parameters, list parsing, and matrix processing.

I am also working on a Python data capture and processing toolkit. I am still modifying it. welcome to star: yangjiePro/cutout-GitHub. Python has scapy, which is a framework used for crawling. it uses the curl in php to capture the number on the mobile phone verification code platform.
Crawls grass liu pages using curl and automatically downloads images
Well, I like Caojia. I am still reading python. I personally think that python is really powerful, and nodejs will definitely see it later,
Oh, php does not support multithreading, so we can only use servers or extensions for this purpose. we will not .........
Forget it. we recommend that you use Python, and the multi-thread mode will be very nice.
I have used Python to write the crawling program for eight major music websites, so I suggest you. I have used PHP, Python, JS, and Node. js.
It's okay to write a crawler in PHP. I wrote one and ran it in PHP Command Line. With Curl_multi 50 concurrent threads, you can capture about 0.6 million pages a day, depending on the speed of the network. I use the campus network so it is faster, and the data is extracted using regular expressions.
Curl is a mature lib. it performs well in exception handling, http header, POST, and other operations. it is important to operate MySQL in PHP for warehouse receiving.
However, in the multi-threaded Curl (Curl_multi) aspect, it will be difficult for beginners, especially the official PHP documentation in Curl_multi is also very vague.

One of the biggest advantages of Python crawler writing is dummies. the lib functions such as Requests are equivalent to those of Curl. However, if you only do simple crawler, it is easier to use and has Beautiful SoupSuch a dummies lib is indeed very suitable for crawling.
However, encoding may be a headache for beginners. I think PHP may be better. In fact, if it is not required by the team, I will write all my own crawlers using PHP.

JavaScript is like a virtual machine in a virtual machine. PerformanceNot to mention.
  1. It runs in a sandbox first.Operate databases or local files, It will be troublesome, there is no native interface, because I have never used this crawler, and I have not studied any other solutions.
  2. For DOM tree parsing,Low efficiency,Memory usage is also relatively large.
  3. Cross-origin, although it can be disabled in Chrome through -- disable-web-security, it is also a headache.
  4. In short, JavaScript is intended to write crawlers, which is a lot of headache.
I have never seen anyone use this crawler.

I have never used Node. js.

1. there is basically no difference in page parsing capabilities. everyone supports regular expressions. However, it is much more convenient to use Python for some dummies;
2. PHP provides native support for MySQL for database operations. Python needs to add lib such as MySQLdb, but it is not troublesome;
3. the crawling efficiency supports multiple threads. I don't feel any difference in efficiency. Basically, the bottleneck is only on the network. However, I have never done a rigorous test. after all, I don't have the habit of using the same function in multiple languages, but I feel that PHP is faster?
4. there is basically no difference in the amount of code. There are dozens of lines of things. if you add exception processing, there will be a hundred lines of things, or you may need to Mark some exceptions, there are hundreds of rows to be processed, such as re-crawling. there is no difference.
However, if Python does not include lib, it is obviously the least.
When it comes to performance, crawlers and performance are basically not tied, so you don't have to consider it.. When I started crawling, the crawling efficiency was nearly 30 mbps. the crawling efficiency with PHP Command Line was less than 3-5% CPU usage, memory consumption is about 15-20 MiB (Core 2 Duo P8700 -- some old ones are running, and crawlers use 50 threads, each thread contains 10 regular expressions, 1 JSON parsing, 2 database Insert operations (if not exist of millions of data), and about 40 exception judgments )-- Only the network is the bottleneck..
If you do not have a G port, you don't have to worry about the performance. just pick one of them and you will be familiar with it.
I crawled around GiB of data over the past few days. I crawled around GiB of data over the past few days.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.