Python instant web crawler Project Launch instructions

Source: Internet
Author: User
Tags php language

As a love of programming, the old programmer, really according to the impulse of resistance, Python is really too hot, constantly provoke my heart.

I am alert to python, thinking that I was based on Drupal system, using the PHP language, when the language upgrade, overturned the old version of a lot of things, have to spend a lot of time and effort to transplant and upgrade, there are still some hidden somewhere in the code buried Thunder. I don't think Python will be able to avoid this problem (there are many such sounds, such as Python 3 destroying Python).

However, I started this Python instant web crawler project. I used C + +, Java and JavaScript to write crawler-related programs for more than 10 years, to pursue high-performance, non-C + MO, while there is a sound standard system, so you and your system very confident, as long as the full test, you can run as expected. In the Gooseeker project, we are constantly working in one Direction-"harvesting data"-and allowing users (not only professional data collectors) to experience the thrill of harvesting Internet data. One of the important meanings of "harvesting" is large quantities. Now, I'm going to start the "instant web Crawler" to supplement the "reap" scenes that are not covered, and I see:

    • At the system level: "Instant" stands for rapid deployment of data application Systems
    • At the data flow level: "Instant" represents the acquisition of data to the use of data is instantaneous, a single data object can be processed on its own, without waiting for a batch to be stored in the database, and then taken out of the database
    • "Instant" another implication is that a web crawler is an embedded module that integrates with the entire information processing system.


A lot of programmers are playing Python crawlers, and I've drawn up a plan: build a more modular software component that addresses the most energy-intensive content extraction issues (some people conclude that big data and data analysis chain, data preparation accounted for 80% of the workload, we may wish to extend the 80% of the workload of network data fetching is to write crawl rules for various data structures of various websites.

I think of him as a small machine (see), the input is the original Web page, the output is extracted from the structured content, the small machine also has a replaceable part: the input into a command block of the output structure, we become "extractor", so that people no longer for debugging regular expressions or XPath and distressed.

This is an open project, two years ago launched a mobile phone on the instant web crawler project, because it was developed for a business group, so inconvenient to open, the same ideas and methods will be open to this project, and with the current hottest python to do, I hope you can participate together. In the course of implementation, we will open up all the information and results, the pits we have encountered.

Python instant web crawler Project Launch instructions

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.