Python instant web crawler Project Launch instructions

Source: Internet
Author: User
Tags php language

650) this.width=650; "src=" Http://s1.51cto.com/wyfs02/M01/80/01/wKioL1c0RZKxd7EaAAAl9nnpAr0577.jpg "title=" 6630359680210913771.jpg "alt=" Wkiol1c0rzkxd7eaaaal9nnpar0577.jpg "/>

As a love of programming, the old programmer, really according to the impulse of resistance, Python is really too hot, constantly provoke my heart.


I am alert to python, thinking that I was based on Drupal system, using the PHP language, when the language upgrade, overturned the old version of a lot of things, have to spend a lot of time and effort to transplant and upgrade, there are still some hidden somewhere in the code buried Thunder. I don't think Python will be able to avoid this problem (there are many such sounds, such as python 3 destroying Python).


However, I started this Python instant web crawler project. I used C + +, Java and JavaScript to write crawler-related programs for more than 10 years, to pursue high-performance, non-C + MO, while there is a sound standard system, so you and your system very confident, as long as the full test, you can run as expected. In the Gooseeker project, we continue to work in one Direction-"Harvesting Data"and let the vast majority of users (not only professional data collection users) experience the thrill of harvesting Internet data. "Harvesting"One of the important meanings is large quantities. Now, I'm going to start the "Instant web crawler", the purpose is to add"Harvesting"There is no coverage of the scene, and what I see is:

    • At the system level: "Instant" stands for rapid deployment of data application Systems

    • At the data flow level: "Instant" represents the acquisition of data to the use of data is instantaneous, a single data object can be processed on its own, without waiting for a batch to be stored in the database, and then taken out of the database

    • "Instant" another implication is that a web crawler is an embedded module that integrates with the entire information processing system.

650) this.width=650; "id=" aimg_846 "src=" http://www.gooseeker.com/doc/data/attachment/forum/201605/06/ 120228qojqc66gj6ar3qv3.png "class=" Zoom "width=" 487 "height=" 224 "style=" margin-top:10px; "alt=" 120228qojqc66gj6ar3qv3.png "/>

a few programmers are playing Python crawlers, and I've drawn up a plan:Build a more modular software component that addresses the most energy-intensive content extraction issues(It is concluded that big data and data analysis on the whole chain, data preparation accounted for 80% of the workload, we may wish to extend, the network data crawl of the workload of 80% is in the various web sites to write a variety of data structures crawl rules).

I think of him as a small machine (see), the input is the original page, the output is extracted from the structured content, the small machine also has a replaceable part: the input into the output structure of a command block, we become "Extraction Device, so everyone is no longer bothered by debugging regular expressions or XPath.

This is an open project, two years ago launched a mobile phone on the instant web crawler project, because it was developed for a business group, so inconvenient to open, the same ideas and methods will be open to this project, and with the current hottest python to do, I hope you can participate together. In the course of implementation, we will open up all the information and results, the pits we have encountered.


This article from "Fullerhua blog" blog, declined reprint!

Python instant web crawler Project Launch instructions

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.