Design and custom implementation of pseudo-distributed web crawler framework

Source: Internet
Author: User

"Project Vision" system based on the direction of the intelligent crawler to the data from the original passive finishing to the future of the transition of the active offensive, will give "crawler" self-cognition ability, to actively seek "offensive" target. Instead of artificial complex and monotonous repetitive work. Can achieve the role of independently in a certain direction in the field of artificial intelligence.

"Project Progress" project phase of the basic implementation framework, data processing and simple crawl task implementation.

"Project description" in order to better understand the excellent framework of the implementation principle, the project as far as possible to block the excellent open-source third-party jar package implementation, custom implementation and then go to merit and eat, form a contrast.

"Design principles" follow the new technology according to form, and constantly accumulate the ability to harness the various open source technologies under different business scenarios.

"Project Planning"

The first phase of the entire project framework to achieve the building and basic knowledge of the savings!

The second phase of the full realization of the crawler in the network field of various needs operation

The third period uses the framework to obtain the specified amount of data in a specific domain, and through the "crawler" Automatically batch processing of large amounts of data, from which to collect (mining) useful data or through a variety of identification technology (pattern recognition, image recognition, speech recognition, etc.) to achieve automatic data collection and efficient collation! Skip to issue fourth to fifth

The fourth stage platform framework upgrade, using the existing mature data processing and data visualization technology to the corresponding functional points of the platform production (so-called production words transformation refers to the use of mature framework to quickly adapt to production needs) transformation. [Mature technology has Hadoop\spark\redis and other excellent various aspects of the framework or technology, R Language], (note: and try to use open source framework, the purpose is to study the implementation of excellent framework source code)

Fifth platform language upgrade, because the data processing direction Python has excellent performance, it is necessary to develop a good platform part of the function with the appropriate language implementation [note: The platform in the real operation of the process must have many aspects of performance problems, to constantly reflect on the solution of the problem, Combining the advantages of different languages and the differences in business logic, using different processing methods]

Sixth phase Mobile development, optional direction ...

The seventh phase uses the technology of machine learning to analyze the automation problems in the data, which is the passive collation of the data in prehistoric times, and becomes the automatic cleaning of the future data, and the mature recognition technology is applied to the platform development.

"Project design Advice"

In order to achieve any function from the JDK or even more basic code training, so that they know more about the underlying implementation details, so in the shortest possible time to the JDK 80% to achieve free use, Java is just a tool or means of communication with the computer! In the programming process, self-realization of certain functions must not be optimal, you have to learn from the way to facilitate more appropriate and computer communication, different languages and different advantages, to explore in the continuous practice, to play the advantages of different languages come out!

If Java is an automatic sedan, C is a manual jeep. What about data structures? is how the gearbox works. You can not know how to work the gearbox, the automatic car from A to B, and not necessarily more than the understanding of people slow. Writing a program, like driving, experience can play a big role, but if you do not know how the bottom is how to work, you can always drive, neither repair cars, nor build cars. If you're not interested in either of these things, the data structure is good to use. But if you are still a bit higher in the field of programming in this lifetime, the data structure is not open to the topic.

Java has done too much for you, so it's still very easy to support the type of container class, plus garbage collection, will make you feel that programming easily. But have you ever wondered how those container classes come and what it means to be there? The most superficial, such as ArrayList this class, you think of its existence is how big welfare--a random access, automatically increase the capacity of the array, this thing C is not, to achieve their own. But how exactly? If you are interested in this kind of problem, the data structure must be seen. Even the object-oriented programming paradigm itself is a data structure problem: how can we encapsulate the data and manipulate the data in a way that can create something like class/prototype?
In addition, it is important that the data structure is also the cornerstone of various practical algorithms, so learning data structure is to enhance the internal force of things. Throughout the process of design to continue to insist on learning from the mistakes, a famous person once said, I learn from the mistakes of others, rather than always to learn how others are successful! That is to say, we want this Love bug, debugging bug is to grow in self-transcendence!

"Architecture and Core Module implementation"

    1. Crawler uses the heartbeat mechanism-easy to monitor the task status and reduce the burden of Task Scheduler.
    2. The crawler obtains the resources after the unified format transmits to the processing module, The imitation Hadoop's Mr Realization--uses the computer resources more effectively.
    3. Data storage technology is dynamically selected according to the characteristics of task data--Make room for the extensibility of framework.
    4. Using the package of map to implement PO container (singleton mode)--store the unique key value pair parameter in the system, convenient to call.
    5. The use of connection pooling technology in data source and thread management--for the rational use of resources.

"Schematic diagram of Distributed crawler architecture"

"Project Address" https://github.com/chenkai1100/SpiderFrame/

Welcome the Great God of the road to criticize.

Design and custom implementation of pseudo-distributed web crawler framework

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.