Source: video collection website
1 Preface
This chapter provides an overview of SRS.
Open source, destined to belong to all mankind, and always loyal to her elites. She represents the love of each of us and shows our talents.
Open-source has also paid openwebant a special significance. Its success also represents the success of open-source in China. It has also become a bond that makes China's aspiring young people and Ambassador of love, together, we will show the wisdom and strength of the Chinese talents.
With our full strength, we shouted: "unity is power, and dedication is virtue !! "
1.1 Purpose
This entry contains the following content:
- Describe the purpose of the actual SRS;
- Describe the intended readers of SRS.
This document describes in detail the application scenarios, functional features, system structure, project reason, and Group Features of openwebant.
This document is suitable for system designers, system developers, and system testers.
Range: 1.2
It is usually considered that a high level of design for the software may require a large amount of resources (may account for 10%-20% of the overall product development cost ). There are two options:
- Use a name to identify the software product to be produced. For example: ××× database system, Report Generation Program, and so on;
- What the software product will do, and what the software product will do if necessary
Describes the application of the software. Should:
- Describe all related flashes, purposes, and final goals as accurately as possible.
- If there is a high-level description, it should be consistent with similar statements in the High-level description (for example, the System Requirement Specification Description ).
Openwebant is an open-source, web-based, and hardworking unstructured text information collection system like ant Financial.
Openwebant can be used for website information collection, full-text retrieval on the site, and structured processing of unstructured text information.
The system is developed using Java technology and consists of BS and CS structures. Currently, the bsstructure is implemented.
Because the information collection system needs to provide different inputs and outputs, which are related to devices and cannot be encoded in a unified manner, openwebant must provide the application interfaces for these inputs and outputs. And be able to manage these external drivers.
The system involves a variety of technologies, including lexical search, multithreading, XML, syntax analysis, network chart data structure, plug-ins, transmission protocols, databases, template languages, and other technologies. We urgently hope that various experts will join in system development or provide technical support.
The project team follows the general open source protocol open source release system and provides download and paid technical support on our official website www.java51.com.
Openwebant is a system used for information collection. The collected content is limited to unstructured text information. Information sources can be:
- Web System
- File System
- Database System
- Other unstructured text data sources
The publishing target is a system or receiving device that provides unstructured text data storage. It can be of the following types:
- File System
- Database System
- Other unstructured text data storage systems or receiving devices
For openwebant, the collection of information is called input, and the publishing of information is called output. To ensure transparency and support for various device-related terminal systems, the system provides unified Input and Output interfaces. The system drives the installed terminal 1 according to the scheduling policy.
Various terminals must provide a set of policies for communication between themselves and the collection system. This policy includes:
A) Communication Protocol B) collection rules or publishing drivers
For different terminals, there are different communication protocols, for Web systems such as HTTP, https, FTP. For file systems, the operating system provides standard read/write interfaces. For database systems, you can use JDBC to read and write data.
Openwebant consists of multiple independent components. According to different responsibilities, it is divided into the following modules:
- Input Driver Management Module
- Output Driver Management Module
- Input driver management module parameter settings
- Output driver management module parameter settings
- Collection rule syntax
- Collection rule parsing module
- Collection Rule Management Module
- Collection Engine scheduling management module
- Collection Engine
- Pipeline construction module
- Template Management Module
- Template parsing module
- HTTP input Driver Module
- File System input Driver Module
- JDBC publishing Driver Module
- HTTP driver parameter settings
- JDBC driver parameter settings
- Set collection rule parameters
- Template parameter settings
The system implements Common Input and Output implementations. For example, input includes FTP, HTTP, and file. The output includes the database and file.
Definition, acronyms, and vertices
This section must provide definitions of all required terms, acronyms, and vertices for an appropriate interpretation of SRS. This information can be provided in the SRS appendix. You can also refer to other files.
- Openwebant: an open unstructured text information collection system.
- Input Source: the storage device or software system that provides information.
- Output target: the storage device or software system that receives information.
- FTP: file network transmission protocol.
- HTTP: webpage transmission protocol.
- File: file system.
1.4 references
This article shall include:
- A list of all documents referenced in SRS, such as approved plans and tasks, approved by the higher authorities, and contracts;
- List other references, such as other published documents and major documents of the project. Each document or document must have a title, index number or document number, the date of publication or publication, and the publishing unit;
- For details, you can find the source of the reference file. This information can be provided by referencing the appendix or other documents.
2 Project Overview
This chapter describes the general factors that affect products and their needs. This Chapter does not describe specific requirements, but only makes the requirements easier to understand.
In today's information age, information collection and re-processing can create value. Information sharing makes our vision more and more open, but in this vast sea of information, we often lack a tool to discover what we care about and effectively organize them.
Openwebant is generated for this purpose. Hope is big, but personal strength is thin, so we need to accumulate multiple talents to jointly build openwebant. The more people you participate in, the more intelligence you have. The more powerful openwebant will be, the better it will be for the benefit of mankind.
I am also eager to get to know more people of insight through the link of openwebant, and jointly develop a broader world.
2.1 product description
This article describes a product with other related products or projects.
- If this product is independent and all content is self-contained, it should be stated here;
- If the product defined by SRS is a large system or an integral part of a project, this article should include the following:
- The functions of each component of a large system or project should be summarized and their interfaces should be described;
- Points out the main external interfaces of the software product. Here, you do not need to describe the interface in detail. The detailed description should be placed in the other sections of SRS;
- Describes the computer hardware and peripheral devices used. This is just a comprehensive description.
In the description of this article, it is very helpful to use a block diagram to express the main components, interconnectivity and external interfaces of a large system or project.
This section neither forces the description of the design scheme nor describes the design constraints for solving the problem. This article should provide reasons for the design constraints described in the specific requirements chapter later.
Openwebant can be divided into input subsystems, filter engines, and output subsystems by subsystem. As follows:
The crawling engine reads data through the input subsystem, extracts results based on filtering rules, and publishes data to the storage device or receiving system through the output subsystem.
Input subsystem:
The input subsystem manages various input interfaces and provides input extension interfaces, such as extended FTP interfaces, HTTP interfaces, interfaces, and file interfaces. It also provides other modules for searching and referencing input interfaces.
Filter engine:
The function of the filter engine is to submit the matching results to the next Filter Based on the filtering rules of the input data or to the output sub-system for publishing to the storage device or receiving system.
The output subsystem is responsible for managing various output interfaces and providing extended output interfaces, such as the extended DB interfaces, external system input interfaces, and file interfaces. It also provides other modules for searching and referencing output interfaces.
System Structure:
The input subsystem reads data from an external system or device through the corresponding communication protocol and submits the data to the worry engine. The filter engine builds a filtering Pipeline Based on the filtering rules and submits the data to the filter. The server determines whether it is a value based on the input data. If it is an external system address, the server reads data from the input system. Finally, the filter result is submitted to the output subsystem and published to the storage device or to the receiving system.
A filtering pipeline is a channel for information flow. The filtering engine organizes the filters in a tree based on the input relationship of the filters. As follows:
The figure shows five pipelines:
1 = 1-1 = 1-1-1;
1 = 1-1 = 1-1-2;
1 = 1-2 = 1-2-1;
1 = 1-2 = 1-2-2;
1 = 1-2 = 1-2-3
Collection rules are defined by the agreed syntax. The description language can use XML, and the Rule logic can use regular expressions or custom scripting languages, or the combination of the two.
Openwebant needs to check the validity of the collection logic, check whether Input and Output loops exist, and check whether output results are not published.
Case:
1) interconnection between input and output! Outerr1.gif!
2) the output result is not published! Outerr2.gif!
The collection memory stores the filter results and serves as the input for the next filter.
During the collection process, values are associated in the following forms:
A) Parent-Child Association
Filter A has multiple matching results A1, A2, and A3. Values A1, A2, and A3 are used as the input source for the Second Matching of B, multiple matching results (a1b1, a1b2), (a2b1, a21b1), and (A3B1, A3B2) are generated ). Ensure the relevance of [a1, (a1b1, a1b2)], [A2, (a2b1, A2B2)], and [A3, (A3B1, A3B2)] during system release.
Figure:
B) Brothers
Data A, B, and C are filtered from the data source. The subscript of a letter indicates multiple values matched by a filter. B and C can only be associated with the form of [b1-1, c1-1], [b1-2, C1-2], [b2-1, c2-1], [b2-2, c2-2]. We call it sequential Association.
Figure:
The collection memory must be able to represent the above data relationship in the storage data structure, and can index and search up through the subscript of the filter.
2.2 product features
This section provides a summary of the software functions to be completed. For example, for an accounting program, SRS can use this section to describe: customer account maintenance, customer financial statements, and invoice production, without having to describe the numerous details required by the function. Sometimes, if there is a high-level Specification Description, the feature abstract can be obtained directly from it. This high-level Specification Description assigns special features to the software product. For clarity, note:
- One way to compile the function is to create a menu so that the user or the person who reads the file for the first time can understand it;
- It is also helpful to use a square diagram to express the relationship between different functions and them. However, remember that such a diagram is not required during product design, but an effective explanatory tool.
This does not need to describe specific requirements, but provides reasons for some requirements described in the subsequent requirements chapter of SRS.
The system uses an XML file to describe the collection behaviors of the system. The description scope includes the input targets, matching rules, release targets, and additional parameters of the input targets and release targets. The system provides users with a friendly Wizard Page to configure the file.
The system is an open structure with reserved input and output programming interfaces. Users can easily crawl content on websites, files, and databases and publish it to other types of systems.
The system provides context-related collection storage, and the publishing interface can implement full-text retrieval and re-processing in the collection results.
The system can perform multiple information collection tasks at the same time. Each task can use multiple threads. You can limit the system overhead by limiting the number of tasks, threads, and collection interval, you can restrict the collection range based on the collection depth and the url id.
The system automatically associates the collection results in different depths and ranges, and automatically replaces the remote address with the local address. The format controller that can automatically filter data.
Supports user logon, resumable collection, conditional release, repeated filtering, template modification, content replacement, and other functions. To some extent, it can identify special target addresses and collect data by page.
2.3 user features
This section describes the general characteristics of end users of products that affect specific requirements.
Many people are related to the system in the operation and maintenance phases of the software life cycle. These include users, operators, maintenance personnel, and system staff. Some characteristics of these people, such as educational level, experience, technology, and expertise, are important constraints applied to the operating environment of the system.
If most of the users in the system are temporary users, the system must contain prompts on how to complete basic functions, it is not assumed that the user has learned these details from past meetings or reading the user's guide.
The content of this article cannot be used to describe specific requirements or impose certain special design constraints. This article should provide reasons for the description of certain specific requirements or design constraints in the SRS specific requirements chapter.
The final user group of the system is the website administrator. Some websites that have hidden the target addresses may fail to perform operations. In particular, some websites disrupt the webpage content, it makes it more difficult for them to locate keywords accurately. For these problems, the system tries its best to provide some examples and some matching wildcards so that they can imitate and ignore some irregular web pages. The official website also provides forums to share and learn from each other's experiences.
The system reserves input and output programming interfaces. Some users can extend these interfaces so that the system can be used in more scenarios. For these users, the system provides detailed interface descriptions and extended sample code. We provide the development kit and describe the usage and functions of each class and each method.
Another type of users belong to the merchant's category. They only engage in rule making and exchange or sell their own rules online. They are more concerned about the search and content quality of websites. They are divided into two types: hunters who can discover various types of information and meet the content requirements of various network management systems, they get a lot of money in quantity. Of course, for some rare content, the price is just like gold. The other is hackers who are proficient in Web technology, witty and persistent. in their hands, they are some of the best products that are hard to find. Of course, the price is high.
Since openwebant is an open-source software, many people will analyze and use the components in openwebant to expand and improve openwebant. They use not only the system interface, but also the internal code. They need to refer to various technical documents of openwebant. Therefore, apart from the user manual, other development documents are also essential.
Collaboration relationship:
To be completed. For related articles, visit the video mining website collector.
Java has no worries about the progress of the openwebant Project, provides a network platform for Project Co-construction and communication, shares and compiles various technical documents of openwebant with project members. Discusses Java-related technical research. Welcome to join us!