Website collector system design (implemented)

Source: Internet
Author: User
Tags keyword list

ArticleSource: Visual mining website collector

1. Function Description 1. keywords can be extracted from key libraries and forums. Manually enter keywords and add them to the key font. 2. Obtain the URL. You can call googleapi to retrieve the URL list on the Internet based on keywords. You can manually add or select a URL list from the URL library. You can save the current list to the URL library. When saving the list, filter duplicate URLs. 3. The collection task setting system performs automatic collection based on the collection task. Task settings include collected URLs, filtering rules, and logon information. 4. Set the automatic task system to start at a scheduled time. Task settings include the start time and collection task. 5. the system automatically generates a collection Task Based on the search results of googleapi. 6. Manually crop and manually delete unwanted content. 7. The repeat filtering system searches for the cropped content in this task set and discards the same task. 8. The cascade extraction system extracts links based on the cropped content, including images, nearby pages, flash pages, and associated webpages. Temporary sub-tasks for system survival. 9. Website redirection recognition some websites will be directed to another website after clicking, and the system will automatically replace the target address in the task. 10. For websites that require logon to view information, the system can simulate user logon. 11. The field extraction system can automatically extract titles, bodies, authors, sources, keywords, and body pages based on rules. 12. manually modify captured content. 13. Manually associate the captured content with the associated content. The associated content can be obtained from subtasks. You can also manually search for this task and the release database. 14. Manually Save the captured content to the database. The system automatically checks the integrity of the association. 16. The capture rule description file (RDF. XML) uses XML to describe the crawling rule and website structure. The system provides the label language, such as the title, author, source, page number, and hyperlink. The system automatically extracts and releases the description file based on the webpage structure. 2. The ultimate goal of the target system is to achieve automatic topic search, fully automatic crawling, and fully automatic Publishing. It does not rely on the googleapi search engine, and uses its own engine to Search topic content. It provides content subscription services. 3. Overall strategy
    • task rules are defined in XML format. Take tasks as independent units. Filters in a task can share data with each other, and the results of the previous filter can be used as the input of the last filter.
    • the system can automatically adjust the execution sequence of the filter based on the input dependency of the filter.
    • provides database operation descriptions to automatically publish captured results to the database.
    • the input source is generally a URL. The system automatically captures the corresponding page based on the URL as the task input.
4. module structure description: The arrows in the figure indicate the data flow between modules. Tasks are concurrently executed in multiple threads. The filter tree constructs a tree structure based on the input relationship of the filter. The filters at the same layer in the tree are concurrently executed by multiple threads. The website collector saves the hash code to its own database for future uniqueness check. 5. Use Case 6 configuration file parsing: Description: configparser parses the RDF. xml configuration file at system startup. The parsing results are stored in taskentity, filterentity, and databaseentity. Taskentity labels: <task> <ID> </ID> <atuotime> </atuotime>... </Task>; filterentity labels: <filter> <in> </In> <Keys> <key>... </Key> </Keys> </filter> databaseentity Tag: <database>... <SQL>... </SQL> </database> filter engine: Description: The filterrunner schedules filters in multiple threads. When filtertread outputs to the cache, filterrunner starts its next child (filter ). Filterrunner acts as the facade of taskboundary and exposes outbufferentity. Filters are defined in RDF. xml. Treeconstruct is put into filtertreeentity in a tree structure based on the input relationship between filters. For example, filters A, B, C, and D. A is the input of B, B is the input of C, and A is the input of D, their data structure is: A/B D/C their structure meaning is: the father is a child input, and the brothers are independent. From the thread perspective, nodes on a branch are sequential and nodes at the same level are parallel. Therefore, filterrunner is responsible for starting. Start B and D when output to. Start C in case of B output. Task engine: Description: taskrunner starts a task in multiple threads. Tasktread acts as the facade of outbufferentity. When filterboundary is input to outbufferentity, tasktread checks whether the filter output required by databasetask is complete. If yes, databasetask is started. 7. Typical sequence diagram: when the system starts for the first time, it parses the configuration file, constructs a filter tree, and starts the task controller. The task controller starts the filter controllers of the task processing threads based on the existing tasks. The output of the filter triggers the filter controller to start the filter at the next node, and triggers the task controller to check the values required for database operations. If the filter is complete, the database task is started. 8. the application interface system adopts an open structure, and uses the facade and the overall controller design mode (self-explanatory) to expose and conceal the corresponding interfaces, making the system easy to expand, without losing the Security and robustness of the system. The system provides configuration files (RDF. XML) application interface: l filtertread: Filter Implementation l databasetask: database operation implementation users can expand these two interfaces to implement their own business logic, in the RDF. XML declaration. Unified scheduling and management by the Controller. 9rdf design: <task> <ID> </ID> <atuotime> </atuotime>... </Task> the root node of a task. ID is the unique ID of the task. atuotime: the start time of the task. If it is-1, it is not started; <page> <count> </count> <Start> </start> <End> </end> </Page> page number definition. Count indicates the number of pages to be crawled, the start of the start page, and the end of the end page; <Filters> <filter> <ID> </ID> <time> </time> <in> </In> <Keys> <key> <ID> </ID>... </Key> </Keys> </filter> </Filters> A filter. The ID is the unique ID of the filter. Time indicates how many times the filter is performed. All keywords are defined in the filter. Multiple keywords in the filter are extracted in one context. In is the input of the filter. The value is the keyword value of a filter or the URL; <database> <URL> </URL> <driver> </driver> <username> </username> <userpassword> </userpassword> Task Page Filters Database ID Sqls Count Start End Time In SQL Keys URL Driver Username Userpassword Key Filter Atuotime Hashcode       Example of RDF. xml:
<Task> <ID> t1 <ID> <atuotime> 10 </atuotime> <Filters> <filter> <ID> F1 </ID> <in> http://www.sina.com.cn </In> <page> <count> 10 </count> <Start> akdfa; fakf; A </start> <End> 22; kjk; ja </end> </Page> <Keys> <key> <ID> URL </ID> <Start> dfasf </start> <End> adkfkajf </end> </Key>... </Keys> <filter>... <In> f1.url </In> <Keys> <key> <ID> content </ID> <Start> dfasf </start> <End> adkfkajf </end> </ key> </Keys> <filter> </fileters> </task>
10 special note: the system does not obtain website seeds through the search engine, which will be implemented in future versions. Source: Visual mining website collector http://www.caijiqi.net

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.