Periodic web page capture scheduling file

Source: Internet
Author: User
Tags administrator password

If the datascraper software is required for web page capture and content formatting for periodic web page capture and information extraction, You need to configure periodic web page capture scheduling files for datascraper. This is an XML file, the directory stored in the main directory ($ home. in datascraper, the file name is crontab. XML. When datascraper is running, if this file is found, periodic scheduling parameters will be parsed. If the auto parameter is set, multiple datascraper processes will be automatically started, each periodic webpage capture session corresponding to an auto type can be started manually, regardless of the auto type or non-auto type periodic webpage capture session.

You can write periodic web page capture scheduling files as needed. The following describes the structure and essentials of the file.

<? XML version = "1.0" encoding = "UTF-8"?> <Crontab> <thread name = "project_low"> <parameter> <auto> true </auto> <Start> 10 </start> <period> 10800 </period> <waitonload> false </waitonload> <minidle> 2 </minidle> <maxidle> 10 </maxidle> </parameter> <step name = "renewclue"> <theme> project_list_design.www.sxsoft.com </ theme> </step> <step name = "Crawl"> <theme> project_list_design.www.sxsoft.com </theme> <loadtimeout> 3600 </loadtimeout> <lazycycle> 3 </lazycycle> <updateclue> false </updateclue> <dupratio> 80 </dupratio> <depth>-1 </depth> <width>-1 </width> <renew> false </Renew> <period> 0 </period> <scroll1_wratio> 2 </scroll1_wratio> <scrollmorepages> 10 </scrollmorepages> <allowplugin> false </allowplugin> <allowimage> false </allowimage> <allowjavascript> false </allowjavascript> </step> <step name = "Crawl"> <theme> project_design.www.sxsoft.com </theme> <updateclue> false </updateclue> <dupratio> 80 </dupratio> <depth>-1 </depth> <width>-1 </width> <renew> false </Renew> <period> 0 </period> <resumepageload> true </resumepageload> <resumemaxcount> 3 </resumemaxcount> </step> <step name = "uploadresult"> <theme> project_design.www.sx.com </theme> <slicesearchloc> http://www.metaseeker.cn/projectsearch/ </ slicesearchloc> <account>Username </Account> <password> Thepassword </Password> </step> <step name = "indexharvest"> <theme> project_design.www.sxsoft.com </theme> <slicesearchloc> http://www.metaseeker.cn/projectsearch/ </slicesearchloc> <account> Username </Account> <password> Thepassword </Password> </step> </thread> </crontab>

Where:

    • The crontab label represents the entire XML file body. Only one XML file can exist.
    • The parameter block enclosed by the thread tag represents a periodic web page capture session. an XML document can have multiple parameters, each of which corresponds to an independent datascraper window. Each session should have a name, it is represented by the name attribute of thread. When datascraper is used to manually start periodic web page capture, the name is used to specify which one to start.

The parameter blocks of periodic web page capturing sessions are divided into two parts. The first part is the parameter section, followed by Multiple Information Extraction steps, which are represented by step parameter blocks. The parameter section only has one instance and has the following parameters:

    • Auto: true or false, indicating whether to automatically start periodic web page capture sessions
    • Start: a number in seconds, indicating the delay in starting a periodic web page capture session. to effectively utilize the CPU capability, the latency of multiple sessions should be different, prevent congestion caused by congestion
    • Period: a number in seconds, indicating the interval of the next scheduling,Note:It is not an accurate scheduling cycle, but the time to stop after a batch of web pages are captured, because the actual time spent in the previous batch is difficult to determine in advance (for example, due to limitations of network communication conditions), the actual time of a cycle cannot be determined.
    • Waitonload: true or false, indicating whether to extract the content after the target page is fully loaded. False indicates not waiting, as long as the extracted content has been loaded, it can be seen that the Information Extraction performance can be effectively improved, but some page content is dynamically generated by JavaScript. Sometimes it may be safer to wait until it is fully loaded. If you select "true" or "false", you can set it to "false" first. If any content is missing, use "true. True for non-periodic information extraction.
    • Minidle: an optional parameter. It is a positive integer in seconds, indicating the shortest time to wait after a webpage is crawled. This parameter works with the next parameter, datascraper can wait for a random length between the two to avoid excessive traffic pressure on the target website.
    • Maxidle: and the selected parameter, which is a positive integer in seconds and should be greater than minidle.

Note:The first line of the configuration file cannot be omitted if cross-platform communication is required and the configuration file contains Chinese characters. Otherwise, the Chinese content cannot be correctly parsed.

There can be many steps in periodic web page capture sessions. The V4 online version has four predefined step types, and the enterprise edition can customize the extension based on customer needs. The four steps are as follows:

  • Renewclue: changes the state of all information extracted from a topic to start. In this way, datascraper can capture the web page pointed to by this clue. Theoretically, clues of any topic can be reset to start. However, in actual application, if some clues are mistakenly set to start, information extraction time can only be wasted. For example, the product type information on the product information page on the online auction website is extracted, and the clue is set to extracted after extraction. after the auction, the product information is not necessary to be extracted again, therefore, resetting to start can only prevent the extraction of new product information. On the contrary, this step should be used to reset the Clues pointing to the product list page. In another situation, the product information page is valid for a long time on the online shopping website. For example, when the price changes over time, the page needs to be periodically extracted for the product price comparison service, in principle, this step can be used to reset such information extraction clues. However, we should use the server-side scheduling periodic information extraction method described below. In addition, the number of clues that belong to the same topic may be large, and the server limits the number of Reset clues to 10000. In the previous example, the reset topic is project_list_design.www.sxsoft.com. In fact, there is only one clue. This page displays the list of all the latest outsourced projects to enable the entry point of the Web Crawler. Obviously, it is no longer appropriate to use the renewclue step to reset the status.
  • Crawl: commands web crawlers to capture webpages of a topic and extract webpage content, execute the Information Extraction workflow in datascraper, and manually start Information Extraction for a topic from the topic list on the datascraper interface, however, there are some differences. This step can determine the number of clues to be extracted without user input through the interface. In addition, the configuration parameters can constrain the depth and breadth of extraction, including the following parameters:
    • Theme: topic name
    • Loadtimeout: the waiting time for loading the target webpage, in seconds. If the required content has not been loaded into the embedded browser after this time, the system will stop crawling.Note:Generally, this parameter is used to capture clues that require turning pages, because it is a pity that turning pages to a page in the middle is abandoned due to timeout. If this parameter is used to capture a single page of content, you do not need to set this parameter, but keep the default 60 (seconds). A long timeout will slow down the webpage crawling efficiency because of the slow speed of the target website.
    • Lazycycle: valid only for active mode or active extension mode, in seconds. The default value is 5 seconds (if this parameter is not configured). This is a very conservative number and the capture speed is affected. In many cases, you can set this parameter to 1 ~ 4. To increase the speed, this parameter is not selected based on the download speed of the target webpage, and the speed and stability of the target website. if the content is loaded asynchronously, this number should be increased. Of course, it is difficult to judge the speed stability from the senses, mainly by checking whether the captured results meet the expectations and then remedy the problem. For detailed usage, see web page capture mode.
    • Updateclue: true or false. Other theme clues can be extracted when the webpage content is extracted. For example, the current topic is project_list_design.www.sxsoft.com, and the topic is project_design.www.sxsoft.com on the outsourcing project list page, if the metaseeker system finds that the Newly Extracted clue points to the page that was previously extracted, this parameter determines how to treat this clue. True indicates that the clue state is reset to start, false indicates that the original status is not modified.
    • Dupratio: a percentage of molecules, ranging from 0 ~ An integer of 100, used to extract pages. For example, there are many outsourcing projects on the outsourcing project list page, which are displayed in multiple pages. The latest project is displayed on the previous page, when pages are extracted from previously extracted content, the Information Extraction Process should be terminated. This parameter indicates how many clues are found to be repeated and page turning should be stopped.
    • Depth: a positive integer or-1, which indicates the number of pages to be turned over and-1 indicates that the page is not stopped. This parameter is used in conjunction with dupratio.Note:The meaning of depth here is not the same as that of a common web crawler.
    • Width: a positive integer or-1, which indicates the number of clues to which the topic is directed at most in this round of periodic web page capturing sessions.-1 indicates no limit, that is to say, no matter how many clues in the Start state are extracted in this round.Note:The meaning of the breadth here is not the same as that of a common web crawler.
    • Renew: true or false, indicating whether to enable periodic information extraction for server-side scheduling,Note: this parameter is ignored by online metaseeker.
    • Period: a positive integer, indicating the cycle of server-side scheduling periodic information extraction,Note: this parameter is ignored by online metaseeker.
    • Resumepageload: (added in v4.10.0), true or false. If a large number of datascraper threads capture webpage information at the same time, the download of the target webpage may time out due to network congestion at a certain time point. This parameter indicates whether to download the webpage again, in combination with the resumemaxcount parameter, it can effectively improve the web page capture reliability.Note: not all webpages can be downloaded again.For example, if an Ajax website asynchronously refreshes the content of a webpage by sending an HTTP message, this type of webpage cannot be effectively re-downloaded or rolled back. Another example is the webpage that is downloaded by sending an http post message, the download cannot be effectively re-downloaded because an alert prompt box will pop up in Firefox during the re-download. At this time, datascraper simply closes the prompt box and the post message is not sent.
    • Resumemaxcount: (added in v4.10.0), positive integer, indicating the number of download attempts.
    • Scroll1_wratio: (v4.11.1 new feature) scrolling latency, an integer greater than zero. If n is used, the delay is 1/n seconds for each screen to be rolled. For more information, see the next parameter scrollmorepages.
    • Scrollmorepages: (added in v4.11.1) number of additional screen scrolling times. The default value is 0, indicating no screen scrolling. The scrolling function is used to capture data on an Ajax webpage. The data is not downloaded from the server at first until the user scrolls the browser screen and displays the data in the window. For more information, see how to capture Ajax website data through automatic scrolling.
    • Allowplugin: (exclusive function of Enterprise Edition) whether to allow loading of content that requires special plug-in interpretation, such as video and flash. For more information, see how to block plug-ins and images when capturing webpages.
    • Allowimage: (exclusive feature of Enterprise Edition) whether to allow image loading, true or false. For more information, see how to block plug-ins and images when capturing webpages.
    • Allowjavascript: (exclusive to Enterprise Edition) whether to allow loading and interpreting JavaScript, true or false. For more information, see how to block javascript when capturing a webpage.
  • Uploadresult: compress the Information Extraction result file (XML file) into a zip package and upload it to the slicesearch server.Note: This step is only used when the slicesearch server is deployed. Otherwise, periodic information extraction session interruption occurs.
    • Theme: topic name
    • Slicesearchloc: the URL address of the complete slicesearch server, with a slash at the end.
    • Account: slicesearch Administrator Account
    • Password: slicesearch administrator password
  • Indexharvest: The slicesearch server builds an index for the information extraction results of a specified topic. After this command is issued, indexing is not necessarily started immediately, but indexing requests are queued.Note: This step is only used when the slicesearch server is deployed. Otherwise, periodic information extraction session interruption occurs.
    • Theme: topic name
    • Slicesearchloc: the URL address of the complete slicesearch server, with a slash at the end.
    • Account: slicesearch Administrator Account
    • Password: slicesearch administrator password

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.