Preemptive multi-thread network spider

Source: Internet
Author: User

Win32 API supports preemptive multi-threaded networks, which are useful for compiling MFC Network spider. Spider Engineering (Program) is a program on how to use preemptive multithreading technology to gather information with web spiders/robots on the Internet.

This project generates a program that acts like a spider and checks the Web site for a broken URL link. Link verification is only performed on the Link specified by href. It displays the updated URL list in a list view clistview to reflect the status of the hyperlink. This project can be used as a template for collecting and indexing information. This template saves the information to database files that can be used for query.

On the web, a search engine uses a program called robots (also called crawler, Spider, Worm, ramer, and slide Walker) to collect information. It automatically aggregates and indexes information from the web, then, store the information to the database. (Note: A robot searches for a page and uses the link on the page as the starting point of the new URL to be indexed) you can create a query to query these databases to find the information they need.

By preemptible multithreading, You Can index a URL-based web page, start a new thread to follow each new URL link, and index A new URL starting point. This project uses the MDI document class that is used with the custom MDI child framework. an editing view is displayed when you download the web page, and a list view is displayed when you check the URL Connection. In addition, the cobarray, cinternetsession, chttpconnection, chttpfile, and cwinthread MFC classes are used in this project. The cwinthread class is used to generate multithreading instead of the asynchronous mode in the cinternetsession class, which is retained from the insock's 16-bit Windows platform. The SPIDER project uses a simple working thread to check the URL link or download a Web page. The cspiderthread class is derived from the cwinthread class. Therefore, each cspiderthread object can use cwinthread
The message_map () function. By declaring "declare_message_map ()" in the cspiderthread class, user interfaces can respond to user input. This means that you can check the URL link on a Web server and download or open a web page from another Web server. The user interface does not respond to user input only when the number of threads exceeds the value of maximum_wait_objects defined as 64. In the constructor of each cspiderthread object, we provide the threadproc function and the thread parameters that will be transmitted to the threadproc function.

Cspiderthread * pthread;
Pthread = NULL;
Pthread = new cspiderthread (cspiderthread: threadfunc, pthreadparams); // create a new cspiderthread object;

In the class cspiderthread constructor, we set the pointer cwinthread * m_pthread in the thread parameters, so we can point to the correct example of this thread:
Pthreadparams-> m_pthread = this;

The cspiderthread threadproc Function

// Simple working thread function
Uint cspiderthread: threadfunc (lpvoid pparam)
{
Threadparams * lpthreadparams = (threadparams *) pparam;
Cspiderthread * lpthread = (cspiderthread *) lpthreadparams-> m_pthread;

Lpthread-> threadrun (lpthreadparams );

// Use sendmessage instead of postmessageuse to keep the current number of threads in sync.
// If the number of threads is greater than maximum_wait_objects (64), this program will become unable to respond to user input

: Sendmessage (lpthreadparams-> m_hwndpolicyprogress,
Wm_user_thread_done, 0, (lparam) lpthreadparams );
// Delete lpthreadparams and reduce the total number of threads

Return 0;
}

This structure is passed to the cspiderthread threadproc Function
Typedef struct tagthreadparams
{
Hwnd m_hwndpolicyprogress;
Hwnd m_hwndpolicyview;
Cwinthread * m_pthread;
Cstring m_pszurl;
Cstring m_contents;
Cstring m_strservername;
Cstring m_strobject;
Cstring m_checkurlname;
Cstring m_string;
DWORD m_dwservicetype;
DWORD m_threadid;
DWORD m_status;
Urlstatus m_pstatus;
Internet_port m_nport;
Int m_type;
Bool m_rootlinks;

} Threadparams;

After the cspiderthread object is created, we use the creatthread function to start executing a new thread object.

If (! Pthread-> createthread () // start cwinthread object execution
{
Afxmessagebox ("cannot start new thread ");
Delete pthread;
Pthread = NULL;
Delete pthreadparams;
Return false;
}
Once a new thread is running, we use the: sengmessage function to send a message to cdocument's-> clistview. The message contains the URL link status structure.
If (pthreadparams-> m_hwndpolicyview! = NULL)
: Sendmessage (pthreadparams-> m_hwndpolicyview, wm_user_check_done, 0, (lparam) & pthreadparams-> m_pstatus );

URL status structure:

Typedef struct tagurlstatus
{
Cstring m_url;
Cstring m_urlpage;
Cstring m_statusstring;
Cstring m_lastmodified;
Cstring m_contenttype;
Cstring m_contentlength;
DWORD m_status;
} Urlstatus, * purlstatus;

Each new thread creates a new cmyinternetsession class (derived from cinternetsession) object and sets enablestatuscallback to true. Therefore, we can check the status of all internetsession callbacks. Set the dwcontext ID used by the callback to the thread ID.

Bool cinetthread: initserver ()
{

Try
{
M_psession = new cmyinternetsession (agentname, m_nthreadid );
Int ntimeout = 30; // very important! If the setting is too small, the server times out. If the setting is too large, the thread is suspended.
/*
The timeout value for network connection requests is several milliseconds. If the connection request time exceeds this timeout value, the request will be canceled.
The default timeout value is unlimited.
*/
M_psession-> setoption (internet_option_connect_timeout, 1000 * ntimeout );

/* The latency between retry connections is in milliseconds. */
M_psession-> setoption (internet_option_connect_backoff, 1000 );

/* The number of Retries for network connection requests. If a connection attempt fails after the specified number of retries, the request is canceled. The default value is 5. */
M_psession-> setoption (internet_option_connect_retries, 1 );
M_psession-> enablestatuscallback (true );

}
Catch (cinternetexception * PEX)
{
// Catch errors from wininet
// Pex-> reporterror ();
M_psession = NULL;
Pex-> Delete ();
Return false;
}

Return true;
}

The key to using the MFC winine class in a single or multi-threaded program is to use try and catch blocks around all the MFC wininet class functions. Because the Internet connection is sometimes unstable, or the web page you visit does not exist, a cinternetexception error is thrown.


Try
{
// Some MFC wininet class function
}
Catch (cinternetexception * PEX)
{
// Catch errors from wininet
// Pex-> reporterror ();
Pex-> Delete ();
Return false;
}
The maximum number of threads is set to 64. You can set it to any number from 1 to 100. Setting too high will cause the link to fail, meaning you will have to re-check the URL link. The continuous and rapid HTTP requests under the/cgi-bin/directory will cause the server to crash. The spider program sends four HTTP requests in one second and 240 requests in one minute. This will also cause the server to crash. Take a closer look when checking on any server. Each server has a log requesting the proxy IP address of the Web file. You may receive an email from zookeeper, the web server administrator.


You can create the robots.txt file for a few directories to prevent these directories from being indexed. This mechanism is usually used to protect/cgi-bin/directories. CGI scripts occupy more server resources to be retrieved. When the Spider Program checks the URL link, its target is to request too many documents too quickly. Spider programs stick to the Standards rejected by robots. This standard is a protocol between robot developers that allows WWW sites to limit robot requests on URLs. By using this access restriction standard, robots will not retrieve any documents that the Web server wants to reject. Before checking the root URL, check that the robots.txt file is in the main directory. If the spiderprogram discovers the robots.txt file, it will stop searching. In addition, the program also checks the meta tags on all web pages. If a meta tag is found, its name = "Robots"
Content = "noindex, nofollow", the URL on that page is not indexed.

Create:
Windows 95
MFC/VC ++ 5.0
Wininet. h time 9/25/97
Wininet. Lib time 9/16/97
Wininet. dll time 9/18/97

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.