Web source code Filtering

Source: Internet
Author: User

This example uses the mime filter technology to filter webpage source code. This section is taken from the HTML code filtering technology.

To implement HTML code filtering, you must register one or more mime filters (Pluggable mime filter ). The mime filter is a COM Object and must implement the iinternetprotocolsink and iinternetprotocol interfaces. Before implementing the mime filter object, let's take a look at the mime filter and web processor (transaction handler) in pluggable protocols overview. DLL. DLL implements the iinternetprotocol and iinternetprotocolsink interfaces.) 1. The web processor calls the iinternetprotocolroot: Start method of the mime filter (iinternetprotocol is derived from iinternetprotocolroot ); 2. The web processor successively calls the iinternetprotocolsink: reportprogress and iinternetprotocolsink: reportdata methods of the mime filter; 3. The mime filter calls the iinternetprotocol :: 4. The mime filter calls the iinternetprotocolsink: reportdata method of the Web processor; 5. The web processor calls the iinternetprotoco: Read method of the mime filter. Therefore, the mime filter must be implemented, there are several important methods: 1. iinternetprotocolroot: Start method: hresult start ([in] lpcwstr szurl, [in] iinternetprotocolsink * poiprotsink, [in] iinternetbindinfo * poibindinfo, [in] DWORD grfpi, [in] DWORD dwreserved); As a mime filter object, szurl imports the MIME type (if it is a name space handlers object, this parameter is a URL to be downloaded or parsed ). If you want a URL, you can use the poibindinfo interface. The following is an example: lpolestr pwzurl; ulong uelfetched; pibindinfo-> getbindstring (bindstring_url, & pwzurl, 1, & uelfetch ); poiprotsink is created by urlmon. iinternetprotocolsink interface provided by DLL, because this interface needs to be called in the subsequent processing process, so it must be saved; grfpi is an enumeration variable and must contain the pi_filter_mode flag, indicates that the object runs in filter mode. Dwreserved is a pointer to the protocolfilterdata structure. The pProtocol member of this structure is urlmon. iinternetprotocol interface provided by DLL, because it needs to be called in the subsequent processing process, so it should be saved. In fact, this interface can also be obtained by calling QueryInterface through the poiprotsink parameter. Similarly, pprotocolsink and poiprotsink in the protocolfilterdata structure both point to the same interface. In the start method, we only need to save the iinternetprotocolsink and iinternetprotocol interfaces provided by urlmon. dll. 2. iinternetprotocolsink: reportprogress method: hresult reportprogress ([in] ulong ulstatuscode, [in] lpcwstr szstatustext) as the mime filter. Generally, ulstatuscode is callback. When ulstatuscode is callback, szstatustext is the path name of the temporary cache file, but some webpages are not written to the cache, so szstatustext may be a null string. 3. iinternetprotocolsink: reportdata method: hresult reportdata ([in] DWORD grfbscf, [in] ulong ulprogress, [in] ulong ulprogressmax ); IE will call the reportdata method of the mime Filter during or after downloading the file. ulprogressmax indicates that the file is always the data volume, and ulprogress indicates the download progress. Theoretically, After all files are downloaded, ulprogress should be equal to ulprogressmax (in fact, when the webpage file is not very large, even if ulprogress is not equal to ulprogressmax, all files may be downloaded). Another parameter that reflects the File Download is grfbscf. Sometimes, the reportdata method is called multiple times by the web processor. Reportdata is suitable for filtering webpage content or modifying webpage content. Here, you can save the webpage content to your own cache or stream by calling read and perform proper processing (check the character encoding ). Finally, do not forget to call the iinternetprotocolsink: reportdata method of the Web processor to report data download information to it. After the Web processor receives this notification, it will call the iinternetprotocol: read of the mime filter. At this time, you can submit the modified data to the Web processor. The following code example shows how to call the read pre-save data of the Web processor in reportdata: cstring TS (""); char P [1024]; hresult hr; ulong readtotal; do {memset (p, 0, sizeof (p); HR = urlmonprotocol-> Read (p, sizeof (P)-1, & readtotal); cstring ptemp (P ); TS = TS + ptemp;} while (HR! = S_false) & (HR! = Inet_e_download_failure) & (HR! = Inet_e_data_not_available); Generally, only s_ OK or s_false is returned when data is obtained successfully through read. s_ OK indicates that data is still available, while s_false indicates that data has been read, therefore, set the cycle condition to HR = s_ OK. So why isn't the condition judgment at location a if (hR = s_ OK | hR = s_false)? In some cases, read may return other values, however, some data is successfully read, and the data size is the value specified by readtotal. If the part of data is left blank, the webpage cannot be parsed normally. Run the following code to create a temporary file: If (cachefilename = "") {tchar fname [512]; createurlcacheentry (ole2t (URL), ts. getlength (), _ T ("htm"), fname, 0); cfile hfile; hfile. open (fname, cfile: modecreate | cfile: modewrite); hfile. write (TS, ts. getlength (); reportprogress (bindstatus_cachefilenameavailable, T2W (fname);} modify the webpage code: ts. replace (_ T ("Baidu"), _ T ("kilobytes"); prepare data for the browser: totalsize = ts. getlength (); createstreamonhglobal (0, true, & da Tastream); const char * PTS = ts. getbuffer (TS. getlength (); ulong cbwritten; datastream-> write (PTS, ts. getlength (), & cbwritten); Ts. releasebuffer (); PTS = NULL; ularge_integer dummy; _ large_integer zero; zero. quadpart = 0; datastream-> seek (zero, stream_seek_set, & dummy); 4. iinternetprotocol: Read method called by the web processor to obtain the data to be parsed by the browser. In the previous method of reportdata, We have cached all the data in the stream. Therefore, we only need to return the data in the stream to the Web processor. The following code demonstrates a simple process in read: datastream-> Read (PV, CB, pcbread); written + = * pcbread; If (written = totalsize) {return s_false ;} else {return s_ OK;} 10 million Note: s_false should be returned when the data has been read, otherwise the read may be called in an infinite loop. After processing these methods, it is basically caused by great efforts. The other methods are very simple. You can refer to the example above.

 

Source code download

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.