Site Generation Static Page introduction

Source: Internet
Author: User
Tags comments file size html tags md5 regular expression servervariables
Introduction | static | page


There are only two main steps to generate HTML methods:



First, get the contents of the HTML file to be generated
Save the contents of the obtained HTML file as an HTML file



My main point here is just the first step: How to get the contents of the HTML file to be generated:



There are several ways to get the content of an HTML file at the moment:



1, this method and is written in the script to generate the HTML content, not easy to preview the content of the generated page, can not visualize the layout page, change the HTML template will be more complex. There are a lot of people in this way, but I feel that this method is the most inconvenient.



Str= "str=str& "


2, the production of a separate HTML template page, dynamic content with a specific character as a mark (such as: someone with $title $ marked as a page title), Load the contents of its templates with ADODB.stream or scripting.filesystemobject, and replace them with dynamic content (such as:replace (loaded template contents, "$title $", RS (" Title " )).



3, use XMLHTTP or ServerXMLHTTP to obtain the HTML content that the dynamic page displays:



I used to generate an instance of an HTML file:
'-----------------Shangai (Xiao-qi)
' Weburl is the dynamic page address to get
' Gethttppage (weburl) is a function to get the contents of a dynamic page
Weburl= "http://" &request.servervariables ("SERVER_NAME") & "/contact.asp?id=" &rs ("id") & "" ' Specify dynamic page address
Body=gethttppage (Weburl) ' The contents of a dynamic page address with a function
'-----------------Shangai (Xiao-qi)



The biggest advantage of this method is not to use the effort to write static template page, only the original dynamic page into the HTML static page, but the production speed is not too fast.

My common method of generating HTML is the 3rd: use XMLHTTP to get the HTML content generated by the dynamic page, and then save it as an HTML file with ADODB.stream or Scripting.FileSystemObject.

The second step is the method of generating the file:



Common useful adodb.stream generated files and Scripting.FileSystemObject generated files in ASP two kinds:



1, Scripting.FileSystemObject Generation file method:



'-----------------Shangai (Xiao-qi)
Set fso = CreateObject ("Scripting.FileSystemObject")
File=server.mappath ("To generate file path and filename. htm")
Set Txt=fso. OpenTextFile (File,8,true)
data1= File Contents using WriteLine method to generate files
Txt. WriteLine data1
data2= "File contents" to generate files using the Write method
Txt. Write data2
Txt. Close
Txt.fso
'-----------------Shangai (Xiao-qi)



2, ADODB. Stream generate File Method:






'-----------------Shangai (Xiao-qi)
Dim Objadostream
Set objadostream = Server.CreateObject ("ADODB. Stream ")
Objadostream.type = 1
Objadostream.open ()
Objadostream.write ("File content")
Objadostream.savetofile to generate file paths and file names. htm,2
Objadostream.close ()
'-----------------Shangai (Xiao-qi)



Acquisition principle:



The main steps of the acquisition process are as follows:



First, get the content of the collected pages
Second, extract all the data from the acquisition code



First, get the content of the collected pages



I currently have the ASP commonly used to obtain the content of the page to be collected methods:



1, with the ServerXMLHTTP component to 



Function GetBody (Weburl)
'-----------------Shangai (Xiao-qi)
' Create object
Dim objXmlHttp
Set objxmlhttp=server.createobject ("Msxml2.serverxmlhttp")
' Request file, in asynchronous form
Objxmlhttp.open "Get", Weburl,false
Objxmlhttp.send
While Objxmlhttp.readystate <> 4
Objxmlhttp.waitforresponse 1000
Wend
' Get the results
Getbody=objxmlhttp.responsebody
' Release object
Set objxmlhttp=nothing
'-----------------Shangai (Xiao-qi)
End Function



Call method: GetBody (Urlf address of the file)



2, or XMLHTTP components to 



Function GetBody (Weburl)
'-----------------Shangai (Xiao-qi)
' Create object
Set retrieval = CreateObject ("Microsoft.XMLHTTP")
With retrieval
. Open "Get", Weburl, False, "", ""
. Send
GetBody =. Responsebody
End With
' Release object
Set retrieval = Nothing
'-----------------Shangai (Xiao-qi)
End Function



Call method: GetBody (Urlf address of the file)



This gets the data content that needs to be encoded to be converted before it can be used



Function bytestobstr (body,cset)
'-----------------Shangai (Xiao Qi)
        Dim objstream
        Set objstream = Server.CreateObject ("ADODB.stream")
        objstream. Type = 1
        objstream. Mode =3
        objstream. Open
        objstream. Write body
        objstream. Position = 0
        objstream. Type = 2
        objstream. Charset = Cset
        bytestobstr = objstream. ReadText
        objstream. Close
        Set objstream = Nothing
'-----------------Shangai (Xiao Qi)
End Function



Call Method: Bytestobstr (data to be converted, encoding) ' encoding is commonly used for GB2312 and UTF-8.



Second, extract all the data from the acquisition code



1, using the ASP built-in mid function to intercept the required data



Function Body (wstr,start,over)
'-----------------Shangai (Xiao-qi)
Start=newstring (Wstr,start)
' Set the unique start tag for the data that needs to be processed
Over=newstring (Wstr,over)
' and start corresponds to the unique end tag of the data that needs to be processed
Body=mid (Wstr,start,over-start)
' Set the scope of the display page
'-----------------Shangai (Xiao-qi)
End Function



Call method: Body (content of the captured page, start tag, end tag)



2, the use of regular access to the required data



Function Body (wstr,start,over)
'-----------------Shangai (Xiao-qi)
Set Xiaoqi = New Regexp ' Set Configuration object
Xiaoqi. IgnoreCase = True ' ignores case
Xiaoqi. Global = True ' Set to Full-text search
Xiaoqi. Pattern = "&start&". +? " &over& "" ' Regular expression
Set matches =xiaoqi. Execute (WSTR) ' Starts execution configuration
Set xiaoqi=nothing
Body= ""
For the Match in matches
Body=body&match.value ' Loop match
Next
'-----------------Shangai (Xiao-qi)
End Function



Call method: Body (content of the captured page, start tag, end tag)



The collection Procedure Xiang fine thought:



1. Get the page Address of Page List page of the website
At present, most dynamic Web site paging address has rules, such as:
Dynamic page
First page: index.asp?page=1
Second page: index.asp?page=2
Third page: index.asp?page=3
.....



Static page
First page: page_1.htm
Second page: page_2.htm
Third page: page_3.htm
.....



To get the page address of the page of the site's paging list, you need to replace the changing characters of each page address with a variable such as: page_<%= "&page&"%>.htm



2. Get the content of the page list page of the collected website



3, from the paging list code to extract the content page collected The URL connection address
There are also fixed rules for content page connections in most paging pages, such as:
<a href= "URL1" > Connection 1</a> <br>
<a href= "Url2" > Connection 2</a> <br>
<a href= "Url3" > Connection 3</a> <br>



You can get a collection of URL connections with the following code



'-----------------Shangai (Xiao-qi)
Set Xiaoqi = New Regexp
Xiaoqi. IgnoreCase = True
Xiaoqi. Global = True
Xiaoqi. Pattern = "" ". +?" "" "
Set matches =xiaoqi. Execute (page List contents)
Set xiaoqi=nothing
Url= ""
For the Match in matches
Url=url&match.value
Next
'-----------------Shangai (Xiao-qi)



4, get the content of the contents of the content page, according to the "Extract tag" from the content of the collected page to intercept the data to be obtained.



Because it's a dynamically generated page, most content pages have the same HTML tag, and we can extract the content of the parts that are needed based on these regular tags. Such as:



Each page has a page Title <title> page title </title>, and the mid intercept function written above can be used to get the value between <title></title>, or it can be obtained using regular expressions.
Example: Body ("<title> page title </title>", "<title>", "</title>")



At present, there are many methods to prevent collection, first of all, introduce the common methods of collection strategy and its drawbacks and collection countermeasures:



First, the determination of an IP in a certain period of time to the site pages of the number of visits, if significantly more than the normal person browsing speed, the denial of this IP access



Disadvantages:
1, this method is only applicable to dynamic pages, such as: asp\jsp\php ... Static pages can not determine the number of times a certain IP access to this site page.
2, this method will seriously affect the search engine spiders are included, because the search engine spiders included, browsing speed will be faster and multithreading. This method will also reject search engine spiders included in the site files
Collection strategy: can only slow down the acquisition speed, or not to pick
Recommendation: do a search engine spider's IP library, only allow search engine spiders to quickly browse the contents of the site. Search engine spider's IP library collection, also not too easy, a search engine spider, also not necessarily only a fixed IP address.
Comments: This method is more effective against collection, but it will affect the search engine for its inclusion.



Second, use JavaScript to encrypt content pages



Disadvantages: This method is applicable to static pages, but it will seriously affect the search engine on its collection, search engines received the content, are also encrypted after the content
Collection Countermeasures: The proposal is not adopted, such as the need to pick, the password to solve the JS script also picked down.
Suggestion: There are no good suggestions for improvement at present
Comments: It is recommended to rely on search engine with traffic webmaster do not use this method.



Replace specific tags in content pages with "specific tags + hidden copyright text"



Disadvantages : This method is not very bad, will only add a little bit of page file size, but easy to reverse collection
acquisition Strategy: the collection of copyright text containing hidden copyright text, or replace it with their own copyright.
Suggestion: There are no good suggestions for improvement at present
Comments: I feel a little practical value, even if it is to add random hidden words, is tantamount to the superfluous.



Four, only allow users to browse only after



Disadvantages: This method will seriously affect the search engine spiders are included in the
Collection Countermeasures: At present, the outdated has been sent a countermeasure article, the specific countermeasures to see this bar "ASP thief program How to use XMLHTTP to realize the submission of forms and cookies or session of the sending"
Suggestion: There are no good suggestions for improvement at present
Comments: It is recommended to rely on search engine with traffic webmaster do not use this method. However, this method to prevent the general acquisition program, or a little effect.



Five, use JavaScript, VBScript script to do pagination



Disadvantages: affect the search engine on its included
acquisition Strategy: analysis of JavaScript, VBScript script, find out its paging rules, make a corresponding site of the paging collection page can be.
Suggestion: There are no good suggestions for improvement at present
comments: people who feel the scripting language can find their paging rules



Six, only allowed through the Site page connection view, such as: Request.ServerVariables ("Http_referer")



Disadvantages : Affect the search engine on its included
Collection Countermeasures : I do not know whether to simulate the source of the Web page .... At present, I do not have a corresponding method of collection countermeasures
Suggestion: There are no good suggestions for improvement at present
comments : It is recommended to rely on search engine with traffic webmaster do not use this method. However, this method to prevent the general acquisition program, or a little effect.



From the above can be seen, the current commonly used to prevent collection methods, or will be included in the search engine has a greater impact, or prevent the collection effect is not good, not to prevent the effect of collection. Then, there is no effective collection, but does not affect the search engine included methods? Then please continue to look down!



From the front of the collection principle you can see that most of the acquisition procedures are based on the analysis of rules to collect, such as analysis of paging file name rules, analysis of page code rules.



First, pagination file name rules to prevent collection countermeasures



Most of the collectors rely on the analysis of the paging file name rules, batch, multi-page collection. If others cannot find out the file name rules of your paging file, then others will not be able to do a lot of your site to collect multiple pages.



Implementation method:



I think it is a good way to encrypt the paging file name with MD5, here, some people will say, you use MD5 to encrypt the paging file name, others according to this rule can also simulate your encryption rules to get your paging file name.



I want to point out that when we encrypt the paging file name, don't just encrypt the part of the file name change



If i represent pagination page numbers, then we should not encrypt this: PAGE_NAME=MD5 (i,) & ". htm"



It is best to follow up one or more characters on the page number that you want to encrypt, such as: Page_name=md5 (i& "any one or several letters",) & ". htm"



Because the MD5 can not be decrypted, the other people see the page letters are MD5 encrypted results, so add people can not know what you follow after the letter is what, unless he used violent ****md5, but not too realistic.



Second, the page code rules to prevent collection countermeasures



If our content page has no code rules, then people can't extract the content they need from your code. So we want to do this step to prevent collection, it is necessary to make code without rules.



Implementation method:



To make the other person need to extract the tag randomization



1, custom multiple page templates, each page template in the important HTML tags are different, rendering the page content, random selection of page templates, some pages with Css+div layout, and some pages with table layout, this method is troublesome point, a content page, to do more than a few template pages, But the collection itself is a very cumbersome thing, to do a template, can play a role in the collection, for many people, are worth.
2, if the above method too troublesome, the page of the important HTML tag randomization, you can.

The more Web templates you do, the more random HTML code is, the other side analysis of the content code, the more trouble, the other side for your site to write a collection strategy, more difficult, at this time, most people will shrink, because this person is lazy, will collect other people's website data ~ ~ ~ Again, At present, most people are to take the acquisition program developed by others to collect data, the development of their own collection procedures to collect data, after all, is a minority.



There are some simple ideas to provide to you:



1, to the data collector is important, but not important to the search engine content with client script display
2, a page of data, divided into N page display, but also to increase the difficulty of collecting methods
3, with a deeper connection, because most of the current acquisition program can only collect the site content of the first 3 layers, if the content of the connection layer deeper, can also avoid being collected. However, this may cause the customer to browse the inconvenience. Such as:



Most of the sites are home----content index paging----content page
If changed to:
Home----Content index pagination----content page entry----content page
Note: Content page entry is best to add code that automatically goes to the content page



<meta http-equiv= "Refresh" content= "6;url= content page (http://www.xiaoqi.net)" >



In fact, as long as the first step to prevent collection ( encryption paging file name rules ), the effect of the collection has been good, or recommend two methods to use the collection method at the same time, to gather more difficulty, so that they know difficult to page back.




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.