Similarities:
A. Both of them need to directly capture the source code of the webpage to work effectively,
B. The two websites crawl the accessed content multiple times within the unit time;
C. From a macro perspective, both IP addresses will change;
D. the two are impatient to crack some of your encryption (verification) on the webpage. For example, the webpage content is encrypted through JS files. For example, you need to enter a verification code to view the content, for example, You need to log on to access the content.
Differences:
Search engine crawlers ignore the source code scripts, styles, and HTML tags of the entire webpage.CodeAnd then perform Syntactic Analysis on the remaining text parts. The collector usually captures the required data based on the characteristics of HTML tags. When creating collection rules, the collector needs to fill in the start and end signs of the target content, so as to locate the required content; you can also create a regular expression for a specific webpage to filter out the desired content. HTML tags (Web Page Structure Analysis) are involved in both the start and end signs and regular expressions ).
Then we will propose some anti-collection methods.
1. restrict the number of visits to IP addresses per unit time
Analysis: no common person can access the same website five times in one second, unlessProgramBut with this kind of preference, there will be a search engine crawler and a nasty collector.
Disadvantages: one-size-fits-all, which also prevents search engines from indexing websites.
Applicable websites: websites that do not rely heavily on search engines
What can collectors do: Reduce the number of visits per unit time and reduce the Collection Efficiency
2. Shielding IP addresses
Analysis: the background counter is used to record the visitor's IP address and Access frequency, and the visitor's access records are analyzed manually to shield suspicious IP addresses.
Disadvantages: It seems that the webmaster is busy.
Applicable websites: All websites, and Webmasters can know which robots are Google or Baidu
What will the collector do: guerrilla warfare! The IP proxy is used for collection once and once, but it will reduce the collector's efficiency and network speed (using proxy ).
3. Use js to encrypt webpage content
Note: I have never touched this method, but it seems to me elsewhere.
Analysis: No need to analyze, search engine crawlers and collectors kill
Applicable websites: websites that hate search engines and collectors
The collector will do this: if you are so arrogant, they will not pick you up.
4. Hiding website copyrights or random spam text on webpages. These text styles are written in CSS files.
Analysis: Although collection cannot be prevented, the collected content will be filled with copyright instructions or spam texts on your website, because the collector does not collect your CSS files at the same time, those texts are displayed without style.
Applicable websites: All websites
What will the collector do: it is easy to replace and replace the copyrighted text. There is no way for random spam text. It's easy.
5. Users can log on to access website content
Analysis: search engine crawlers do not design logon programs for each of these types of websites. I heard that the collector can design a simulated User Login submission form for a website.
Applicable websites: websites that hate search engines and want to block most collectors
What will the collector do: Create a module for user login submission form Behavior
6. pagination by scripting (hiding pages)
Analysis: in that sentence, search engine crawlers do not analyze the hidden pages of various websites, which affects search engine indexing. However, when writing collection rules, the collector needs to analyze the target webpage code and understand the script knowledge to know the real link address of the page.
Applicable websites: websites that are not highly reliant on search engines, and those who collect you do not know about scripts
What will the collector do: it should be said that the collector will do what it will do. It will analyze your webpage code anyway. By the way, it will not take much time to analyze your paging script.