Prevent crawlers, starting with HTTP request header information, so useragent need to do dynamic settings
#-*-Coding:utf-8-*-"Created on April 21, 2017 user agent @author: DZM @param encryption Level ID: N: No security encryption, I: Weak security encryption, U: Strong secure encryption @param rendering engine: G Ecko, WebKit, Khtml, Presto, Trident, Tasman and other @see: http://www.cnblogs.com/junrong624/p/5533655.html prevention Crawler, Starting with the HTTP request header, user-agent in
=1ts=0ys=0cs=0lb=1sb=0pb=4mr=1 " data = Urllib.request.urlopen (URL)." Read () . Decode ("Utf-8") data2 = json.loads (data) # Restores the string to its original data type print (data2[' data '][0]) IP = str (data2[') Data '][0][' IP ']) Dkou = str (data2[' data '][0][' Port ') zh_ip = IP + ': ' + Dkou print (zh_ip) proxy = Urllib.request.ProxyHandler ({"https": Zh_ip}) # format IP, note that the first parameter, the request target may be HTTP or HTTPS, corresponding s
sound_id of the sound module in the album on the page ...
The procedure is as follows:
Import randomimport requestsfrom bs4 import beautifulsoupimport jsonfrom lxml import etreeimport pymongoclients = Pymongo. Mongoclient ("localhost", 27017) db = clients["Ximalaya"]collection_1 = db["album"]collection_2 = db["detail"]ua_list = [ "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/22.0.1207.1 safari/537.1 ","
sound module in the album on the page ...The procedure is as follows:Import randomimport requestsfrom bs4 import beautifulsoupimport jsonfrom lxml import etreeimport pymongoclients = Pymongo. Mongoclient ("localhost", 27017) db = clients["Ximalaya"]collection_1 = db["album"]collection_2 = db["detail"]ua_list = [ "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/22.0.1207.1 safari/537.1 ","
Are you curious about the user-agent that identifies the browser identity, and why each browser has the Mozilla word?
1
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36
2
Mozilla/5.0 (Linux; U; Android 4.1.2; zh-tw; GT-I9300 Build/JZO54K) AppleWebKit/534.30 (KHTML
" "Created on September 25, 2017 @author:kearney" "ImportRandomdefget_useragents (): Useragents= [ "mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) applewebkit/537.36 (khtml, like Gecko) chrome/35.0.1916.47 safari/537.36", "mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/51.0.2704.103 safari/537.36", "mozilla
Are you curious about the User-Agent that identifies the browser? Why does each browser contain Mozilla? View Source print?
1
Mozilla/5.0 (Windows NT 6.1; wow64) applewebkit/537.36 (khtml, like gecko) Chrome/27.0.1453.94 Safari/537.36
2
Mozilla/5.0 (Linux; U; Android 4.1.2; ZH-tw; GT-I9300 build/jzo54k) applewebkit/534.
,} User_agents= [ "mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Acoobrowser;. NET CLR 1.1.4322;. NET CLR 2.0.50727)", "mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1;. NET CLR 2.0.50727; Media Center PC 5.0;. NET CLR 3.0.04506)", "mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; Aolbuild 4337.35; Windows NT 5.1;. NET CLR 1.1
additional protection against security vulnerabilities. It allows JavaScript, Java, and other executable content to be allowed only in trusted domain names that the user chooses.
Fifth--download Statusbar, daily average active user 2.2 million. With it, you can view and manage downloads in the status bar without having to use an intrusive browsing download window. Fully customizable interfaces can be automatically hidden when not in use and will not disturb the user.
The fourth and third are
In the beginning there was NCSA Mosaic, and mosaic called itselfNcsa_mosaic/2.0 (Windows 3.1), And mosaic displayed pictures along with text, and there was much rejoicing.
And behold, then came a new Web browser known as "Mozilla", being short for "mosaic killer," But Mosaic was not amused, so the public name was changed to Netscape, and Netscape called itselfMozilla/1.0 (win3.1), And there was more rejoicing. and Netscape supported frames, and frames
I recommend my IIS log analyzer!
What is User-Agent?
User-Agent: record the browser from which the request comes.
User-Agent analyze website http://www.useragentstring.com/
You can know the visitor information by parsing the User-Agent.
The IIS log of the server for one day is intercepted for analysis!
Robot search Robot
Search engines like Google and Baidu all have automatic crawlers.ProgramThe web page information is continuously crawled on the Internet. To create their search in
1. Firefox
Gecko is the Firefox rendering engine. The original gecko was developed as part of a generic Mozilla browser, and the first browser to use the Gecko engine was Netscape6;
We can use the user agent detection: The following JS code:
var ua = navigator.useragent;
Console.log (UA);
Under Windows Firefox, print the following:
Copy Code code as follows:
mozilla/5.0 (W
Compile Firefox by yourself and add the Thunderbird Method
Last year, I introduced Mozilla's two open-source software Firefox and Thunderbird compilation methods on the Windows platform in blog "compile Mozilla Firefox and Thunderbird by yourself", but since the end of last year, to simplify compilation, Mozilla integrates all the tools for compiling Firefox and Thunderbird into a unified tool
still in progress
}
// User proxy available for random use
Private $ agents = array (
'Sogou web spider/4.0 (+ http://www.sogou.com/docs/help/webmasters.htm#07 )',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0 ;. net clr 2.0.50727 ;. net clr 3.0.20.6.2152 ;. net clr 3.5.30729 ;. net clr 1.1.4322; CBA; InfoPath.2; SE 2.X MetaSr 1.0; AskTB5.6; SE 2.X MetaSr 1.0 )',
'Ia _ archiver (+ http://www.
Google Chrome has made everyone hot, and only professional users will notice the "mozilla/5.0 Windows" that are issued when Chrome accesses the Web page. U Windows NT 5.1; En-US) applewebkit/525.13 (khtml, like Gecko) chrome/0.2.149.27 safari/525.13 "useragent string, seemingly heavenly book, what is it exactly what it represents." Let's take a look.
The earliest time there was a browser called NCSA Mosaic, labeled as ncsa_mosaic/2.0 (Windows 3.1), wh
use these proxy IP Web site crawl, the site can also use these proxy IP reverse restrictions, by crawling these IP saved in the server to limit the use of proxy IP crawl crawler.
Get to the point.
OK, now actually, write a crawler through the proxy IP access site.
First get the proxy IP, used to crawl.
Def get_proxy_ip ():
headers = {
' Host ': ' www.xicidaili.com ', '
user-agent ': ' Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0) '
path" "" #----------------------------------------- -------------document Processing--------------------------------------------------------# Write document def write (Path,text): With open ( Path, ' A ', encoding= ' Utf-8 ') as F:f.writelines (text) f.write (' \ n ') # Clear document def truncatefile (path): with open (Path, ' W ', encoding= ' Utf-8 ') as F:f.truncate () # reads the document DEF read (path): With open (path, ' R ', encoding= ' utf-8 ') as F:txt = [] for s in F.readlines (): Tx
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.