Save Python crawler web page capture
Select the car theme of the desktop wallpaper Website:
The following two prints are enabled during debugging.
#print tag#print attrs
#!/usr/bin/env pythonimport reimport urllib2import HTMLParserbase = "http://desk.zol.com.cn"path = '/home/mk/cars/'star = ''def get_url(html):parser = parse(False)request = urllib2.Request(htm
Import reRegular Expressions:Frequently used symbols: Dot question mark, asterisk, and parenthesis.: matches any character except for line break \ n--the DOT can be interpreted as a placeholder, and a dot number matches one character.*: Matches the previous character 0 or unlimited times?: matches the previous character 0 or 1 times. *: Greedy algorithm (as many matches as possible to the data). *?: Non-greedy algorithm (find as many combinations as possible to meet the criteria)(): The data in
1. Install Beautiful Souphttp://www.crummy.com/software/BeautifulSoup/bs4/download/4.4/After extracting, go to the root directoryOperating under the console:Python setup.py InstallOperation Result:Processing dependencies for beautifulsoup4==4.4.0Finished processing dependencies for beautifulsoup4==4.4.0Then, continue to run under the console:Pip Install Beautifulsoup4Create a new test filetest_soup.py from Import BeautifulSoupOperating under the console:Python test_soup.pyIf no error occurs, the
' : open ( ' Report.xls ' ' RB ' )}>>> r = requestspost (urlfiles =files) >>> r. Text{ "files": { "file": "}, /span> you can also explicitly set the file name:>>>Url=' Http://httpbin.org/post '>>>Files={ ' file ' : ( ' Report.xls ' open ( ' Report.xls ' ' RB ' >>> r = requestspost (urlfiles =files) >>> r. Text{ "files": { "file": "}, /span> If you want, you can also send a string as a file to receive :>>>Url=' Http://httpbin.org/post
1 #Coding:utf-82 ImportUrllib23 ImportRe4 ImportThreading5 6 #image Download7 defloadimg (addr,x,y,artname):8data =urllib2.urlopen (addr). Read ()9f = open (Artname.decode ("Utf-8") +str (y) +'. jpg','WB')Ten f.write (data) One f.close () A - #specific Post page resolution, get the image link address, and use loadimg download artname for the post name - defGetimglink (html,x,artname): theRelink =' " alt= ". *.jpg"/>' -Cinfo =Re.findall (relink,html) -y =0 - forLininchCinfo: +IMGADDR ='
Today using requests and BeautifulSoup climbed some pictures, or very fulfilling, comments may be wrong, I hope you have more commentsImportRequests fromBs4Importbeautifulsoupcircle= Requests.get ('Http://travel.quanjing.com/tag/12975/%E9%A9%AC%E5%B0%94%E4%BB%A3%E5%A4%AB')#put the acquired picture address into count in turnCount = []#put the acquired page content into BeautifulSoupSoup = BeautifulSoup (Circle.text,'lxml')#According to Google Selectgadget this plugin, get HTML tags, such as get:
['URL'])) returnNewsdetails12. Use the For loop to generate multiple page links13, Batch crawl every page of news in the text14. Use Pandas to organize dataPython for Data analysis
Originated from R
Table-like format
Provides efficient, easy-to-use data Frame that allows users to quickly manipulate and analyze data
Save data to DatabaseContinue fighting here, the first web crawler i
) Requests.get (URL, params=none, **kwargs)? URL: URL link to get page? Additional parameters in Params:url, dictionary or byte stream format, optional? **kwargs:12 Parameters for control access(3) Properties of the Response object:R.status_code The return status of the HTTP request, 200 indicates a successful connection, and 404 indicates a failureR.text A string form of the HTTP response content, that is, the page content of the URLR.encoding How the response content is encoded from the HTTP h
standard format (encode), and then passed as a data parameter to the Request object. Examples are as follows:ii.2.1.3 headers--is a dictionary type, the header dictionary can be passed in as a parameter directly to the request, or you can add each key and value as a parameter by calling the Add_header () method. The user-agent header, which identifies the browser, is often used for spoof and spoofing, because some HTTP services allow only certain requests to come from common browsers rather tha
Python's first Web Crawler
Recently I want to get started with Python. The method for getting started with a language is to write a Demo. Python Demo must be a crawler. The first small crawler is a little simple, so do not spray i
1 Creating a ProjectScrapy Startproject Tutorial2 Defining the itemImport ScrapyClass Dmozitem (Scrapy. Item):title = Scrapy. Field ()link = scrapy. Field ()desc = scrapy. Field ()After the Paser data is saved to the item list, it is passed to pipeline using3 Write the first crawler (spider), saved in the Tutorial/spiders directory dmoz_spider.py, the crawler to be based on the file name to start.Import Scr
Python crawler scrapy framework-manual recognition, logon, inverted text verification code, and digital English Verification Code,
Currently, zhihu uses the verification code of the inverted text in the click graph:
You need to click the inverted text in the figure to log
Weekend nothing to write a web crawler, first introduced its function, this is a small program, mainly used to crawl pages of articles, blogs, etc., first find the article you want to crawl, such as Han's Sina blog, into his article directory, write down the directory connection such as HTTP/ Blog.sina.com.cn/s/articlelist_1191258123_0_1.html, there is a connection in each article, all we need to do now is
Version number: Python2.7.5,python3 changes are large, you find another tutorial.
The so-called Web crawl, is the URL address specified in the network resources from the network stream read out, save to Local.Similar to the use of the program to simulate the function of IE browser, the URL as the content of the HTTP request to the server side, and then read the server-side response resources.
In Python, w
address of the entire page that contains the picture, and the return value is a listImport reimport urllibdef gethtml (URL): page = urllib.urlopen (URL) html = page.read () return htmldef getimg (HTML): Reg = R ' src= "(. +?\.jpg)" Pic_ext ' Imgre = Re.compile (reg) imglist = Re.findall (imgre,html) return imglist html = gethtml ("http://tieba.baidu.com/p/2460150866") print getimg (HTML)Third, save the picture to a localIn contrast to the previous step, the core is to use the Urllib.urlretrieve
Description: Download pictures, regular expressions for tests onlyTest URL for Iron Man post An introduction to Mark's armor postsThe following code will download all the pictures of the first page to the root directory of the program#!/usr/bin/env python#!-*-coding:utf-8-*-ImportUrllib,urllib2ImportRe#返回网页源代码 def gethtml(URL):html = urllib2.urlopen (URL) srccode = Html.read ()returnSrccode def getimg(srcco
characteristics of the model's personal homepage, visually is such a page:
Analysis of the source, we will find that the model's image address can be obtained:
1 HTML = urlopen (Personurl)2 bs = BeautifulSoup (Html.read (). Decode (' GBK '), "Html.parser")3 contents = Bs.find ("div", {"Class": "Mm-aixiu-content"})4 IMGs = Contents.findall ("img", {"src": re.compile (R '//img\.alicdn\.com/.*\.jpg ')})
So we can get a picture of the model's person
#Coding=utf-8ImportUrllibImportUrllib2#URL addressUrl='https://www.baidu.com/s'#Parametersvalues={ 'IE':'UTF-8', 'WD':'Test' }#for parametric encapsulationData=Urllib.urlencode (values)#assemble the full URL#Req=urllib2. Request (Url,data)url=url+'?'+Data#access the full URL#response = Urllib2.urlopen (req)Response =urllib2.urlopen (URL) HTML=Response.read ()PrintHtmlRun again to get the resultHTTPS has been redirected and needs to use HTTP#Coding=utf-8ImportUrllibImportU
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.