first determine to crawl the joke site, I was directly Baidu jokes to search the first site, the URL is http://www.jokeji.cn/, point to find web page construction in my opinion is still more complex, because still beginners, First of all, we must find the source of the Web resources, find out the rules, and then the targeted crawling resources. For this URL, I found a category box in the left sidebar with a joke encyclopedia. This seems to basically include the whole station of the word joke. There are a lot of small categories under this category, and after the small category is a list of all the jokes in this category, each page contains some jokes, and we end up crawling the jokes in a Web page to store them.
Crawl ideas: Crawl http://www.jokeji.cn/, get an HTML page, analyze this HTML page, get all the categories in the Classification box URL address, traverse the URL of each category, crawl the HTML page of the category URL, Analyze category URL page, get all Jokes Web page list, traverse joke Web list URL, get final HTML page containing a joke, get jokes, store to MySQL database.
Crawl I use a Google browser, the site under the F12 can see the page source code, at this time to analyze the joke Daquan classification box structure, in order to use the Python regular expression to get the information we want.
The general filter content selects the component that is easy to uniquely identify the upper layer of the target content component, where I chose <div class= "Joketype l_left" ></div> this HTML, This div can contain all of the contents of the classification, and then filter it again to filter out all the URLs. To the Classification sub-page I looked at the law of the URL, the URL of this category is/listxx_1.htm start, found that the category page has a last button URL is just the end, And the URL of the. htm in front of the number increment, so it is good to crawl all the jokes page, as for the extraction of jokes no longer say, directly on all the code.
MySQL Database storage module code, the file name is mysql.py.
ImportPymysqldefInsert (joke):#Get linksConn=pymysql.connect (host='127.0.0.1', user='Root', passwd='123456', db='python') cur=conn.cursor ()#SQL statement,%s is a placeholder (%s is unique, no matter what data type uses%s) to prevent SQL injectionSql='INSERT into joke (joke) VALUES (%s)' #params= (' Eric ', ' Wuhan ') #参数 insert a single #li=[(' A1 ', ' B1 '), (' A2 ', ' B2 ')] #批量插入参数 #Recount=cur.execute (sql,params)Recount=cur.executemany (Sql,joke)#BULK InsertConn.commit ()#Commit, execute multiple commands only one time is required tocur.close () conn.close ( )defGet_one ():#Get linksConn=pymysql.connect (host='127.0.0.1', user='Root', passwd='123456', db='python') cur=conn.cursor () sql1='Select number from number'recount=Cur.execute (SQL1) number=cur.fetchone () [0]+1SQL2='select joke from joke where id=%s'recount=Cur.execute (sql2,number) joke=cur.fetchone () [0] Sql3='Update number set number=%s where number=%s'params= (number,number-1) Recount=Cur.execute (Sql3,params) conn.commit () Cur.close () conn.close ()
Note: The Pymysql module needs to be installed, in general now Python is the default PIP package management, no this module directly at the command line execution: Pip Pymysql. Need to create a Python database in the MySQL database, table joke, table has two columns, one column is ID, I set it to the int type of the self-added primary key properties, a column of joke, set to the text type, store jokes. The Insert function is passed in a joke tuple type parameter, which is stored as a string, each string represents a joke, and the function of the Insert function is to insert the tuple's joke into the database. Get_one function is to get a joke, here I add a table specifically to save jokes to take where, so as not to read duplicate jokes, this table name is number, there is a field, number, I implemented a record, the value is 0.
The crawler code, the file name is spider.py.
ImportUrllib fromUrllibImportRequestImportReImportChardetImportMySQLImport Time" "find the http://www.jokeji.cn/list29_1.htm joke list to get the first joke page http://www.jokeji.cn/jokehtml/bxnn/2018080417455037. The HTM iterates through all the jokes on the web to get the next list of jokes, repeating the traversal until this sort of joke traverses the next category" "classSpider (): URL='http://www.jokeji.cn/jokehtml/bxnn/2018073023294432.htm'Header= { 'user-agent':'mozilla/5.0 (Linux; Android 6.0; Nexus 5 build/mra58n) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 Mobile safari/537.36', 'accept-encoding':"', 'Referer':'http://www.jokeji.cn/list29_1.htm'} Classifys=[]#List collection, store joke category URL #Get HTML text def __fetch_content(self): cookie_html= Request. Request (spider.url,headers=spider.header) cookie_html=Request.urlopen (cookie_html) htmls=Cookie_html.read ()#use Chardet to obtain the encoding format for HTMLSencoding=Chardet.detect (htmls) encoding=encoding['encoding'] #there is always a character conversion error when not adding ignore, indicating that there are illegal characters, and must be added ignore ignore illegal charactersHtmls=htmls.decode (Encoding,'Ignore') Time.sleep (0.1) returnhtmls#get joke category URL def __analysis_classify(self): Spider.url='http://www.jokeji.cn/'spider.header['Referer']='http://www.jokeji.cn/'htmls=self.__fetch_content() Root_pattern='<div class= "Joketype l_left" > ([\s\s]*?) </div>'Classify_pattern='<a href= "([\s\s]*?)" >'root_html=Re.findall (root_pattern,htmls) Spider.classifys=Re.findall (Classify_pattern,str (root_html))#get all the jokes in the current page and return a tuple def __analysis(SELF,HTMLS): Anchors=[] Root_pattern='<span id= "text110" > ([\s\s]*?) </span>'Juck_pattern='<p>\d, ([\s\s]*?) </P>'root_html=Re.findall (root_pattern,htmls) forHtmlinchRoot_html:jucks=Re.findall (juck_pattern,html)#Replace line breakC=re.compile ('<[b| b][r| R]\s*/*>') forIinchjucks:i=c.sub ('\ n', i) anchors.append (i)#i=i.replace (' <BR> ', ' \ n ') returnAnchorsreturnAnchors#crawling with original jokes def __analysis_yuanchuangxiaohua(self,url): URL='http://www.jokeji.cn'+URLPass #Crawl classification under the joke def __analysis_joke(self): spider.header['Referer']='http://www.jokeji.cn/'Root_pattern='<div class= "List_title" > ([\s\s]*?) </div>'Page_pattern='<a href= "([\s\s]*?)" \s*target= "_blank" \s*>' forClassifyinchSpider.classifys:Try: if '/yuanchuangxiaohua' inchclassify:self.__analysis_yuanchuangxiaohua(classify) classify='http://www.jokeji.cn'+classify Spider.url=classify htmls=self.__fetch_content() #maximum number of pages in a record categoryMax_number=int (Re.findall ('[\s\s]*? (\d*?) \.htm "> Last </a>', HTMLS) [0])#start walking through each large page, a large page contains a lot of small pages, small pages inside is a joke exceptbaseexception:Continue forBigpage_numberinchRange (1,max_number+1): Try: Bigpage_url=Classify temp=re.compile ('(\d*). htm') #change the URL because every big page in the category is regular, and all the numbers in front of the. htm are added from 1 to the maximum number of pagesBigpage_url=temp.sub (str (bigpage_number) +'. htm', Bigpage_url)#Replace URLSpider.url=Bigpage_url htmls=self.__fetch_content() root_html=Re.findall (root_pattern,htmls)#gets the URL address of a large page inside the small pagepages=Re.findall (page_pattern,root_html[0])#Traverse all small page URLs to get the jokes exceptbaseexception:Continue forPageinchPages:Try: Spider.url='http://www.jokeji.cn'+page htmls=self.__fetch_content() tuples=self.__analysis(HTMLS)#inserting into a databasemysql.insert (tuples)exceptbaseexception:Continue defGo (self): self.__analysis_classify() self.__analysis_joke() Spider=Spider () spider.go ()
Because the jokes in the classification of the original joke page format is different, so the difference is not realized. Beginners, the first time to write a blog record, download the blog Openlivewriter found difficult to use, or write a good page.
Python crawl jokes stored in MySQL