Python 2.7_ Multi-process Access brief book Special data (i)

Source: Internet
Author: User
Tags xpath

Learn python for a few months just practice practiced hand, found that problems continue to improve, first from the topic, crawl some data, a start on the book site structure is not familiar with, crawl recommended, popular, City 3 navigation bar, Exchange found recommendations and hot is the sort of different, the URL will be repeated, as well as each feature details page Three categories of latest comments, The latest collection, the hot will also be repeated to make the next adjustment, the code execution will return all the topic of the URLs tuple object, so that the next step into each thematic page resolution to get additional data. Note: The variable focus number of focus, and the top of the topic after opening the topic attention will be different, for example, some topics focus on 10175 people, the theme list page will be displayed as "10.07k", so the next time you get the details page to retrieve the specific value

Multi-process access to brief book feature data and write to MySQL database
    • Crawl all featured URLs in popular and city page http://www.jianshu.com/recommendations/collections categories, topic names, included articles, number of followers
    • Define multiple functions to get
      • Get cities and popular two categories asynchronously loaded URL functions
      • Parsing URL functions
      • Fetching data to return to database object
      • Get data into a database
      • Multi-process
Build table
#MySQL数据库建表CREATE TABLE catetable (cate_name varchar (255), Cate_url varchar (255), Total_num INT (+), focus INT (+), KEY Cate_name (cate_name), KEY Cate_url (cate_url)) Engine=innodb DEFAULT Charset=utf8
Python code
#coding: Utf-8import sysreload (SYS) sys.setdefaultencoding (' utf-8 ') import requestsfrom lxml import Etreeimport Mysqldbfrom multiprocessing Import Pool "Get the URL list of all the topics on the page next extract URL analysis for each topic: http://www.jianshu.com/recommendations /collections page for asynchronous loading, view popular and city navigation bar request construct URL list "def get_cateurls (): urls=[] for I in range (1, 3): Cityurl= ' http://w Ww.jianshu.com/recommendations/collections?page=%s&order_by=city '% i urls.append (Cityurl) for J in range (1, (+): hoturl= ' http://www.jianshu.com/recommendations/collections?page=%s&order_by=hot '% J Urls.append (Hot URL) return URLs ' Parse page ' def get_response (URL): Html=requests.get (URL). Content selector=etree. HTML (HTML) return selector "' get thematic data get thematic title topic URL Ingest article number of followers by the ZIP function transpose the data back to" def get_catedata (URL): selector= Get_response (URL) cate_url= map (lambda x: ' http://www.jianshu.com ' +x, Selector.xpath ('//div[@id = "List-container"]// Div[contains (@class, "Count")]/a/@href ')) cate_name=selector. XPath ('//div/h4/a/text () ') Total_num=map (Lambda X:int (X.strip ("article"). Strip ()), Selector.xpath ('//div[@id = ' List-container "]//div[contains (@class," Count ")]/a/text ()) Focus1=selector.xpath ('//div[@id = ' List-container ']/ /div[contains (@class, "Count")]/text () ') focus=[] for I in Focus1:focus_num=i.split ("•")            [1].rstrip ("People focus") if "K" in Focus_num:focus_num=int (float (focus_num[:-1]) *1000) Else: Focus_num=int (focus_num) #print i,focus_num focus.append (focus_num) data=zip (cate_name,cate_url,total_nu M,focus) return data ' write to database ' def insert_into_mysql (URL): Try:conn=mysqldb.connect (host= ' 127.0.0.1 ', user=            ' Root ', passwd= ' your password ', db= ' local_db ', port=3306,charset= ' UTF8 ') with conn:cursor = Conn.cursor () Print u ' loading%s page '% URL data=get_catedata (URL) for i in data: #print i[0], i[1], I[2],I[3] Cursor.execute ("INSERT into catetable (Cate_name,cate_url,total_num,focus) VALUES (%s,%s,%s,%s) ", (I[0], i[1], i[2],i[3])) Conn.commit () Sql= ' select * from catetable ' count = cursor.execute (sql) Print U ' total input%s data '% count except MyS  Qldb.error,e:print e ' "Get all the thematic URLs from the database to facilitate further acquisition of future thematic page data" "Def Get_allcate_urls_from_mysql (): Try:conn =             MySQLdb.connect (host= ' 127.0.0.1 ', user= ' root ', passwd= ' your password ', db= ' local_db ', port=3306, charset= ' UTF8 ') with Conn: cursor = conn.cursor () sql = ' Select Cate_url from catetable ' Count = Cursor.execute (sql            ) Print U ' has entered%s data '% Count print u ' is getting feature URL ' All_cate_urls=cursor.fetchall () return all_cate_urls except Mysqldb.error, E:print url,e "multi-Process Execution" Def get_allcate_urls (): urls=get_cate URLs () pool = Pool (processes=4) Pool.map (insert_into_mysql,urls) allcate_urls=get_allcate_urls_from_mysql () re Turn Allcate_urls "Get firstData to all topics next get data for each topic "If __name__ = = ' __main__ ': Allcate_urls=get_allcate_urls () 

  

View data table Data

Completed the query to a total of 914 topics, including 34 urban topics, 880 hot Topics

Python 2.7_ Multi-process Access brief book Special data (i)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.