Python: Large (34k) user crawler

Source: Internet
Author: User
Tags xpath

I learned python the other day and completed most of the exercises in the Python workbook: Https://github.com/Show-Me-the-Code/python (I have the exercise code on GitHub, welcome pickup). After seeing a Python crawler project in @salamer, I felt pretty good. So he spent 4 days to complete a large-scale crawl to know the user information of the crawler, because of personal network reasons, climbed 12 hours, access to 34k user information (theoretically can crawl the entire station information, may be longer, preferably on the server run) and organized into an intuitive chart (the article at the end of the show).


Well, let's talk about the main technical points:


(1) Using Python's request module to get the HTML page, note that you want to modify your own cookies, so that we are more like using a browser to access

(2) Use the XPath module to extract the required key information from the HTML (name, occupation, place of residence, followers, etc.)

(3) Use Redis as a queue to solve problems of concurrency and large-scale data (can be distributed)

(4) using BFS width-First search, so that the program can continuously expand the continuous search users

(5) data storage to No-sql database: MongoDB (efficient and lightweight and supports concurrency)

(6) Increase crawl speed using Python's process pool module

(7) Using Csv,pandas,matplotlib module for data processing (need to perfect)


Next, we conduct a careful analysis:


(i) Access to data


The main use of Python's request for HTML access, in addition, the cookie in the header carries our login information, so press your F12 to add your own cookie to the program.

There are a lot of navy, we use a strategy for better quality crawling of user information: Just grab everyone's followers, so you can relatively effectively reduce the navy and trumpet.


#cookie要自己从浏览器获取 self.header["user-agent"]= "mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:35.0) gecko/20100101 firefox/35.0 "self.cookies={" q_c1 ":" 8074ec0c513747b090575cec4a547cbd|1459957053000| 1459957053000 "," l_cap_id ": '" y2mzodmyyjgznwnjngy4yzhjmdg4mwmzmwm2nmjmzgq=|1462068499|cd4a80252719f069 Cc467a686ee8c130c5a278ae "'," cap_id ": '" yziwnjmwnjyynjk0ndcyntkwmtfiztdinmy1yziwmje=|1462068499|efc681                      05333307319525e1fc911ade8151d9e6a6 "'," d_c0 ": ' agaai9whuwmptsz7ysmea9d_dtdc6ijre4a=|1459957053 ', "_za": "9b9dde53-9e53-4ed1-a17f-363b875a8107", "Login": ' "ywqyyzq4zdyyotawndvjntg2zmy 3mdfky2qwodi5mgy=|1462068522|49dd99d3c8330436f211a130209b4c56215b8ec3 "'," __utma ":" 51854390.803819812 .1462069647.1462069647.1462069647.1 "," __UTMZ ":" 51854390.1462069647.1.1.utmcsr=baidu|utmccn= (organic) |utmcmd=organic "," _XSRF ":" 6b32002d2d529794005f7b70b4ad163e "," _zap ":" a769d54e-78bf-44af-8f24-f9786a00e322 "," __UTMB ":" 518 54390.4.10.1462069647 "," __UTMC ":" 51854390 "," L_n_c ":" 1 "," Z_c0 ":" mi4wqufbqwnjqw9bqufbwufbajndrzddumnbqufcaefsvk5ldkpnvndcrlqzm1byvehqbwk0vngyvkswsvdpoxhredjb|1462068522 |eed70f89765a9dd2fdbd6ab1aabd40f7c23ea283 "," S-q ":"%e4%ba%91%e8%88%92 "," S-i ":" 2 "," Sid ":" 1JSJLBSG "," s-t ":" AutoComplete "," __UTMV ":" 518543 90.100--|2=registration_date=20140316=1^3=entry_date=20140316=1 "," __utmt ":" 1 "}

Using XPath to extract the information we need to pay attention to in HTML, here is a small example of the use of XPath, please Baidu:)

def get_xpath_source (self,source):        if Source:            return source[0]        else:            return '

Tree=html.fromstring (Html_text)        Self.user_name=self.get_xpath_source (Tree.xpath ("//a[@class = ' name ']/text () "))        Self.user_location=self.get_xpath_source (Tree.xpath ("//span[@class = ' Location item ']/@title "))        Self.user_gender=self.get_xpath_source (Tree.xpath ("//span[@class = ' Item gender ']/i/@class"))

(ii) Search and storage


The URL queue for the search is likely to be large, and we use Redis as a queue to store it, not only when the program exits without losing data (the program is rerun to continue the last search), but also supports distributed horizontal scaling and concurrency.

The core uses BFS width first search to expand, it is unclear, I am afraid to learn the algorithm. Storage is available in two ways, one for direct output to the console, and the other for storing to a MONGODB fee relational database.


# Core Module, BFS width First search def bfs_search (option):    global Red while    True:        temp=red.rpop (' Red_to_spider ')        if Temp==0:            print ' empty '            break        result=spider (temp,option)        result.get_user_data ()    return " Ok

def Store_data_to_mongo (self):        new_profile = Zhihu_user_profile (        user_name=self.user_name,        user_be_ Agreed=self.user_be_agreed,        user_be_thanked=self.user_be_thanked,        user_followees=self.user_followees,        user_followers=self.user_followers,        user_education_school=self.user_education_school,        user_ Education_subject=self.user_education_subject,        user_employment=self.user_employment,        User_employment_ Extra=self.user_employment_extra,        user_location=self.user_location,        User_gender=self.user_gender,        User_info=self.user_info,        user_intro=self.user_intro,        user_url=self.url        )        New_ Profile.save ()

(iii) Increased efficiency through multiple processes

Python because of the Gil lock reason, multithreading does not achieve true parallelism. Here are a few things to keep in mind when using the process pool provided by Python for multi-process operations:

The actual test down, in the choice to store data to MongoDB database this way, the multi-process did not improve efficiency, even slower than a single process, I analyzed the following reasons: Because the calculation of the part of the time is very small, the main bottleneck in disk IO, that is, write into the database, a moment can only have a process in writing, Multi-process will increase the overhead of many locking mechanisms, resulting in the above results.

But the direct output will be much faster. This also suggests that the multi-process does not necessarily improve the speed, according to the situation to choose the appropriate model.

Use multi-process, note that the actual test out, and no apparent speed of ascension, bottleneck in IO write, if the direct output, the speed will significantly accelerate    res=[]    Process_pool=pool (4) for    I in range (4):        res.append (Process_pool.apply_async (bfs_search, (option,)))    process_pool.close ()    Process_pool.join ()    for num in res:        print ":::", Num.get ()    print ' work had done! '

(iv) data analysis

Here we use the Csv,pandas module for data analysis, about the use of the module please Google, here to paste out some of their own analysis diagram:


User-aware City distribution:

First- tier city users topped the ranks, especially in Beijing. The United States is much better.


User-aware professional distribution:

Really know the most on the program ape.


Knowledge of user School distribution:

The schools of the five tigers in Qing bei and east China are mostly, and it seems that the students are of high quality.


Knowledge of user career distribution:


A lot of big guys, so many founders and CEOs, and predators: Product managers ....


Well, just show up here, students interested in this project, can go to my github to view, source all here

Data Analysis part is not professional, I hope more people to improve the project, I will also open the next step to learn, it will be distributed crawler, hope to bring help to everyone ~



Python: Large (34k) user crawler

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.