Use multi-threaded crawler crawl * Inside the mailbox and mobile phone number

Source: Internet
Author: User
Tags mathematical functions mysql host
This crawler is mainly to Baidu bar in the content of various posts to crawl, and analysis of the content of the post will be the mobile phone number and email address crawled out. The main process is explained in detail in the code comments.

Test environment:

Code tested in Windows7 64bit,python 2.7 64bit (Installation mysqldb extension) and CentOS 6.5,python 2.7 (with mysqldb extension)

Environment Preparation:

工欲善其事 its prerequisite, you can see from my environment is windows 7 + pycharm. The python environment is Python 2.7 64bit. This is a development environment that is more suitable for novice users. Then we recommend that you install a easy_install, listen to the name to know that this is an installer, it is used to install some expansion packages, such as in Python if we want to operate the MySQL database, Python native is not supported, We have to install the MYSQLDB package to allow Python to operate the MySQL database, and if there is a easy_install we just need a single line of command to quickly install the number MYSQLDB expansion pack, like Yum in PHP Composer,centos, Apt-get in Ubuntu is just as convenient.

Related tools can be found in GitHub: Cw1997/python-tools, where the Easy_install installation only needs to run the Py script on the python command line and wait a moment, he will automatically join the Windows environment variable, On the Windows command line if the input easy_install has a echo indicating that the installation was successful.

Details of the environment selection:

As for the computer hardware is of course the faster the better, the memory at least 8G start, because the crawler itself requires a lot of storage and analysis of intermediate data, especially the multi-threaded crawler, in the crawl with pagination of the list and the details page, and fetch a large amount of data with the queue queue to allocate crawl task is very memory. There are times when we crawl data using JSON, and if you use a NoSQL database store like MongoDB, it can also account for memory.

Network connection is recommended to use a wired network, because some of the poor wireless routers on the market and ordinary civilian wireless network card in the case of large-size intermittent disconnection or loss of data, such as substitution.

As for the operating system and Python certainly choose 64 bits. If you are using a 32-bit operating system, you cannot use large memory. If you're using a 32-bit python, you may not feel a problem when you crawl data on a small scale, but when the data is large, such as a list, a queue, a dictionary that stores a lot of data, the memory overflow error is reported when Python's memory consumption exceeds 2g.

If you are going to use MySQL to store data, it is recommended to use a later version of mysql5.5 because the mysql5.5 version supports JSON data types, so you can discard MongoDB.

As for Python now has a 3.x version, why is python2.7 used here? The reason for choosing version 2.7 was that the Python core program, which was bought a long while ago, is the second edition, still with 2.7 as the sample version. And there are still a large number of online tutorial materials are explained in 2.7 for the version, 2.7 in some ways and 3.x is very different, if we did not learn 2.7, may be some minor grammatical differences are not very understanding will cause us to understand the deviation, or can not understand the demo code. And now there are some dependent packages that are only compatible with version 2.7. The advice is if you are ready to learn python and then go to work in the company, and the company does not have the old code to maintain, Then you can consider the direct 3.x, if you have more ample time, and there is no very systematic Daniel Belt, can only rely on the online fragmented blog article to learn, then still learn 2.7 learning 3.x, after all, after learning 2.7 3.x to get started quickly.

Multi-threaded crawler involves the knowledge point:

In fact, for any software project, we all want to know what knowledge is needed to write this project, and we can observe which packages are imported into the main portal file of this project.

Now look at our project, as a person who has just come into contact with Python, there may be some packages that have hardly ever been used, so in this section we are going to simply say what these packages do, and what knowledge points they will be involved in, and what the key words are. This article does not take a long time to start from the basics, so we have to learn to use Baidu, search these knowledge points of the key words from learning. Here is a look at these points of knowledge.

HTTP protocol:

Our crawler crawl data is essentially a non-stop HTTP request that gets an HTTP response and is stored in our computer. Understanding the HTTP protocol helps us to accurately control some of the parameters that speed up the crawl speed when fetching data, such as keep-alive.

Threading Module (multi-threaded):

The programs we write are all single-threaded, and the code we write is run in the main thread, and the main thread runs in the Python process.

Multithreading in Python is achieved through a module named threading. There was a thread module before, but threading was more control over threads, so we later switched to threading for multithreaded programming.

To put it simply, using the threading module to write multithreaded programs is to define a class yourself first, and then this class inherits threading. Thread, and write the work code of each thread to the run method of a class, of course, if the thread itself is created in order to do some initialization work, then in his __init__ method to write the initialization work to execute the code, this method is like PHP, The same way as the constructor in Java.

One additional point here is the concept of thread safety. Normally we have only one thread at a time in a single-threaded situation that operates on resources (files, variables), so there is no possibility of conflicts. However, when the multi-threaded case, may appear the same time two threads in the operation of the same resource, resulting in resource corruption, so we need a mechanism to solve the damage caused by this conflict, usually have lock and other operations, such as MySQL database InnoDB table engine has row-level locks, file operations have read locks and so on , these are the bottom of their program to help us finish. So we usually just need to know those operations, or those programs that deal with thread-safety issues, and then we can use them in multithreaded programming. This thread-safety-aware program is generally called a "thread-safe version," For example, PHP has a TS version, which is thread-safe safety. The queue module We're going to talk about is a thread-safe queueing data structure, so we can safely use it in multithreaded programming.

Finally, we're going to talk about the concept of critical threading blocking. When we finish learning the threading module in detail, we probably know how to create and start a thread. But if we create the thread and then call the Start method, then we'll find that it's like the whole program is over, what's going on? In fact, this is because we are in the main thread only responsible for starting the child thread code, also means that the main thread only initiates the function of the child thread, as for the sub-thread execution of the code, they are essentially written in a class inside a method, and do not actually execute him in the main thread, So the main thread after the completion of the sub-thread, his job has been completed, has been glorious exit. Now that the main thread exits, the Python process ends, and the other threads have no memory space to carry on. So we should be to let the main thread eldest brother wait until all the sub-threaded brother all finished and then glorious exit, then the thread object in what method can the main thread stuck it? Thread.Sleep? This is really a solution, but how long should we let the main thread sleep? We do not know exactly how long it will take to complete a task, and we certainly cannot use this method. So we should check the Internet at this time what is the way to let the child thread "stuck" the main thread? "Stuck" the word seems too coarse, in fact, said professional point, it should be called "blocking", so we can query "Python sub-thread blocking the main thread", if we will use the search engine correctly, we should find a method called join (), yes, this join () Method is the method that the child thread uses to block the main thread, and when the child thread has not finished executing, the main thread runs to the line containing the join () method, and the code following the join () method is executed until all the threads have finished executing.

Queue modules (Queues):

Suppose there is a scene like this, we need to crawl a person's blog, we know that this person's blog has two pages, a list.php page shows a link to all articles of this blog, and a view.php page that shows the specific content of an article.

If we want to take down all the articles in this person's blog, the idea of writing a single-threaded crawler is to use regular expressions to fetch the href attribute of all links in the list.php page to a tag, and deposit a name called Article_ An array of list (not called an array in Python, called list, Chinese list), and then uses a for loop to iterate over the article_list array, grabbing the content with various functions that crawl the contents of the page and then depositing it into the database.

If we are going to write a multithreaded crawler to accomplish this task, assuming that our program is using 10 threads, then we will have to find a way to divide the previously crawled article_list evenly into 10 parts, assigning each copy to one of the sub-threads respectively.

But the problem is, if our article_list array length is not a multiple of 10, that is, the number of articles is not an integer multiple of 10, then the last thread will be less than the other threads assigned to some tasks, then it will end faster.

If it's just crawling this thousands of-word blog post, it doesn't seem to be a problem, but if one of our tasks (not necessarily the task of crawling pages, possibly mathematical calculations, or time-consuming tasks like graphics rendering, etc.) runs for a long time, it can be a huge waste of resources and time. The purpose of our multithreading is to make the best use of all computing resources and calculate the time, so we have to find a way to allow the task to be more scientific and rational distribution.

and also to consider a situation, that is, the article is a large number of cases, we will be able to quickly crawl to the content of the article, but also as soon as possible to see what we have crawled, this demand in many CMS collection station is often reflected.

For example, we now want to crawl the target blog, there are tens of millions of articles, usually in this case the blog will do the paging process, Then if we follow the traditional ideas above to crawl all the pages of list.php at least a few hours or even days, if the boss wants you to display the crawl content as soon as possible, and as soon as possible to show what has been crawled to our CMS collection station, then we have to achieve one side grab list.php and the number of crawled It throws a article_list array and extracts the crawled URL address from the article_list array with another thread, and the thread then goes to the corresponding URL address to fetch the contents of the blog post with the regular expression. How do you implement this feature?

We need to turn on two types of threads at the same time, one thread dedicated to fetching the URLs in list.php and dropping into the article_list array, and the other thread is dedicated to Article_ The list extracts the URL and grabs the corresponding blog content from the corresponding view.php page.

But do we remember the concept of thread safety mentioned earlier? The first class of threads writes data to the Article_list array, and the other class of threads reads the data from the Article_list and deletes the data that has already been read. However, the list in Python is not a thread-safe version of the data structure, so doing so can cause unpredictable errors. So we can try to use a more convenient and thread-safe data structure, which is the queue data structures mentioned in our sub-headings.

The same queue also has a join () method, which is actually the same as the join () method in the previous section of threading, except that in the queue, join () is blocked when the queue is not empty. Otherwise continue executing the code following join (). In this reptile I used this method to block the main thread rather than to block the main threads directly through the threading join, which is the benefit of not having to write a dead loop to determine whether there are still unfinished tasks in the current task queue, making the program run more efficiently and making the code more elegant.

There is also a detail is in the python2.7 queue module name is queue, and in python3.x has been renamed to queue, is the first letter case difference, if you copy the code on the Internet, to remember this small difference.

Getopt module:

If you learn the C language, you should be familiar with this module, which is a module that takes the parameters from the command line. For example, we usually operate the MySQL database on the command line, that is, input mysql-h127.0.0.1-uroot-p, where the "-h127.0.0.1-uroot-p" behind MySQL is the parameter part that can be obtained.

We usually write a crawler, there are some parameters are required by the user to manually input, such as the MySQL host IP, user name password and so on. In order to make our program more friendly and generic, there are some configuration items that do not need to be hardcoded in the code, but when we execute him we dynamically pass in, combined with the Getopt module we can implement this function.

Hashlib (hash):

A hash is essentially a set of mathematical algorithms, a feature of which is that when you give a parameter, he can output another result, although the result is short, but he can be considered unique. For example, we have heard of Md5,sha-1, and so on, they belong to the hashing algorithm. They can turn some of the files, the text after a series of mathematical operations into a short number of less than 100 bits of a digital English mixed string.

The Hashlib module in Python encapsulates these mathematical functions for us, and we simply call it to complete the hashing operation.

Why do I have this bag in my reptile? Because in some interface requests, the server needs to bring some check code, to ensure that the interface request data has not been tampered with or lost, these check codes are generally hash algorithms, so we need to use this module to complete this operation.

Json:

Most of the time we crawl to the data is not HTML, but some JSON data, JSON is essentially a string of key-value pairs, if we need to extract the specific string, then we need the JSON module to convert this JSON string to the dict type to facilitate our operation.

Re (regular expression):

Sometimes we crawl to some of the content of the Web page, but we need to have some specific format content in the Web page, for example, the format of the e-mail is usually preceded by a number of English letters plus an @ symbol Plus/http Xxx.xxx the domain name, and to describe the format as a computer language, we can use an expression called a regular expression to express this format, and let the computer automatically match the text that matches that particular format from a large string of strings.

Sys

This module is mainly used to deal with some system aspects, in which I use him to solve the problem of output coding.

Time

People who have learnt a little English can guess this module is used to process time, in which I use it to get the current timestamp, and then to get the program run time by subtracting the timestamp at the end of the main thread with the current timestamp minus the time the program starts running.

, open 50 Threads fetch 100 pages (30 posts per page, equivalent to crawl 3,000 posts) paste the post content and extract the mobile phone mailbox This step takes a total of 330 seconds.

Urllib and Urllib2:

These two modules are used to handle some HTTP requests, as well as URL formatting. The core code of my Reptile HTTP request section is the use of this module.

MYSQLDB:

This is a third-party module for manipulating the MySQL database in Python.

Here we have to pay attention to one detail: The MySQLdb module is not a thread-safe version, meaning we cannot share the same MySQL connection handle in multiple threads. So you can see in my code that I've passed in a new MySQL connection handle in each thread's constructor. Therefore, each child thread will only use its own standalone MySQL connection handle.

Cmd_color_printers:

This is also a third-party module, the network can find the relevant code, this module is mainly used to output color strings to the command line. For example, we usually have bugs in the crawler, to output red font will be more conspicuous, it is necessary to use this module.

Error handling for automated crawlers:

If everyone in the network quality is not very good environment to use the crawler, you will find sometimes reported anomalies, there is no write exception handling.

In general, if we are to write highly automated crawlers, we need to anticipate all the anomalies that our crawlers might encounter and deal with them.

For example, we should put the tasks we were working on back into the task queue, otherwise we would have missed the information. This is also a complex point of crawler writing.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.