1. Development environment
operating system: Win10 python version: Python 3.5.2 mysql:5.5.53
2, the use of the module
If not, use Pip to install: Pip install xxx xxx need to install the module
3, Analysis link (blog official website: https://www.cnblogs.com/)
Here we briefly analyze the home section
After analyzing the page system link variable is the last number, so you can write the link to the following mode, so that when the execution of a loop will be able to access all the page content to access
4. Analyze the content of the page
The information we need for the entire page is a blog post from bloggers, such as:
Accurate is the need to extract the title of the blog, introduction, release time and blog links
Find this page press F12 to review the element
Mouse Click this arrow, and then put on the page content, find the element we are looking for, in the following code section will appear the corresponding HTML:
right mouse button, select Copy element, you can copy this piece of information to the text, find a text document to save the following part of the code:
This content contains all the information of a blog, followed by the regular to extract the content we need to
5. Regular expressions
title= re.compile (' <a class= ' titlelnk.*?> (. *?) </a> ', Re. S
title1= Re.findall (title,html)
HTML is the entire Web page all the code document, these two lines of code will be in this page all the blog title in the Title1 list
where <a class= "titlelnk.*?> (. *?) </a> is a label that matches to all Class Titlelnk, (. *?) It's what we extracted.
6. Link Database
db = Pymysql.connect ("127.0.0.1", "root", "root", "crawler", charset= "UTF8") #打开数据链接,
Pymysql.connect () The first four parameters I will not say more, charset= "UTF8" This parameter can be saved just to ensure that the encoding is correct, or some circumstances can not insert data
cursor cursor = db. Cursor() # Create a Cursor object using the cursor () method
7. mysql INSERT statement
8. Collating the Code
Principle, code are in this, want to extract the content, analysis site can, of course, not all sites can crawl, special site has anti-crawling measures, need to learn more knowledge (Access frequency control, proxy IP pool, etc.)
Python crawler loops into MySQL database