How the URL manager is implemented:
1. Memory
Python memory
Backlog URL collection: Set ()
Crawled URL Collection: Set ()
2. Relational database
Mysql
URLs (URL, is_crawled)
3. Cache database (high performance, large company storage)
Redis
Backlog URL Collection: Set
Crawled URL Collection: Set
Web Downloader
Urllib2 Python official Base module
Requests third Party package more powerful
Import Urllib2
Urllib2 Download Page Method one:
###########################
#直接请求
Response = Urllib2.urlopen (' http://www.baidu.com ')
#获取状态码, if 200 indicates success
Print Response.getcode ()
#读取内容
Cont = Response.read ()
############################
Urllib2 Download Page Method 2:
Add data, HTTP header
############################
Import Urllib2
# Create a Request object
Request = Urllib2. Request (URL)
# Add Data
Request.add_data (' A ', ' 1 ')
# Add headers for HTTP
Request.add_header (' user-agent ', ' mozilla/5.0 ')
# Send request to get results
Response = Urllib2.urlopen (Request)
############################
Urllib2 Download Page Method 3:
Add a processor for a special scenario
Httpcookieprocessor
Proxyhandler
Httpshandler
Httpredirecthandler
Urllib2 three ways to download Web pages:
Web Parser
Tools to extract valuable data from Web pages
1. Regular expressions (complex, fuzzy matching)
1. Html.parser
2. Beautiful Soup (third party plugin, powerful)
3. lxml
Beautiful Soup
Python third-party library for extracting data from HTML or XML
Official website: https://www.crummy.com/software/BeautifulSoup/
Installing beautiful Soup
Beautiful Soup Syntax
1. Create a BeautifulSoup object based on an HTML page
2. Search node Find_all, find (can be searched by node name, node attribute value, node text)
3. You can then access the node's name, attributes, text
# Create a BeautifulSoup object
From BS4 import BeautifulSoup
# Create BeautifulSoup objects based on HTML page strings
Soup = BeautifulSoup (
Html_doc, # HTML document string
' Html.parser ' #HTML解析器
from_encoding= ' UTF8 ' #HTML文档的编码
)
# Search node (find_all, find)
Find_all (Name, Attrs, String)
# Find all nodes labeled a
Soup.find_all (' a ')
# Find all tags as a, link to nodes in/view/123.htm form
Soup.find_all (' A ', href= '/view/123.htm ')
# <a href= ' 123.htm ' class= ' abc ' >Python</a>
# Find all the nodes labeled Div,class as ABC, text python
Soup.find_all (' div ', class_= ' abc ', string= ' Python ')
Information for accessing the node:
# get node: <a href= ' 1.html ' >Python</a>
# Gets the label name of the node to find
Node.name
# Gets the href attribute of the A node found
node[' href ']
# Gets the link text of a node found to
Node.get_text ()
Python crawler Crawl data