Preface:
This series of articles is a brief introduction to crawlers and teaches you how to crawl content on your site in a simple way.
Readers are required to have a basic understanding of HTML language and python language .
(This series is also my study in the process of learning the crawler notes, as the depth of learning will continue to update)
Crawler Introduction:
Web crawler is an automatic access to Web content of the program, is an important part of the search engine.
the web crawler first obtains the source code of a website, through source parsing (such as <a><p> tags, etc.) to get the content you want.
Environment configuration:
Ubuntu System (Installation method please self-Baidu . because the following required software installation and operation is more convenient than Windows in ubuntu System , we choose Ubuntu System)
Crawlers have a lot of software available, we choose to use Python to crawl Web pages under Ubuntu, and put the crawled content into the MySQL database.
Required Software:
python :ubuntu system comes with no need to install
pip :python ( need to download python Span style= "color: #333333;" > library for Web page crawling, installing pip convenient for our python download of library )
scra PY :python developed a quick high-level screen capture and web grab frame web local
BeautifulSoup : Parsing with tags (e.g. <a>,<p>,id,class , etc.) from the HTML or XML python that extracts data from a file Library .
MySQL : an associated database management system that stores data in different tables, used to store data.
Software Installation steps:
To open a command line using ctrl+alt+t
1.pip installation sudoapt-get Install Python-pip
2.scrapy installation Pipinstall scrapy
3.beautifulsoup4 installation Pipinstall Beautifulsoup4
4. Install the mysql -related Python library
(1) Pipinstall Mysql-connector-python
(2) Pipinstall Mysql-python
(3) Pipinstall mysql-utilities
5. install MySQL
(1) sudoapt-get Install Mysql-server
(2) sudoapt-get Install mysql-client
(3) sudoapt-get Install Libmysqlclient-dev
Environment configuration is a very troublesome job, to have patience Oh ~ different Ubuntu system version may encounter a variety of wonderful problems, due to personal level limit, please Baidu (-.-)
If the environment is already configured, then we can start crawling the page ^v^
Learn the crawler from scratch (i)------environment configuration