Python simple crawler 3. python Crawler

Last Update:2016-11-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python simple crawler 3. python Crawler
We will continue to study BeautifulSoup classification Printing Output Python simple Crawler 1 Python simple Crawler 2

The first two sections mainly show how to use BeautifulSoup to capture webpage information and obtain the corresponding image title,

This means that we only know how to use tools to browse and retrieve content, but only you know What to crawl.

We need to organize the categories, name them, and print the categories so that others can see what the title is and what the content is.

#! Usr/bin/env python #-*-coding: UTF-8-*-from bs4 import BeautifulSoupimport requestsimport jsonheaders = {'user-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/100', # 'cookier': 'cnzzdata1260535040 = 537.36-242528197-null % 7C1478672438 ',} url = 'HTTP: // www. beiwo. TV/index. php? S=vod-search-id-14-tid--area--year-inclusearch_year-order-gold.html 'wb _ data = requests. get (url, headers = headers) soup = BeautifulSoup (wb_data.text, 'lxml') imgs = soup. select ("ul. img-list.clearfix> li> a> img ") titles = soup. select ("ul. img-list.clearfix> li> h5 ") yanyuans = soup. select ("ul. img-list.clearfix> li> p ") stars = soup. select ("p. star> em ") J_data = {} count = 0for title, img, yanyuan, star in zip (titles, imgs, yanyuans, stars): data = {" title ": title. get_text (), "img": img. get ("src"), "Actor": list (yanyuan. stripped_strings), "rating": star. get_text (),} J_data [count] = data count + = 1 print (data) with open ("test.txt", 'w') as f: f. write (json. dumps (J_data ))

I just sent the complete code a little bit for example:

First, we need to import the corresponding method in the standard format. Here I add json to save the captured data and put it in the txt file.

Headers camouflage browser header file shorthand, url is the address of the web page you grab (now many websites have anti-crawling protection increasingly difficult to crawl Information)

Requests wb_data returned by the Web service to parse BeautifulSoup in lxml format.

The captured information is as follows: titles title imgs image yanyuans stars score is added with s because the returned information is all the relevant information captured and returned to the list.

The format in which the J_data dictionary is stored. count is used to count the Key value used as the dictionary. The zip method is described as follows:

The two values in the same position of the list can be matched one by one and the rows returned by the tuples are converted into a new list. Here I use them to classify and output the corresponding information.

The commonly used write method with can be used to close the file without writing it. After processing it, it will clean up the subsequent operations. The effect is as follows:

Here we sort out all the desired data and categories, and print them out to show others what they want. Because the scores and actors are placed under a tag, there will be bugs when there is no actor name.

Let's take a look at what is saved in the txt file as follows:

Many people say that the error code is garbled. In fact, \ u6f14 is a Chinese character, but it is written into the text in unicode encoding format. If you read it in reverse direction, it can still be printed normally (because the file is too long to be intercepted)

The following code is used to create a new py file:

#!/usr/bin/env python# -*- coding:utf-8 -*-import jsonwith open('test.txt','r') as f:    dic = json.loads(f.readline())    for i in range(len(dic)):    print(dic[str(i)])

Method module for importing json

Open the file test.txt mode. r reads the file and name it f (the test.txt directory generated by the program is the current directory, that is, the three files are put together. If you want to write them elsewhere, write them to the relative path)

Because there is only one row, f. readline () reads the unicode-encoded file in json. load mode. You read the file in json. dumps mode, and the returned type dictionary is dic.

Using a loop to see that the value is the content of the dictionary, the effect is as follows:

Currently, the basic BeautifulSoup usage methods have been completed. In fact, many extensions can be made, such as storing captured data into mysql or other databases and writing it into xls tables, because I am mainly introducing BeautifulSoup here, I have not introduced it, but I can do it as an extension exercise.

You can learn about the basic usage and statements of the database. You can also learn how to use a third-party module of the python table to write data into the excel file. Finally, I would like to remind you that you can use these tools and discover problems only when you practice more,

Think, solve, and improve. Finally, I would like to thank my classmates and friends who can watch the video. I will update and explain the useful libraries and methods from time to time.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python simple crawler 3. python Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python simple crawler 3. python Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support