Python simple crawler 3. python Crawler

Source: Internet
Author: User

Python simple crawler 3. python Crawler
We will continue to study BeautifulSoup classification Printing Output Python simple Crawler 1 Python simple Crawler 2

The first two sections mainly show how to use BeautifulSoup to capture webpage information and obtain the corresponding image title,

This means that we only know how to use tools to browse and retrieve content, but only you know What to crawl.

We need to organize the categories, name them, and print the categories so that others can see what the title is and what the content is.

#! Usr/bin/env python #-*-coding: UTF-8-*-from bs4 import BeautifulSoupimport requestsimport jsonheaders = {'user-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/100', # 'cookier': 'cnzzdata1260535040 = 537.36-242528197-null % 7C1478672438 ',} url = 'HTTP: // www. beiwo. TV/index. php? S=vod-search-id-14-tid--area--year-inclusearch_year-order-gold.html 'wb _ data = requests. get (url, headers = headers) soup = BeautifulSoup (wb_data.text, 'lxml') imgs = soup. select ("ul. img-list.clearfix> li> a> img ") titles = soup. select ("ul. img-list.clearfix> li> h5 ") yanyuans = soup. select ("ul. img-list.clearfix> li> p ") stars = soup. select ("p. star> em ") J_data = {} count = 0for title, img, yanyuan, star in zip (titles, imgs, yanyuans, stars): data = {" title ": title. get_text (), "img": img. get ("src"), "Actor": list (yanyuan. stripped_strings), "rating": star. get_text (),} J_data [count] = data count + = 1 print (data) with open ("test.txt", 'w') as f: f. write (json. dumps (J_data ))

I just sent the complete code a little bit for example:

First, we need to import the corresponding method in the standard format. Here I add json to save the captured data and put it in the txt file.

Headers camouflage browser header file shorthand, url is the address of the web page you grab (now many websites have anti-crawling protection increasingly difficult to crawl Information)

Requests wb_data returned by the Web service to parse BeautifulSoup in lxml format.

The captured information is as follows: titles title imgs image yanyuans stars score is added with s because the returned information is all the relevant information captured and returned to the list.

The format in which the J_data dictionary is stored. count is used to count the Key value used as the dictionary. The zip method is described as follows:

The two values in the same position of the list can be matched one by one and the rows returned by the tuples are converted into a new list. Here I use them to classify and output the corresponding information.

The commonly used write method with can be used to close the file without writing it. After processing it, it will clean up the subsequent operations. The effect is as follows:

Here we sort out all the desired data and categories, and print them out to show others what they want. Because the scores and actors are placed under a tag, there will be bugs when there is no actor name.

Let's take a look at what is saved in the txt file as follows:

Many people say that the error code is garbled. In fact, \ u6f14 is a Chinese character, but it is written into the text in unicode encoding format. If you read it in reverse direction, it can still be printed normally (because the file is too long to be intercepted)

The following code is used to create a new py file:

#!/usr/bin/env python# -*- coding:utf-8 -*-import jsonwith open('test.txt','r') as f:    dic = json.loads(f.readline())    for i in range(len(dic)):    print(dic[str(i)])

Method module for importing json

Open the file test.txt mode. r reads the file and name it f (the test.txt directory generated by the program is the current directory, that is, the three files are put together. If you want to write them elsewhere, write them to the relative path)

Because there is only one row, f. readline () reads the unicode-encoded file in json. load mode. You read the file in json. dumps mode, and the returned type dictionary is dic.

Using a loop to see that the value is the content of the dictionary, the effect is as follows:

Currently, the basic BeautifulSoup usage methods have been completed. In fact, many extensions can be made, such as storing captured data into mysql or other databases and writing it into xls tables, because I am mainly introducing BeautifulSoup here, I have not introduced it, but I can do it as an extension exercise.

You can learn about the basic usage and statements of the database. You can also learn how to use a third-party module of the python table to write data into the excel file. Finally, I would like to remind you that you can use these tools and discover problems only when you practice more,

Think, solve, and improve. Finally, I would like to thank my classmates and friends who can watch the video. I will update and explain the useful libraries and methods from time to time.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.