Python crawler (3)--python methods and steps to crawl large data

Last Update:2018-07-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

methods and steps for Python to crawl large data: First, crawl the link we need

channel_extract.py
The first link here is what we call a big class link:

from BS4 import beautifulsoup Import Requests Start_url = ' http://lz.ganji.com/wu/' Host_url = ' http://lz.ganji.com/' def get_channel_urls (URL): wb_data = requests.get (URL) soup = BeautifulSoup (Wb_data.text, ' lxml ') links = soup.select ('. Fenlei > dt > A ') #print (links) for link in links:page_url = host_ URL + link.get (' href ') print (Page_url) #get_channel_urls (start_url) channel_urls = ' Http://lz.ganji.com/jiaj u/http://lz.ganji.com/rirongbaihuo/http://lz.ganji.com/shouji/http://lz.ganji.com/bangong/http://lz.ganji.com/ nongyongpin/http://lz.ganji.com/jiadian/http://lz.ganji.com/ershoubijibendiannao/http://lz.ganji.com/
ruanjiantushu/http://lz.ganji.com/yingyouyunfu/http://lz.ganji.com/diannao/http://lz.ganji.com/xianzhilipin/ http://lz.ganji.com/fushixiaobaxuemao/http://lz.ganji.com/meironghuazhuang/http://lz.ganji.com/shuma/http:// lz.ganji.com/laonianyongpin/http://lz.ganji.com/xuniwupin/' '

Then take me to climb 58 of the same city as an example is to crawl the second-hand market all categories of links, that is, I said the large category of links;
Find common features of these links, output them with functions, and store them as multiple lines of text. second, to obtain the details of the page we need the link and detailed information

page_parsing.py 1, to talk about our database:

First look at the code:

#引入库文件 from
bs4 import beautifulsoup Import
requests import
Pymongo #python操作MongoDB的库
import re
Import time

#链接和建立数据库
client = Pymongo. Mongoclient (' localhost ', 27017)
Ceshi = client[' Ceshi '] #建ceshi数据库
ganji_url_list = ceshi[' ganji_url_list ' ] #建立表文件
ganji_url_info = ceshi[' Ganji_url_info ']

2, to determine whether the page structure and we want to match the page structure, such as sometimes there are 404 pages; 3, from the page to extract the links we want, that is, each detail page links;

What we're going to say here is one way:
Item_link = link.get (' href '). Split ('? ') [0]

This link is what kind of, this get method is what ghost.
And then I found out that this type was

<class ' bs4.element.tab>

If we want to get a property individually, we can do this, for example, what is the class name that we get

Print soup.p[' class ']
#[' title '

You can also use the Get method to pass in the name of the property, which is equivalent

Print Soup.p.get (' class ')
#[' title '

Let me put the code here:

#爬取所有商品的详情页面链接:
def get_type_links (channel, num):
    List_view = ' {0}o{1}/'. Format (channel, str (num))
    # Print (list_view)
    wb_data = Requests.get (list_view)
    soup = BeautifulSoup (wb_data.text, ' lxml ') Linkon
    = Soup.select ('. Pagebox ') #判断是否为我们所需页面的标志
    #如果爬下来的select链接为这样: Div.pagebox > Ul > Li:nth-child (1) > A > Span  here: nth-child (1) to delete
    #print (Linkon)
    if Linkon:
        link = soup.select ('. ZZ > Zz-til > A ')
        link_2 = Soup.select ('. Js-item > A ')
        link = link + link_2
        #print (len (link) for
        LINKC in Link:
  LINKC = linkc.get (' href ')
            ganji_url_list.insert_one ({' URL ': linkc})
            print (LINKC)
    else:
        Pass

4, crawl the details of the page we need information

Let me put a piece of code on it:

#爬取赶集网详情页链接: def get_url_info_ganji (URL): Time.sleep (1) wb_data = requests.get (URL) soup = BeautifulSoup (wb_da Ta.text, ' lxml ') Try:title = Soup.select (' head > title ') [0].text Timec = Soup.select ('. pr-5 ') [0].t Ext.strip () type = Soup.select ('. det-infor > li > Span > A ') [0].text price = Soup.select ('. det-in For > li > I ') [0].text place = Soup.select ('. det-infor > Li > A ") [1:] Placeb = [] for PLACEC in Place:placeb.append (placec.text) tag = Soup.select ('. Second-dt-bewrite > Ul > Li ') [0]
            . Text tag = '. Join (Tag.split ()) #print (Time.split ()) data = {' URL ': URL,
            ' title ': Title, ' Time ': Timec.split (), ' type ': type, ' price ': Price, ' Place ': placeb, ' new ': Tag} ganji_url_info.insert_one (data) #向数据库中插入一条数据 print (d ATA) except IndexerrOr:pass

How to write our main function.

main.py
Look at the code:

#先从别的文件中引入函数和数据: From
multiprocessing import Pool from
page_parsing import Get_type_links,get_url_info_ganji , ganji_url_list from
channel_extract import channel_urls

#爬取所有链接的函数:
def get_all_links_from (channel): For
    I in range (1,100):
        get_type_links (channel,i)

#后执行这个函数用来爬取所有详情页的文件:
if __name__ = = ' __main__ ':
#     pool = Pool ()
#     pool = Pool ()
#     Pool.map (Get_url_info_ganji, [url[' url '] for URL in Ganji_url_list.find ()])
#     Pool.close ()
#     Pool.join ()


#先执行下面的这个函数, to crawl all links:
if __ name__ = = ' __main__ ':
    pool = Pool () Pool
    = Pool ()
    Pool.map (Get_all_links_from,channel_urls.split ())
    pool.close ()
    pool.join ()

v. Counting procedure

count.py
Used to show the number of crawl data;

import time to page_parsing import Ganji_url_list,ganji_url_info while True: # print (ganji_ Url_list.find (). Count ()) # Time.sleep (5) Print (Ganji_url_info.find (). Count ()) Time.sleep (5)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler (3)--python methods and steps to crawl large data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler (3)--python methods and steps to crawl large data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support