Python crawler (3)--python methods and steps to crawl large data

Source: Internet
Author: User
methods and steps for Python to crawl large data: First, crawl the link we need

channel_extract.py
The first link here is what we call a big class link:

from BS4 import beautifulsoup Import Requests Start_url = ' http://lz.ganji.com/wu/' Host_url = ' http://lz.ganji.com/' def get_channel_urls (URL): wb_data = requests.get (URL) soup = BeautifulSoup (Wb_data.text, ' lxml ') links = soup.select ('. Fenlei > dt > A ') #print (links) for link in links:page_url = host_ URL + link.get (' href ') print (Page_url) #get_channel_urls (start_url) channel_urls = ' Http://lz.ganji.com/jiaj u/http://lz.ganji.com/rirongbaihuo/http://lz.ganji.com/shouji/http://lz.ganji.com/bangong/http://lz.ganji.com/ nongyongpin/http://lz.ganji.com/jiadian/http://lz.ganji.com/ershoubijibendiannao/http://lz.ganji.com/
ruanjiantushu/http://lz.ganji.com/yingyouyunfu/http://lz.ganji.com/diannao/http://lz.ganji.com/xianzhilipin/ http://lz.ganji.com/fushixiaobaxuemao/http://lz.ganji.com/meironghuazhuang/http://lz.ganji.com/shuma/http:// lz.ganji.com/laonianyongpin/http://lz.ganji.com/xuniwupin/' ' 

Then take me to climb 58 of the same city as an example is to crawl the second-hand market all categories of links, that is, I said the large category of links;
Find common features of these links, output them with functions, and store them as multiple lines of text. second, to obtain the details of the page we need the link and detailed information

page_parsing.py 1, to talk about our database:

First look at the code:

#引入库文件 from
bs4 import beautifulsoup Import
requests import
Pymongo #python操作MongoDB的库
import re
Import time

#链接和建立数据库
client = Pymongo. Mongoclient (' localhost ', 27017)
Ceshi = client[' Ceshi '] #建ceshi数据库
ganji_url_list = ceshi[' ganji_url_list ' ] #建立表文件
ganji_url_info = ceshi[' Ganji_url_info ']
2, to determine whether the page structure and we want to match the page structure, such as sometimes there are 404 pages; 3, from the page to extract the links we want, that is, each detail page links;
What we're going to say here is one way:
Item_link = link.get (' href '). Split ('? ') [0]

This link is what kind of, this get method is what ghost.
And then I found out that this type was

<class ' bs4.element.tab>

If we want to get a property individually, we can do this, for example, what is the class name that we get

Print soup.p[' class ']
#[' title '

You can also use the Get method to pass in the name of the property, which is equivalent

Print Soup.p.get (' class ')
#[' title '

Let me put the code here:

#爬取所有商品的详情页面链接:
def get_type_links (channel, num):
    List_view = ' {0}o{1}/'. Format (channel, str (num))
    # Print (list_view)
    wb_data = Requests.get (list_view)
    soup = BeautifulSoup (wb_data.text, ' lxml ') Linkon
    = Soup.select ('. Pagebox ') #判断是否为我们所需页面的标志
    #如果爬下来的select链接为这样: Div.pagebox > Ul > Li:nth-child (1) > A > Span  here: nth-child (1) to delete
    #print (Linkon)
    if Linkon:
        link = soup.select ('. ZZ > Zz-til > A ')
        link_2 = Soup.select ('. Js-item > A ')
        link = link + link_2
        #print (len (link) for
        LINKC in Link:
  LINKC = linkc.get (' href ')
            ganji_url_list.insert_one ({' URL ': linkc})
            print (LINKC)
    else:
        Pass
4, crawl the details of the page we need information

Let me put a piece of code on it:

#爬取赶集网详情页链接: def get_url_info_ganji (URL): Time.sleep (1) wb_data = requests.get (URL) soup = BeautifulSoup (wb_da Ta.text, ' lxml ') Try:title = Soup.select (' head > title ') [0].text Timec = Soup.select ('. pr-5 ') [0].t Ext.strip () type = Soup.select ('. det-infor > li > Span > A ') [0].text price = Soup.select ('. det-in For > li > I ') [0].text place = Soup.select ('. det-infor > Li > A ") [1:] Placeb = [] for PLACEC in Place:placeb.append (placec.text) tag = Soup.select ('. Second-dt-bewrite > Ul > Li ') [0]
            . Text tag = '. Join (Tag.split ()) #print (Time.split ()) data = {' URL ': URL,
            ' title ': Title, ' Time ': Timec.split (), ' type ': type, ' price ': Price, ' Place ': placeb, ' new ': Tag} ganji_url_info.insert_one (data) #向数据库中插入一条数据 print (d ATA) except IndexerrOr:pass 
How to write our main function.

main.py
Look at the code:

#先从别的文件中引入函数和数据: From
multiprocessing import Pool from
page_parsing import Get_type_links,get_url_info_ganji , ganji_url_list from
channel_extract import channel_urls

#爬取所有链接的函数:
def get_all_links_from (channel): For
    I in range (1,100):
        get_type_links (channel,i)

#后执行这个函数用来爬取所有详情页的文件:
if __name__ = = ' __main__ ':
#     pool = Pool ()
#     pool = Pool ()
#     Pool.map (Get_url_info_ganji, [url[' url '] for URL in Ganji_url_list.find ()])
#     Pool.close ()
#     Pool.join ()


#先执行下面的这个函数, to crawl all links:
if __ name__ = = ' __main__ ':
    pool = Pool () Pool
    = Pool ()
    Pool.map (Get_all_links_from,channel_urls.split ())
    pool.close ()
    pool.join ()
v. Counting procedure

count.py
Used to show the number of crawl data;

import time to page_parsing import Ganji_url_list,ganji_url_info while True: # print (ganji_ Url_list.find (). Count ()) # Time.sleep (5) Print (Ganji_url_info.find (). Count ()) Time.sleep (5) 
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.