the process of operation, The entire document tree is loaded and then queried for matching operations, consuming more resources in the process and less processing performance relative to XPath
So why use BS4? Because, it, simple enough!
Description Language | handling Efficiency | hands-on level
Regular Expression | Very High Efficiency | Difficult
Xpath | Efficiency is High | Normal
BS4 | High Efficien
Python uses bs4 to obtain 58 city types in the same city, pythonbs4
This example describes how Python uses bs4 to obtain 58 city types in the same city. Share it with you for your reference. The details are as follows:
#-*-Coding: UTF-8 -*-#! /Usr/bin/pythonimport urllibimport OS, datetime, sysfrom
See the scrape chapter recently. There is a s_urls[0][' href ' that cannot be understood. Think Python has a non-numeric subscript array. After the multi-query to know that this is the tag query in BeautifulSoupHttps://stackoverflow.com/questions/5815747/beautifulsoup-getting-href?noredirect=1lq=1From BS4 import beautifulsoup# what does Thread meansfrom threading import threadimport urllib.request#location
Reference Links:Use of BS4 and requests: https://www.cnblogs.com/baojinjin/p/6819389.htmlInstalling pip:80293806#Python 3.x began to bring the PIP, if not please confident Baidu installed. #pip install BEAUTIFULSOUP4 requests fromBs4ImportBeautifulSoupImportRequestsres= Requests.get ('https://etherscan.io/token/tokenholderchart/0x86fa049857e0209aa7d9e616f7eb3b3b78ecfdb0?range=10') res.encoding='GBK'Soup= Be
This article is mainly used to solve the successful installation of beautiful soup in the terminal, but the following error still occurs in idle:
>>>from BS4 Import BeautifulSoup
Traceback (most recent call last):
File "
From BS4 import BeautifulSoup
Importerror:no module named ' BS4 '
One of the reasons for this error (and for many reasons) is the sudo p
The examples in this article describe how Python uses BS4 to get the 58 city classification of cities. Share to everyone for your reference. Specific as follows:
#-*-Coding:utf-8-*-#! /usr/bin/pythonimport urllibimport OS, datetime, Sysfrom BS4 import beautifulsoupreload (SYS) sys.setdefaultencoding (" Utf-8 ") __baseurl__ =" http://bj.58.com/"__initurl__ =" ht
Working with Documents: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/Python's coding problem is more disgusting.Decode decodingEncode encodingIn the file header settings#-*-Coding:utf-8-*-Let Python use UTF8.#-*-Coding:utf-8-*-__author__ = ' Administrator ' from BS4 import beautifulsoupimport requestsimport osimport sysimport I Odef gethtml (URL):
by tag tag:Soup. Select("Body A")# [# # soup. Select("HTML head title")# [ Locate the direct sub-label under a tag tag [6]:Soup.Select("Head > title")# [Soup.Select("P > a")# [# # soup. Select("p > A:nth-of-type (2)")# [soup. Select("p > #link1")# [soup. Select("Body > A")# [] Find the sibling node tag:Soup. Select("#link1 ~. Sister")# [# Soup. Select("#link1 +. Sister")# [ Look through the class name of the CSS: soup. Select ( "sister" ) # [# # soup. Select ( "[
# crawl Picture # target site: http://699pic.com/sousuo-218808-13-1.html Import requests from BS4 import BeautifulSoup Import os r = requests.get (' http://699pic.com/sousuo-218808-13-1.html ') # r.content # returns a byte stream Soup = BeautifulSoup (r.content, ' Html.parser ') # with the HTML parser, find r.content # tu = Soup.find_all (' img ') # Find all the tags named "I MG "Object Tu = Soup.find_all (class_=" lazy ") # Find all tags
To analyze who is the king of water stickers, the first to collect posts and posters of data. here to test Baidu Bar Liyi The first 100 pages:
#coding: Utf-8 import urllib2 from BS4 import beautifulsoup import CSV import re import sys reload (SYS) sys.setdefaultencod ing (' utf-8 ') #wb写 A + append mode for K in range (0,100): req = urllib2. Request (' http://tieba.baidu.com/f?kw= liyi ie=utf-8pn= ' +str (k*50)) CSVFile = File (' tiezi.csv ', ' ab+
Soup = BeautifulSoup (Html_doc)Soup is BeautifulSoup processing the formatted string, Soup.title get the title tag, SOUP.P gets the first P tag in the document, to get all the tags, you have to use Find_allFunction.The Find_all function returns a sequence that loops through it, and then gets the thought in turn.Get_text () is the return text, which is the label of every BeautifulSoup processed object. You can try print soup.p.get_text ()In fact, you can get other properties of the tag, such as I
This is my first reptile, choose to crawl this site is because, his URL is very regular, not because of his pictures, not because of pictures, not ...
First, his first address for each set of graphs is as followsHttp://www.mmjpg.com/mm/1
The address of the picture is as followsHttp://img.mmjpg.com/2015/1/1.jpg
There are years in the URL of the picture, because I don't know which year it is, so it's inconvenient to climb down all the pictures.
So I found the picture address of the first picture f
1. Online resources will be requested:1 Import Requests 2 res=requests.get ('http://*******')3 res.encoding=' utf-8'4print(res.text)This uses the requests get method to get the HTML, specifically get or post and so on through the page header information to query:For example, Baidu's method is can use get get.2, will get the Web page use BeautifulSoup to analyze1 fromBs4ImportBeautifulSoup2Soup=beautifulsoup (Res.text,'Html.parser')3 Print(soup)#you can see the contents of the Web page4 forNew
1. The default use of the CENTOS7 is the python2.7 version, do not move it.
2. Download python3.5.1, Address: https://www.python.org/downloads/source/click: gzipped source Tarsal and BS4, address: https:// www.crummy.com/software/BeautifulSoup/bs4/download/4.0/
3. Installation steps:
1 into the/usr/local directory: cd/usr/local
2 Create a new folder in the local directory: sudo mkdir python3
3) Decompressio
System with the python2.7Download the latest python3.5.2 https://www.python.org/downloads/release/python-352/on the websiteSince CentOS does not have its own apt-get, it can only be downloaded and installedIf your Linux has apt-get, please dosudo apt-get install PYTHON-BS4BS4 's https://www.crummy.com/software/BeautifulSoup/bs4/download/Default path install Pytho
Readers may wonder what my title looks like, mostly just write lxml and bs4 the two PY module names may not be able to attract the attention of the public, generally speaking of web page parsing technology, referring to the keywords are more beautifulsoup and XPath, and their respective modules ( Python is called a module, but other platforms are more known as libraries, and rarely gets materializing to tal
Reptiles sometimes encounter two situations, resulting in the inability to crawl properly(1) IP blockade, (looks like the U.S. regiment will appear)(2) Prohibit the robot to crawl, (such as Amazon)
Workaround:Let's take the crawler code in the
Subjects: Wheat Academy 1, most of the video information is present in http://www.maiziedu.com/course/all/, all the video information has its own ID, the first query address should be in: ' http://www.maiziedu.com/course/' + In the ID
?
Analysis page get title, get directory for Create folder
Url_dict1 = {} URL = ' http://www.maiziedu.com/course/{} '. Format (num)page = Urllib.request.urlopen (URL) context = page.Read().Decode(' UTF8 ') title =Re.Search(' Group().Strip(
You can refer directly to the BS4 documentation: Https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-allNote that the following are :1. Some tag properties cannot be used in search, such as the data-* attribute in HTML5 :BeautifulSoup(' )data_soup. Find_all(data-foo="value")# syntaxerror:keyword can ' t be an expression However, you can use the attrs parameter of the Find_all () method
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.