Python web crawler and Information extraction--5. Information organization and extraction method

Last Update:2018-02-27 Source: Internet

Author: User

Tags chr python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Three forms of information markers

(1) XML (extensible Markup Language) Extensible tagged Voice

<name> ... </name> tagged with content
<name/> label with no content
<!‐‐‐‐> notes

(2) JSON (Javsscript Object Notation) has a type of key-value pair Key:value

"Key": "Value"
"Key": ["value1", "value2"] multi-value with [,] organization
"Key": {"subkey": "Subvalue"} key value pair nested {,}

(3) Yaml (Yaml Ain ' t Markup Language) untyped key-value pair Key:value

Indent Expression owning relationship

-Express side-by-side relationships

| Expression of the whole block of data # represents a comment

Key:value
Key: #Comment
‐value1
‐value2
Key:
Subkey:subvalue

2. Compare

Xml

The earliest universal Information Markup language, extensibility is good, but cumbersome

Information interaction and delivery on the Internet

Json

Information has a type, suitable for program processing (JS), more concise than XML

Mobile app Cloud and node information Communication, no annotations

Yaml

No type of information, the highest proportion of text information, good readability

Various types of system configuration files, annotated easy to read

3. General methods of information extraction

Method One: Complete parsing information in the form of markup, and then extract the key information

Advantages: Accurate information analysis
Disadvantage: The extraction process is cumbersome and slow

Method Two: Ignore the Mark form, search the key information directly

Advantages: The extraction process is simple and fast
Disadvantage: The accuracy of extraction results is related to information content

Fusion method: Combining form parsing and searching method to extract key information

Need tag parser and text lookup function

4. HTML content Lookup method based on BS4 library

<>.find_all (name, Attrs, recursive, string, **kwargs)
Returns a list type that stores the results of a lookup

? Name: Retrieves a string for the label name

? Attrs: Retrieving strings for Tag property values, labeling attribute retrieval
? Recursive: Whether to retrieve all descendants, default True
? String: Retrieving strings for the string range in <>...</>

<tag> (..) equivalent to <tag>.find_all (..)
Soup (..) Equivalent to Soup.find_all (..)

Extension methods:

<>.find () Search and return only one result, with the. Find_all () parameter
<>.find_parents () Search in ancestor node, return list type, same. Find_all () parameter
<>.find_parent () Returns a result in the ancestor node with the. Find () parameter
<>.find_next_siblings () Searches in subsequent parallel nodes, returns the list type, and the same. Find_all () parameter
<>.find_next_sibling () Returns a result in subsequent parallel nodes, with the. Find () parameter
<>.find_previous_siblings () searches in a sequential parallel node, returns the list type, and the same. Find_all () parameter
<>.find_previous_sibling () Returns a result in a sequential parallel node with the. Find () parameter

4. Chinese University Ranking crawler example

#CrawUnivRankingB. py
Import Requests
From bs4 import beautifulsoup
Import BS4
def gethtmltext (URL):
try:
r = requests.get (URL, timeout=+)
r.raise_for_status ()
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def fillunivlist (ulist, HTML):
Soup = beautifulsoup (HTML, "Html.parser")
for tr in soup.find (' tbody '). Children:
if isinstance(tr, bs4.element.Tag):
TDS = TR (' TD ')
ulist.append ([tds[0].string, tds[1].string, tds[3].string])
def printunivlist (ulist, num):
tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
print(tplt. Format("Rank","School name","Total score",chr(12288 )))
for i in range(num):
u=Ulist[i]
print(tplt. Format(u[0],u[1],u[2],chr(12288 )))
def Main ():
uinfo = []
URL = ' https://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html '
html = gethtmltext (URL)
fillunivlist (uinfo, HTML)
printunivlist (Uinfo, ) # univs
Main ()

Python web crawler and Information extraction--5. Information organization and extraction method

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More