1. Three forms of information markers
(1) XML (extensible Markup Language) Extensible tagged Voice
<name> ... </name> tagged with content
<name/> label with no content
<!‐‐‐‐> notes
(2) JSON (Javsscript Object Notation) has a type of key-value pair Key:value
"Key": "Value"
"Key": ["value1", "value2"] multi-value with [,] organization
"Key": {"subkey": "Subvalue"} key value pair nested {,}
(3) Yaml (Yaml Ain ' t Markup Language) untyped key-value pair Key:value
Indent Expression owning relationship
-Express side-by-side relationships
| Expression of the whole block of data # represents a comment
Key:value
Key: #Comment
‐value1
‐value2
Key:
Subkey:subvalue
2. Compare
Xml
The earliest universal Information Markup language, extensibility is good, but cumbersome
Information interaction and delivery on the Internet
Json
Information has a type, suitable for program processing (JS), more concise than XML
Mobile app Cloud and node information Communication, no annotations
Yaml
No type of information, the highest proportion of text information, good readability
Various types of system configuration files, annotated easy to read
3. General methods of information extraction
Method One: Complete parsing information in the form of markup, and then extract the key information
Advantages: Accurate information analysis
Disadvantage: The extraction process is cumbersome and slow
Method Two: Ignore the Mark form, search the key information directly
Advantages: The extraction process is simple and fast
Disadvantage: The accuracy of extraction results is related to information content
Fusion method: Combining form parsing and searching method to extract key information
Need tag parser and text lookup function
4. HTML content Lookup method based on BS4 library
<>.find_all (name, Attrs, recursive, string, **kwargs)
Returns a list type that stores the results of a lookup
? Name: Retrieves a string for the label name
? Attrs: Retrieving strings for Tag property values, labeling attribute retrieval
? Recursive: Whether to retrieve all descendants, default True
? String: Retrieving strings for the string range in <>...</>
<tag> (..) equivalent to <tag>.find_all (..)
Soup (..) Equivalent to Soup.find_all (..)
Extension methods:
<>.find () Search and return only one result, with the. Find_all () parameter
<>.find_parents () Search in ancestor node, return list type, same. Find_all () parameter
<>.find_parent () Returns a result in the ancestor node with the. Find () parameter
<>.find_next_siblings () Searches in subsequent parallel nodes, returns the list type, and the same. Find_all () parameter
<>.find_next_sibling () Returns a result in subsequent parallel nodes, with the. Find () parameter
<>.find_previous_siblings () searches in a sequential parallel node, returns the list type, and the same. Find_all () parameter
<>.find_previous_sibling () Returns a result in a sequential parallel node with the. Find () parameter
4. Chinese University Ranking crawler example
- #CrawUnivRankingB. py
- Import Requests
- From bs4 import beautifulsoup
- Import BS4
- def gethtmltext (URL):
- try:
- r = requests.get (URL, timeout=+)
- r.raise_for_status ()
- r.encoding = r.apparent_encoding
- return r.text
- except:
- return ""
- def fillunivlist (ulist, HTML):
- Soup = beautifulsoup (HTML, "Html.parser")
- for tr in soup.find (' tbody '). Children:
- if isinstance(tr, bs4.element.Tag):
- TDS = TR (' TD ')
- ulist.append ([tds[0].string, tds[1].string, tds[3].string])
- def printunivlist (ulist, num):
- tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
- print(tplt. Format("Rank","School name","Total score",chr(12288 )))
- for i in range(num):
- u=Ulist[i]
- print(tplt. Format(u[0],u[1],u[2],chr(12288 )))
- def Main ():
- uinfo = []
- URL = ' https://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html '
- html = gethtmltext (URL)
- fillunivlist (uinfo, HTML)
- printunivlist (Uinfo, ) # univs
- Main ()
Python web crawler and Information extraction--5. Information organization and extraction method