Python web crawler and Information extraction--5. Information organization and extraction method

Source: Internet
Author: User
Tags chr python web crawler

1. Three forms of information markers

(1) XML (extensible Markup Language) Extensible tagged Voice

<name> ... </name> tagged with content
<name/> label with no content
<!‐‐‐‐> notes

(2) JSON (Javsscript Object Notation) has a type of key-value pair Key:value

"Key": "Value"
"Key": ["value1", "value2"] multi-value with [,] organization
"Key": {"subkey": "Subvalue"} key value pair nested {,}

(3) Yaml (Yaml Ain ' t Markup Language) untyped key-value pair Key:value

Indent Expression owning relationship

-Express side-by-side relationships

| Expression of the whole block of data # represents a comment

Key:value
Key: #Comment
‐value1
‐value2
Key:
Subkey:subvalue

2. Compare

Xml

The earliest universal Information Markup language, extensibility is good, but cumbersome

Information interaction and delivery on the Internet

Json

Information has a type, suitable for program processing (JS), more concise than XML

Mobile app Cloud and node information Communication, no annotations

Yaml

No type of information, the highest proportion of text information, good readability

Various types of system configuration files, annotated easy to read

3. General methods of information extraction

Method One: Complete parsing information in the form of markup, and then extract the key information

Advantages: Accurate information analysis
Disadvantage: The extraction process is cumbersome and slow

Method Two: Ignore the Mark form, search the key information directly

Advantages: The extraction process is simple and fast
Disadvantage: The accuracy of extraction results is related to information content

Fusion method: Combining form parsing and searching method to extract key information

Need tag parser and text lookup function

4. HTML content Lookup method based on BS4 library

<>.find_all (name, Attrs, recursive, string, **kwargs)
Returns a list type that stores the results of a lookup

? Name: Retrieves a string for the label name

? Attrs: Retrieving strings for Tag property values, labeling attribute retrieval
? Recursive: Whether to retrieve all descendants, default True
? String: Retrieving strings for the string range in <>...</>

<tag> (..) equivalent to <tag>.find_all (..)
Soup (..) Equivalent to Soup.find_all (..)

Extension methods:

<>.find () Search and return only one result, with the. Find_all () parameter
<>.find_parents () Search in ancestor node, return list type, same. Find_all () parameter
<>.find_parent () Returns a result in the ancestor node with the. Find () parameter
<>.find_next_siblings () Searches in subsequent parallel nodes, returns the list type, and the same. Find_all () parameter
<>.find_next_sibling () Returns a result in subsequent parallel nodes, with the. Find () parameter
<>.find_previous_siblings () searches in a sequential parallel node, returns the list type, and the same. Find_all () parameter
<>.find_previous_sibling () Returns a result in a sequential parallel node with the. Find () parameter

4. Chinese University Ranking crawler example

  1. #CrawUnivRankingB. py
  2. Import Requests
  3. From bs4 import beautifulsoup
  4. Import BS4
  5. def gethtmltext (URL):
  6. try:
  7. r = requests.get (URL, timeout=+)
  8. r.raise_for_status ()
  9. r.encoding = r.apparent_encoding
  10. return r.text
  11. except:
  12. return ""
  13. def fillunivlist (ulist, HTML):
  14. Soup = beautifulsoup (HTML, "Html.parser")
  15. for tr in soup.find (' tbody '). Children:
  16. if isinstance(tr, bs4.element.Tag):
  17. TDS = TR (' TD ')
  18. ulist.append ([tds[0].string, tds[1].string, tds[3].string])
  19. def printunivlist (ulist, num):
  20. tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
  21. print(tplt. Format("Rank","School name","Total score",chr(12288 )))
  22. for i in range(num):
  23. u=Ulist[i]
  24. print(tplt. Format(u[0],u[1],u[2],chr(12288 )))
  25. def Main ():
  26. uinfo = []
  27. URL = ' https://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html '
  28. html = gethtmltext (URL)
  29. fillunivlist (uinfo, HTML)
  30. printunivlist (Uinfo, ) # univs
  31. Main ()

Python web crawler and Information extraction--5. Information organization and extraction method

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.