Using bs4 and urllib2 to capture webpages is a pitfall
Today, I tried to crawl news on the Sina portal using python for one day. It is actually not difficult. The key is to get stuck on the following three issues.
Question 1: Sina news returns gzip data
After reading data, you want to use decode to convert the read string to a unicode string. This is obviously a common way for
Python3 practice-get Data from the website (Carbon Market Data-GD) (bs4/Beautifulsoup), python3bs4
Based on your individual needs, you can obtain some data from a website and find that the webpage link is hidden. You need to view the code in the browser to obtain the real link.
In the following case, data is directly crawled from the real link.
In addition, it is found that the "lxml" table cannot be directly parsed using read_html of pandas, which r
)) # output the first ID property equals gz_gszze label print(soup.find (id='gz_gszze')) # output the first id attribute equals The text content of the Gz_gszze label print(soup.find (id='gz_ Gszze'). Get_text ())# get all text content Print (Soup.get_text ()) # Output All property information for the first a-label print(soup.a.attrs)#Loop a label forLinkinchSoup.find_all ('a'): #gets the href attribute content of link Print(Link.get ('href'))#cyclic output of SOUP.P's child nodes f
I. Development environment
(1) Win10
(2) Python 2.7
(3) Pycharm
Second, the class that saves data to Excel
Import XLWT class Savaballdate (object): Def __init__ (self, items): Self.items = Items self.run (self. Items) def run (self,items): fileName = U ' shuangse qiu. xls '. Encode (' GBK ') book = XLWT. Workbook (encoding= ' UTF8 ') sheet=book.add_sheet (' Ball ', cell_overwrite_ok=true) sheet.write (0, 0, u ' lottery Date '. E Ncode (' UTF8 ')) sh
handles the functions used in JSON formatImport JSONJson.dumps (): Converts a dictionary or list to a JSON-formatted stringJson.loads (): Converts a JSON format string to a Python objectJson.dump (): Converts a dictionary or list into a JSON-formatted string and writes to a fileJson.load (): Reading JSON format strings from a file into a Python objectFront-end Processing:Converts a JSON format string to a
expressions to find tag content that contains link" "[" "RecursiveSoup.find_all ('a', recursive=False)# return [] indicates that the son does not have a label on the nodeStringSoup.find_all (string='basic python')#[' Basic Python '] Import resoup.find_all (String=re.compile ('python'))# All occurrences of a python
Recently posted in the article, want to put the landlord's speech all down, a copy of a good trouble. Then made a semi-automatic extraction tool, very simple.Did not do the login crawl function, because the more troublesome, just do not log in once.
In fact, is a label to filter out the landlord's speech, of course, you need to open the page to choose only to see the landlord and then save the page as HTML form, and then run this program
#!/usr/bin/env pyt
The Prettify () method of the BS4 library:
To print a label:
For Chinese HTML code, you can also print directly:
Method of HTML content lookup based on BS4 library
Name: Retrieves a string for the label name.
where (Import re) is the import of the regular expression library.
Attrs: A string that retrieves the value of a tag property, which can b
Label:Example:http://xyzp.haitou.cc/article/722427.html The first is to download each page directly, you can use Os.system ("wget" +str (URL)) or Urllib2.urlopen (URL), very simple not to repeat. Then, the plays, to extract information: #!/usr/bin/env python
#Coding=utf-8
fromBs4ImportBeautifulSoupImportCodecsImportSYSImportOS Reload (SYS) sys.setdefaultencoding ("Utf-8")
ImportRe fromPymongoImportmongoclientdefget_jdstr (fname): Soup=""retdict={} w
'soup.a.next_sibling.next_sibling# return to soup.a.previous_siblingsoup.a.previous_sibling.previous_sibling# parallel traversal of the tag tree # Traverse subsequent nodes for inch soup.a.next_siblings: Print (sibling) # traversing a previous node for inch soup.a.previous_siblings: Print (sibling)HTML formatting and encoding based on BS4Formatting:When we use the soup.prettify () statement, Prettiffy () adds a newline character to the HTML file so that the file is properly output as
#https://www.crummy.com/software/beautifulsoup/bs4/doc/index.zh.html#find-all#beautifulSoup可以解析HTML, download the installation using PIP install BEAUTIFULSOUP4, the module is imported using BS4.Import BS4NOSTARCHSOUP=BS4. BeautifulSoup (Res.text)#bs4. The BeautifulSoup () function returns a BeautifulSoup object.#也可以像Be
Recently used in the Project crawler, also read a lot of modules, and finally chose the more useful requests and BS4, in which the confusion to explain.
Requests Area Load page, use BeautifulSoup to do parsing
Response = Requests.get (' http://www.infoq.com/cn/articles ') print response >>>200 Here 200 is the HTTP status code, And 200 indicates that the server successfully processed the request
Response = Response. Text The above actions will get the
{Code ...} the above is my code, using soup. after the find_all () function is used, 64 tag segments are obtained in coursera. However, after recursive objects and files are written, controlb obtains the names of 64 first courses, as shown below,
Python web crawler and information extraction (2) -- BeautifulSoup,
BeautifulSoup official introduction:
Beautiful Soup is a Python library that can extract data from HTML or XML files. It can implement the usual document navigation, searching, and modifying methods through your favorite converter.
Https://www.crummy.com/software/BeautifulSoup/Install BeautifulSoup
Find "cmd.exe" in "C: \ Windows \ System3
1. Installing BEAUTIFULSOUP4Easy_install installation method, Easy_install need to be installed in advance
Easy_install BEAUTIFULSOUP4
Pip installation method, Pip also needs to be installed in advance. There is also a BeautifulSoup package in PyPI, which is the release version of Beautiful Soup3. Installation is not recommended here.
Pip Install Beautifulsoup4
Debain or Ubuntu installation mode
Apt-get Install PYTHON-
Learning notes for python crawler Beautifulsoup,Related content:
What is beautifulsoup?
Bs4 usage
Import Module
Select use parser
Search by Tag Name
Use find \ find_all to find
Search Using select
Start Time:
What is beautifulsoup:
Is a Python library that can extract data from HTML or XML files. It can use your favorite co
Learning python Network Data Collection notes-Chapter 1 and Chapter 2: python data collection
If the English version is poor, you can only view the Chinese version. The translation of Posts and Telecommunications publishing house is really bad.
The above is the message, and the following is the text.
We recommend that you install Python or a later version of ptho
Python Network data collection 3-data stored in CSV and MySQL
Warm up first and download all the pictures from a page.
Import requestsfrom BS4 Import beautifulsoupheaders = {' user-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) ' chrome/52.0.2743.116 safari/537.36 edge/15.16193 '}start_url = ' https ://www.pythonscraping.com ' r = Requests.get (Start_url, header
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.