PY crawls English documents to learn words

Source: Internet
Author: User

Recently, I started to read some full-book English books. Although I can read some rough books, I still have some difficulties as I have never passed level 4, there are still some key words that affect my understanding of the sentence. Because I read paper words, it is inconvenient to query them. So I want to make a surprise, I want to gather program Words and learn them together. I hope this is more targeted, because you want to, arbitrary (arbitrary, arbitrary) such words are unlikely to appear in technical documents. Learning such words is not very helpful for understanding English technical documents. So I spent a few hours studying it today and found that it is very feasible and the implementation is not difficult. The main steps are as follows: first, get the translation interface first. The translation sources are interface-type and crawler-type. I looked at the translation page of Baidu, which is not very easy to crawl. However, Baidu official provides the translation interface as long as you apply to become a developer of Baidu, then the app key provided by Baidu can be used for translation, and the interfaces are also rich, but I still feel a little trouble. The other one is Google. I carefully looked at the interface on the Google translation page and found that it was surprisingly simple. There was only one set-Cookie interface, and then the translation interface. You must have guessed it, I used to crawl Google's translation pages. Second, crawl the words in the English document you want to analyze. Translation well, there are a large number of words in total, so I am also getting it through crawlers, given a page, such. Well, this is the way of thinking. If you are interested, you can try again first. If you want to see my implementations first, so I am very welcome. OK. Let's first create a result chart ,. This is the result of the so-called crawling and translation. Then paste the Code:

#-*-Coding: UTF-8-*-import re, urllib. parse, urllib. Request, HTTP. cookiejar
# The following four lines of code are used to cache cookies. The Google translation interface does not translate them to you. It only shows that the request source has a cookie that he does not approve of, generally urllib. request requests do not cache cookies. Cj = http. cookiejar. lwpcookiejar () cookie_support = urllib. request. httpcookieprocessor (CJ) opener = urllib. request. build_opener (cookie_support, urllib. request. httphandler) urllib. request. install_opener (opener) # Get data through the GET request. Note that headers should be included, although it is a GET request, however, the background of some websites will check whether the request has headers. If headers is not included, it is regarded as a crawler. He ignores it. Therefore, we need to simulate def getdata (URL ): headers = {'user-agent': 'mozilla/5.0 (Windows NT 6.1; wow64; RV: 23.0) Gecko/20100101 Firefox/100'} request = urllib. request. request (url = URL, headers = headers) response = urllib. request. urlopen (request) TEXT = response. read (). decode ('utf-8') return text # log on to the Google Translate page only once. google_logined = false # translation function def translate (Word): If not google_logined:
# The set-Cookie interface just mentioned is the short http: // tr ..... /, which sets a cookie getdata ('HTTP: // translate.google.cn/') Global google_logined = true try:
# Often the URL is the translation interface, and finally with the words to be translated DATA = getdata ("http://translate.google.cn/translate_a/single? Client = T & SL = en & TL = ZH-CN & HL = ZH-CN & dt = BD & dt = ex & dt = LD & dt = MD & dt = QC & dt = RW & dt = RM & dt = SS & dt = T & dt = AT & dt = Sw & Ie = UTF-8 & OE = UTF-8 & OC = 2 & otf = 1 & srcrom = 1 & SSEL = 0 & tsel = 0 & Q = "+ word)
# In fact, there are a lot of translated things, but we only need the first Chinese, so we can use the regular expression matching to get out Reg = Re. compile ('^ \[\[\[\"(. + ?) \ ", \" (. + ?) \ "\] ') TRAN = reg. findall (data) [0]
# At first, I didn't use try again t. I found that some words have changed the JSON structure for translation. It is estimated that the JSON format given by Google is very short and the regular expression matching is wrong. So try again before t: return ('error ======================================>>> '+ word) return Tran # You may have guessed this regular expression, that is, the regular expression matches all words on a page, but the performance is not very good, he will match the span in <span> Apple </span> with Apple, but a little bit of processing can solve the problem. Let's play parse_word_reg = Re first. compile ('([A-Za-Z] {3,})') # a function for parsing words on the page. The parameter is URL and matches with the above regular expression, A Result List is returned. In addition, I used the set to generate a function for filtering. I used python to understand it... Def parse_words (URL): content = getdata (URL) Words = parse_word_reg.findall (content) Words = List (SET (words) return words # All right, there are ten thousand letters available, I only owe the test. Let's upload the URL of a technical document to implement it. Then we find that he starts to translate the documents one by one. Words = parse_words ("https://docs.python.org/3/library/abc.html#module-abc") for word in words: Print (translate (Word ))

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.