A translation function is used in the project, that is, after the translation is submitted to Google, the returned results are obtained.
First, find out the context of Google translation:
Request Processing
After submitting the translation, check what the request and response are:
VcHLzOG9u7XEVVJMPC9zdHJvbmc + PC9wPgo8cD48L3A + CjxwcmUgY2xhc3M9 "brush: java;"> url = httl: // translate.google.cn/translate_a/t
The preceding submission form is shown in the following figure:
Sl = source language = en (english)
Tl = target language = zh-CN (Simplified Chinese)
And encoding method: UTF-8
Q = query = "this is a dog"
So we get our post_date
values = {'client': 't', 'sl': 'en', 'tl': 'zh-CN', 'hl': 'zh-CN', 'ie': 'UTF-8', 'oe': 'UTF-8', 'prev': 'btn', 'ssel': '0', 'tsel': '0', 'q': text}
Check the header information again:
There is a browser, and we do not know if it is necessary. We also add a browser information:
browser = "Mozilla/5.0 (Windows NT 6.1; WOW64)"
Integrate the above information to obtain the request
values = {'client': 't', 'sl': 'en', 'tl': 'zh-CN', 'hl': 'zh-CN', 'ie': 'UTF-8', 'oe': 'UTF-8', 'prev': 'btn', 'ssel': '0', 'tsel': '0', 'q': text} url = "http://translate.google.cn/translate_a/t" data = urllib.urlencode(values) req = urllib2.Request(url, data) browser = "Mozilla/5.0 (Windows NT 6.1; WOW64)" req.add_header('User-Agent', browser)
Then we get the page
response = urllib2.urlopen(req) get_page = response.read()
Response Processing
What is the response information we can see:
However, it cannot be seen that after submitting long sentences, we can know that the format returned by Google translation is: (too long to write)
[[1st sentence translation, original text, pronunciation], [second sentence translation, original text, pronunciation],...], [Other information (meaning and so on)]
Therefore, we can use the following two-step Regular Expression matching to obtain the text:
text_page = re.search('\[\[.*?\]\]', get_page).group()rex = re.compile(r'\[\".*?\",')re.findall(rex, text_page)
Finally, we find that there are additional "waits" in the text, and further replace and process them:
item = item.replace('[', "")item = item.replace('",', "")tem = item.replace('"', "")
Final program
import reimport urllibimport urllib2def translate(text): """translate English to Chinese""" alues = {'client': 't', 'sl': 'en', 'tl': 'zh-CN', 'hl': 'zh-CN', 'ie': 'UTF-8', 'oe': 'UTF-8', 'prev': 'btn', 'ssel': '0', 'tsel': '0', 'q': text} url = "http://translate.google.cn/translate_a/t" data = urllib.urlencode(values) req = urllib2.Request(url, data) browser = "Mozilla/5.0 (Windows NT 6.1; WOW64)" req.add_header('User-Agent', browser) response = urllib2.urlopen(req) get_page = response.read() text_page = re.search('\[\[.*?\]\]', get_page).group() text_list = [] rex = re.compile(r'\[\".*?\",') for item in re.findall(rex, text_page): item = item.replace('[', "") item = item.replace('",', "") item = item.replace('"', "") text_list.append(item) text_result = "".join(text_list) return text_result