python用於url解碼和中文解析的小指令碼(python url decoder)

最後更新：2016-06-16 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

複製代碼代碼如下:

# -*- coding: utf8 -*-
#! python
print(repr("測試警示，xxxx是大豬頭".decode("UTF8").encode("GBK")).replace("\\x","%"))

注意第一個 decode("UTF8") 要與檔案聲明的編碼一樣。

最開始對這個問題的接觸，來自於一個Javascript解謎闖關的小遊戲，某一關的提示如下：

剛開始的幾關都是很簡單很簡單的哦～～這一關只是簡單的字串變形而已…..

後面是一大長串開頭是%5Cu4e0b%5Cu4e00%5Cu5173%5Cu7684這樣的字串。
這種東西以前經常在瀏覽器的地址欄見到，就是一直不知道怎麼轉換成能看懂的東東，
網上google了一下，結合python的url解碼和unicode解碼，解決方式如下:

複製代碼代碼如下:

import urllib escaped_str="%5Cu4e0b%5Cu4e00%5Cu5173%5Cu7684%5Cu9875%5Cu9762%5Cu540d%5Cu5b57%5Cu662f%5Cx20%5Cx69%5Cx32%5Cx6a%5Cx62%5Cx6a%5Cx33%5Cx69%5Cx34%5Cx62%5Cx62%5Cx35%5Cx34%5Cx62%5Cx35%5Cx32%5Cx69%5Cx62%5Cx33%5Cx2e%5Cx68%5Cx74%5Cx6d"
print urllib.unquote(escaped_str).decode('unicode-escape')

最近，我對firefox的autoproxy外掛程式中的gfwlist中的中文詞彙（用過代理的同學們，你們懂的）產生了興趣，然而這些網址都是用url編碼的，比如http://zh.wikipedia.org/wiki/%E9%97%A8，需要使用Regex將被url編碼的中文字元提取出來，寫了個小指令碼如下：

複製代碼代碼如下:

import urllib
import re
with open("listfile","r") as f:
for url_str in f:
match=re.compile("((%\w{2}){3,})").findall(url_str)
#漢字url編碼的樣式是：百分比符號+2個十六進位數，重複3次

if match!=None:
#如果匹配成功，則將提取出的部分轉換為中文
for trans in match:
print urllib.unquote(trans[0]),

然而這個指令碼仍有一些缺點，對於列表檔案中的某些中文字元仍然不能正常解碼，比如下面這幾行測試代碼

複製代碼代碼如下:

import urllib
a="http://zh.wikipedia.org/wiki/%BD%F0%B6"
b="http://zh.wikipedia.org/wiki/%E9%97%A8"
de=urllib.unquote
print de(a),de(b)

輸出結果就是前者可以正確解碼，而後者不可以，個人覺得原因可能和big5編碼有關，如果誰知道什麼解決辦法，還請告訴我一下~

以下是補充：

de(a).decode(“gbk”,”ignore”)
de(b).decode(“utf8″,”ignore”)

這樣你可以得到這些字串的unicode編碼。

你用的unquote不是decoder, 你需要作必要的decode和encode。我一直用utf8作我默認環境的，我覺得你大概用的gbk吧，所以後者的解碼你那邊失敗了。猜編碼是很累的事情，如果大家都用utf8倒也好，但是有些人習慣了gb。

http://yac163.svn.sourceforge.net/viewvc/yac163/trunk/yac163-nox/Pic.py?revision=198&view=markup

參考我這個很古老code裡面的#102-147行給每個decode和encode調用加上(…,”ignore”)。

複製代碼代碼如下:

def strdecode( string,charset=None ):
if isinstance(string,unicode):
return string
if charset:
try:
return string.decode(charset)
except UnicodeDecodeError:
return _strdecode(string)
else:
return _strdecode(string)

def _strdecode(string):
try:

return string.decode('utf8')
except UnicodeDecodeError:
try:
return string.decode('gb2312')
except UnicodeDecodeError:
try:

return string.decode('gbk')
except UnicodeDecodeError:
return string.decode('gb18030')

def strencode( string,charset=None ):
if isinstance(string,str):
return string
if charset:
try:
return string.encode(charset)
except UnicodeEncodeError:
return _strencode(string)
else:
return _strencode(string)
def _strencode(string):

try:
return string.encode('utf8')
except UnicodeEncodeError:
try:
return string.encode('gb2312')
except UnicodeEncodeError:
try:
return string.encode('gbk')
except UnicodeEncodeError:
return string.encode('gb18030')



本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

python用於url解碼和中文解析的小指令碼(python url decoder)

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support