Remember to learn python when learning the crawler, often encountered coding problems (in fact, in python3 coding problem has been very little ...) ), with the requests library is very convenient to solve these problems. Recently, there are programmers who have co-learned python to write an e-book website, want the relevant crawler, so I went to try ... Of course, the first step to meet the "coding problem", this time requests will not be used.
observed that after the novel website search page, jump URL shape such as: http://so.biquge.la/cse/search?s=7138806708853866527&q=%CD%EA%C3%C0%CA%C0%BD%E7
Also, querying for different content changes only after the &q= content. Started thinking it was encrypted (well, I'm really small white ...) Daniel tells me it's just a code ... Use the Urllib.parse.unquote (inside the Python2 is Urllib.unquote).
In Python3 , this is exactly the case:
From urllib Import parsecity = Parse.unquote ('%E5%B1%B1%E8%A5%BF ',) # encoding= ' utf-8 ' Print (city) # Shanxi
This is an example of consulting others, the perfect run. But when I went to apply this format, there was garbled. The check is found to be related to the encoding of the Web page (the above code is also intercepted from the Web page). The example of the page encoding is UTF-8, while the encoding of the novel website to parse is GBK. The code is then modified as follows:
Name = Parse.unquote ('%ce%e4%b6%af%c7%ac%c0%a4 ', encoding= ' gb18030 ') # GBK can also print (name) # Wu Action Universe
In other words, the default in the first example is encoding= ' Utf-8 '. (PS: For GBK and GB18030, refer to this article.) )
Even if we can decode it successfully, then ... Naturally think, is how to weave back? Below, "reversing" Please note:
x = Parse.quote (' Martial universe ', encoding= ' GB18030 ') print (x)
Output Result:
%ce%e4%b6%af%c7%ac%c0%a4
As simple as imagined, that is, change the unquote to quote.
At this point, it is a more understanding of the coding problem, of course, the road is still very long!
Finally thank the group inside two great God's help @irvine-song before the waste Emperor, @ Fujian-Tianya.
I passed with Python3 (iii)-I went to. Coding problem again--urllib.parse.unquote