As mentioned earlier, the URL of each article is written to item, but the length of each URL is different.
You can set a field in item how to make each URL the same length, which requires a MD5 of each URL
Operation, so that the length is uniform, and then added to the Set item field
Create a new folder from the root of the project and write all the custom methods that can be used, named Util
and create a new common.py file from the Util
Write the following:
1 import Hashlib 2 GET_MD5 (URL): 3 if Isinstance (URL,STR): 4 url = url.encode ( " utf-8 " 5 m = Hashlib.md5 () 6 M.update (URL) 7 return m.hexdigest ()
explanation of conversion codesall characters in the Python3 are Unicode encoded, while MD5 is the encoding of UTF-8, which is not difficult to understand
Calculations are done in the CPU, and in memory it should be utf-8 encoded, in order to save memory, and in Python2, this is not the case, because all characters in Python3 are Unicode
Encoding, all python3 are not garbled.
Finally, the method is introduced from jobbole.py and written to the item field
from Import get_md5artical_item["url_object_id"] = get_md5 (Response.url)
Now that all the item fields have been added, all that remains is to write to the database.
Scrapy base ———— writes indefinite length URLs to the item in fixed length