I long-term sales of ultra-large amount of micro-blog data, and provide specific micro-blog data packaging, Message to [email protected]
Objective
Due to various reasons such as hardware needs to be about more than 1.7 million 2t of micro-blog image data to be stored in MySQL. The non-relational database that was used to keep the microblog data is MongoDB, because of a variety of MySQL unfamiliar, trampled countless pits, to go back and forth to change the 3 days to complete.
Dig a pit Pits tour
Build Table
When you save the data, you need to design the database first, and I'm ready to design 3 tables
Weibo table: [ID, UserID, blog_text, lat, LNG, Created_time, reserve] pkey:id
Picture table: [MD5, Pic_url, Pic_bin, Exif, reserve] pkey:md5
Relationship table: [ID, MD5, reserve] Pkey: (ID, MD5) Fkey: (ID, micro-Blog ID) (MD5, picture table MD5)
Build the table when the other problems are good, mainly pic_bin type and Blog_text type has a lot of problems, first of all, the type of Pic_bin, began to set the blob, but after running to find the largest blob can only save 1M of data, And can not meet the micro-Bo image storage, after the change to Mediumblob(16M) basically can meet the requirements. And then it was Blog_text, the first big pit I met.
At first, it was natural to set the Blog_text type to text, but ran up to find some data can not be stored in, will be error, the screening found that some of the microblog text contains the emoji expression ... Then find a lot of information found that because the utf8 text is three bytes, but emoji is four bytes, you need to change the encoding to UTF8MB4. However, I am on the mac MySQL configuration file to report all kinds of wonderful mistakes, rage TEXT changed to BLOB, just fine. Because Local is a Mac, I want to connect to a remote windows on a computer to connect to the Synology MySQL on the windows ... Local configuration changed also white change.
Save Picture
Then this is a big pit!!! because I use the python3, so read the picture obtained by the binary result will have a B ', the bytes, is precisely because of this B ' cause the SQL statement stitching, this b after the single quotation marks will be combined with the SQL statement quotes , Cause the next binary does not have an error inside the quotation marks! Binary encoding is not like a string can be escaped to character, tried a lot of methods are not! Finally, there is no way to use Base64 to encrypt the binary into a string, save it to a database, and then decrypt it when it takes time.
pic_bin = str(base64.b64encode(pic_bin))[2:-1]
Change configuration file
Due to the use of Python multi-process, one hours 8G data volume, picture data is larger, the contract time to return more than the default MySQL limit, appeared MySQL server has gone away, this time to change the configuration file, in the configuration file parameters
Max_allowed_packet = 600M
Wait_timeout = 60000
Lost connection to Mysql server during query
The program running run always appear this mistake, always find the reason, tried various ways to see a lot of information, has been a mistake. I really don't know what the reason is ... Then I thought, I care about him for what reason, lost connection after reconnecting on the line. Use Conn. Ping (True) determines whether MySQL is connected successfully. If you lose the connection, reconnect it! Finally solved the problem.
Code Implementation
#!/usr/bin/env python#-*-coding:utf-8-*-# Created by Baoyi on 2017/10/16from multiprocessing.pool import Poolimport py Mysqlimport requestsimport jsonimport exifreadfrom io import bytesioimport configparserimport hashlibimport Loggingimport base64# configuration Logginglogging.basicconfig (level=logging. WARNING, format= '% (asctime) s% (filename) s[line:% (lineno) d]% (levelname) s% (message) s ', Datefmt= '%a,%d%b%Y%h:%m:%s ', filename= ' Weibo.log ', filemode= ' w ') CF = Configpa Rser. Configparser () cf.read ("configparser.conf") # Read Config Mysqldb_host = cf.get ("MySQL", "db_host") Db_port = Cf.getint ("MySQL") , "Db_port") Db_user = Cf.get ("MySQL", "db_user") Db_pass = Cf.get ("MySQL", "db_pass") db = Cf.get ("MySQL", "DB") # Create Connection conn = Pymysql.connect (Host=db_host, User=db_user, Passwd=db_pass, Db=db, Port=db_port, charset= ' UTF8 ') # get cursor cursors = Conn.cursor () # Create Insert_sqlinsert_blog_sql = ("INSERT IGNORE into blog (userid, id, blog_text, lat, LNG, created_time) VALUES (' {uid} ', ' {ID} ', ' {blog_text} ', ' {lat} ', ' {LNG} ', ' {created_time} ') ') Insert_pic_sql = ("Insert IGNORE into Pics (Pic_url, Pic_bin, MD5, EXIF) VALUES (' {pic_url} ', ' {pic_bin} ', ' {MD5} ', ' {exif} ') Insert_relationship_ sql = ("INSERT IGNORE into relationship (ID, MD5) VALUES (' {id} ', ' {MD5} ') ') uid = []with open ('./data/final_id.txt ', ' R ') As F:for I in F.readlines (): Uid.append (the I.strip (' \ r \ n ')) # processing picture data def handle_pic (pic_url): Large_pic_url = P Ic_url.replace (' thumbnail ', ' large ') Large_bin = Requests.get (large_pic_url) return large_bin.contentdef get_poiid_i NFO (UID): Try:url = ' https://api.weibo.com/2/statuses/user_timeline.json ' Load = {' access_to Ken ': ' xxxxxxxxxx ', ' uid ': uid, ' count ': +, ' feature ': 2, ' Trim_user ': 1 } Get_info = Requests.get (Url=url, Params=load, timeout= (Ten)) if Get_info.status_code! = 200: Logging.warning (ConnectionerROR) Pass Info_json = Json.loads (get_info.content) info_json[' uid '] = uid statuses = Info_ json[' statuses ' # Processing filtering Weibo data for status in Statuses:id = Status[' idstr '] if status[' ge O '] is not None:lat = status[' geo ' [' coordinates '][0] lng = status[' geo ' [' coordinates '][1] Pic_urls = status[' Pic_urls '] # determine if in Beijing if (115.7 < LNG < 117.4) and ( 39.4 < lat < 41.6): # If in Beijing, insert the blog data into the library Blog_text = status[' text '].replace (' \ '), ' \ ' \ ') created_time = status[' created_at '] try:cursor.exec Ute (Insert_blog_sql.format (Uid=uid, Id=id, Blog_text=blog_text, Lat=lat, LNG=LNG, created_time=created_time)) except Pymysql.err.OperationalError a S E_blog: Logging.warning (e_blog.args[1]) Pass # conn.commit () # process picture for Pic_url in Pic_urls: # Get original picture binary data Pic_bin = h Andle_pic (pic_url[' thumbnail_pic ') # read EXIF data Pic_file = Bytesio (Pic_bin) # Convert binary data to file object easy to read EXIF data information and generate MD5 Tag1 = Exifread.process_file (Pic_file, Details=false, Strict=tru E) tag = {} for key, value in Tag1.items (): If Ke Y Not in (' Jpegthumbnail ', ' tiffthumbnail ', ' Filename ', ' Exif Makernote '): # Remove four unnecessary EXIF attributes, simplify the amount of information tag[key] = str (value) tags = json.dumps (tag) # Dumps for JSON type this tag is EXIF JSON data # generated MD5 MD5 = has HLIB.MD5 (Pic_file.read ()). Hexdigest () # First the binary picture is converted to a string by base64 and then saved TRY:CU Rsor.execute (Insert_pic_sql.format (pic_url=pic_url[' thumbnail_pic '].replace (' thumbnail ', ' Large '), Pic_bin=str (Base64.b64encode (Pic_bin)) [2:-1], MD5=MD5, exif=tags)) except Pymysql.err.OperationalError as E_pic:logging.warning (e_pic.args[1]) pass TR Y:cursor.execute (Insert_relationship_sql.format (Id=id, MD5=MD5)) except Pymysql.err.OperationalError as e_relation:logging.warning (e_relation) Pass Conn.commit () else:logging.info (id + "isn't in Beiji Ng ")Pass Else:logging.info (id + ' Geo is null ') pass except Pymysql.err.Operation Alerror as E:logging.error (e.args[1]) passdef judge_conn (i): Global conn try:conn.ping (True) Get_poiid_info (i) except Pymysql.err.OperationalError as E:logging.error (' reconnect ') conn = pymy Sql.connect (Host=db_host, User=db_user, Passwd=db_pass, db=db, charset= ' UTF8 ') get_poiid_info (i) def handle_tuple (A_ Tuple): Read_uid_set = [] for i in A_tuple:read_uid_set.append (I[0]) return set (Read_uid_set) if __name__ = = ' __main__ ': Sql_find_uid = ("Select userid from Blog") Cursor.execute (Sql_find_uid) read_uid_tuple = Cursor.fetchall () read_list = Handle_tuple (read_uid_tuple) print (len (read_list)) New_uid = set (UID). Difference (read_list) print (len (new_uid)) pool = Pool () pool.map (Judge_conn, List (new_uid))
Python crawl micro-blog picture data stored in the various pits encountered in MySQL \python mysql storage image