Email grilled avatar to show you how to write a simple script grill map

Source: Internet
Author: User

I'll show you how to write a simple script grill with a mail-head.

The hand has hundreds of thousands of mailbox, originally the user system did not do the picture of things, now want to according to these mailboxes can take part of the user's avatar, you can directly use
Gravatar service, but this thing will be wall, or pull back to the reliable, the 2nd way is QQ mailbox, through the analysis of data found that hundreds of thousands of
There are more than half of the user is QQ mailbox, so to find a way through without oauth to get.

Thinking and Technology choice

As a pythoner, there are many reptile frameworks to choose from, such as Scrapy Pyspider Yes, there is a Chinese UI with a time schedule

The crawler framework will give you a lot of things, basic things into the parse callback and so on, important function rooms can be used in depth or breadth-first algorithm to 类似下一页 crawl, better
Give you a simple way to do agent camouflage, proxy camouflage, password verification, time scheduling and so on.

But the mailbox grill map This kind of thing is to get the URL directly after grab back good, no need so selectmen, so requests enough.

The things to Do

Down back to the picture, but do not default pictures, such as QQ's avatar if not, will give the default picture of several sizes, but I do not want this thing, no is no

Can be stripped of the process of the picture can let him back to the scene before (I do not want to re-seize, hundreds of thousands of mailbox it)

Multiple processes can be used to speed up the crawl

Let's start with the implementation

The first step is to get the URL, if you don't mind gravatar will be wall, QQ connection will change (after all, not the document given address), this place is enough.

Get URL based on mailboxGravatar

Gravatar documentation

Gravatar Python implementation

Please bring your own ladder if necessary.

Gravatar Nothing to say, is to get MD5 after the QQ mailbox

Need to note that the parameter s is the size, Gravatar do better, basic what size have
D is the default parameter, do not want to use the default avatar when the 404,gravatar will return 404 response, other parameters please see the document yourself

Qq

Http://q4.qlogo.cn/g?b=qq&nk=491794128&s=1

QQ connection is relatively easy to get (don't ask me how to find, I forgot)

NK is QQ number, QQ mailbox can also

s for the size of the picture, I picked it up and found that there are so many size sizes 1 2 3 4 5 40 41 100 140 160 240 640 ,
It is all the size, of which 2 corresponds to40, 4 correspondsto 100, but please note that not everyone has a 100-size picture (10 ago biography of the Avatar, never changed, really have this user, I have ...)

This post tells you how to avoid AppID by QQ number to get to QQ nickname and Avatar
The inside mentions PHP Curl anti-hotlinking catch thing is a pity PHP, I have changed to Python,
Python version, although the final implementation does not use this thing (QQ has a direct access to the connection Oh yeah), but not necessarily when the use of.

The following is a 5-size map, not sure if it can be displayed on GitHub or OSC or SF





Can not be displayed please point 1
Can not be displayed please point 2
Can not be displayed please point 3
Can not be displayed please point 4
Can not be displayed please point 5

Before the code to say the problem encountered

Like all the crawler may encounter problems, you need to disguise agents, otherwise the crawler may be banned, because I crawl QQ when found that after a period of time QQ avatar size changed to 0, must be something.

Maybe you will see in my code that I use the mailbox. jpg named the captured figure, which is because I want to write a simple thing to look at these graphs.

Gravatar user volume, this proportion has been again, from 40 people 1 people, to 60 people 1 people, when I caught 60,000 mailbox when I found this proportion is roughly 100 of 1 people

About ignoring the default picture, Gravatar directly uses 404 to judge, this simple. QQ trouble, first download back to the default of a few pictures, and then md5 the next figure, so download QQ map when compared to this MD5 code, the same is the default picture, pass.

Log will help you make the best use of log on the recovery site.

About the multi-process, the simplest, still remember the idea of learning algorithms, large tasks into small tasks, so gross position's mailing list is split into a few part, the script to do some support can be used at the same time a few processes to run.

On the Code

See here for the latest code

#!/usr/bin/env python#-*-coding:utf-8-*-import requestsimport hashlibimport urllibimport sysimport osimport Randomfro M functools Import partialagents = ["Avant browser/1.2.789rel1 (http://www.avantbrowser.com)", "mozilla/5.0 (window S U Windows NT 6.1; En-US) applewebkit/532.5 (khtml, like Gecko) chrome/4.0.249.0 safari/532.5 "," mozilla/5.0 (Windows; U Windows NT 5.2; En-US) applewebkit/532.9 (khtml, like Gecko) chrome/5.0.310.0 safari/532.9 "," mozilla/5.0 (Windows; U Windows NT 5.1; En-US) applewebkit/534.7 (khtml, like Gecko) chrome/7.0.514.0 safari/534.7 "," mozilla/5.0 (Windows; U Windows NT 6.0; En-US) applewebkit/534.14 (khtml, like Gecko) chrome/9.0.601.0 safari/534.14 "," mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/534.14 (khtml, like Gecko) chrome/10.0.601.0 safari/534.14 "," mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/534.20 (khtml, like Gecko) chrome/11.0.672.2 safari/534.20 "," mozilla/5.0 (Windows NT 6.1; WOW64)applewebkit/534.27 (khtml, like Gecko) chrome/12.0.712.0 safari/534.27 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/535.1 (khtml, like Gecko) chrome/13.0.782.24 safari/535.1 "," mozilla/5.0 (Windows NT 6.0) Applewebki t/535.2 (khtml, like Gecko) chrome/15.0.874.120 safari/535.2 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/535.7 (khtml, like Gecko) chrome/16.0.912.36 safari/535.7 "," mozilla/5.0 (Windows; U Windows NT 6.0 x64; En-us; Rv:1.9pre) gecko/2008072421 Minefield/3.0.2pre "," mozilla/5.0 (Windows; U Windows NT 5.1; En-us; rv:1.9.0.10) gecko/2009042316 firefox/3.0.10 "," mozilla/5.0 (Windows; U Windows NT 6.0; EN-GB; rv:1.9.0.11) gecko/2009060215 firefox/3.0.11 (. NET CLR 3.5.30729) "," mozilla/5.0 (Windows; U Windows NT 6.0; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 GTB5 "," mozilla/5.0 (Windows; U Windows NT 5.1; Tr rv:1.9.2.8) gecko/20100722 firefox/3.6.8 (. NET CLR 3.5.30729;. net4.0e) "," mozilla/5.0 (Windows NT 6.1; rv:2.0.1) gecko/20100101 firefox/4.0.1 "," mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) gecko/20100101 firefox/4.0.1 "," mozilla/5.0 (Windows NT 5.1; rv:5.0) gecko/20100101 firefox/5.0 "," mozilla/5.0 (Windows NT 6.1; WOW64; RV:6.0A2) gecko/20110622 firefox/6.0a2 "," mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) gecko/20100101 firefox/7.0.1 "," mozilla/5.0 (Windows NT 6.1; WOW64; Rv:2.0b4pre) gecko/20100815 Minefield/4.0b4pre ",]qq_md5_escape = [' 11567101378fc08988b38b8f0acb1f74 ', ' 9d11f9fcc1888a4be8d610f8f4bba224 ']log_files = ' scrapy_{}.log ' email_list = ' email_list_{}.json ' AVATAR_PATH = ' avatar/ {} {} ' log_level_exists = ' EXISTS ' log_level_notset_or_error = ' notset_or_error ' log_level_type_error = ' TYPE_ERROR ' LOG_ Level_error = ' ERROR ' log_level_fail = ' FAIL ' log_level_success = ' SUCCESS ' log_level_ignore = ' IGNORE ' def get_gravatar_ URL (email, default_avatar=none, Use_404=false, size=100): data = {} if Default_avatar and Default_avatar.startswith ( ' http '): data[' d '] = Default_avatar   If use_404:data[' d '] = ' 404 ' data[' s '] = str (size) Gravatar_url = "http://secure.gravatar.com/avatar/" +    HASHLIB.MD5 (Email.lower ()). Hexdigest () + "?"    Gravatar_url + = Urllib.urlencode (data) return Gravatar_urldef get_random_headers (): Agent = Random.choice (AGENTS) headers = {' User-agent ': Agent} return Headersdef Check_logfile (part): Last_scrapy_line = 1 if os.path.exists (' s                Crapy_{}.log '. Format (part): With open (' Scrapy_{}.log '. Format (part)) as Log_read:for line in Log_read: Last_scrapy_line = max (last_scrapy_line, int (line.split () [0])) Print Last_scrapy_line return LAST_SCR Apy_line + 1def get_log_message (log_format= ' {index} {level} {email} {msg} ', Index=none, Level=none, Email=none, Msg=none ): Return Log_format.format (Index=index, Level=level, Email=email, msg=msg) Success_log = partial (Get_log_message, Leve L=log_level_success, msg= ' scrapyed SUCCESS ') Exist_log = partial (Get_log_message, Level=log_level_exisTS, msg= ' scrapyed already ') Fail_log = partial (Get_log_message, Level=log_level_fail, msg= ' scrapyed failed ') Not_qq_log = Partial (Get_log_message, Level=log_level_type_error, msg= ' not QQ email ') Ignore_log = partial (Get_log_message, level= Log_level_type_error, msg= ' ignore email ') Empty_size_log = partial (Get_log_message, Level=log_level_error, msg= ' EMPTY Avatar ') Unexcept_error_log = partial (Get_log_message, Level=log_level_error, msg= ' unexcept ERROR ') def write_log (LOG, msg): Log.write (msg) log.write (' \ n ') Log.flush () def save_avatar_file (filename, content): with open (filename, '    WB ') as Avatar_file:avatar_file.write (content) def scrapy_context (part, suffix= '. jpg ', Rescrapy=false, Hook=none):        Last_scrapy_line = Check_logfile (part) index = Last_scrapy_line with open (Log_files.format (part), ' a ') as LOG:                With open (Email_list.format (part)) as List_file:for linenum, e-mail in Enumerate (list_file):       If LineNum < last_scrapy_line:             Continue email = Email.strip () if not rescrapy:if OS.PATH.E                        Xists (avatar_path.format (email, suffix)): Print Exist_log (Index=index, Email=email) Index + = 1 Continue if not hook:raise Notimplementederro                 R () Try:hook (part, Suffix=suffix, Rescrapy=rescrapy, Log=log, Index=index, Email=email)                    Except Exception as Ex:print Unexcept_error_log (Index=index, Email=email) Write_log (log, Unexcept_error_log (Index=index, Email=email)) Raise ex index + = 1d EF Scrapy_qq_hook (part, suffix= '. jpg ', Rescrapy=false, Log=none, Index=none, Email=none): If ' qq.com ' not in Email.lowe        R (): Print Not_qq_log (Index=index, Email=email) write_log (LOG, Not_qq_log (Index=index, Email=email)) return URL = ' HTTP//Q4.qlogo.cn/g?b=qq&nk={}&s=4 '. Format (email) response = requests.get (URL, timeout=10, headers=get_random_ Headers ()) if Response.status_code = = 200: # Determine if the user has a large icon if no then request small icon if HASHLIB.MD5 (response.content) in Q Q_md5_escape:url = ' http://q4.qlogo.cn/g?b=qq&nk={}&s=2 '. Format (email) response = requests. Get (URL, timeout=10, headers=get_random_headers ()) if Response.status_code = = 200:if not len (re sponse.content): Print Empty_size_log (Index=index, Email=email) write_log (LOG, EMPTY _size_log (Index=index, Email=email)) # Here again judging is because the last 200 judge made a picture check if Response.status_code = = 200:save_avat Ar_file (avatar_path.format (email, suffix), response.content) print Success_log (Index=index, Email=email) WRI Te_log (log, Success_log (Index=index, Email=email)) Else:print Fail_log (Index=index, Email=email) write_l OG (log, Fail_log (Index=index, EMail=email)) def scrapy_gravatar_hook (part, suffix= '. jpg ', Rescrapy=false, Ignore_email_suffix=none, Log=None, index= None, Email=none): If Ignore_email_suffix and Ignore_email_suffix in Email.lower (): Print Ignore_log (index=index , email=email) write_log (log, Ignore_log (Index=index, Email=email)) return response = Requests.get (get_gr Avatar_url (email, use_404=true), timeout=10, Headers=get_random_headers ()) if Response.status_code = = 200:save_        Avatar_file (avatar_path.format (email, suffix), response.content) print Success_log (Index=index, Email=email) Write_log (log, Success_log (Index=index, Email=email)) Else:print Fail_log (Index=index, Email=email) WRI Te_log (log, Fail_log (Index=index, email=email)) Returnscrapy_gravatar = partial (Scrapy_context, Hook=scrapy_gravata R_hook) scrapy_qq = partial (Scrapy_context, hook=scrapy_qq_hook) Func_mapper = {' QQ ': scrapy_qq, ' Gravatar ': scrapy_g Ravatar,}if __name__ = = ' __maiN__ ': Scrapy_type = sys.argv[1] part = sys.argv[2] If Scrapy_type not in Func_mapper:print ' type should I n [QQ | gravatar] ' exit (0) Func_mapper[scrapy_type] (part)
Simple usage

PIP Install requests

    1. Put scrapy_avatar.py under a folder for example/opt/projects/scripts

    2. mkdir /opt/projects/scripts/avatar

    3. Put your file list in Email_list_0.json

    4. python scrapy_avatar.py gravatar 0Orpython scrapy_avatar.py qq 0

Simple description

When the email_list is larger, you can split the email_list into multiple lists in order to use more processes
For example email_list_0.json email_list_1.json
You can use python scrapy_avatar.py gravatar 0 python scrapy_avatar.py gravatar 1 up to two processes to catch a

Other feature please read the code and change the two hook methods inside

Spit Groove
    1. Because this is a simple script, so lazy to use the click to do the script parameter processing, only rely on the requests, parameter judgment will not bother to write.

    2. Originally scrapy_context used in that for loop is contextmanager yield to do, but there is a strange RuntimeError generator didn‘t stop , helpless will yield changed to hook method.

    3. QQ head Some strange problems, for example, not everyone has 100 size of the figure, but no one has 40 size of the map, so priority to take a large map, on the other side of the QQ made a judgment.

    4. No other methods of the context and hooks are included in the script, and those who need it should modify it themselves.

It'll be cool to open this with chrome.Attached: simple way to display Linux server pictures flask+nginx

Django is heavier, Flask+nginx is enough, because there is no other need

Pip Install flask

app.py throw to grab the map, change the root of nginx inside Avatar address, throw into/etc/nginx/site-enable reload Nginx, don't forget the host Tim Localtest

Flask Code app.py
#!/usr/bin/env python#-*-coding:utf-8-*-from flask Import flask, send_from_directory, safe_joinimport Osapp = Flask (__ name__) App.debug = True@app.route ("/") def hello ():    avatars = os.listdir (' avatar ')    avatars = sorted (avatars)    html = ' \ n '. Join ("
Nginx
Upstream Localtest-backend {    server 127.0.0.1:11111 fail_timeout=0;} server {  listen;  server_name localtest.com;  Location ~/avatar/(? p<file>.*) {    root/opt/projects/scripts/new;    try_files/avatar/$file/avatar/$file =404;    Expires 30d;    gzip on;    Gzip_types text/plain application/x-javascript text/css application/javascript;    Gzip_comp_level 3;  }  Location/{        proxy_pass http://localtest-backend;  }}

Email grilled avatar to show you how to write a simple script grill map

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.