Python crawler 3 Sina Micro Blog Crawler Combat

Source: Internet
Author: User
Tags instance method system log

The projects and documents were put on the GitHub Https://github.com/poiu1235/weibo-catch:

Interested can follow, or a little bit of praise


The depth-digging approach I'm using here is not to set the crawl boundaries (this is to be considered later)

The general idea is to log in with your own account and get a list of your tweets and friends.

Then according to the list of friends and then crawl each other's microblog list and friends list. This constant depth of excavation and traversal of the process

In the process I used a MySQL database for storage, followed by joining the MongoDB database for storage.


Let's add a little bit of Linux knowledge: What functions are the folders in the system sorted by:


/bin
Bin is the abbreviation for binary. This directory follows the structure of the UNIX system and holds the commands most frequently used by users. such as CP, LS, cat, and so on.
/boot
Here are some of the core files used to start Linux.
/dev
Dev is the abbreviation for device (equipment). This directory is the external device for all Linux, with functions similar to the. SYS and WIN under DOS. In Linux, devices and files are accessed in the same way. For example:/dev/hda represents the first physical IDE hard disk.
/etc
This directory is used to store the configuration files and subdirectories required for system administration.
/home
User's home directory, such as a user named Wang, his home directory is/home/wang can also be expressed in ~wang.
/lib
This directory holds the most basic dynamic link shared library of the system, which acts like a. dll file in Windows. Almost all applications need to use these shared libraries.
/lost+found
This directory is usually empty, and when the system shuts down, it becomes a refuge for homeless files. Yes, a bit similar to the DOS. chk file.
/mnt
This directory is empty and the system provides this directory to allow users to temporarily mount other file systems.
/proc
This directory is a virtual directory, which is the mapping of system memory, we can access this directory directly to obtain system information. In other words, the contents of this directory are not on the hard disk but in memory.
/root
The home directory of the system administrator (also known as the Super User). As the owner of the system, there must be some privileges. such as owning a single directory.
/sbin
S is the meaning of super user, which means that the management program used by the system administrator is stored here.
/tmp
This directory, needless to say, must be a place to store some temporary files.
/usr
This is the largest directory, and the applications and files we use are almost all stored in this directory. Contains the following subdirectories;
/usr/x11r6
A directory for storing X-window;
/usr/bin
There are many applications stored;
/usr/sbin
Some of the management programs that are used for super users are here;
/usr/doc
This is the base of Linux documentation;
/usr/include
Linux to develop and compile the application required header files, find here;
/usr/lib
Store some common dynamic link shared libraries and static archives;
/usr/local
This is provided to the general user's/usr directory, where the installation of software is most appropriate;
/usr/man
Man is a synonym for help in Linux, and this is the directory where help documents are stored;
/usr/src
Linux Open source code exists in this directory, enthusiasts do not let go oh.
/var
This directory holds things that are constantly expanding, and in order to maintain the relative stability of/usr, directories that are often modified can be placed in this directory, and many system administrators actually do so. Incidentally, the system log file is in the/var/log directory.

Crawl the micro-blogging rules

The whole process I want to crawl two aspects of content.

One is the microblogging content

The other is a friend (he is concerned about the person, active, interested in the strong) relationship rather than (attention to his people, fans, passive, too much interference)

I'm going to crawl down there. Two places, one is the MySQL relational database, the other is in the form of a text file to the system

MySQL is used in the previous Hadoop system Namenode node in the MySQL, IP is 192.168.1.113

Because the default MySQL installation, is bound to the IP, is limited to local access, stand-alone access

So to change the configuration

sudo nano/etc/mysql/my.cnf find bind-address = 127.0.0.1
Comment out this line, such as: #bind-address = 127.0.0.1

Allow arbitrary IP access;
After this annotation, local access cannot be omitted, the host address must be specified with the-h command

Restart Mysql:sudo/etc/init.d/mysql restart
Authorized users to connect remotely
Grant all privileges on *.* to root@ '% ' identified by ' password ' with GRANT option;
Flush privileges;
The first line of command is interpreted as follows, *.*: The first * represents the database name, and the second * represents the table name. This means that all the tables in the database are licensed to the user. Root: Grant root account. "%": Authorized User IP can be specified, where any IP address can access the MySQL database. "Password": Assign the password corresponding to the account, where the password is replaced by your MySQL root account password.
The second line of command is to refresh the permission information, and that is to let the settings we make take effect immediately.

Login MySQL with mysql-uroot-p
If you are logging on to another machine MySQL with mysql-uroot-h192.168.1.113-p

#查看编码方式

Show variables like "character%";
#修改编码方式

sudo service MySQL stop

sudo nano/etc/mysql/my.cnf
#在文件内的 [mysqld], add the following two-line settings:
Character_set_server=utf8
init_connect= ' SET NAMES UTF8 '
sudo service MySQL start

The database character set and the server character set become utf-8 and can be written in Chinese.

Here I installed the MySQL aids on Ubuntu is Workbench and Heidisql online does not recommend Navicat although I am under windows with Navicat most comfortable

The default MySQL directory under data inside the ' MySQL ' This schema does not see in the workbench.
Click the SQL Editor in the menu-edit->preferences, and then tick the check boxes in front of the show Data dicrionaries and Internal Schemas.

Go back to the past refresh or reconnect, it will appear
The next step is to build a library, build a table work




In execute, you use the equivalent format such as name (char), age (int), and the statement is: Cur.execute (insert INTO db.table (name,age) VALUES (%s,%d)% (v[name), v[ Age]), V[name] is a character, v[age] is int
No forced formatting is required in Executemany, and the preceding table statement is:
values=[("Zhan",), ("Li", 33)]
Executemany ("INSERT into Db.tables (name,age) VALUES (%s,%s)", values)

Specific reasons I am not clear, can only be said to walk through the pit, Executemany generally only with the%s can be, with other may be an error. The inserted value is normal (seemingly automatically matching column properties). , there are heroes know also hope advice

Note: Cursor.execute () can accept one or two parameters:
(1) Cursor.execute ("INSERT into resource (Cid,name) values (%s,%s)", (12,name));
This format accepts two parameters, MySQLdb automatically escapes and quotes the string for you, does not have to escape itself, after executing this statement, the Resource table has one more record: 12 \
(2) Cursor.execute ("INSERT into resource (Cid,name) values (%s,%s)"% (12,name));
This format uses Python strings to format itself to generate a query, which is passed to execute with an argument, at which point you must escape and add quotes to the string, that is, the statement above is wrong and should be modified to:
Name = mysqldb.escape_string (name);
Cursor.execute ("INSERT into resource (Cid,name) values (%s, '%s ')"% (12,name));
The record inserted like this is the same as (1): 12 \


Create Database Weibocatch;
Use Weibocatch; /* Kanji and one letter are all one character/create table W_user (WID char () primary key, Wname varchar (MB) not NULL, recon tinyint (1), #0表示未认证, 1 table Show authenticated color tinyint (1) default NULL, #0表示黄v (personal), 1 for Blue V (Enterprise) flag tinyint (2) default 0, #0表示未被爬取, 1 for crawl success, 2 for crawl failure inserttime TI

Mestamp default now ());

CREATE TABLE W_error (Wid char (), exception varchar (2000));

CREATE TABLE W_relation (Wid char (), Wfriendid char (10)); CREATE TABLE W_conn (weiboid char), tag tinyint (1),/* judgment is forwarded Weibo or original microblogging, 1 is forwarding, 0 is original/Atid char, atname varchar (m), p
Rimary key (Weiboid,tag));
/*date or datetime types are not allowed to use functions as default values, so instead of the timestamp type/CREATE TABLE w_content (weiboid Char primary key, Wid char (10), Content varchar (1000), url varchar (1000), map varchar (m), label varchar (MB), Intime varchar (MB), Picid int, Wfrom VARC
Har (MB), inserttime timestamp default now ());
/*date or datetime types are not allowed to use functions as default values, so instead of the timestamp type/CREATE TABLE W_transfer (weiboid Char primary key, Wid char (10), TransfErid char (+), content varchar (1000), remark varchar (1000), label varchar (MB), url varchar (1000), Intime varchar (MB), pi

CID int, Wfrom varchar (m), inserttime timestamp default now ()); CREATE TABLE Pic_reg (ID int primary KEY auto_increment, path varchar (3000), Label1 varchar (MB), Label2 varchar (MB), lab

EL3 varchar (), title varchar (9000)) auto_increment=10000;

INSERT into Weibocatch.w_user (wid,wname,recon,color,flag) VALUES (' 2468833122 ', ' poiu1235 ', 0, ', 0);

INSERT into Weibocatch.w_user (wid,wname,recon,color,flag) VALUES (' 2430104687 ', ' Jansenkaigen ', 0,0,0); Show GLOBAL VARIABLES like ' auto_incre% ';
--Global variable SELECT * from Weibocatch.w_relation;
Select Now ();
Delete from Weibocatch.pic_reg;

Delete from Weibocatch.w_transfer;
ALTER TABLE Weibocatch.pic_reg auto_increment=10000;
SELECT * from Weibocatch.w_user; Update Weibocatch.w_user set flag=0 where wid = ' 2430104687 '

Make sure to Self.conn.commit () after the insert and update operation is complete, or you will not change the database

                Try:
                     xxxx;
                Except Exception as err:
                    self.conn.rollback ()
                    Print err
                finally:
                    #关闭连接, releasing Resources     
                    cursor.close ();
                    

if (Wconn.tag.strip ()): #判断字符串不为空


The process of preparing a database connection is written in a function so that it is temporarily used to invoke

Self.inition () #因为mysql connection is very easy to time out the disconnect, the default 10s, so the time to use a statement


#插入一条记录 all use%s as a placeholder, not even numbers%d
#因为这个插件程序会帮你转换的, the string will automatically give you "if the number does not add

#使用这函数向一个给定Connection对象返回的值是该Connection对象产生
#对影响AUTO_INCREMENT列的最新语句第一个AUTO_INCREMENT值的.
#这个值不能被其它Connection对象的影响, that is, they produce their own auto_increment values.
#第二, last_insert_id is table-Independent,
#如果向表a插入数据后, and then inserting data into table B, LAST_INSERT_ID returns the ID value in table B.
cursor.execute ("Select last_insert_id ()")
# "Get only one record:" Result   
= Cursor.fetchone (); #这个返回的是一个tuple, The tuple of an element is followed by a ",". No real sense.

#表示从第二个取到倒数第二个, remove the single quotation marks from both ends of the pattern string
loginweb=loginweb[1:-1]

#reload (SYS) 
#sys. setdefaultencoding ("Utf-8") #这个是因为插件报错, but the program is normal.

#str1. Decode (' gb2312 ') that converts a gb2312 encoded string str1 into a Unicode encoding
#str2. Encode (' gb2312 '), Represents the conversion of a Unicode-encoded string str2 to a gb2312 encoding. 

#设置读取目录是当前目录
#但是虽然存文件方便了, but the current cookie file path #os cannot be read correctly due to current path changes
. ChDir ("/home/luis/workspace/weibo-catch/ Picture/")
#os. GETCWD () <pre name=" code "class=" Python "> #所以直接写完整路径会省去很多麻烦
ppth="/home/luis/" Workspace/weibo-catch/picture/"

data = Urllib.urlopen (Picurl). Read ()
f = open (name, ' WB ', 8192) #设置文件缓冲区为8M大小, some pictures are more afraid to save at one time
f.write (data)
The F.close#close method is equivalent to flush the buffer before closing.

#因为f. Read () method reads the STR type, the default is UTF8, you can find it directly with the STR type, without the decode operation of the transcoding

if (countt==10):
   subprocess.call ("pause", Shell=true) #让程序暂停替换了os. System (' pause ') method or if (Raw_input ()): Pass

#注意每个beautifulsoup对象截取出来的对象都是 a "specific label" object, you can directly perform BeautifulSoup related operations
#而不是字符串类型, and if you want to do some string manipulation, you must turn to the STR () type.
#Beautiful Soup uses the code to automatically detect the library to identify the current document encoding and convert to Unicode encoding
#通过Beautiful soup output documents, regardless of the encoding of the input document, the output encoding is UTF-8 encoded
#但是像index and simple Find method will be converted into Unicode encoding mode to operate, after the execution of the code into UTF8 operation.
#即先转码再转回来 (First decode (' UTF8 ') and then encode (' UTF8 '))

Time.sleep (2) #每次完成一个任务就暂停2秒

#一般的字符串操作有rfind looked for it from the back, BeautifulSoup didn't.

#不可以用remove methods, including Del Pop These methods only list has
#remove is deleted by keyword, but Del and pop are deleted by index

#表示去掉字串两头的空格, S.strip (RM) deletes the characters from the beginning and end of the s string in the RM sequence,      
#当rm为空时, by default remove whitespace (including ' \ n ', ' \ R ', ' \ t ', '  )

The #get_text () method gets all the content, and the content array gets the chunks of the contents that are segmented by the inner tag.
Transfercon=contentcut2.get_text ()


S= "original"
Sx=s.decode (' Utf-8 ') #千万要注意不能把两句写在一起, dynamic compilation does not recognize
picurl=contentcut2.parent.next_sibling.find (' a '), TEXT=SX) ["href"]

#补充一点, double quotes will escape the contents, and single quotes will not


#为什么可以不用初始化对象就可以直接通过类名调用方法, of course, is not possible, when the operation will be an error

# (this error is very subtle and does not complain during encoding because Python is the #在运行过程中就会找不到这个方法 to interpret execution
.)
#解决方式, either instantiate the object and then invoke the method, or use the "class method" to add the @classmethod modifier to the class name
#并且要把方法的第一个默认参数写成cls) Here I use the second method to add modifiers to the class name

#类方法, to decorate with Classmethod.
#类方法的隐含调用参数是类, and the implied invocation parameter of the class instance method is an instance of the class, and the static method does not implicitly invoke the parameter
@classmethod
def findweibo (cls,sweb):

#注意: The position of the join () method is outside the for loop, which means that you must wait for the two processes in the For loop to complete before the main process is executed.
#或者结合flag and while to determine whether Chengcheng is still valid 
#is_alive (): Return whether the thread is alive.

# #findAll got a list of labels

RESP1 = soup.find (' input ', attrs = {' name ': ' VK '}) [' Value ']


#中文正则表达式
#前面带一个u is not r and the encoding range for Chinese characters is [\u4e00-\u9fa5]
reg=u "<a href= (' Http://login.weibo.cn/[^\u4e00-\u9fa5 ]*?) [>]+? [\u4e00-\u9fa5] {2}</a> "


Anyway, all project files are on the GitHub Https://github.com/poiu1235/weibo-catch:




Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.