The project and documentation are on GitHub, Https://github.com/poiu1235/weibo-catch:
Interested can follow a bit, or a
I use the deep mining method here: There is no boundary to set the crawl (this is to be considered later)
The general idea is to log in with your own account and get a list of your Weibo and friends.
Then follow the list of friends and then crawl each other's microblog list and friends list. The process of digging and traversing in such a constant depth
I used the MySQL database for storage, followed by the MongoDB database for storage.
Add a bit of Linux knowledge: What are the functions of the folders in the system:
/bin
Bin is the abbreviation for binary. This directory inherits the structure of the UNIX system and stores the commands most often used by the user. such as CP, LS, cat, and so on.
/boot
Here are some of the core files used when starting Linux.
/dev
Dev is the abbreviation for device (devices). This directory is the external device for all Linux, which functions like the. SYS and the. vxd under Win in DOS. In Linux, devices and files are accessed in the same way. For example:/dev/hda represents the first physical IDE hard disk.
/etc
This directory is used to store the configuration files and subdirectories required for system administration.
/home
The user's home directory, such as a user named Wang, his home directory is/home/wang can also be expressed in ~wang.
/lib
This directory contains the system's most basic dynamic link sharing library, which acts like a. dll file in Windows. These shared libraries are required for almost all applications.
/lost+found
This directory is usually empty, and when the system shuts down, it becomes a refuge for homeless files. Yes, a bit like the. chk file under DOS.
/mnt
This directory is empty and the system provides this directory to let users temporarily mount other file systems.
/proc
This directory is a virtual directory, it is the mapping of system memory, we can access this directory directly to obtain system information. In other words, the contents of this directory are not on the hard disk but in memory.
/root
The home directory of the system administrator (also called the Superuser). As the owner of the system, there must be some privilege! such as owning a directory alone.
/sbin
S is the meaning of super user, which means that the management program used by the system administrator is stored here.
/tmp
This directory, needless to say, must be the place to store some temporary files.
/usr
This is the largest directory, and almost all of the applications and files we use are stored in this directory. It contains the following subdirectories;
/usr/x11r6
A directory for storing X-window;
/usr/bin
Store many applications;
/usr/sbin
Some management programs for super users are put here;
/usr/doc
This is the base of the Linux documentation;
/usr/include
Linux under the development and compilation of applications required by the header file, find here;
/usr/lib
Store some common dynamic link shared libraries and static archives;
/usr/local
This is provided to the general user of the/usr directory, where the installation of software is most suitable;
/usr/man
Man is a synonym for help in Linux, and here is the directory where the help documents are stored;
/usr/src
Linux Open source code on the existence of this directory, enthusiasts do not let go Oh!
/var
The directory is stocked with things that are constantly expanding, in order to maintain the relative stability of/usr, those directories that are often modified can be placed in this directory, in fact many system administrators do this. Incidentally, the system's log files are in the/var/log directory.
Crawl Weibo rules
The whole process I want to crawl two aspects of the content.
One is Weibo content
The other is a friend (he pays attention to people, initiative, interest strongly) relationship rather than (followers, fans, passive, too much interference)
Crawl down the words I also intend to exist two places, one is the MySQL relational database, the other is a text file in the form of the system
MySQL uses MySQL from the Namenode node in the previous Hadoop system, and the IP is 192.168.1.113
Because the default MySQL installation, is bound to the IP, is only limited to local access, stand-alone access
So to change the configuration
sudo nano/etc/mysql/my.cnf find bind-address = 127.0.0.1
Comment out this line, such as: #bind-address = 127.0.0.1
Allow any IP access;
After this comment, local access cannot be omitted. The host address must be specified with the-h command
Restart Mysql:sudo/etc/init.d/mysql restart
Authorized users can connect remotely
Grant all privileges on * * to [email protected] '% ' identified by ' password ' with GRANT option;
Flush privileges;
The first line of the command is interpreted as follows, * *: The first * represents the database name, and the second * represents the table name. This means that all tables in the database are licensed to the user. Root: Grant root account. "%": indicates that the authorized user IP can be specified, which means that any IP address can access the MySQL database. "Password": Assign the password of the account, here the password itself is replaced with your MySQL root account password.
The second line of command is to refresh the permission information, that is, let our settings take effect immediately.
Login MySQL with mysql-uroot-p
If you are logging on to other machines for MySQL with mysql-uroot-h192.168.1.113-p
#查看编码方式
Show variables like "character%";
#修改编码方式
sudo service MySQL stop
sudo nano/etc/mysql/my.cnf
Add the following two lines of settings #在文件内的 [mysqld]:
Character_set_server=utf8
init_connect= ' SET NAMES UTF8 '
sudo service MySQL start
The database character set and the server character set will become Utf-8, so you can write Chinese.
Here I installed the MySQL Assistant tool on Ubuntu is Workbench and Heidisql online is not recommended navicat although I use navicat under Windows most handy
The default MySQL directory in the data "MySQL" This schema is not seen in the workbench it?
Click the SQL Editor in the menu-edit->preferences and tick the checkbox in front of "show Data dicrionaries and Internal Schemas".
Go back to the past to refresh or reconnect, it will appear
And then there's some build-up, table work.
execute to use a peer format such as name (char), age (int), at which time the statement is: Cur.execute ("INSERT into db.table (name,age) VALUES (%s,%d)"% (V[name], v[age]), V[name] is character, v[ Age] is an int
executemany do not need to force the format, the same table statement is:
values=[("Zhan", +), ("Li", ")]
Executemany ("INSERT into Db.tables (name,age) VALUES (%s,%s)", values)
Specific reasons I also do not know, can only say is through the pit bar, Executemany generally only use the%s can, with other may error. The inserted value is normal (seemingly auto-matching column properties?). ), there is the warrior know also hope to advise
Note: Cursor.execute () can accept a parameter, or it can accept two parameters:
(1) Cursor.execute ("INSERT into resource (Cid,name) values (%s,%s)", (12,name));
This format is to accept two parameters, MySQLdb will automatically escape the string for you and quote, do not have to escape their own, after the execution of this statement, the Resource table has one more record: 12 \
(2) Cursor.execute ("INSERT into resource (Cid,name) values (%s,%s)"% (12,name));
This format uses the Python string format itself to generate a query, that is, to execute a parameter, you must escape the string and add quotation marks, that is, the upper statement is wrong, should be modified to:
Name = mysqldb.escape_string (name);
Cursor.execute ("INSERT into resource (Cid,name) values (%s, '%s ')"% (12,name));
This inserts the same record as (1): 12 \
Create DATABASE Weibocatch;use weibocatch;/* Kanji and a letter are all one character */create table W_user (wid char () Primary key,wname varchar ( ) Not Null,recon tinyint (1), #0表示未认证, 1 indicates authenticated color tinyint (1) default NULL, #0表示黄v (personal), 1 for Blue V (Enterprise) flag tinyint (2) default 0, #0表示未被爬取, 1 for crawl success, 2 for crawl failure inserttime timestamp default now (), CREATE TABLE W_error (wid char), exception varchar ( ) CREATE TABLE W_relation (wid char, Wfriendid char), CREATE TABLE W_conn (weiboid char (+), tag tinyint (1),/* The judgment is to be forwarded Weibo or original microblogging, 1 is forwarded, 0 is original */atid char (TEN), Atname varchar (primary), the key (Weiboid,tag)); */ A date or datetime type cannot use a function as the default value, so instead of using the timestamp type */create table w_content (weiboid char (+) Primary Key,wid char (10), Content varchar (+), url varchar (+), map varchar ($), label varchar (+), Intime varchar, picid int,wfrom varchar (+), inserttime timestamp default now ()),/*date or DateTime types cannot use functions as default values, so instead timestamp type */create table W_ Transfer (weiboid char (+) Primary Key,wid char (TEN), Transferid char (+), content varchar (+), remark VarcHar (+), label varchar (+), url varchar (+), Intime varchar (+), picid int,wfrom varchar (+), inserttime timestamp Default now ()), CREATE table Pic_reg (ID int primary KEY auto_increment,path varchar), Label1 varchar, label2 varchar (+), Label3 varchar (+), Title varchar (9000)) Auto_increment=10000;insert into Weibocatch.w_user (Wid,wname, Recon,color,flag) VALUES (' 2468833122 ', ' poiu1235 ', 0, ' ', 0); INSERT into Weibocatch.w_user (wid,wname,recon,color,flag VALUES (' 2430104687 ', ' Jansenkaigen ', 0,0,0); SHOW GLOBAL VARIABLES like ' auto_incre% '; --Global variable SELECT * from Weibocatch.w_relation;select today ();d elete from Weibocatch.pic_reg;delete from Weibocatch.w_ Transfer;alter Table Weibocatch.pic_reg Auto_increment=10000;select * from Weibocatch.w_user;update Weibocatch.w_user Set flag=0 where wid = ' 2430104687 '
Insert and update operations must be SELF.CONN.COMMIT () after completion, otherwise it will not change the database
Try: xxxx; Except Exception as err: self.conn.rollback () Print err finally: #关闭连接, freeing Resource cursor.close ();
if (Wconn.tag.strip ()): #判断字符串不为空
The process of preparing the database connection is written in a function so that each time it is temporarily used to invoke the
Self.inition () #因为mysql connection is easily timed out, default 10s, so simply use each time in the declaration
#插入一条记录 all use%s as a placeholder, even if the number is not%d
#因为这个插件程序会帮你转换的, the string will automatically give you ', if the number does not add
#使用这函数向一个给定Connection对象返回的值是该Connection对象产生 # The first auto_increment value for the most recent statement affecting the Auto_increment column. #这个值不能被其它Connection对象的影响, that is, they produce their own auto_increment values. #第二, last_insert_id is not related to table, #如果向表a插入数据后, then insert data to Table B, LAST_INSERT_ID returns the ID value in table B. Cursor.execute ("Select last_insert_id ()") # "gets only one record:" result = Cursor.fetchone (); #这个返回的是一个tuple, an element's tuple is followed by a ”,“。 No real sense.
#表示从第二个取到倒数第二个, remove the single quotes from both ends of the pattern string loginweb=loginweb[1:-1]
#reload (SYS) #sys. setdefaultencoding ("Utf-8") #这个是因为插件报错, but the program is normal.
#设置读取目录是当前目录 # But although it is convenient to save the file, the current cookie file path #os.chdir ("/home/luis/workspace/weibo-catch/picture/") cannot be read correctly due to current path changes # OS.GETCWD () <pre name= "code" class= "python" > #所以直接写完整路径会省去很多麻烦ppth = "/home/luis/workspace/weibo-catch/picture /"
data = Urllib.urlopen (Picurl). Read () F = open (name, ' WB ', 8192) #设置文件缓冲区为8M大小, some pictures are bigger afraid of not having to complete f.write (data) f.close#close method is equivalent to flush the buffer and then close it.
The #因为f. Read () method reads the STR type, which is UTF8 by default and can be searched directly with the str type without transcoding decode operation
if (countt==10): subprocess.call ("pause", Shell=true) #让程序暂停替换了os. System (' pause ') method or with if (Raw_input ()): Pass
#注意每个beautifulsoup对象截取出来的对象都是 a "specific label" object, you can directly do BeautifulSoup related operations # Instead of string types, and if you want to do some string manipulation, you must go to the STR () type. #Beautiful Soup uses the code to automatically detect a sub-library to identify the current document encoding and convert it to Unicode encoding # Output the document by Beautiful soup, no matter what encoding the input document is, the output code is UTF-8 encoded # But like index and simple The Find method will be converted to Unicode encoding and then to operate, after the execution of the code into the UTF8 operation. #即先转码再转回来 (first decode (' UTF8 ') and encode (' UTF8 '))
Time.sleep (2) #每次完成一个任务就暂停2秒
#一般的字符串操作有rfind look for it from the back, BeautifulSoup.
#不可以用remove methods, including Del Pop These methods only list have #remove is deleted by keyword, but Del and pop are deleted by index
#表示去掉字串两头的空格, S.strip (RM) removes the characters from the beginning and end of the s string, in the RM sequence, #当rm为空时, by default removing whitespace characters (including ' \ n ', ' \ R ', ' \ t ', ')
The #get_text () method can get all the content, and the content array gets the inner tag fragment block. Transfercon=contentcut2.get_text ()
S= "original" Sx=s.decode (' Utf-8 ') #千万要注意不能把两句写在一起, dynamic compilation does not recognize Picurl=contentcut2.parent.next_sibling.find (' a ', text=sx) [" HREF "]
#补充一点, double quotes escape the contents, and single quotes do not
#为什么可以不用初始化对象就可以直接通过类名调用方法, of course, it is not possible to run the time will be error
# (this error is very subtle and does not cause errors in the encoding process, because Python is interpreted as executing # This method will not be found during the run.) #解决方式, either instantiate the object before calling the method, or use the class method to add the @classmethod modifier to the class name # and write the first default parameter of the method as CLS) Here I use the second method, adding modifiers to the class name
#类方法, use Classmethod to decorate.
#类方法的隐含调用参数是类, whereas the implicit invocation argument of a class instance method is an instance of the class, the static method does not imply calling the parameter @classmethoddef Findweibo (cls,sweb):
#注意: The Join () method is positioned outside the for loop, which means that it must wait for the two processes in the For loop to finish before executing the main process. #或者结合flag and while to determine if regulation regulation is still valid #is_alive (): Return whether the thread is alive.
# #findAll got a list of labels
RESP1 = soup.find (' input ', attrs = {' name ': ' VK '}) [' Value ']
#中文正则表达式 # preceded by a U is not r and the encoding range for Chinese characters is [\u4e00-\u9fa5]reg=u "<a href= (' http://login.weibo.cn/[^\u4e00-\u9fa5]*?) [>]+? [\u4e00-\u9fa5] {2}</a> "
Anyway, all project files are https://github.com/poiu1235/weibo-catch on GitHub:
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Python crawler 3 Sina micro-BO crawler Combat