[PHP] Network Disk search engine-collect and crawl Baidu Network Disk shared files to achieve Network Disk search,
The title is too big. It is a lie. Recently, PHP is used to implement a simple online storage search program, which is associated with the public platform. You can enter a keyword using the public account. The public account returns the corresponding online storage. It is such a simple function. Similar to many online search websites, my collection and search programs are implemented in PHP. The full text and Word Segmentation search sections use the open-source software xunsearch, now let's introduce the implementation process.
1. Obtain a batch of online storage users
2. Obtain the share list based on the online storage user.
3. xunsearch implements full-text search and Word Segmentation search
4. Public platform interface development
Function display:
Obtain and collect Baidu online storage users
To obtain the sharing list, first collect Baidu user information. Now I will introduce how to find a large number of Baidu users. First open the browser review elements, view HTTP request package, open your own Baidu Network Disk home page address https://pan.baidu.com/pcloud/home, view the list of subscribed users, observe the request.
Https://pan.baidu.com/pcloud/friend/getfollowlist? Query_uk = 3317165372 & limit = 24 & start = 0 & bdstoken = bc329b0677cad94231e973953a09b46f & channel = chunlei & clienttype = 0 & web = 1 & logid = subscription = This request is to obtain the subscri.
The preceding parameter meanings are: query_uk (my id number, Baidu is named after uk); limit (number of items displayed on each page ); start (the start Number of the page); all the remaining parameters are not used.
What is the simplified interface address: https://pan.baidu.com/pcloud/friend/getfollowlist? Query_uk = {$ uk} & limit = 24 & start = {$ start}
Process the subscriber interface address obtained by PAGE
For the moment, suppose that I subscribe to 2400 users, and this quantity is enough. If 24 users are displayed on each page, the page is divided into 100 pages. First, let's look at how to generate the 100 URLs.
<? Php/** get subscriber */class UkSpider {private $ pages; // Number of pages private $ start = 24; // Number of public function _ construct ($ pages = 100) {$ this-> pages = $ pages ;} /*** generate the interface url */public function makeUrl ($ rootUk) {$ urls = array (); for ($ I = 0; $ I <= $ this-> pages; $ I ++) {$ start = $ this-> start * $ I; $ url = "http://pan.baidu.com/pcloud/friend/getfollowlist? Query_uk = {$ rootUk} & limit = 24 & start = {$ start} "; $ urls [] = $ url;} return $ urls ;}} $ ukSpider = new UkSpider (); $ urls = $ ukSpider-> makeUrl (3317165372); print_r ($ urls );
Result of the obtained url interface list:
Array( [0] => http://pan.baidu.com/pcloud/friend/getfollowlist?query_uk=3317165372&limit=24&start=0 [1] => http://pan.baidu.com/pcloud/friend/getfollowlist?query_uk=3317165372&limit=24&start=24 [2] => http://pan.baidu.com/pcloud/friend/getfollowlist?query_uk=3317165372&limit=24&start=48 [3] => http://pan.baidu.com/pcloud/friend/getfollowlist?query_uk=3317165372&limit=24&start=72 [4] => http://pan.baidu.com/pcloud/friend/getfollowlist?query_uk=3317165372&limit=24&start=96 [5] => http://pan.baidu.com/pcloud/friend/getfollowlist?query_uk=3317165372&limit=24&start=120
Use CURL to request the interface address
You can directly use the file_get_contents () function when requesting the interface address, but here I use the PHP CURL Extension function, because the request header information needs to be modified when obtaining the shared file list.
The JSON information structure returned by this interface is as follows:
{"Errno": 0, "request_id": 3319309807, "total_count": 3, "follow_list": [{"type":-1, "follow_uname ": "enthusiastic *** alliance", "avatar_url": "http://himg.bdimg.com/sys/portrait/item/7fd8667f.jpg", "intro": "", "user_type": 0, "is_vip": 0, "follow_count ": 0, "fans_count": 21677, "follow_time": 1493550371, "pubrent_count": 23467, "follow_uk": 3631952313, "album_count": 0 },{ "type ": -1, "follow_uname": "comment * Comment ", "Avatar_url": "http://himg.bdimg.com/sys/portrait/item/fa5ec198.jpg", "intro": "Wanli gold rush, for you to recommend high-quality novels, full of resource benefits! "," User_type ": 6," is_vip ": 0," follow_count ": 10," fans_count ": 5463," follow_time ": 1493548024," pubrent_count ": 2448, "follow_uk": 1587328030, "album_count": 0 },{ "type":-1, "follow_uname": "self-checking tickets", "avatar_url": "http://himg.bdimg.com/sys/portrait/item/8c5b2810.jpg ", "intro": "Nothing to watch a novel. "," User_type ": 0," is_vip ": 0," follow_count ": 299," fans_count ": 60771," follow_time ": 1493547941," pubrent_count ": 13404, "follow_uk": 1528087287, "album_count": 0}]}
If you want to create a comprehensive online search website, you can store all the information in the database. Now I am just creating a simple novel search website, so I only leave the uk Number of the subscription disk owner.
<? Php/** get subscriber */class UkSpider {private $ pages; // Number of pages private $ start = 24; // Number of public function _ construct ($ pages = 100) {$ this-> pages = $ pages ;} /*** generate the interface url */public function makeUrl ($ rootUk) {$ urls = array (); for ($ I = 0; $ I <= $ this-> pages; $ I ++) {$ start = $ this-> start * $ I; $ url = "https://pan.baidu.com/pcloud/friend/getfollowlist? Query_uk = {$ rootUk} & limit = 24 & start = {$ start} "; $ urls [] = $ url;} return $ urls ;} /*** obtain the subscription user id based on the URL */public function getFollowsByUrl ($ url) {$ result = $ this-> sendRequest ($ url); $ arr = json_decode ($ result, true); if (empty ($ arr) |! Isset ($ arr ['follow _ list']) {return;} $ ret = array (); foreach ($ arr ['follow _ list'] as $ fan) {$ ret [] = $ fan ['follow _ uk '];} return $ ret;}/*** send request */public function sendRequest ($ url, $ data = null, $ header = null) {$ curl = curl_init (); curl_setopt ($ curl, CURLOPT_URL, $ url); curl_setopt ($ curl, CURLOPT_SSL_VERIFYPEER, FALSE ); curl_setopt ($ curl, CURLOPT_SSL_VERIFYHOST, FALSE); if (! Empty ($ data) {curl_setopt ($ curl, CURLOPT_POST, 1); curl_setopt ($ curl, CURLOPT_POSTFIELDS, $ data);} if (! Empty ($ header) {curl_setopt ($ curl, CURLOPT_HTTPHEADER, $ header);} curl_setopt ($ curl, CURLOPT_RETURNTRANSFER, 1); $ output = curl_exec ($ curl ); curl_close ($ curl); return $ output ;}$ ukSpider = new UkSpider (); $ urls = $ ukSpider-> makeUrl (3317165372 ); // cyclically paginated urlforeach ($ urls as $ url) {echo "loading :". $ url. "\ r \ n"; // random sleep for 7 to 11 seconds $ second = rand (7,11); echo "sleep... {$ second} s \ r \ n "; sleep ($ second); // initiate a request $ followList = $ ukSpider-> getFollowsByUrl ($ url ); // if no data exists, stop the request if (empty ($ followList) {break;} print_r ($ followList );}
The urls generated in the previous step of the loop request. Note that the request should be sent at intervals of a certain number of seconds. Otherwise, the request will be blocked directly and the loop should be stopped if no data is available. This script must be run in command line mode. In the browser, it will time out and die.
Collect User uk numbers cyclically
Use the mysql database to create a table, such as uks, to store the collected user numbers. The table structure is as follows:
CREATE TABLE `uks` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `uk` varchar(100) NOT NULL DEFAULT '', `get_follow` tinyint(4) NOT NULL DEFAULT '0', `get_share` tinyint(4) NOT NULL DEFAULT '0', PRIMARY KEY (`id`), UNIQUE KEY `uk_2` (`uk`), KEY `uk` (`uk`))
Store the data in a batch, and then continue to find the subscription disk owner based on the batch. The important field in the loop is that uk is unique; 'Get _ follow is 0 by default. When the subscriber list is searched for again, It is changed to 1 to prevent repeated collection.
The next article introduces how to obtain the sharing list based on uk and import the list to the database.
Demonstration address, follow the public number: online storage novels, or scan the following QR code