I have always wanted to create a website with pictures of content. The previous idea was to create a CMS and then upload some pictures by myself ..
At first, there was no motivation to do this. Then I gave up and studied a CURL. It is better to implement this idea anyway.
Using PHP to steal pictures is like wearing so and sandals. Although there is no problem, it does hurt.
Let me first talk about the design of the PHP thief program. PHP does not support multithreading, so it can only be done in order.
Get the HTML page of the target website + parse HTML page get the connection to the image store + read and save it locally in binary mode + rename = process OK
You can run the program in two ways:
First: run the program with the browser (most of them will be stuck, set the timeout and memory size to OK, it is difficult to wait between you)
Another method: start PHP with a command line (PHP timeout does not exist)
/
The code is as follows: |
Copy code |
** * HTML parsing class * Author: Summer * Date: 2014-08-22 **/ Class Analytical { Public function _ construct () { Require_once ('class/SimpleHtmlDom. Class. Php '); $ This-> _ getDir (); } Private function _ getDir () { $ Dir = "../TMP/HTML/Results/1 "; $ ImgBIG = "../TMP/IMG/JPG/BIG "; $ It = new DirectoryIterator ($ dir ."/"); Foreach ($ it as $ file ){ // Use the isDot () method to filter out the "." and "." directories respectively. If (! $ It-> isDot ()){ $ Dirs = $ dir. "/". $ file; $ Tmp = explode (".", $ file ); $ Html = file_get_html ($ dirs ); $ UlArr = $ html-> find ('IMG '); Foreach ($ ulArr as $ key => $ value) { If ($ value-> class = "u ") { $ Url = "http://www.111cn.net". $ value-> src; $ Infomation = file_get_contents ($ url ); $ Result = $ this-> saveHtml ($ infomation, $ imgBIG, $ tmp ['0']. ". jpg "); If ($ result) { Echo $ file. "OKn "; } } } } } } Private function saveHtml ($ infomation, $ filedir, $ filename) { If (! $ This-> mkdirs ($ filedir )) { Return 0; } $ Sf = $ filedir. "/". $ filename; $ Fp = fopen ($ sf, "w"); // open a file in write mode Return fwrite ($ fp, $ infomation); // save the content Fclose ($ fp); // close the file } // Create a directory Private function mkdirs ($ dir) { If (! Is_dir ($ dir )) { If (! $ This-> mkdirs (dirname ($ dir ))){ Return false; } If (! Mkdir ($ dir, 0777 )){ Return false; } } Return true; } } New Analytical ();
|
The above is the process of getting the IMG connection address on the HTML page.
Two important things are used:
1. simplehtmldom extension for php dom parsing
2. Directory iterator of PHP
Understand these two things. This analysis class has no difficulties.
What if I get the page to be parsed?
In fact, the principle is the same as above. Obtains the URL of the page, reads the page through CURL, and returns an HTML string,
Then, save the function package HTML page to your local device.
I want to collect images on the page (to prevent anti-Leech protection from others), so the design is complicated.
The reason for separation is that simplehtmldom objects are very large and the process is clearer by splitting them.
Some people will say, why does it skip the process of saving HTML to the local without regular expression matching? BINGO! I can't bother writing regular expressions.