The Phpcrawl framework used,
The 1th step is to set the start address;
The 2nd step is to set the type of content to download: text/html;
The 3rd step is to use regular expressions to set the URL rules to be extended;
The 4th step is to start crawling and crawl the content of URLs that conform to the 3rd step URL rule.
The 5th step is to use regular expressions or DOM parsing tools to parse what you need.
The problem is:
Some of the content is AJAX requests, the request address is written by JavaScript, has been stitched up well. So, what should this address do to allow this crawler to execute? Put in the 3rd step of the extension address is not, because it is their own splicing, the source code does not have this address, matching is not.
Reply content:
The Phpcrawl framework used,
The 1th step is to set the start address;
The 2nd step is to set the type of content to download: text/html;
The 3rd step is to use regular expressions to set the URL rules to be extended;
The 4th step is to start crawling and crawl the content of URLs that conform to the 3rd step URL rule.
The 5th step is to use regular expressions or DOM parsing tools to parse what you need.
The problem is:
Some of the content is AJAX requests, the request address is written by JavaScript, has been stitched up well. So, what should this address do to allow this crawler to execute? Put in the 3rd step of the extension address is not, because it is their own splicing, the source code does not have this address, matching is not.
Use the stitched address directly, and then see if the Ajax is a GET or POST request, set the parameters and make a curl request, and then parse the data.