How to Write web crawlers in php?

Source: Internet
Author: User
How to Write web crawlers in PHP? 1. don't tell me that PHP is not suitable for this. I don't want to learn a new language to write crawlers. I know it can be implemented. 2. I have a certain degree of PHP programming basics, familiar with data structures and algorithms, and have basic network knowledge, such as TCPIP protocol. can you provide the name of a specific book? 4. Name of an online article. can I greedy for the source code? Thank you! How to Write web crawlers in PHP?
1. Don't tell me that PHP is not suitable for this. I don't want to learn a new language to write crawlers. I know it can be implemented.
2. I have a certain degree of PHP programming basics. I am familiar with data structures and algorithms and have basic network knowledge, such as TCP/IP protocols.
3. Can you provide the name of a specific book or an online article?
4. Can I greedy for source code?
Thank you! Reply content:
  • Pcntl_fork or swoole_process implement multi-process concurrency. It takes 500 ms to capture each web page, and 200 processes can be opened to capture 400 pages per second.
  • Curl can capture pages and set cookies to simulate logon.
  • Simple_html_dom implements page parsing and DOM Processing
  • To simulate a browser, use casperJS. Encapsulate a service interface with swoole extension to call the PHP Layer
Here, a crawler system is implemented based on the above technical solutions. It captures tens of millions of pages every day. You need this-Goutte, a simple PHP Web Scraper-FriendsOfPHP/Goutte · GitHub USTC Spider This is written in PHP. You can capture the target website, write data to the local device, and then directly read the local file. It is not difficult to implement content crawler in php. The curl and selenium mentioned above can already accomplish almost all possible tasks. However, if you still want to process the content, you 'd better add something that can process user interaction. casperjs is. Webbots, Spiders, and Screen Scrapers: Technical Analysis and Application practices. This afternoon, a Douban group that meets the requirements is captured through keywords. , Very rough. It's also just the beginning.
The problem is that you are still trying to solve the problem...
It's too slow... it's done by a single thread. I think the answer to the most votes is quite good. Prepare for further transformation. Php simulates logon to the educational administration system. During the test, the logon is successful, but the page is not displayed. The simplest method is to use the regular expression + get_file_contents to implement crawling.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.