Webbots, spiders, and screen scrapers: Technical Analysis and Application practices (formerly known as version 2nd)

Source: Internet
Author: User
Tags php regular expression
Basic information of webbots, spiders, and screen scrapers: Technical Analysis and Application Practice (formerly known as version 2nd). Original Title: webbots, spiders, and screen scrapers: A guide to developing Internet agents with PHP/curl, Second Edition original press: No starch press Author: (US) Michael Schrenk Translator: Zhang Lei Shen Xin series name: Chapter Hua programmer library Press: mechanical Industry Press ISBN: 9787111417682 Release Date: May 2013 publication date: 16 open pages: 282 versions: 2-1 category: computer> Software and programming> more about network programming "webbots, spiders and screen scrapers: Technical Analysis and Application practices (2nd) introduction to computer books webbots, Spiders and screen scrapers: Technical Analysis and Application practices (formerly known as version 2nd) are authoritative works in the webbots, spiders, and screen scrapers fields, it is widely recognized in the international security field and is the result of 15 years of experience from senior network security experts. The technical principles and advanced skills of webbots, spiders, and screen scrapers are comprehensively and in detail, and the design and development methods of nine commonly used network robots are explained in the case, highly operable. In addition to a wealth of theoretical and practical content, this book also introduces the commercial use of ideas, tirelessly warned developers how to develop a law-abiding and non-interfering network of constructive network robots. Chapter 31 of the book consists of four parts: Part 1 (1 ~ Chapter 7) systematically introduces various concepts and technical principles related to webbots, spiders, and screen scrapers, which is the basic knowledge necessary to understand and use them. Part 2 (8 ~ Chapter 16 ), the design and development methods of nine types of common robots, including price monitoring, image capturing, search ranking detection, information aggregation, FTP information, reading and email sending, are carefully explained in the form of cases, very practical guidance; Part 3 (17 ~ Chapter 25) summarizes and summarizes a large number of advanced skills, including the design methods of spider programs, purchasing robots and seckilling devices, related cryptography, authentication methods, advanced cookie management, how to plan the running of network robots and spider, and crawling weird websites using browser macros, modify imacros, and so on; Part 4 (26 ~ Chapter 31) is to expand knowledge, including how to design concealed network robots and spider, write Fault-Tolerant network robots, design websites favored by network robots, eliminate spider, and related legal knowledge. Contents: webbots, spiders, and screen scrapers: Technical Analysis and Application practices (original book version 2nd) part 1 Basic Concepts and technology chapter 1 main content of this book 1st discover the real potential of the Internet 31.1 for developers 31.2.1 network robot developers are short of talents 41.2.2 it is interesting to compile network robots 41.2.3 web Robots use "constructive hacking" technology 41.3 for enterprise managers 51.3.1 customize the Internet for businesses 51.3.2 make full use of the public experience in Web Robots 51.3.3 get twice the result with half the effort 61.4 Conclusion 6 chapter 2nd network robot project creativity 72.1 inspiration for browser limitations 72.1.1 network robots that aggregate and filter related information 72.1.2 network robots that interpret online information 82.1.3 personal proxy network robots 9.2.2 start from crazy ideas 92.2.1 help busy people free themselves 102.2.2 automatic run, cost Saving 102.2.3 Protection of Intellectual Property 102.2.4 monitoring opportunities 112.2.5 Verify access permissions on the website 112.2.6 create online newspaper clipping service 112.2.7 Find Unauthorized Wi-Fi network 122.2.8 tracking website Technology 122.2.9 make incompatible systems communication 122.3 Conclusion 13 Chapter 1 download web page 3rd when they are files, instead of using PHP built-in functions on web page 143.2 to download files 153.2.1 using fopen () and fgets () to download files 153.2.2 using file () function download file 173.3 PHP/curl library introduction 183.3.1 multiple transmission protocols 183.3.2 form submission 193.3.3 Basic Authentication Technology 193.3.4 cookie193.3.5 redirection 193.3.3.6 proxy name fraud 193.3.3.7 Link Management 203.3.8 socket management 203.4 install PHP/curl203.5 lib_http Library 213.5.1 familiar with the default value 213.5.2 use lib_http213.5.3 learn more HTTP header information 243.5.4 check lib_http source code 253.6 Conclusion 25 Chapter 25 basic parsing technology 4th mixed content and tags 264.1 messy parsing format HTML file 264.2 Standard parsing process 274.4 use lib_parse library 274.4.1 to break down strings with delimiters: split_string () function 274.4.2 extract the delimiter: return_between () function 284.4.3 parses the dataset into an array: parse_array () function 294.4.4.4 extract attribute value: get_attribute () function 304.4.5 Remove useless text: remove () function 324.5 useful PHP function 324.5.1 judge whether a string is in another string 324.5.2 replace a part of another string 334.5.3 parse unformatted text 334.5.4 measure string similarity 344.6 conclusion 344.6.1 do not trust Encoding chaotic web page 344.6.2 small step parsing 354.6.3 do not render the parsing result 354.6.4 less regular expression 35 Chapter 35 advanced parsing technology using regular expressions 5th pattern matching-Key of Regular Expressions 365.1 PHP regular Expression type 365.2.1 PHP Regular Expression Function 375.2.2 similarities with PHP built-in function 385.3 learn from example pattern matching 395.3.1 extract number 395.3.2 probe string sequence 395.3.3 letter character match 405.3.4 wildcard match 405.3.5 select match 415.3.6 Regular Expression for grouping and range matching 415.4 regular expression related to Web robot developer 415.4.1 extract phone number 2.16.4.2 next learning what 455.5 when to use regular expression 465.5.1 Regular Expression strengths 465.5.2 pattern matching for parsing disadvantages of web pages 465.5.3 which is faster, regular Expression or PHP built-in function 485.6 Conclusion 48 Chapter 6th automatic form submission 496.1 form interface reverse engineering 506.2 form processor, data field, form method and event trigger 506.2.1 form processor 506.2.2 data field 516.2.3 Form Method 526.2.4 multi-component code 546.2.5 event trigger 546.3 unpredictable form 556.3.1 javascript can modify form 556.3.2 form HTML code before submission usually unable to read 556.3.3 cookie does not exist in form, but will affect its operation 556.4 analysis form 556.5 conclusion 596.5.1 do not expose identity 596.5.2 correct simulation browser 596.5.3 avoid form errors 60 chapter 7th process large-scale data 617.1 organize data 617.1.1 naming convention 617.1.2 store in structured files data 627.1.3 text data 647.1.4 stored in the database image 667.1.5 database, still use File System 687.2 to reduce data scale 687.2.1 Save image file address 687.2.2 compress data 687.2.3 remove format information 717.3 generate image thumbnail 727.4 Conclusion 73 second part network robot project Chapter 2 price monitoring network robot 778.1 target website 778.2 design parsing script 788.3 initialization and download target webpage 798.4 further explore chapter 83 9th image capturing network robot 849.1 image capturing network robot example 849.2 create image capturing network robot 859.2.1 binary secure download process 869.2.2 directory structure 879.2.3 main script 879.3 further discussion 909.4 conclusion 90 Chapter 10th link verification network robot 9110.1 create link verification network robot 9110.1.1 initialize the network robot and download the target webpage 9210.1.2 set page reference 9210.1.3 extract link 9310.1.4 run the verification cycle 9310.1.5 to generate the complete URL path 9310.1.6 download the full link path 9410.1.7 display Page Status 9510.2 run the network robot 9510.2.1 lib_http_codes9610.2.2 lib_resolve_addresses9610.3 further explore Chapter 97 search ranking detection web robot 11th Introduction search ranking detection what does a network robot do? 10011.3 run search ranking detection network robot 10011.4 search ranking detection network robot working principle 10111.5 search ranking detection network robot script 10111.5.1 initialization variable 10211.5.2 start cycle 10211.5.3 get search result 10311.5.4 resolution search result 10311.6 conclusion 201711.6.1 the data source must be well-formed. 11.6.2 the Web search site may treat Web Robots differently from the web browser rolling 11.6.3 crawling the search engine is not a good idea; 11.6.4 familiar with Google api1_11.7 further explore Chapter 107 Information aggregation network robot 10812.1 select data source for network robot 10812.2 information aggregation network robot example 12.2.1 familiar with RSS source 12.2.2 compile information aggregation network robot 11112.3 Add filter mechanism for information aggregation network robot 11412.4 further explore Chapter 115 FTP network robot 116

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.