The Python write web crawler is a great guide to crawling Web data using Python, explaining how to crawl data from static pages and how to manage server load using caching. In addition, the book describes how to use AJAX URLs and Firebug extensions to crawl data, and more about crawling techniques, such as using browser rendering, managing cookies, extracting data from a complex site protected by a validation code by submitting a form, and more. This book uses Scrapy to create an advanced web crawler and crawl some real sites.
"Writing web crawler with Python" introduces the following content:
Crawl sites by tracking links;
Use lxml to extract data from the page;
Build a thread crawler to crawl pages in parallel;
Cache the downloaded content to reduce bandwidth consumption;
Parsing web sites that rely on JavaScript;
Interacting with forms and sessions;
Solve the problem of verification code for protected pages;
Reverse engineer the AJAX call;
Use Scrapy to create advanced crawlers.
Reader object of the book
This book is written for developers who want to build a reliable data crawling solution, which assumes that the reader has some experience in Python programming. Of course, readers with other programming language development experience can read this book and understand the concepts and principles involved in the book.
Write web crawler with Python-cloud