Python Network data acquisition PDF

Source: Internet
Author: User
Tags nltk

: Network Disk Download

Content Introduction· · · · · ·

This book uses the simple and powerful Python language, introduces the network data collection, and provides comprehensive guidance for collecting various data types in the modern network. The first part focuses on the basic principles of network data acquisition: How to request information from a Web server using Python, how to handle the response of the server, and how to interact with the website in an automated way. The second part describes how to test Web sites with web crawlers, automate processing, and how to access the network in more ways.

Author profile ...

Ryan Mitchell

Data scientist, software engineer, currently in Boston Linkedrive Company is responsible for developing the Company's API and data analysis tools. Previously, abine Company built web crawler and network robot. She often consults on network data collection projects, mainly for the financial and retail sectors. There is also instant Web scraping with Java.

Catalogue-Translator IX
Preface XI
The first part creates the crawler
The 1th chapter of the Network Crawler 2
1.1 Network Connections 2
1.2 BeautifulSoup Introduction 4
1.2.1 Mounting BeautifulSoup 5
1.2.2 Running BeautifulSoup 7
1.2.3 Reliable Network Connectivity 8
The 2nd Chapter Complex HTML parsing 11
2.1 doesn't always have to use a hammer. 11
2.2 Back one Bowl BeautifulSoup 12
2.2.1 BeautifulSoup's Find () and FindAll () 13
2.2.2 Other BeautifulSoup Objects 15
2.2.3 Navigation Tree 16
2.3 Regular Expressions 19
2.4 Regular Expressions and BeautifulSoup 23
2.5 Getting Properties 24
2.6 Lambda Expression 24
2.7 Beyond BeautifulSoup 25
The 3rd chapter begins to collect 26
3.1 Traversing a single domain 26
3.2 Capturing the entire site 30
3.3 Acquisition via internet 34
3.4 Collecting with Scrapy 38
Chapter 4th using API 42
4.1 API Overview 43
4.2 API General rules 43
4.2.1 Method 44
4.2.2 Verification 44
4.3 Server Response 45
4.4 Echo Nest 46
4.5 Twitter API 48
4.5.1 Start 48
4.5.2 several examples 50
4.6 Google API 52
4.6.1 Start 52
4.6.2 several examples 53
4.7 Parsing JSON Data 55
4.8 Back to topic 56
4.9, say something. API 60
5th. Storing Data 61
5.1 Media files 61
5.2 Storing data in CSV 64
5.3 MySQL 65
5.3.1 Installing MySQL 66
5.3.2 Basic Command 68
5.3.3 Integration with Python 71
5.3.4 database technology and best practices 74
5.3.5 "Six-degree space game" in MySQL 75
5.4 Email 77
6th. Read Document 80
6.1 Document Encoding 80
6.2 Plain Text 81
6.3 CSV 85
6.4 PDF 87
6.5 Microsoft Word and. docx 88
Part II Advanced Data acquisition
Chapter 7th Data Cleansing 94
7.1 Writing code Cleaning data 94
7.2 data storage and then cleaning 98
Chapter 8th Natural Language Processing 103
8.1 Summarizing Data 104
8.2 Markov Model 106
8.3 Natural Language Toolkit 112
8.3.1 Installation and Setup 112
8.3.2 using NLTK to do statistical analysis 113
8.3.3 using NLTK to do part-of-speech analysis 115
8.4 Other Resources 119
The 9th chapter through the Web form and the login window to collect 120
9.1 Python Requests Library 120
9.2 Submit a basic form 121
9.3 radio buttons, check boxes, and other inputs 123
9.4 Submitting files and images 124
9.5 Handling Logins and Cookies 125
9.6 Other form Issues 127
10th. Capturing JavaScript 128
10.1 JavaScript Introduction 128
10.2 Ajax and Dynamic HTML 131
10.3 Handling Redirection 137
The 11th chapter of image recognition and word processing 139
11.1 OCR Library Overview 140
11.1.1 Pillow 140
11.1.2 Tesseract 140
11.1.3 NumPy 141
11.2 Handling the format specification for text 142
11.3 Reading Verification Code and training Tesseract 146
11.4 Get Verification Code submit answer 151
The 12th chapter avoids the collection trap 154
12.1 Code of Ethics 154
12.2 Make a web robot look like a human user 155
12.2.1 Modify Request Header 155
12.2.2 Processing Cookies 157
12.2.3 Time is everything 159
12.3 Common Forms Security 159
12.3.1 implied input field value 159
12.3.2 Avoid honeypot 160
12.4 Problem Checklist 162
13th Crawler test Site 164
13.1 Test Introduction 164
13.2 Python Unit Test 165
13.3 Selenium Unit Test 168
13.4 Python unit test and Selenium Unit Test selection 172
The 14th Chapter Remote Collection 174
14.1 Why to use remote server 174
14.1.1 prevent IP address from being blocked 174
14.1.2 Portability and Extensibility 175
14.2 Tor Proxy Server 176
14.3 Remote Host 177
14.3.1 running from the web host 178
14.3.2 running from a cloud host 178
14.4 Other Resources 179
14.5 March forward 180
Appendix A python Introduction 181
Appendix B Introduction to the Internet 184
Appendix C Legal and moral constraints on network data collection 188
About the author 200
Cover Introduction 200

: Network Disk Download

Python Network data acquisition PDF

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.