Write a Python crawler development experience

Source: Internet
Author: User

Why do python crawler, because I go to a movie site a bit of a pit, it supports tag tags to query their favorite movies, but does not support double-label or three-label query. Since a movie corresponds to multiple types (tag), this means that I need to go into this movie introduction interface and see if his tag is what I need. It's too much trouble. So I thought about being a Python crawler.

First, demand analysis.

The process is as follows: in the main interface of the site to get each movie url--"into the introduction of each movie screen-" to determine whether its tag meets the requirements, if it matches the name of the return to the film-"The URL and movie name to meet the criteria to save as a file-" next page

A simple requirement, the basic two for loop solves the problem.

Before Java, Java can do, but has been heard Python crawler python crawler, so think that since Python is simple, then use Python to do it. It's really simple.

Install Python first, it's not much to say.

Then the search for "python crawler", roughly read a few articles, recommended with requests and scrapy more. Scrapy is a reptile frame, I need a framework for such simple needs. I decided to use the requests.

PIP Install requests

First install the requests, according to the tutorial, the python comes with the editor to play the following code:

Import requests
Response = Requests.get (' https://www.baidu.com/') context = response.textprint (context)

It's a sense of accomplishment to see the console output a string of characters. After all, my first Python program.

But the problem arises, I need to crawl the site needs Fq AH. Search on the Internet, requests support agent, and then added the following code:

proxies = {' https ': ' https://127.0.0.1:1080 ', ' http ': ' http://127.0.0.1:1080 '}headers = {' user-agent ': ' mozilla/5.0 ( Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/56.0.2924.87 safari/537.36 '}

Can be used, but the number of lines of code is already a bit more, I need a Python editor, although I have the universe first IDE, but I did not know that vs support Python, after all, I only use it to write. NET program. So the Internet search, are recommended pycharm, good, is you.

Installation and debugging good Pycharm, and follow the online tutorial to continue.

There is a problem, Google said is the indentation of the problem (spit groove, I am a day, most of the problems encountered are indentation problems), I checked, indentation no problem ah. Copy the code into the notepad++ check, the original pycharm will automatically change tab tab to four spaces, because I am using notepad++ and pycharm mixed edit this py file, resulting in the code inside, indented some [tab], there are four of spaces, Fuck, the original Python does not support tab and space mixing. Instead, use notepad++ to edit and run with Pycharm.

Good, this requests I will at least use his get method, the other I do not use, and now need to use the regular to find the URL of these movies. By the way, during lunch, I took a general look at Liaoche's Python tutorial and looked at the basic syntax.

Website: https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000

I don't know much about the regular, but I just need to write a program that works, and use the simplest form of brute force: take the whole block of that URL, and then intercept the extra part in the same way as a string intercept. This makes the requirement for regular expressions much simpler.

Traverse each URL, then use requests to get to the Web page content, use the regular to get to the page content of the tag section, find these tags have their own needs.

At this time I met a indentation problem, check for half a day, error that line format no problem ah. And then check the whole code, well, I have a try to forget to write excep, then you quote try error ah, newspaper indentation wrong why.

There is also the same indentation problem, check for half a day, the error of the line does not have a problem, ah, suddenly see, I an else: do not write to deal with the content directly jump out of an if code block, is not this cause, put this else: delete. Problem solving.

Real pit.

In general, Python as a weak type of language, or very easy to get started, I have not touched Python, in a day through a variety of Google's way to write a usable script. If you put it in Java or C #, you might be busy with the Java environment, configuring the IDE, and understanding the individual data types.

But Python's indentation is really a big hole.

Python is really slow to run. Of course, more than I artificially faster, you can consider the situation of multithreading.

PS: Why do you ask me to find so many movies?

The need for fast-forward movies, of course, is the more.

Write a Python crawler development experience

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.