Recently, I am planning to use python for a web crawler graduation design. How can I solve this problem?

Last Update:2018-05-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python tips: prepare five months for the effect. For example, what to do. Specific application. Process. It is really small. For more information, see python. Prepare five months for the effect. For example, what to do. The specific application. Process. It is really small. For more information, see the following link: it is easy to write a crawler, especially python, and it is difficult to write a crawler,
Give a simple example: Put the http://paste.ubuntu.com All the above Code crawled
Write a for loop and call several functions of urllib2. The Code contains 10 to 20 rows.
Difficulty 0

Scenario:
1. The website server is very stuck, and some pages cannot be opened. The urlopen is directly stuck on some pages (urlopen has timeout after 2.6)
2. The crawled website is garbled. You need to analyze the webpage code.
3. The web page is compressed using gzip. You need to specify in the header whether to do not compress by default or decompress the file after the page is downloaded.
4. Your crawler is too fast and the server asks you to stop for tea.
5. The server does not like crawling by crawlers. It will analyze the browser information in the header of the header, and how to forge
6. Overall crawler design: Use bfs crawling or dfs crawling
7. How to store URLs with valid data structures so that the crawled pages are not repeatedly crawled
8. websites such as 1024 (escape, you have to log on before you can crawl to its content, how to get cookies

The above issues are common for writing crawlers. Due to the powerful python library, some code is slightly added.
Difficulty 1

Scenario:
1. As for cookies, the website will certainly have a log out problem. How to Avoid session failure caused by crawling various logs out during crawler crawling?
2. If you have a verification code, how can you bypass or identify it?
3. Too slow, opening 50 threads to crawl website data together

Difficulty 2

Scenario:
1. for complex pages, it is necessary to be very familiar with regular expressions to extract their links effectively.
2. Some labels are dynamically generated using Js. js itself can be encrypted, and even a strange thing is jsfuck.

Difficulty 3

In short, crawlers are the most important thing to do is to simulate browser behavior. The complexity of specific programs is determined by the functions you want to implement and the websites you have crawled.
There are not many crawlers to write, and so many crawlers can be thought of for the time being. You are welcome to add that my website is also a crawler. First, it is very easy to write only one crawler, so you need to add some gimmicks. Such as multithreading, such as intelligence. Secondly, since we need to make a crawler, we need to communicate with our mentor first. In case the instructor asks you to write a search engine and you write a crawler, that's not good. Finally, write several pages to show the crawler results, which can increase the amount of code or enrich the paper. Read the scrapy document, which is very easy to use. Add some difficulties to implement a distributed crawler. At the same time, you must write the management of the client and server, and then work with the front-end page management tasks and servers ..
If the webpage data is boring, you can capture the APP data ⊙ Request + Bs4 to view my signature. There are many cases in it for you to quickly collect the data. Do not believe it! I am also writing to capture data first and then perform data analysis. Finally, I will present the general idea on the webpage. Look at pyspider: binux/pyspider · GitHub.

You may find some inspiration .. A simple crawler can be written in less than 20 rows. Add a regular expression to httplib.
One of the most important aspects of Biji is the idea and the other is the technology. These two parts can be combined or complementary.
If it is just a crawler, you can consider multi-thread and distributed aspects. Let's talk about performance. This aspect is very deep, and it will be very nutritious. Finally, write A beautiful UI, and properly A + has wood.
If you have good ideas, implement one or more special functions. It's okay to be technically lacking. To do a good job, you can consider two aspects.
1. Good project technology and depth
2. Projects have practical value, that is, they can be applied to life.

So either your crawler is technically niubility
Either the data that your crawler crawls, it's useful
Of course, the data itself does not speak. You have to sort and analyze the data and finally draw a conclusion. Then, your configuration is complete, in this way, there is no technical difference, for example, crawling A liu video (the picture is too simple), adding some gimmicks, multithreading, and capturing different pages at the same time, after writing it, remember to share the open source program!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Recently, I am planning to use python for a web crawler graduation design. How can I solve this problem?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Recently, I am planning to use python for a web crawler graduation design. How can I solve this problem?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support