Recently, I am planning to use python for a web crawler graduation design. How can I solve this problem?

Source: Internet
Author: User
Python tips: prepare five months for the effect. For example, what to do. Specific application. Process. It is really small. For more information, see python. Prepare five months for the effect. For example, what to do. The specific application. Process. It is really small. For more information, see the following link: it is easy to write a crawler, especially python, and it is difficult to write a crawler,
Give a simple example: Put the http://paste.ubuntu.com All the above Code crawled
Write a for loop and call several functions of urllib2. The Code contains 10 to 20 rows.
Difficulty 0


Scenario:
1. The website server is very stuck, and some pages cannot be opened. The urlopen is directly stuck on some pages (urlopen has timeout after 2.6)
2. The crawled website is garbled. You need to analyze the webpage code.
3. The web page is compressed using gzip. You need to specify in the header whether to do not compress by default or decompress the file after the page is downloaded.
4. Your crawler is too fast and the server asks you to stop for tea.
5. The server does not like crawling by crawlers. It will analyze the browser information in the header of the header, and how to forge
6. Overall crawler design: Use bfs crawling or dfs crawling
7. How to store URLs with valid data structures so that the crawled pages are not repeatedly crawled
8. websites such as 1024 (escape, you have to log on before you can crawl to its content, how to get cookies

The above issues are common for writing crawlers. Due to the powerful python library, some code is slightly added.
Difficulty 1


Scenario:
1. As for cookies, the website will certainly have a log out problem. How to Avoid session failure caused by crawling various logs out during crawler crawling?
2. If you have a verification code, how can you bypass or identify it?
3. Too slow, opening 50 threads to crawl website data together


Difficulty 2


Scenario:
1. for complex pages, it is necessary to be very familiar with regular expressions to extract their links effectively.
2. Some labels are dynamically generated using Js. js itself can be encrypted, and even a strange thing is jsfuck.

Difficulty 3


In short, crawlers are the most important thing to do is to simulate browser behavior. The complexity of specific programs is determined by the functions you want to implement and the websites you have crawled.
There are not many crawlers to write, and so many crawlers can be thought of for the time being. You are welcome to add that my website is also a crawler. First, it is very easy to write only one crawler, so you need to add some gimmicks. Such as multithreading, such as intelligence. Secondly, since we need to make a crawler, we need to communicate with our mentor first. In case the instructor asks you to write a search engine and you write a crawler, that's not good. Finally, write several pages to show the crawler results, which can increase the amount of code or enrich the paper. Read the scrapy document, which is very easy to use. Add some difficulties to implement a distributed crawler. At the same time, you must write the management of the client and server, and then work with the front-end page management tasks and servers ..
If the webpage data is boring, you can capture the APP data ⊙ Request + Bs4 to view my signature. There are many cases in it for you to quickly collect the data. Do not believe it! I am also writing to capture data first and then perform data analysis. Finally, I will present the general idea on the webpage. Look at pyspider: binux/pyspider · GitHub.

You may find some inspiration .. A simple crawler can be written in less than 20 rows. Add a regular expression to httplib.
One of the most important aspects of Biji is the idea and the other is the technology. These two parts can be combined or complementary.
If it is just a crawler, you can consider multi-thread and distributed aspects. Let's talk about performance. This aspect is very deep, and it will be very nutritious. Finally, write A beautiful UI, and properly A + has wood.
If you have good ideas, implement one or more special functions. It's okay to be technically lacking. To do a good job, you can consider two aspects.
1. Good project technology and depth
2. Projects have practical value, that is, they can be applied to life.

So either your crawler is technically niubility
Either the data that your crawler crawls, it's useful
Of course, the data itself does not speak. You have to sort and analyze the data and finally draw a conclusion. Then, your configuration is complete, in this way, there is no technical difference, for example, crawling A liu video (the picture is too simple), adding some gimmicks, multithreading, and capturing different pages at the same time, after writing it, remember to share the open source program!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.