Ready to make suggestions for a web crawler's graduation design with Python?

Source: Internet
Author: User
Python small white, ready for 5 months to make the effect. Ask for advice like what to do. specifically why apply. Processes and the like. It's really small. White, ask for advice

Reply content:

It's easy to do reptiles, especially Python, and it's hard to say it's hard,
Give a chestnut a simple: Will/ httppaste.ubuntu.comAll the code above crawled down
Write A For loop, call URLLIB2 a few functions, the basic 10 lines to 20 lines within the code
Difficulty 0
Scene:
1. The Web server is very card, some pages can not open, urlopen directly on the infinite card on some pages (2.6 after Urlopen has a timeout)
2. Crawl down the website garbled, you have to analyze the page encoding
3. Web pages with gzip compression, you are to be in the header of the contract is not compressed by default or the page after the download of their own decompression
4. Your crawler is too fast for the server to stop for a sip of tea.
5. The server does not like crawling, will be the header of the head browser information analysis, how to forge
6. Crawler overall design, with BFS crawl or Dfs crawl
7. How to store URLs with a valid data structure so that crawled pages are not repeatedly crawled to
8. Such as 1024 sites (escape, you have to log in to crawl the content, how to get cookies
The above questions are written by crawler very common, because Python powerful library, slightly add some code just
Difficulty 1
Scene:
1. It is also a cookie problem, the site will certainly have a place is log out, crawler crawl in the process of how to avoid crawling to various log out resulting in session failure
2. How to bypass or identify the verification code if there is a verification code to crawl to the place
3. Too slow to open 50 threads crawl site data together
Difficulty 2
Scene:
1. For a complex page, how to effectively extract its links, you need to be very skilled in regular expressions
2. Some tags are generated dynamically with JS, JS itself can be encrypted, even a little bit is jsfuck, how to crawl these
Difficulty 3
In short, the most important thing is to simulate the browser behavior, the specific program is more complex, by the function you want to implement and the site itself decided to crawl
Crawler write not much, temporarily can think of so much, welcome to add my bi set is also a reptile, from the angle of the completion of the next. It is very simple to write a reptile first, so add a little gimmick. such as multi-threading, such as intelligence. Second, since you want to be a reptile, you have to communicate with your mentor beforehand. In case the tutor means you write a search engine, and you write a crawler, it is not good. Finally, write a few pages to show the results of the crawler, both to increase the amount of code, can also enrich the paper. See scrapy documentation, very useful. Add difficulty, realize a distributed crawler, at the same time to write good client and server management, and then with the front-end page management tasks and servers.
The data of the Web page if grasping boring, can grasp the data of the app ⊙▽⊙ REQUEST+BS4 look at my signature, there are many cases, for you to quickly collect, do not believe to argue! I'm also working on a data-capture and data-analysis. Finally, on the page, look at this idea Pyspider:binux/pyspider GitHub
Maybe I can find some inspiration. Simple crawlers can be written in less than 20 lines. Httplib plus regular.
Bishi The most important one is the idea, one is the technology. These two parts can be combined, and can complement each other.
If it's just a reptile, it can be considered in terms of multithreading and distribution. Talk about performance, this can be said very deep, Bishi will be very nutritious. Finally write a beautiful UI, a well-completed a + has wood.
If you have a good idea, implement one or more special functions. The technical lack of some is OK. The idea is to do well, can be considered from two aspects
1. The project is technically good and has a deep
2. The project has practical value, is to be able to apply to the life
So either your crawlers are technically niubility.
Either your crawler crawls to the data, it's useful
Of course the data itself is not talking, you have to collate data, analysis, and finally come to the conclusion that your bi set is also good, so in the technical poor some also have no relationship such as climbing a Liu's video (the picture is too simple), add a gimmick, multi-threaded at the same time crawl different pages, oh, write well after the program open source

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.