Recently used Python to write a crawler, feel very bad experience, ask you?

Source: Internet
Author: User
Recently used Python to write a crawler, the beginning of the 3.4, very hard to write, but in the runtime often stop working, after reducing the number of threads (16->8), stability has improved, but still occasionally there is a problem of stop work. Therefore changed python3.5, found that some packages do not support 3.5, so listen to the proposal to change to the most popular 2.7 version, the result sqlite3 packet error.
Databaseerror:malformed database Schema (is_transient)-Near "where": syntax error
There is no where in the SQL statement to report such an error, and the original program under python3.4 can run successfully, of course, 3to2 difference I also changed.

Finally, before I give up python, I want to ask you, where is the advantage of Python? I personally feel that the language itself is somewhat unstable and often comes up with some inexplicable mistakes. Whether this is my problem or the Python question.

Reply content:

Personally, the problems encountered in Python development are commonly encountered in other languages. The process of solving these problems is to deepen one's own understanding of the process.

The master did not find the problem at all to change the Python version, just met the soft rib of Python, the version is very different. It is recommended to search for more error messages to understand the problem.

The advantages of Python, as you can say, are straightforward, quick to get started, and a lot of resources. I guess you're your crawler. The Web page contains where, and you write it in SQL, resulting in your statement containing where.
This kind of problem is usually your own problem. I think it's a matter of programming philosophy.
Programming is done in most cases to solve the problem. Programming in order to dazzle is actually a rare situation.
Take crawlers for example.
Writing reptiles, PA data, is usually the whole big problem in a very small one (generally the first step), the data for analysis, to do screening, or to do the post robot, even if just a simple batch download pictures of sister. The main energy of the whole thing and the main "generate income" environment is behind. Naturally, it is natural to write a crawler, how quickly how to come, how to save energy how to come.
In most cases, PY is the best choice to do this (at least for the first time). All of the compiled languages you don't have to think about it, exactly, languages that do not provide an interactive environment can be considered, and the cost of debugging is too high. So what else is left? Python,nodejs, Ruby.
Nodejs is the design concept is asynchronous, and for the crawler such a program, asynchronous is a bit superfluous feeling.
Ruby is also a good choice, but there are some innate flaws, its grammar is too "subtle", when you use, often involuntarily into the "I can do more elegant" such a trap, and so elegant to bring you the benefits of not Ah, Your next reptile's business logic has nothing to do with your reptile and probably a little resemblance. It's hard for you to be elegant and elegant everywhere. Unless your main business is to write Ruby (not ROR), you will easily fall into the difficulty of choosing.
Anti-Python, simple, rough, no brainstem. Simple enough syntax allows you to focus on:

" “ How did the server go back to a piece of crap?


" “ Why do I have to block my father?

" “ you won't be able to load the data once .

" “ you this js function parameter how can pass SIGN2 first, then preach SIGN1! "The real story, a well-known network disk signature function, functions (SIGN2,SIGN1), for which I stuck for two hours."

" “ Boy, your page structure is too bad.

" “ ha ha ya 502 wrong

" “ I'm a genius, I'm sure.

On these important issues.

When you have this kind of experience, you find that some things are similar. You start to touch some reptile frame, also accumulated some handy library, extracted some reusable code, began to have a more open ideas. You find that if you can directly use JS on the target site, and then start to run JS inside the py. You find that the download connection seems to be fixed, so why don't I just call wget? The boss said, "You give me an application ah, there is an interface," you find that with Py2exe Packaging command line, C#+WPF drag a form program is not so troublesome.
This time you will have a "Python is really good" sigh, drink tea to write the program finished. (What do you mean by performance?) Why do you think you should drink tea, wait a minute. No, add money to the configuration AH)
"Finish" I can't make a delicious meal, is not the problem of kitchenware? "The stability is improved when the number of threads (16->8) is reduced, but there are occasional problems with stopping work"

Obviously this is because you do not understand multithreaded programming, there is no lock-in protection of related data caused by the problem. If your logic is more complex, post-debug is a very difficult problem.

If you don't know how to do it in a multithreaded program Always consider the existence of another threadTo make the necessary data Lock and protect, it is recommended that you do not use multithreading. (If you must use multiple threads, you can only use Rust-there is no multithreaded program compilation that doesn't properly protect the data.) )

I wrote a multi-threaded crawler, and then no longer used multi-threaded write crawler. Too much bother.

Also, keep in mind that Python 3.x and 2.x are similar two different languages, the original good program for another error is normal. 3to2 and 2to3 are just assistive tools that don't completely solve the problem (or you don't have to be incompatible). Let's talk at 1.1:
It was hard to finish
, but at runtime it usually stops working, and after reducing the number of threads (16->8), stability is improved, but there are still occasional problems with stopping work.
First, the crawler is an IO-intensive program (Network IO and disk IO), the bottleneck of such programs is mostly in the network and disk read and write speed, multithreading to some extent,Can speed up the efficiency of the crawler, but this "acceleration" cannot exceed min (export bandwidth, disk write speed), and, regarding Python's multi-threading, because Gil's existence, actually is some beginners not easy to discover the pit.

found that some packages do not support 3.5
Python2 and 3 of the division is indeed a brain remnant, but as far as I know, Python crawler-related libraries, most of which have been supported by the python3.x, do not know what the main use of the library. Of course, I am inclined to use the python2.7.x ~ ~ ~

so listen to the proposal to change to the most popular version 2.7, the result sqlite3 package without cause error.
Databaseerror:malformed database Schema (is_transient)-Near "where": syntax error
There is no where in the SQL statement to report such an error, and the original program under python3.4 can run successfully, of course, 3to2 difference I also changed.
Big "syntax error" has told you that your SQL statement error ~ However, for the crawler to select the database, considering the variability of the crawler data, the proposed master using a NoSQL database, if you are temporarily not or more confused database selection, it is recommended to start from MongoDB.

Finally, before I give up python, I want to ask you, where is the advantage of Python? I personally feel that the language itself is somewhat unstable and often comes up with some inexplicable mistakes. Whether this is my problem or the Python question.
Here, the first suggestion is to write "Python" as "Python", which is a rigorous attitude of a software engineer。 Second, Python is not perfect, but "it's my problem or Python question," the answer should be the question of the Lord, if Python can be a beginner to spend three minutes to pick out a fatal bug, then it will never have the current market.


In fact, Python itself is very simple, although there are some "advanced" usage, but even if not, you can quickly write practical code. Programming to more things outside the programming language, network protocols, design patterns, mathematics, algorithms, and so on, these are the knowledge that crosses the programming language but the programming will use, also is the difference between the high quality code and the low Level code. It's your problem to be very responsible ... Because the masses of the people have honed the module + frame combination under the circumstances of the crawler is not what the situation is you play out the flowers can only show that you have a reptile or python in one of the understanding of a relatively large problem.

Advantage is actually simple without brain a grasp, as long as understand the code understand HTTP understand HTML probably understand a little JS, if you want to write a database to understand a little bit of database do not understand also can rely on ORM direct object operation database, a variety of wheels ready-made without re-build, a variety of full-frame details do not worry about ...

Is there a more convenient choice? Yes. Now it's easier to write with node, besides ... Of course, there are more pits. If you can, go to the code! Talk is cheap, show me the code why does the Lord not study why to stop working instead of changing the version? Python is not Windows →_→
  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.