Recently many friends asked me, I am self-learning crawler, learn to what extent can go to find a job?
This article will talk about my own experience, about reptiles, about the work, for reference only.
What level of learning?
For the moment, target the Novice crawler engineer and simply list it:
(Necessary part)
Language selection: General understanding of Python, Java, Golang
Familiar with multithreaded programming, network programming, HTTP protocol related
Developed a complete crawler project (preferably a full-site crawler experience, this will be said below)
Anti-crawl correlation, cookies, IP pools, verification codes, etc.
Skilled in the use of distributed
(not necessary, recommended)
Learn about Message Queuing, such as RABBITMQ, Kafka, Redis, and more
Experience in data mining, natural language processing, information retrieval, machine learning
Familiar with app data collection, intermediary agent
Large data processing (hive/mr/spark/storm)
Database Mysql,redis,mongdb
Familiar with git operation, Linux environment development
Read the JS code, this is really important
How to Improve
Just take a look at the tutorial can be started, in terms of Python, will requests of course is not enough, but also need to understand the scrapy and pyspider the two frameworks, Scrapy_redis also need to understand the principle.
How to build the distributed, how to solve the memory, speed problems encountered.
What is the difference between reference Scrapy-redis and scrapy?
What do you mean, full site crawling?
The simplest to take the hook to give an example, search keywords, there are 30 pages, do not think that the 30 pages crawl is the whole station crawl, you should think of ways to get all the data down.
What is the way to narrow the range by screening, slowly to OK.
At the same time, each position will have a recommended position, and then write a collection of recommended crawler.
This process needs to be aware of how to go heavy, MONGO can, Redis also can.
How to increase the data insertion speed in the reference scrapy
Actual Project Experience
This interview will definitely be asked by someone, such as:
Which websites have you climbed?
What is the maximum daily average acquisition amount?
What are your toughest problems and how do you solve them?
Wait a minute
So how do we find the project? For example, I want to crawl micro-blog data, to GitHub search under, the project is still less?
Language selection
My own advice is Python, Java, Golang best know, Java crawler is also a lot, but the online tutorial is almost python, sad.
Finally said Golang,golang really very good, say a number, Golang can download the number of pages per minute 2W, Python can? ~ ~ Small recommended a learning Python study skirt "227, 435, 450", whether you are Daniel or small white, is want to change or want to join the profession can come to learn together progress together! The skirt has the development tool, many dry goods and the technical information to share! Hope Novice Less Detours
About anti-crawling
Common UA, refer, etc. need to know what is, some of the ID of how to produce, whether it is necessary; about the IP pool this piece I do not understand, do not say, need to pay attention to is how to design the black mechanism, the simulation landing is also necessary, Fuck-login can study the code, or mention PR.
Analog landing is actually a step-by-step request to save a cookie session
How to judge ability enough
Very simple, give a task, crawl to know all the problems.
How would you think and design this project?
Welcome message points out
The above is only personal opinion, if there are deficiencies please point out. Hope can help you
Summary of knowledge points in the end of the text: Python, the most efficient way to string "connect"? It must have been a surprise to you.
Many articles on the Web, string connection should use the "join" method instead of "+" operation. Say the former is more efficient, it creates a new string at less cost, if you use "+" to connect multiple strings, each connection, it is necessary to allocate a memory for the string, the efficiency seems a bit low, this interpretation sounds reasonable, but the Cpython interpreter is not really according to what we said?
A test was made today, and the result may be unexpected.
The above 3 functions, respectively, use "join" and "format" and "+" operation to concatenate strings, from 0 to N, a total of n numbers are concatenated together to form a new string, such as: 1234567891011......N.
Here is the test script:
Each group took 15 sample data, respectively, with 1,2,4,8, ... 8,192 Digital Connection, the statistical data can be seen, in the very small amount of data, the efficiency of the three are almost no difference, when less than 20 strings connected, with "+" efficiency or even higher, however, with the number of strings increased, "join" method play out the effect, and with "+" more and more slowly. This is basically the same thing, whether it's Python2 or Python3.
So the conclusion is: if the concatenated string is very few, only a few or more than 10, can be connected through "+", after all, this way more straightforward, and more than a certain number, you should use the "join" method, only when the operation of big data, the contrast between the two is obvious.