To what extent can reptiles learn to find a job? This is one of my advice to you!

Source: Internet
Author: User

Recently many friends asked me, I am self-learning crawler, learn to what extent can go to find a job?

This article will talk about my own experience, about reptiles, about the work, for reference only.

What level of learning?

For the moment, target the Novice crawler engineer and simply list it:

(Necessary part)

Language selection: General understanding of Python, Java, Golang

Familiar with multithreaded programming, network programming, HTTP protocol related

Developed a complete crawler project (preferably a full-site crawler experience, this will be said below)

Anti-crawl correlation, cookies, IP pools, verification codes, etc.

Skilled in the use of distributed

(not necessary, recommended)

Learn about Message Queuing, such as RABBITMQ, Kafka, Redis, and more

Experience in data mining, natural language processing, information retrieval, machine learning

Familiar with app data collection, intermediary agent

Large data processing (hive/mr/spark/storm)

Database Mysql,redis,mongdb

Familiar with git operation, Linux environment development

Read the JS code, this is really important

How to Improve

Just take a look at the tutorial can be started, in terms of Python, will requests of course is not enough, but also need to understand the scrapy and pyspider the two frameworks, Scrapy_redis also need to understand the principle.

How to build the distributed, how to solve the memory, speed problems encountered.

What is the difference between reference Scrapy-redis and scrapy?

What do you mean, full site crawling?

The simplest to take the hook to give an example, search keywords, there are 30 pages, do not think that the 30 pages crawl is the whole station crawl, you should think of ways to get all the data down.

What is the way to narrow the range by screening, slowly to OK.

At the same time, each position will have a recommended position, and then write a collection of recommended crawler.

This process needs to be aware of how to go heavy, MONGO can, Redis also can.

How to increase the data insertion speed in the reference scrapy

Actual Project Experience

This interview will definitely be asked by someone, such as:

Which websites have you climbed?

What is the maximum daily average acquisition amount?

What are your toughest problems and how do you solve them?

Wait a minute

So how do we find the project? For example, I want to crawl micro-blog data, to GitHub search under, the project is still less?

Language selection

My own advice is Python, Java, Golang best know, Java crawler is also a lot, but the online tutorial is almost python, sad.

Finally said Golang,golang really very good, say a number, Golang can download the number of pages per minute 2W, Python can? ~ ~ Small recommended a learning Python study skirt "227, 435, 450", whether you are Daniel or small white, is want to change or want to join the profession can come to learn together progress together! The skirt has the development tool, many dry goods and the technical information to share! Hope Novice Less Detours

About anti-crawling

Common UA, refer, etc. need to know what is, some of the ID of how to produce, whether it is necessary; about the IP pool this piece I do not understand, do not say, need to pay attention to is how to design the black mechanism, the simulation landing is also necessary, Fuck-login can study the code, or mention PR.

Analog landing is actually a step-by-step request to save a cookie session

How to judge ability enough

Very simple, give a task, crawl to know all the problems.

How would you think and design this project?

Welcome message points out

The above is only personal opinion, if there are deficiencies please point out. Hope can help you

Summary of knowledge points in the end of the text: Python, the most efficient way to string "connect"? It must have been a surprise to you.

Many articles on the Web, string connection should use the "join" method instead of "+" operation. Say the former is more efficient, it creates a new string at less cost, if you use "+" to connect multiple strings, each connection, it is necessary to allocate a memory for the string, the efficiency seems a bit low, this interpretation sounds reasonable, but the Cpython interpreter is not really according to what we said?

A test was made today, and the result may be unexpected.

The above 3 functions, respectively, use "join" and "format" and "+" operation to concatenate strings, from 0 to N, a total of n numbers are concatenated together to form a new string, such as: 1234567891011......N.

Here is the test script:

Each group took 15 sample data, respectively, with 1,2,4,8, ... 8,192 Digital Connection, the statistical data can be seen, in the very small amount of data, the efficiency of the three are almost no difference, when less than 20 strings connected, with "+" efficiency or even higher, however, with the number of strings increased, "join" method play out the effect, and with "+" more and more slowly. This is basically the same thing, whether it's Python2 or Python3.

So the conclusion is: if the concatenated string is very few, only a few or more than 10, can be connected through "+", after all, this way more straightforward, and more than a certain number, you should use the "join" method, only when the operation of big data, the contrast between the two is obvious.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.