Crawler Interview FAQ

Source: Internet
Author: User
Tags data structures database issues

many self-learning reptiles (python) partners have been unable to grasp the focus until they've had a job interview, although some of their skills

but because the play bad and missed the job opportunity, I after the N interview after the special summary of the following interview FAQ, for want to turn a reptile small

Partners to provide some reference.

I. Project issues:

A general interviewer's first question 80% is to ask the previous project, so it's best to prepare two of the techniques that you have recently written.

Content of the project, of course, must have written their own hand, in other places to see the source code, even if you see clearly, there is no knock on their own

Know a lot. Here are some points to be drawn

1. What anti-reptile measures did you encounter when you wrote a reptile?

2. What framework to use, why choose this framework (I use the scrapy framework, so the following questions are also for scapy)


Two. The framework question (scrapy) may ask different questions according to the framework you say, but Scrapy is still quite a lot.

The basic structure of 1.scrapy (five parts are what, request to send out the whole process)

2.scrapy principle of the weight of the principle of the weight of the fingerprint;

There are several kinds of 3.scrapy middleware, you have used those middleware,

4.scrapy middleware where to play the role (face-cutting programming)

Three. Agency issues

1. Why is the agent used

2. How to use the agent (specific code, request when the agent added)

3. How to deal with the failure of the agent

Four. Verification Code processing

1. Login Authentication Code Processing

2. The fast crawl speed appears the verification code processing

3. How to identify the verification code by machine

Five. Analog landing problem

1. Simulate the landing process

2.cookie How to handle

3. How to deal with the situation of the website to participate in the secret

Six. Distributed

1. Distributed principle

2. Distributed how to judge the crawler has stopped

3. Distributed principle of going weight

Seven. Data storage and database issues

1. Differences between relational and non-relational databases

2. Crawl down the data you will choose what storage mode, why

3. Types of data supported by various databases, and features such as: Redis how to achieve persistence, MongoDB

Whether to support things, etc...

Eight. The basics of Python

# The basic problem is very much, but because of the reptile nature, still some ask more, the following is the summary

The difference between 1.python2 and Python3, how to implement Python2 code migration to PYTHON3 environment

What's the difference between 2.python2 and Python3 coding (it's pretty annoying to find coding problems at work)

3. iterators, generators, adorners

4.python data type


Nine. Protocol issues

# Crawler takes data from a Web page must be a protocol to simulate network communication

1.http protocol, what the request consists of, what is the difference between each field, HTTPS and HTTP?

2. Certificate issues

3.TCP,UDP a variety of related issues

10. Data extraction Problem

1. The main use of what kind of structured data extraction method, may write one or two examples

2. Regular use

3. How dynamically loaded data is extracted

How to extract 4.json data

12 Algorithm problems

# This is really bad summary, compared to test the code skills, most will let you write the time complexity of the lower

Algorithm. Small partners use Python's type to learn more about Python's data structures.


The above is the summary content, welcome small partners to discuss together. Each company has its own characteristics, but these are basic and common problems.

Wish you all the best to find a job and not write a bug





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.