Python crawler Interview Guide (FAQ) and python Interview Guide

Source: Internet
Author: User

Python crawler Interview Guide (FAQ) and python Interview Guide

  • Do you know the synchronization and Asynchronization of threads?

Thread Synchronization: multiple threads access the same resource at the same time and wait for the end of resource access, wasting time and reducing efficiency

Thread Asynchronization: accesses other resources at the same time during idle waiting for resource access, implementing multi-thread Mechanism

  • Do you know the synchronization and Asynchronization of the network?

Synchronization: submit a request, wait for the server to process, and return the result. The client browser cannot do anything during this period.

Asynchronous: requests are triggered through events-> server processing (this is what the browser can do)-> processing is complete

  • What are the advantages of linked list and sequential table storage?

1. Sequential table Storage

Principle: ordered table storage places data elements in a continuous memory storage space, which features high access efficiency and high speed. But the length cannot be dynamically increased.

Advantage: high access speed and direct storage by subscript

Disadvantages: 1. the insert and delete operations are slow; 2. The length cannot be increased.

For example, when an element is inserted or deleted, the entire table needs to traverse and move the elements to re-arrange the order.

2. Linked List Storage

Principle: linked list storage dynamically allocates space when the program is running. As long as there is space in the memory, there will be no storage overflow problem.

Advantage: The insertion and deletion speed is fast, and the original physical order is retained. For example, when inserting or deleting an element, you only need to change the pointer

Disadvantage: Slow query speed, because circular linked list access is required during query.

  • How do I handle network latency and network exceptions when using redis to build a distributed system?

Due to network exceptions, the request results in distributed systems have the concept of "Three States", namely, "success", "failure", and "timeout (unknown )"

When "timeout" occurs, you can initiate a data read operation to verify that RPC is successful (for example, the banking system practice)

Another simple approach is to design the execution steps to be retried during distributed protocols, that is, the so-called idempotence"

  • What is a data warehouse?

A data warehouse is a topic-oriented, integrated, and stable data set that reflects historical changes and changes over time. It mainly supports decision analysis by managers.

The data warehouse collects a series of historical data related to internal and external business system data sources and archive files of the enterprise, and finally converts the data into strategic decision-making information required by the enterprise.

  • Features:

Subject-oriented: Content division based on different businesses;
Integration features: Because different business source data has different data characteristics, when the business source data enters the data warehouse, it is necessary to use a unified encoding format for data loading, this ensures data uniqueness in the data warehouse;
Non-volatile: a data warehouse stores different historical states of data and does not update the data.
Historical Features: the data retention timestamp field records the various statuses of each data in different time periods.

Assume that a crawler obtains data quickly from the network and writes data to the local database slowly. What data structure is used?

  • Online solution (o ° ω ° o)

Do you know Google's headless browser?

Headless browser is a browser with no interface. Since it is a browser, it should have all the items that the browser should have, but it just cannot see the interface.

In Python, PhantomJS In the selenium module is an interactive browser (headless browser). It is a headless Browser Based on QtWebkit,

Do you know the engines of MySQL databases?

  • InnoDB:

InnoDB is a robust transactional storage engine, which has been used by many Internet companies and provides a powerful solution for users to operate very large data storage.

InnoDB is the ideal choice in the following scenarios:

1. Update intensive tables. The InnoDB Storage engine is particularly suitable for processing multiple concurrent update requests.

2. Transactions. The InnoDB Storage engine is a standard MySQL storage engine that supports transactions.

3. Automatic disaster recovery. Unlike other storage engines, InnoDB tables can be automatically recovered from disasters.

4. Foreign key constraints. MySQL only supports the InnoDB Storage engine.

5. The AUTO_INCREMENT attribute can be automatically added.

In general, InnoDB is a good choice if you need transaction support and a high concurrent reading frequency.

  • MEMORY:

The starting point of using the MySQL Memory storage engine is speed. To get the fastest response time, the logical storage medium used is the system memory.

Although storing table data in the Memory does provide high performance, when the mysqld daemon crashes, all Memory data will be lost.

The acquisition speed also brings some defects.

Memory storage engine is generally used in the following situations:

1. The target data is small and frequently accessed. Data is stored in the Memory, which may cause Memory usage. You can use the max_heap_table_size parameter to control the size of the Memory table and set this parameter to limit the maximum size of the Memory table.

2. if the data is temporary and must be available immediately, it can be stored in the memory table.

3. If the data stored in the Memory table is suddenly lost, it will not have a substantial negative impact on the Application Service.

  • What types of data structures does the redis database have?

Five Data Structures

String

When using a string, redis ** in most cases ** does not understand or parse its meaning. Whether using json, xml, or plain text, it is the same in redis, just a string, you can only perform strlen, append, and other operations on strings. You cannot perform further operations on the string content. Its basic operation Commands include set, get, strlen, getrange, and append:

 SET key value GET key STRLEN key GETRANGE key start end APPEND key value

In most cases, only numbers are stored in strings. redis can use strings as numbers for further operations, including decr, decrby, incr, incrby, and incrbyfloat.

Hash

When using hash, in my opinion, value itself is a set of key-value pairs, but redis calls the key here field (but why is the hkeys command not the hfields command haha ), that is, value is a set of field-value pairs. Its basic operation Commands include hset, hget, hmset, hmet, hgetall, hkeys, and hdel:

 HSET key field value HGET key field HMSET key field value [field value ...] HMGET key field [field ...] HGETALL key HKEYS key HDEL key field [field ...]

List

When using list, value is a string array. When operating this set of strings, you can use pop and push operations like the stack, but both ends of this stack can be operated; you can also use an index parameter like an array. The list operation commands are slightly complex and mainly divided into two types: L and R, L Represents LEFT or LIST, and some operations are performed from the LEFT end of the list, or some operations unrelated to the end; R stands for RIGHT and performs operations from the RIGHT side of the list.

Set

Set is used to store a set of non-repeated values. It can also perform some set operations, just like a mathematical set, which is unordered. Basic operations include sadd and sismember:

 SADD key member [member ...] SISMEMBER key member

Set Operations include: Cross sinter, sum sunion, and difference sdiff:

 SINTER key [key ...] SUNION key [key ...] SDIFF key [key ...]

Sorted set

Sorted set is similar to set, but each element in sorted set has a score, which can be used for sorting and ranking. Basic operations include zadd, zcount, and zrank:

 ZADD key score member [score member ...] ZCOUNT key min max ZRANK key member

Summary

The above is a classic python crawler interview (FAQ) introduced by xiaobian. I hope it will help you. If you have any questions, please leave a message and I will reply to you in a timely manner. Thank you very much for your support for the help House website!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.