Common Python crawler interview questions, called the interviewer to sing the conquest

Source: Internet
Author: User
Tags app service


are you aware of thread synchronization and asynchrony?

Thread synchronization: Multiple threads accessing the same resource at the same time, waiting for resource access to end, wasting time, inefficient
Thread Async: Implement multithreaded mechanisms when accessing resources while accessing other resources while idle waiting
are you aware of network synchronization and Asynchrony?
Synchronization: Submit request, wait for server processing, processing completed returns this period the client browser cannot do anything
Asynchronous: Requests are processed through event triggering server processing (which is still something the browser can do)
What are the advantages of linking lists and sequential tables when they are stored?
1. Sequential table Storage
Principle: Sequential table storage is to put data elements into a contiguous memory storage space, access efficiency, fast. But you can't increase the length dynamically.
Advantages: Efficient access, direct storage via subscript
Cons: 1. Insertions and deletions are slow, 2. Length cannot be increased
For example: when inserting or deleting an element, the entire table needs to traverse the moving elements to rearrange the order
2. Linked List storage
Principle: The chain table storage is the dynamic allocation space in the process of running the program, as long as the memory has space, there will be no storage overflow problem
Advantages: Fast insertion and deletion, retain the original physical order, for example: when inserting or deleting an element, you need to change the pointer point to
Cons: Lookup is slow because you need to have a circular link list access when looking
How do I handle network latency and network anomalies when using Redis to build a distributed system?
Due to the existence of network anomalies, the request result in distributed system has the concept of "three states", namely three states: "Success", "Failure", "Timeout (unknown)"
When a "timeout" occurs, you can verify that the RPC was successful by initiating an operation to read the data (for example, the practice of the banking system)
Another simple approach is to design a distributed protocol that performs steps that are designed to be retried, with the so-called "idempotent"
what is a data warehouse?
The Data Warehouse is a subject-oriented, integrated, stable, data collection that reflects historical changes and changes over time. It mainly supports the decision analysis of managers.
The Data Warehouse collects a series of historical data, such as internal and external business system data sources, archives and so on, and finally transforms into the strategic decision information needed by enterprises.
Characteristics:

    1. Topic-oriented: Content partitioning based on business differences;

    2. Integration characteristics: Because different business source data has different data characteristics, when the business source data enters into the data warehouse, it is necessary to use a uniform encoding format for data loading, so as to ensure the uniqueness of data in the Data Warehouse;

    3. Non-volatile: The Data Warehouse does not perform any update operations on the data by preserving various states of the data in different histories.

    4. Historical attribute: Data retention timestamp field, which records the various states of each data at different times.

Suppose there is a crawler, the frequency of data obtained from the network is fast, the frequency of local write data is slow, and what data structure is used well?
Online solution (O°ω°o)
Do you know Google's headless browser?
The Headless browser, headless browser, is a browser with no interface. Since it is a browser then the browser should have something that it should have, just can't see the interface.
The PHANTOMJS in the Selenium module in Python is a no-interface browser (headless browser): A Qtwebkit-based, headless browser.
Do you know several engines of MySQL database?
InnoDB:
InnoDB is a robust transactional storage engine that has been used by many Internet companies to provide a powerful solution for users to operate very large data stores.
Using InnoDB is the ideal choice for the following occasions:

Recommend to everyone a place to study and exchange: 719139688, the inside can be very good to learn the knowledge of Python, especially for beginners and advanced learners.
1. Update the dense table. The InnoDB storage engine is ideal for handling multiple concurrent update requests.
2. Transactions. The InnoDB storage engine is a standard MySQL storage engine that supports transactions.
3. Automatic disaster recovery. Unlike other storage engines, the InnoDB table can automatically recover from a disaster.
4. Foreign KEY constraints. MySQL supports the foreign key storage engine only InnoDB.
5. Support automatic increment of column auto_increment attribute.
In general, InnoDB is a good choice if transaction support is required and there is a high frequency of concurrent reads.
MEMORY:
The starting point for using the MySQL memory storage engine is speed. To get the fastest response time, the logical storage medium used is system memory.
Although storing table data in memory does provide high performance, all memory data will be lost when the mysqld daemon crashes.
The speed of the acquisition also brings some drawbacks.
Memory storage engines are typically used in the following situations:
1. The target data is small and is accessed very frequently. Storing the data in memory, so it will cause the use of memory, can be controlled by the parameter max_heap_table_size the memory table size, set this parameter, you can limit the memory table maximum size.
2. If the data is temporary and required to be immediately available, it can be stored in the memory table.
3. Data stored in the memory table, if suddenly lost, will not have a substantial negative impact on the app service.
What kinds of data structures are there in the Redis database?
5 Types of data structures
String
When using string, redis** does not understand or parse the meaning in most cases, whether using JSON, XML, or plain text is the same for Redis, just a string that can be used only for strlen, append, and other operations that are common to strings. Cannot be further manipulated for its contents. Its basic operation commands are set, GET, strlen, GetRange, append:


Common Python crawler interview questions, called the interviewer to sing the conquest

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.