Synchronous vs. asynchronous
- Synchronous and asynchronous attention is to the message communication mechanism (synchronous communication/asynchronous communication)
The so-called synchronization is that when a call is made, the call does not return until the result is obtained. But once the call returns, it gets the return value.
In other words, it is the caller who actively waits for the result of the call .
Instead of async, the call returns immediately after it is emitted, so no results are returned. In other words, when an asynchronous procedure call is made, the caller does not get the result immediately. Instead, after the call is made, the callee notifies the caller through state, notification, or through a callback function to handle the call.
Talk about Python adorners, iterators, yield
- Adorner: The essence of the adorner is a closure function, and his role is to allow other functions to add additional functionality without any code modification, and the return value of the adorner is also a function object. We usually have some aspect requirements, such as: Insert log, performance test, transaction processing, cache, Permission check and other scenarios, with the adorner we can write a lot less repetitive code, improve productivity.
Iterators: Iterators are a way to access an iterative object, usually starting with the first element, knowing that all the elements have been accessed before it ends, that the iterator can only move forward and not be able to go backwards, that the iterator does not have to be prepared for all elements in the falling process, and that the element is only evaluated when iterating to that element. The elements that precede this are destroyed, so iterators are suitable for iterating through the infinite sequence of some large amounts of data.
The essence of an iterator is to call the __iter__ method, return an element each time it is called, and throw an stopiteration exception when there is no next element.
What are the suitable scenarios for Python? What to do when a compute-intensive task is encountered?
- Application Scenario: Website operations, financial analysis, server authoring, crawler
- When encountering IO-intensive tasks, most of the tasks involved are network, disk, and so on, the characteristics of this type of task are CPU s low, using multithreading.
Compute-intensive tasks mainly consume CPU performance, who want to use multi-process, of course, the use of Python language is very inefficient, so generally for compute-intensive tasks, you can use C language to write
Talking about MySQL character set and collation
- Character set, which is a collection of encodings used to define characters in a database. Common character Sets: Utf-8 GBK, etc.
Collation refers to whether character comparisons are case-sensitive, and if they are compared by character encoding or directly using binary data.
Say a thread, process, and co-path?
- Answer: A process is a program with a certain independent function on a data set on a running activity, the process is a system for resource allocation and scheduling of an independent unit.
- Each process has its own independent memory space, and different processes communicate through interprocess communication. Because the process compares the weight and occupies the independent memory, the switching overhead between the context processes (stacks, registers, virtual memory, file handles, etc.) is large, but relatively stable and secure.
- A thread is an entity of a process that is the basic unit of CPU Dispatch and Dispatch, which is a smaller unit that can run independently than a process. The thread itself basically does not own the system resources, only has a point in the operation of the necessary resources (such as program counters, a set of registers and stacks), but it can belong to the same process of other lines All resources owned by the process share. Inter-thread communication is mainly through shared memory, the context switches quickly, the resource overhead is low, but the data is easily lost compared to the process is not stable.
- The process is a kind of user-state lightweight thread, the scheduling of the process is entirely user-controlled. The co-process has its own register context and stack. When the scheduling switch, the register context and the stack is saved to another place, in the cut back, restore the previously saved register context and stack, the direct Operation Stack is basically no kernel switching overhead, can not lock the access to global variables, so the context of the switch is very fast.
It's better to say how to use it in a project, for example
How do I troubleshoot thread safety?
Thread safety is in a multi-threaded environment, can ensure that multiple threads execute simultaneously while the program is still running correctly, and to ensure that the shared data can be accessed by multiple threads, but only one thread at a time to access. The solution to the problem of resource competition in multi-threaded environment is locking to ensure the uniqueness of access operation. How to lock? Distributed load Balancing
Common Linux Commands
Ls,cd,more,clear,mkdir,pwd,rm,grep,find,mv,su,date, wait.
What is object-oriented programming?
- A tip: Say something about object-oriented programming, and why object-oriented programming, object-oriented features.
Reply:
Object-oriented programming is a design and programming method to solve software reuse. This method applies the similar operation logic and operation in the software system to the application of data, state, in the form of class, and reuse in the software system by object instance in order to achieve high software development efficiency. encapsulation, inheritance polymorphism.
How to improve the efficiency of Python, please say not less than 2 ways to improve operational efficiency?
- 1. Using the generator
- 2. Key code uses external Feature Pack: Cython, Pylnlne, PyPy, Pyrex
3. For the optimization of the loop-try to avoid accessing the variables ' properties in the loop;
How does the MySQL database store work?
- Answer tip: First answer the MySQL principle, expand it a bit, or MySQL how do you use it?
- Answer: The stored procedure is a programmable function that is created and saved in the database. It can consist of SQL statements and some special control structures. Stored procedures are useful when you want to perform the same functions on different applications or platforms, or encapsulate specific functionality. Stored procedures in a database can be seen as simulations of object-oriented methods in programming. It allows control over how data is accessed. Stored procedures often have the following advantages:
- 1) The stored procedure can achieve a faster execution speed.
- 2) Stored procedures allow standard components to be programmed.
- 3) Stored procedures can be written with flow control statements, with a strong flexibility to complete complex judgments and more complex operations.
- 4) Stored procedures can be used as a security mechanism to make full use of them.
5) Stored procedures can reduce network traffic
What kinds of bugs do you encounter at work and how to solve them?
- Answer tip: Don't talk about some small bugs and get a bit impressed. Or you can turn this problem into something you're having trouble with.
- Reply:
- 1. When the new entry, the business is not very familiar with the business document is not clear, resulting in a lot of overtime. Or the first code merge, the Python version update brings problems, etc.
- 2. The first time in the project to do login module/payment module, not very familiar with, have suffered a lot
3. Usually hit the bug accumulated in the code
Describe the nature of the transaction?
- 1. Atomicity (atomicity): All operations in a transaction are indivisible in the database, either all completed or not executed.
- 2. Consistency (consistency): Several transactions executed in parallel, whose execution results must be consistent with the results executed serially in a sequential order.
- 3. Isolation (Isolation): The execution of a transaction is not disturbed by other transactions, and the intermediate result of the transaction execution must be transparent to other transactions.
4, persistence (durability): For any of the transactions, the system must ensure that the transaction changes to the database is not lost, even if the database
What is the difference between Redis and MySQL?
- Answer tip: What's the difference between answering a question, and then one example, how do you use it?
- Reply:
- Redis is an in-memory database with data stored in memory and fast.
- MySQL is a relational database, persistent storage, stored in the disk, powerful. Search, will involve a certain IO, data access is slow.
I am more commonly used is MySQL, mainly create databases, create tables, data operations, additions and deletions, I do more is to check, for example, in the XXX project, there is a search module, when I do relatively simple is to use fuzzy matching to do the search.
What about Redis under attack?
- Reply:
- At work to prevent Redis from being attacked, I will do the following:
- 1. Master-Slave
- 2. Persistent storage Redis does not start with the root account
- 3. Set up complex passwords
4. Do not allow key mode login
Tell me about MongoDB how do you use it?
- Answer skills: What to answer MongoDB, pros and cons, how to use the usual?
- MongoDB is a document-oriented database system. Written in C + +, does not support SQL, but has its own powerful query syntax.
- MongoDB uses BSON as a format for data storage and transmission. BSON is a JSON-like, binary serialized document that supports nested objects and arrays.
- MongoDB is much like mysql,document corresponds to MySQL's row, collection corresponds to-MySQL table
- Cons: Do not support transactions, MongoDB occupies too much space, maintenance tools are not mature
- Application Scenarios:
- 1. Website data: MONGO is ideal for real-time inserts, updates and queries, as well as the replication and high scalability required for real-time data storage on the site.
- 2. Caching: Because of its high performance, MONGO is also suitable as a caching layer for the information infrastructure. After the system restarts, the persistent cache built by MONGO can avoid overloading the underlying data sources.
- 3. Large-size, low-value data: Storing some data in a traditional relational database can be expensive, and before that many programmers often choose traditional files for storage.
- 4. Highly scalable scenario: The MONGO is ideal for databases consisting of dozens of or hundreds of servers.
- 5. Storage for objects and JSON data: MONGO's BSON data format is ideal for document-formatted storage and querying.
6. Important data: MySQL, General data: MongoDB, temp data: memcache
What are the pros and cons of Redis and MongoDB?
- Answer tip: Say the difference first, and the pros and cons
- Answer: MongoDB and Redis are NoSQL, using structured data storage. There are some differences between the two in the usage scene, which is mainly due to the process of the memory mapping, and the persistence processing method is different. MongoDB recommended cluster deployment, more considering the cluster scenario, Redis is more emphasis on process sequential write, although the support cluster, but also limited to the master-slave mode.
- Redis Benefits:
- 1 Excellent reading and writing performance
- 2 support data persistence, support AOF and RDB two persistence mode
- 3 Support master-slave replication, the host will automatically synchronize the data to the slave, can be read-write separation.
- 4 Rich data structure: string, hash, set, sortedset, list etc. are supported in addition to the string type value.
- Disadvantages:
- 1 Redis does not have automatic fault tolerance and recovery functions, the host slave downtime will cause the front-end partial read and write requests to fail, need to wait for the machine to restart or manually switch the front-end IP to recover.
- 2 host downtime, some data is not synchronized to the slave in time before the outage, switching IP will also introduce data inconsistency problems, reduce the system availability.
-
- Redis is less likely to support online expansion, which can become complex when the cluster capacity is up to the limit on-line. In order to avoid this problem, operators must ensure that there is sufficient space when the system is on-line, which is a great waste of resources. To avoid this problem, operators must ensure that there is sufficient space when the system is on-line, which creates a great waste of resources.
- Advantages and disadvantages of MongoDB:
- Advantages: Weak consistency (eventually consistent), more to ensure the user's access speed of the document structure of the storage mode, more convenient to obtain the number of frequency)
- Built-in Gridfs for efficient storage of binary large objects such as photos and video
- Support for replication sets, main standby, master-to-peer, auto-sharding, and other features dynamic query full index support, extended to internal objects and inline arrays
Cons: Not supported transaction MongoDB occupies too much space, maintenance tools are not mature
How does the database optimize query efficiency?
- Answer skills: Organized according to the topic answer can
- Reply:
- 1. Storage Engine Selection: If the data table requires transaction processing, you should consider using InnoDB because it is fully acid-compliant. If transaction processing is not required, it is wise to use the default storage engine MyISAM
- 2. Sub-table sub-Library, master and slave
- 3. To optimize the query, to avoid full table sweep, first consider the where and order by the columns involved in the index
- 4. Avoid null values for the field in the Where clause, which will cause the engine to discard the full table sweep using the index
- 5. Try to avoid using the! = or <> operator in the WHERE clause, or discard the engine for full table sweep using the index
- 6. You should try to avoid using or in the WHERE clause to join the condition, if a field is indexed and a field is not indexed, it will cause the engine to discard using the index for full table sweep
7.Update statement, if you only change 1, 2 fields, do not update all fields, otherwise frequent calls will cause significant performance consumption, while bringing a large number of logs
Database optimization scheme?
- Answer tip: Title answer
- 1. Optimize indexes, SQL statements, analyze slow queries;
- 2. Design the database strictly according to the design paradigm of the database;
- 3. Use the cache to save disk IO by placing frequently accessed data in the cache without the need for frequently changing data;
- 4. Optimizing hardware, using SSD, using Disk Queue technology (RAID0,RAID1,RDID5), etc.;
- 5. Using MySQL internal table partitioning technology, the data layer different files, can be high magnetic
The reading efficiency of the disk;
- 6. Vertical sub-table; Put some infrequently read data in a table, save disk I/O;
- 7. Master-Slave separate reading and writing, using master-slave replication to separate the read and write operations of the database;
- 8. Sub-database sub-table sub-machine (data volume is particularly large), the main principle is data routing;
- 9. Select the appropriate table engine and optimize the parameters;
- 10. Architecture-level caching, static and distributed;
- 11. Do not use full-text indexing;
12. Use faster storage methods, such as the number of frequently accessed NoSQL storage
Redis basic types, related methods?
- Answer tip: Answer the question first, and find one that explains how to use it, and expand how Redis uses it.
- Answer: Redis supports five types of data: string (String), hash (hash), List (list), set (set), and Zset (sorted set: Ordered set).
String is the most common type of data used by Redis, and the string data structure is the Key/value type, and string can contain any data. Common commands: Set,get,decr,incr,mget, etc.
What are the usage scenarios for Redis?
- Answer skill: Merit, title answer can be
Reply:
-1. Take the latest N data operation
-2. Leaderboard application, TOP N operation
-3. Applications that require precise setting of expiration time
-4. Counter Application
-5.uniq operation, get all data row weight values for a certain period of time
-6.pub/sub Building a real-time messaging system
-7. Build a queue system
-8. Caching
Say bubble sort?
- Answer tip: Answer the bubbling principle, preferably handwritten, expand the other sort?
Bubble sort idea: compare two adjacent elements at a time, and swap them out if they're in the wrong order.
defBubble_improve (L):PrintLflag= 1 forIndexinch Range(Len(l)- 1,0,-1):ifFlag:flag= 0 forTwo_indexinch Range(index):ifL[two_index]>L[two_index+ 1]:l[two_index], L[two_index+ 1]=L[two_index+ 1], L[two_index]flag= 1Else: BreakPrintll=[Ten, -, +, -, -, -] Bubble_improve (L)
Talk about the role of Django,middlewares middleware?
- Answer tip: Title answer.
Answer: Middleware is a process between request and response processing, relatively lightweight, and changes the Django input and output globally.
Speaking of MVVM?
- Answer tip: Say MVVM, then expand back to the familiar mvt?
MVVM: "Data model data two-way binding" idea as the core, there is no connection between View and model, through ViewModel interaction, and the interaction between model and ViewModel is bidirectional, so the change of view data will modify the data source at the same time, Changes in data source data are immediately reflected in the view.
What do you know about Django?
- Answer tip: Say what Django is, then say its pros and cons, and say how the project is used?
- Answer: Django is the direction to go chatty, it is most famous for its fully automated management background: Just use the ORM, do simple object definition, it can automatically generate database structure, and full-featured management background.
- Advantages:
Ultra-high development efficiency.
Applicable to small and medium-sized websites, or as a large-scale website quickly, to achieve the prototype of the product tools.
Completely separate the code, the style; Django fundamentally eliminates the possibility of coding and processing data in templates.
Disadvantages:
Its performance expansion is limited;
With a Django project, it needs to be refactored to meet performance requirements after traffic reaches a certain scale.
The Django built-in ORM has a high degree of coupling with other modules within the frame.
Say Jieba participle?
- Answer skill: What is the jieba participle, what is the function?
- Reply:
- Jieba participle supports three participle modes:
- Precision mode: Try to cut the sentence most precisely, suitable for text analysis;
- Global mode: All the words in the sentence can be swept out, the speed is very fast, but can not solve the ambiguity;
- Search engine mode: On the basis of accurate mode, the long word again segmentation, high recall rate, suitable for search engine participle
- Function:
Word segmentation, add custom dictionaries, keywords, pos tagging, parallel participle, tokenize: return words at the beginning of the original position, Chineseanalyzer for whoosh search engine
How do you implement Django redirection? What is the status code?
- Answer tip: Title answer
Using Httpresponseredirect redirect and reverse, status code: 302,301
Crawl down the data how to go heavy, say a specific algorithm basis?
- Answer tip: Title answer
- For:
- 1. Use MD5 to generate electronic fingerprints to determine if the page has changed
2.nutch to weight. Nutch Digest is a 32-bit hash of each Web page content collected, and if two pages are exactly the same, their digest value will certainly be the same.
Is it good to write a crawler with multiple processes or multithreading? Why?
- Answer tip: Compare the pros and cons of multi-threading and multi-process crawlers
Answer: IO-intensive code (file processing, web crawler, etc.), multi-threaded can effectively improve efficiency (single-threaded IO operation will be IO wait, resulting in unnecessary time wasted, and turn on the multi-threaded can thread A Wait, automatically switch to threads B can not waste CPU resources, so as to improve program execution efficiency). In the actual data acquisition process, both the speed and the response of the problem, but also to consider the hardware of their own machine, to set up multi-process or multi-threaded.
1. What is the difference between numpy and pandas? Separate application scenarios?
- Answer tip: Title answer.
- Numpy is the expansion pack for numerical computations, pure mathematics.
Epandas does the mathematical calculation module based on the matrix of data processing. A set of data structures called DataFrame is provided, which is compared with the table structure in statistical analysis, and is used for computing interfaces, which can be calculated by numpy or other methods.
How does the captcha work?
- Answer tip: Title answer
- 1. Scrapy self-processing verification code
2. Get the URL to the CAPTCHA picture, call the third-party billing excuse to crack the captcha
How is dynamic stock information crawled?
- Answer tip: First, let's take a look at the crawl method and give an example.
- There are two ways to get the stock data now:
- 1.http/javascript Interface Fetch Data
- 2.web-service interface
- Sina Stock Data Interface
- Take the Great Qin Railway (stock code: 601006) For example, if you want to get its latest market, just visit Sina's stock data
Interface: http://hq.sinajs.cn/list=sh601006 This URL returns a string of text, such as Var hq_str_sh601006= "Great Qin Railway, 27.55, 27.25, 26.91, 27.55, 26.20, 26.9 1, 26.92, 22114263, 589824680, 4695, 26.91, 57590, 26.90, 14700, 26.89, 14300,
26.88, 15100, 26.87, 3100, 26.92, 8900, 26.93, 14230, 26.94, 25150, 26.95, 15220, 26.96, 2008-01-11, 15:05:32 ";
Scrapy to heavy?
- Answer technique: be organized in every way to answer
- When the amount of data is small, it can be placed directly in the memory to go to the weight, Python can use set () to go to the weight.
- Redis's set data structure can be used when deduplication requires persistence.
- When the amount of data is larger, the long string can be compressed into 16/32/40 characters with different encryption algorithms, then the above two methods are used to weigh.
- When the amount of data reaches billion (even 1 billion, tens of billions) of orders of magnitude, memory is limited, must use "bit" to come and go heavy, to meet the demand. Bloomfilter is to map the object to a few memory "bits", passing a few bits of the 0/1 value to determine whether an object already exists. However, the Bloomfilter runs in the memory of a machine, it is not easy to persist (the machine down on nothing), nor convenient for the unified weight of the distributed crawler. If you can request memory for Bloomfilter on Redis, the above two problems will be solved.
The best thing about Simhash is to convert a document to a 64-bit byte, call it a feature word, and then judge whether the distance between them is <n (based on experience, the n is generally 3), you can tell if two documents are similar.
What are the options for distribution, and which one is best?
- Answer technique: First of all, what is the plan to analyze which one is good?
- Celery, Beanstalk,gearman personally think Gearman better.
The main reasons are as follows:
- 1). Simple type of technology, low maintenance cost.
- 2). Simplicity is paramount. Can meet current technical requirements (distributed task processing, simultaneous support of asynchronous synchronization tasks, persistence of task queues, simple maintenance deployment).
3). There are mature use cases. Instagram is the use of Gearman to complete the image processing related tasks, there are successful experience, we should certainly learn from.
Post and get differences?
- Answer skills: be organized in all aspects to answer
- 1, GET request, the requested data is appended to the URL, to split the URL and transfer data, multiple parameters with & connection. The encoding format of the URL is encoded in ASCII rather than Uniclde, meaning that all non-ASCII characters are encoded before being transmitted.
POST request: The POST request places the requested data in the package body of the HTTP request packet. The item=bandsaw above is the actual transfer data. Therefore, the data for the GET request is exposed in the address bar, and the POST request does not.
2, the size of the transmitted data
In the HTTP specification, there is no limit to the length of the URL and the size of the data being transmitted. However, in the actual development process, for GET, the specific browser and server to the length of the URL is limited. Therefore, when you use a GET request, the transfer data is limited by the URL length.
Talk about what you know about Selenium and PHANTOMJS?
- Answer tips: How to answer
- Selenium is a WEB-based automated testing tool that allows browsers to automatically load pages, get the data they need, or even screen screenshots, or determine whether certain actions occur on the site, according to our instructions. Selenium does not have a browser, does not support browser features, it needs to be combined with a third-party browser to use. But we sometimes need to let it run in code, so we can use a tool called PHANTOMJS instead of a real browser. There is an API called Webdriver in the Selenium library. Webdriver is a bit like a browser that can load a Web site, but it can also be used like BeautifulSoup or other Selector objects to find page elements, interact with elements on the page (send text, click, etc.), and perform other actions to run a web crawler.
PHANTOMJS is a webkit-based "No Interface" (headless) browser that loads the site into memory and executes JavaScript on the page because it doesn't show a graphical interface, so it's more efficient to run than a full browser.
If we combine Selenium and PHANTOMJS, we can run a very powerful web crawler that can handle Javascrip, cookies, headers, and anything that our real users need to do.
Commonly used anti-creeping measures?
- Answer skills: be organized in all aspects to answer
Answer: 1. Add Agent 2. Reduce the frequency of access 3.user-agent4. Dynamic HTML data Load 5. Verification Code processing 6. Cookies
What are the common anti-reptile coping methods?
- Answer skills: be organized in all aspects to answer
- 1). The Headers anti-crawler is the most common anti-crawler strategy through Headers anti-crawler, from user request. Many sites will be Headers user-agent detection, there are some sites will be Referer detection (some resource site's anti-theft chain is detection Referer). If you encounter this kind of anti-creeping mechanism, you can add Headers directly to the crawler, copy the browser's user-agent into the Headers of the crawler, or modify the Referer value to the target website domain name. For the detection of Headers anti-crawler, in the crawler to modify or add Headers can be very good bypass.
- 2). Anti-crawler based on user behavior
There are also some sites that detect user behavior, such as multiple accesses to the same page within a short time of the same IP, or the same account multiple times within a short period of time. Most Web sites are the former, and in this case, IP proxies can be used to resolve them. can be specially written a crawler, crawling online public proxy IP, after the detection of all saved up. This proxy IP crawler is often used, it is best to prepare their own one. With a large number of proxy IP can be a few times per request to replace an IP, which is easy to do in requests or URLLIB2, so it is easy to bypass the first anti-crawler.
In the second case, the next request can be made at random intervals of several seconds after each request. Some of the logical vulnerabilities of the site, can be requested several times, log out, re-login, continue to request to bypass the same account for a short period of time can not make the same request multiple times limit.
- 3). Reverse Crawler of dynamic pages
Most of the above are in static pages, and there are some sites where the data we need to crawl is obtained through AJAX requests or generated by JavaScript. First, the network request is analyzed with Fiddler. If we can find the AJAX request, can also analyze the specific parameters and the specific meaning of the response, we can use the above method, directly using requests or URLLIB2 simulation AJAX request, the response JSON analysis to get the data needed.
- It's great to be able to directly emulate Ajax requests, but some websites encrypt all the parameters of the AJAX request. We simply have no way of structuring the data we need to request. In this case, we use SELENIUM+PHANTOMJS, invoke the browser kernel, and use Phantomjs to execute JS to simulate the human operation and trigger the JS script in the page. From filling out the form to clicking the button to scrolling the page, all can be simulated, regardless of the specific request and response process, just finish the whole process of people to browse the page to get data simulation.
With this framework can almost bypass most of the anti-crawler, because it is not disguised as a browser to obtain data (the above by adding Headers to a certain extent to disguise as a browser), it is itself a browser, PHANTOMJS is a browser without interface, It's not people who manipulate this browser. Lee Selenium+phantomjs is capable of many things, such as identifying touch