The recent year's work, has a big proportion in doing the cloud platform the thing, simply, is provides the company the user to provide a PAAs, the user can conveniently on our cloud platform to extend the stand-alone service program to the multi-instance program, provides the platform service the way to provide externally. Simply share it here.
First of all, let's say that we have solved what the user needs, or the pain point.
The underlying algorithm is provided directly to the application in the form of a library?
The user provides a basic algorithm, which may be provided in the form of a dynamic library, so that the application needs to focus on the compilation dependency, need to focus on the dictionary and other model files, the development cost is high. Especially if the underlying algorithm needs to be upgraded, or if the model file needs to be upgraded, it needs to be notified to the application for modification. Moreover, if the basic algorithm is unstable and has the behavior of core, it will inevitably affect the service quality of the application party. Also, if the underlying algorithm is very resource-intensive, such as CPU, memory is too much, it may affect the application of the use of resources.
Service of the basic algorithm
In order to avoid the drawbacks of the basic algorithm to the application of dynamic library, the basic algorithm can be provided in the form of Network service, which is the service of the basic algorithm. That is, the basic algorithm is a network service, the user directly access the network service can be. In this way, the service interface of the underlying algorithm is unchanged, regardless of how it is upgraded, without the application having to make any modifications. Open source thrift, and so on RPC's scheme, all very well satisfies the basic algorithm service demand.
How quality of service is measured
Then a service of the basic algorithm, how to ensure that the above-mentioned service quality and latency requirements. Especially the quality of service, or reliability. If we assume that the machine will never fail, the network equipment will always run normally, the service is stable enough, there will be no core, no upgrade, and so on, then directly to find some of the machine to pull up is all right. However, the reality is cruel, the first service may be faulty, it should be constantly upgraded (if there is a program that does not upgrade, or is too perfect, or no one to use, we write programs know that the perfect program you can never write, less write bad code is the normal pursuit), The process of upgrading can not affect the quality of service, and the service cannot monopolize the machine, it is likely to be mixed with other services, so how to choose the right machine deployment, is also a problem. Also, many times a service instance is not enough to meet the requirements, you may want to deploy multiple service instances, where the problem is more, the request to which instance, load balancing, and so on.
Problems faced by service platform
Below we will service the basic algorithm to the promised service quality to provide the process externally, known as the platform of the service. That is, the service of the basic algorithm, through a platform to provide external, you can obtain the corresponding quality of service. So what are the challenges of platform-based?
If a basic algorithm to achieve the platform, behind closed doors, need to do a lot of work. This is also the meaning of PAAs, the service of the work of all the general, so that the strategy of the developers free. So, what are the problems that the platform solves, or where is the pain point of the basic algorithm service?
The platform of service is not simply to deploy a service on several machines and provide services externally, especially when the application is committed to the quality of service. Moreover, in order to improve the resource utilization of the machine, a few services will be mixed together, so how to properly deploy the service is also a problem. In addition, the platform of services also need to pay attention to a lot of things, users generally need to pay attention to the following points:
1. How to define the service interface so that users can self-service access according to the service interface
2. Service addressing: How the application accesses the backend policy service, how to do the load balance of the request, how to notify the application when the backend service address is changed
3. Service on-line: To ensure that the new service on the line can not affect the old service, but also to meet the current service resource requirements
4. Service expansion: How to extend the number of instances to meet the needs of the application as the request grows
5. Machine failure: How to ensure the service quality is not affected when the machine fails, how to migrate the service on the faulty machine to other nodes
6. Service upgrade: How to achieve the service upgrade without compromising the quality of service
7. Access statistics: How to count the traffic, service delay, can distinguish different users access different services, similar to the Hadoop Counter, to achieve a relatively simple universal Counter
8. Monitoring and alerting: how to implement the alarm mechanism when the service is in trouble, in order to find and solve the problem in the first time
9. Access and flow control: services are confidential and can only be used by designated users; the frequency of the application's access needs to be limited by convention
10. Dynamic expansion and reduction of service instances: In order to conserve resources and increase resource utilization, there is a need for dynamic scaling and shrinking: When the service instance load is high, it can be scaled to meet the computing requirements, and when the load is low, the service instance is reduced to use the machine resources to other services.
The architecture is of course to solve the problems encountered when the underlying algorithm is platform-based. But in addition to these issues, you need to focus on the following issues:
1. Resource management and scheduling: Which machines will be assigned to the service
2. Service deployment: After assigning the machine, how to download the service instance package, how to pull up the service after downloading
3. Service address reporting and updating: After the service instance is pulled up, how to report the service address, let the application use these addresses, as well as address updates, how to notify the application party
4. Resource isolation: can only give the service a committed quota, such as how much CPU, how much memory, avoid the problem of the service affect other services
5. Can not affect the performance of the service: This requires that there is no performance loss of virtualization technology to do isolation, etc., we want to realize the user directly using the physical machine the same performance effect
6. Billing system, according to the use of different resources to be billed. For the internal platform of the company, if the machine is a platform, the user is only used, then there can be no such system, but if the provider of the service needs to provide the machine in exchange for the corresponding computing resources, then it is still needed.
Because of the above problems, RD can not focus on the implementation of the core strategy, have to do a lot of platform things. Of course, some of the problems can be done by manual operation, but this will cause the maintenance cost of the service too high. Moreover, the manual operation has a certain lag, which may lead to the decline of service quality in some scenarios. There is also the problem of the resource utilization of the machine, which is also a complex problem.
How to solve the above problems
So how to solve the above problem, here we give some possible solutions:
1. How to define a service interface: Use some common RPC solutions, such as thrift. Like we're using the SOFA-PBRPC,HTTPS://GITHUB.COM/BAIDU/SOFA-PBRPC. This is Baidu internal application of a wide range of RPC, with the company's internal application of the rich components of the scene.
2. Service addressing: Simply define an interface to notify the application of the address of the backend service. Of course SOFA-PBRPC has an addressable component that implements, the application only needs to specify a name to complete the addressing, the component also completes the load balancing strategy, can break the request evenly sent to the backend service provider. Of course, you can use ZK to achieve addressing. For example, when the service instance pulls up and adds its own address to the ZK node, it automatically deletes itself when the connection to ZK is dropped. So that the application can only monitor the data changes of this node. The simplest thing to load-balance is to send the request in sequence to the backend. However, there is a problem is that the machine may be processing power is not the same, that is, the cluster of machines are heterogeneous, so you can record the processing capacity of each node, when a service node pending requests more than a long time, then priority to the processing request faster on the node.
3. Service on-line: This actually requires the cluster resource management and scheduling function, that is, the user submitted the service on-line request, according to the user's resource requests, such as how many CPU cores each service instance needs, memory and network, as well as the number of instances, to allocate computing resources for this service. The cluster chooses the machine node that meets this service and assigns it to the service. This allocation algorithm industry also has a lot of research, the core is to provide resource utilization, but also to ensure the quality of services. Interested in this can look at the dominant resource fairness algorithm.
After allocating the compute resources, you need to deploy the service on the node and pull the service up after the deployment is complete.
4. Service expansion: When the load of the service becomes high, the number of service instances needs to be expanded. This is actually similar to the service line, need to allocate computing resources, deployment, service pull up and so on.
5. Machine failure: Any cloud platform to consider this situation, especially when the number of machines, machine failure almost become an inevitable event. This requires that the machine is offline, that is, the resource dispatch module marks that the machine is not available. Next, you need to migrate the service instances running on the machine to other nodes.
6. Service Upgrade: This process can not require the participation of the resource scheduling module: In-place upgrade on the node where the respective service instance resides, such as downloading the package directly, the download completes the environment after ready to stop the original process directly, start the new deployment. Of course, this process restart will inevitably affect the quality of service. If the service requirements do not need to be 100%, then the simplest method may be acceptable. Of course can also be in the client to aggravate the logic of the test, if the delay can endure.
Another method is to reassign some nodes and then start a new service instance, removing the old machine from the address of the application and ensuring that no new requests reach the old service instance. The load on the service instance is 0 o'clock, and the old service instance is offline. This is slightly more complex, but does not affect quality of service. But if the service instance is hundreds of thousands, then the cost will be higher, and there won't even be enough nodes to complete the upgrade. Of course you can upgrade in batches, which is undoubtedly more complicated.
The first method has one advantage, because it is an in-place upgrade, so you can quickly roll back the old version, because there is no need to lower the package, the time to roll back is the process restart time.
7. Access statistics: Internet companies will have this module, is the collection of distributed logs, aggregation and presentation. To show a report of the service's call volume, latency, and so on. This open source community also has a lot of implementations. Typically, each service instance prints a log of access statistics that are deployed to each node's agent collection, sent to, say, an MQ, and then processed by some worker threads.
8. Monitoring and alarming: the overall use of platform resources, cluster node resource usage, such as CPU idle,memory monitoring, service instance status monitoring and so on. Alarms to the user when an abnormal trigger threshold is detected, or a Cluster Administrator.
9. Access and flow control: This can and the company's account system through, so that you can track the very fine-grained access control, of course, simple practice can also be directly using the visitor IP to do the restrictions, the implementation of the IP whitelist system. Flow control is important for offline services, while offline services focus on throughput, but sometimes a Hadoop task can be very stressful in the background, so the flow control can ensure that the backend service is not overwhelmed or that each task does not compete with other task resources. For online services, there are times when there is a flow control, such as rejecting a part of the request, always slower than everyone, but no one can use it well. Said here, and think of a cluster of slow node problem, is a cluster has dead node is not afraid, afraid of a relatively frustrated node, running ultra-slow but half-dead, special circumstances may drag the entire platform. Can you take a look at this article: http://danluu.com/limplock/
10. Dynamic expansion and reduction of service instances: Some students will ask the service instance if there is no calculation, the empty run there Bai, but at least it will occupy memory, and, generally, the cluster for a service allocation of computing resources, the general will be CPU, memory as a unit of measurement, so if a service consumes CPU, memory, Then it will ensure that it will be used at least, and for online services, some of the "virtual" resources will affect the quality of service in some scenarios.
The above problem, any problem can be derived from a lot of problems. Therefore, this article is what I want to share with you to build a cloud platform when you need to pay attention to the things and the point of attention. It's all about.
We have used go to implement such a cloud platform over the past year. Go is also recommended here. The go in the title is actually the meaning of go. Go, the cloud-era system programming language, just like the last ten years the Server programming language is C, learning you will know what the meaning of this sentence.
Finally do an advertisement, Baidu's classmate if see this article, can search the intranet search Sofacloud, a thorough solution strategy Developer's platform of pain cloud platform. In addition to this, we can also save 85% of business-party machines by optimizing the architecture!! Our current daily average of tens of billions of of the calculation, thousands of compute instances, service dozens of product lines!!! What to think, not to hurry to move to our top!!!
After the use of Sofacloud, clearly the hair has grown out there is no!!! Look good, there is no!!!
Cloud,go!
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Independent development of the core elements of a cloud (PaaS), go, go, go!!!