Summary of middleware technology experience in system design planning

Source: Internet
Author: User
Tags failover

I. middleware technology experience in system design planning

1.1 middleware application architecture Selection

The focus of this architecture choice is whether the business logic is in the foreground application or in the background middleware application. In fact, one of the reasons for the introduction of the middleware platform is the centralized management of business logic. The main reason for this debate is that developers do not have a deep understanding of this business logic, in the process of application design, we still did not get rid of the influence of the previous two-layer structure. In addition, this situation occurs because the business logic concentration increases the program development workload. We believe that the centralized business logic on the middleware server is not only good for the centralized management of business logic, but also has obvious advantages for the later operation and maintenance of the system and software upgrade management. Of course, in terms of development workload, the workload is much larger when business logic is concentrated.

1.2 design and optimization principles of middleware service distribution

In tuxedo, the server can be understood as a Unix process, and the service can be understood as a function in the application process server. Users can divide server and service at will: they can put all services in a single server, or they can adopt the "one service, one server" approach. Tuxedo has no restrictions on this.

The performance of middleware applications directly affects the stability and continuity of business front-end services *. The factors that affect the performance of middleware applications include hardware resource guarantee, database response time, and quality of middleware application software, the distribution of middleware application services also plays a vital role. We have an example. When the front-end enters the charging interface, the application needs to call a service named a, and a is distributed in a server named Sa, sa contains many other online statistics service B, which results in a long execution time when SA responds to B's call. All the SA is executing Service B, the front-end service A does not receive a response, and the charging interface cannot be accessed, resulting in service interruption at the front-end. This is a typical system fault caused by improper service distribution.

One of the most basic principles we follow in the system planning and design process is to try to put "similar service" ** within a server. The so-called "similar service" means that these functions have similar sizes, execution time, complexity, or functions. Let's consider an extreme example: Assume that a service a is very simple, and its average execution time is 100 milliseconds. Another service B queries the database, and its execution time may be 20 seconds. If these two services are used together, serious consequences may occur. As you know, the basic scheduling unit of the operating system is process. Service execution in a process is serialized. Tuxedo places the request transaction packets sent to the same server in the same message queue. When Service B is executing, service a's request packet must wait in the queue for more than 20 seconds before it can be executed, although its execution time is only 100 milliseconds. This is obviously intolerable and should be avoided.

In fact, when designing and developing applications, users cannot completely estimate the execution time and frequency of each service. Therefore, the Division optimization of the server and service is adjusted after the application is developed. The adjustment is based on recording the execution time and frequency of each service during system operation, and then optimizing and adjusting according to the data. The following is a systematic summary of the experience we have gained in practice.

1. differentiate different types of services: business operations or simple data access

Services in middleware applications can be further divided into two types: business operations and data access. During the design, it is best to differentiate the two different types of services. The type of business operations is actually the data modification operation, and the data access type is actually the Data Query operation. However, during actual system operation, any operation involving data modification will inevitably involve data query operations. Such operations should also be normalized to 1st types.

2. Services of the Request/response type must be separated from session services on different servers.

In tuxedo, a server supports either request/response services or session services. The session mode and request/response mode are mutually exclusive communication methods. The request/response method is highly efficient and is the most frequently used communication type. Tpcall/tpacall/tpforward are all functions of this type. The session method is suitable for massive data transmission, but the cost is the decrease in efficiency. Therefore, the two types of services should be separately placed in different servers.

3. Services with similar execution time are placed in the same server

In the tuxedo application environment, there may be hundreds of thousands of services, and the execution time of these services is obviously different. For a single-threaded server, although it may have multiple services, the execution of these services is serialized. If a server has two services, A and B. The execution time of A is 100 ms, and the execution time of B is 1000 ms, that is, the execution time of B can be executed ten times. When B is executing, there may be 10 a waiting in the queue. In this case, the server's processing service throughput is greatly reduced. Therefore, a natural principle is to place services with similar execution time in the same server.

4. Services with the same execution frequency are placed in the same server.

In a typical tuxedo application system, not all services are executed at the same frequency. Some services may be executed frequently, while other services may only be executed several times occasionally. Separating these services can avoid occasionally calling services to block frequently called services.

5. Avoid deadlock

There are two types of deadlocks in Tuxedo. The first is that service a in a server calls Service B in the same server. The reason for this deadlock is that for a single-threaded server, its service is executed in serial mode. When service a calls Service B, Service A is not finished yet. It waits for the result returned by service B. However, service B cannot be executed because the server is still in the status of executing Service. This eventually leads to errors in service.

The second case is that the services in the two servers call each other. For example, service a in server1 and server2 calls service X in server2. At the same time, service y in server2 calls Service B in server1. The deadlock in this case is very similar to the first case because the process can only implement services in a serial mode.

These two deadlocks should be avoided during application design. The first case is more noticeable, but the second case is less noticeable.

1.3. The failover method of the Middleware system application platform is selected in a real production and running environment. There are basically two failover mechanisms on the tuxedo platform. One is to use tuxedo's own failover mechanism and the front-end with two wsnaddr addresses to implement failover. One is to use dual-host software.

The first method is to instruct the two middleware servers to start the WSL process in MP mode, and then configure two IP addresses in the middleware front-end. Through self-experiment, we found that using tuxedo's failover mechanism has only the following advantages: no additional operating system or software support is required. However, there are many disadvantages. The main disadvantages are as follows:

1. Master switching between the two middleware platforms must be completed manually and cannot be completed automatically.

2. Two wsnaddr addresses configured on the front-end. If the first address fails, it takes a long time to connect to the second address and wait for the same time for each service call.

Failover through dual-Machine Software means that both middleware servers are configured in SHP mode, and then failover is implemented by using the IP switching function of the dual-machine software. This method requires that the WSL information of another host be added to the respective configuration files. When another host fails, IP address switching is performed through the dual-host software, run the dual-host switchover script to start the backup WSL of the host. Using this method has the following advantages:

1. The switchover is completed automatically without manual intervention.

2. Front-end services are not affected, ensuring business continuity.

Disadvantages:

1. Normally, when middleware is started on each host, the slave WSL service fails to be started, but this does not affect normal use, the standby WSL takes effect only when another host fails.

2. Additional dual-machine configuration and script configuration are required.

After these comparisons, we have selected the dual-host software solution to implement the middleware failover mechanism. It will play a good role in the future operation and maintenance process.

1.4. There are two ways to connect the middleware to the database in the TUXEDO middleware application. One is to connect the database through Xa, the other is to connect to the database in the application.

Connecting to a database through Xa means adding the connection information to the openinfo section of the tuxedo configuration file using the Xa interface of the relational database, then, tp_open (or xa_open) is called in the initialization code of the server program, and database connection is performed when the server starts.

Directly connect to the database through applications. Database Connection operations are not necessarily performed at server startup.

These two methods of connection are used in the boss system. The main difference is that if the transaction operated by the business is a distributed transaction, the Xa method must be used. The detailed comparison is as follows:

1. distributed transactions can be achieved through XA connections, which ensures the consistency of transactions in different databases. However, because of this, it has additional overhead on both the middleware system and the database system. Sometimes this overhead may even affect the system performance.

2. distributed transactions cannot be implemented through direct database connection of applications, but the system pressure is also reduced.

To sum up, selecting a database connection method also has a crucial impact on the performance of the entire system. We have a principle that Xa connections are not required wherever distributed transactions are not involved. Relying too much on the Xa interface will bring a lot of extra burden and pressure to the system, and cause many additional faults.

Ii. middleware technology experience in system maintenance summary in the daily maintenance of middleware, we learned some basic tuxedo knowledge and found a set of effective troubleshooting methods, I have accumulated some experiences in troubleshooting common faults and will introduce them as follows:

2.1license license is controlled by license file lic.txt, which is located in the udataobj directory and can be divided into two types: SDK and NS2. The two types of license cannot be used together. It controls the number of concurrent users and has 10% redundancy. The number of concurrent users here refers to the number of connections that have been initiated by the front-end program at a certain time and have not been used as tpterm, therefore, long connections and short connections exist in the use of TUXEDO middleware. Persistent connection refers to a series of service calls started after tpinit, and tpterm is implemented only when the system is restarted or the system is very idle. Short connection refers to tpinit before each service call, call tpterm immediately after the call ends. The short connection service calls consume additional connection time, which affects the system response time, and the long connection consumes the number of concurrent users of the system. Therefore, you need to make different designs for different application situations during system design. For example, for interface-type applications, connection time takes a large proportion because of frequent service calls, in this case, you need to use persistent connections. for front-end applications, the connection process time can be ignored for business operation time. In this case, you need to use short connections to save license resources.

2.2 client connection problems when the client cannot connect to the middleware, we need to confirm that

1. Whether the wsnaddr of the client is set or set correctly.

2. Whether maxwsclients and maxaccess in the system configuration file are correctly set.

3. Whether WSL is started, and whether the number of wsh is sufficient.

4. Whether a firewall exists. If a firewall exists, you need to specify the corresponding address ing and port in the WSL configuration.

5. Check whether the operating system has enough scoket resources.

After these problems are solved, the client cannot be connected.

In tuxedo, The tmtrace parameter is very useful for Fault Locating and troubleshooting. It is an environment variable. We usually set this parameter to on the work terminal of the system monitoring. When the front-end reports a system fault, we directly go to the fault module to reproduce the fault, and then go to the C: /or open a ulog file in the installation directory of tuxedo. In this file, you will find the failure information of the tpcall service. Based on this information, find the corresponding table of the server and service, you can quickly find the problem service. In this way, the maintenance personnel can locate the error and solve the fault without having to know the specific process of the program.

On the tuxedo server, you need to pay attention to several types of files, including the ulog file, the TRC file of the Xa interface, and the sqlnet. log, standard output file stdout (which can also be defined by the application ). These files contain a wide range of system operation error alarms. Ulog records the start and stop records of middleware and some configuration operation alarms of the system. The Xa TRC file records some error information about the Xa interface, and some errors about distributed transactions are recorded in this file. Sqlnet. log records the connection failure information of the application and database. As long as the file with the current date and time exists, the connection between the application and the database will be faulty. The standard output file stdout records the running information of the application. Tracking and inspection of these files is of great significance for timely and active fault detection.

2.3. Transaction Control Problems 1. Transaction boundary problems: here we should follow the principle of who initiated the transaction and who ended, who here mainly refers to the front-end and back-end of middleware. We know that the initiation of transactions in Tuxedo can be initiated either in the foreground program or in the background program. Both on the front end and on the back end have their advantages and disadvantages. Transactions increase network transmission traffic at the front end, which can ensure the consistency of operations in the front and back ends in exceptional circumstances *. Putting transactions in the back end can reduce network traffic, however, it is difficult to ensure the consistency of the frontend and backend operations in case of exceptions. We use transactions in the foreground program. However, the principle of WHO initiated and who ended should be followed in both the front-end and backend.

2. Notes for using the Xa interface: first, configure the Xa resource file. libclntsh is generally used in Oracle databases. a, we can first make the sample, extract the Connection Library Used in it and add it to the RM file. Next, you need to authorize XA before Oracle8i, and execute grant select on dba_pending_transactions to public under the DBA user of oracle.

3. transaction suspension problem: when XA is used, we often encounter this situation. The server process is still in progress, but we do not do anything. In related log files, the error "the current process is already in a local transaction" is always reported. At this time, only the server is restarted can the fault be ruled out. In fact, this is a problem of transaction control. The root cause is that before starting a global transaction, the server executes a local transaction and has not committed or rolled back. There are three reasons for program code:

(1) tpbegin is not added before a service with DML statements of the server is called.

(2) The returned value is not determined after tpbegin is called.

(3) The tpcall service call after tpbegin does not determine the return value. After the global transaction times out, it cannot exit the subsequent call in time, resulting in the next tpcall to generate a local transaction.

In addition, there will also be a special problem, that is, the problem of TMS service suspension. The PQ command in tmadmin finds that many requests exist in the TMS queue. In this case, generally, you can only wait for the execution to complete. In severe cases, you need to adjust the corresponding database parameters. in Oracle, this parameter should be set to max_commit_propagation_delay> = 90000.

In actual operation and maintenance, we find that even if the database parameters are adjusted, sometimes the TMS suspension failure occurs. We have made a special research on this issue. The reason why TMS hangs is that TMS is waiting for the release of DX exclusive locks in the database. The objects of such exclusive locks are generally ORACLE System Objects, the occurrence of this exclusive lock is generally caused by the execution of DML statements in global transactions. When the execution of such DML statements is slow, it will cause the lock wait of TMS. In addition, we also found that general query statements do not produce locks (except for the lm_lock lock of OPS), but if the query is placed in a global transaction, it will generate a shared DX lock. If the query call in the global transaction is performed through the Oracle toolkit dbms_ SQL, a DX exclusive lock will be generated. Based on these results, we recommend that you do not need to use XA global transactions in the program development process. If you can put queries outside of transactions, you should not put them in transactions, do not use the Oracle toolkit for development. In accordance with this principle, our developers have also achieved good results.

4. Transaction timeout settings

The transaction timeout control of the middleware application system is very important. It is a disaster for the system to do not set the timeout time. We once experienced serious system faults due to the absence or improper setup of time-out settings. The most direct failure was the system error of insufficient number of global transactions, new transactions cannot be started, resulting in the termination of front-end services. There are three main types of tuxedo Timeout: one is tpbegin timeout (whose value is its parameter) T1 in the code, which controls the completion time of the entire transaction, the second is the Xa connection timeout (sestm in open_info in the configuration file) T2, which controls the wait timeout time for Distributed Transaction locks in the same connection, there is also a timeout time (_ dirstributed_lock_timeout) T3 for the database to wait for the release of the non-Distributed Transaction lock of the database object. These three timeouts work together and have the following relationship T1

2.4 The service cannot be started normally. In actual maintenance work, the service cannot be started normally. All services have been disabled, and tmboot cannot be started. This is generally because the system's IPC resources are not released. IPC resources are system resources used by the operating system for inter-process communication, including signal lights, shared memory, and message queues. In a UNIX operating system, you can use the IPCS command to clearly view the use of IPC resources. When the service cannot be started, observe IPCS and you will find that the IPC resources of the user in the tuxedo running environment are not released. In this case, run the ipcrm command to clear the corresponding IPC resources and then start the service.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.