Research on memory overhead of giant data objects in PHP

Last Update:2014-03-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Do not misunderstand the study on the memory overhead of giant data objects in PHP. it is not my research achievements on this issue, instead, I would like to ask you to help me analyze and study this problem. & nbsp; :) describe the simplified problem background: Use & nbsp; PHP & nbsp; research on memory overhead of massive data objects in PHP implemented by websites
First of all, please do not get me wrong. instead of posting any research results on this issue, I would like to ask you to help me analyze and study this issue :)

Describe the simplified problem background:In a website implemented using PHP, all program files contain a common. php file at the beginning. For business needs. php defines a "giant" data object (an array object containing approximately 500 k int values), such as $ huge_array = array (, 3 ,..., 500000), and only "read" access to $ huge_array is allowed throughout the system. It is assumed that the system needs to run continuously and stably under the state of 100 concurrent requests.

Question 1:$ Huge_array in common. the php source file occupies 10 MB in size (this is not a problem). loading to the memory may occupy 4 MB (just estimate the size of the file, is not the main point to be discussed in this article ). The problem is that every time PHP processes an HTTP request, it must start an independent process (or thread ), is it all about to re-load this 4 M memory block in the memory? If the memory cannot be shared, it may occupy nearly 400 MB of physical memory at the same time, which is unfavorable in terms of memory usage and memory access efficiency.

Question 2:When a Cache mechanism (such as APC and XCache) is enabled, we know that such a cache mechanism can cache opcode, but it seems that it only reduces the repetitive work in the script compilation process. Can the variable loading during the runtime also play a shared memory role? I hope it can play a certain role. after all, the large number of int values must also exist as part of opcode.

Question 3:If XCache's opcode cache cannot be used for this purpose, can I directly operate XCache? That is to say, instead of writing $ huge_array in common. php, it is written to the cache engine.

This article intends to analyze and study whether this usage will cause memory usage bottlenecks and how to optimize it if there is a problem. Of course, trying to reduce the size of this giant data object is the first and most worth considering. but the topic of data structure and algorithm is not discussed in this article. You are welcome to share your own points of view on this issue. it would be better to design some feasible test and verification solutions for your own points of view. if you do not have time to write code, as long as the solution looks reasonable, I am willing to write the test code ^_^

--------------------------------
Based on the plug-in extension function provided by the CSDN forum, I made a signing tool and shared it with you. technical exchanges are welcome. :) share the following:
------ Solution --------------------
Since it is more than 500 k of data, is it necessary to consider indexing and segment processing? why should all data be loaded in ???
------ Solution --------------------
1. yes, they all need to be loaded again in the memory
2. since all program files are included, they should be cached because they are part of opcode.
3. this is a boring thing. it is better to optimize the data structure and algorithm.
------ Solution --------------------
1. the boss said
2. it is part of opcode and must be optimized.
3. if the compilation time is greater than the time when data is obtained by index, why should we compile the data? You only compile data to obtain the corresponding data according to the key. when you have a solution to obtain data by key, why must we use the array data structure of php? This is a bit like whether to retrieve data from a txt file or from a containing php file?

For example, the content of 1.txt is as follows:
1111
2222
3333
4444
......

1. the content of php is as follows:
$ A = array (
1111,
2222,
3333,
4444,
......
);

For every process in 1.php, initialize $ a and obtain $ a [3].
If it is 1.txt, you only need to optimize how to quickly obtain 4th (3 + 1) rows of content. all processes can share the content,

But there is a problem here: the concurrency problem, how to optimize the concurrency reasonably?

There is no common way to solve this problem, either by changing the space for time (M with no concurrent wait time) or by changing the time for space (4 M, however, there is a waiting time to obtain the data)

------ Solution --------------------
I think it is necessary to use the shared memory to process big data. otherwise, each http process will increase the memory pressure due to your big data. Is Xcache and apc both opcode cache? That is, even if php does not need lexical parsing each time and goes directly to the opcode execution stage, it still needs to allocate memory for the large array, with the focus on allocating memory for an http request, if you store this large array to memcache, multiple http requests share a piece of memory. since there are no write operations, there are still no problems with the 100 read concurrency memcache, even if read/write concurrency is involved, memcached supports optimistic locks.
------ Solution --------------------
First, put this stuff in memcache.

It's enough.
------ Solution --------------------
Reference:

The array initial value that contains a large number of int values. as constant data, it must be part of opcode. I'm not sure yet. when I use it in a program to assign a value to a variable, will I first allocate a piece of memory to load it. If this variable is not allocated, it is to direct it to the address where opcode is stored. then the problem arises: is the block memory used to store opcode shared among processes?

Do not think about this. the opcode just ignores the syntax compilation step, for example:
$ A = 1 + 1;
After the opcode is replaced
ZEND_ADD ~ 0 1 1
Only ~ 0 replaces $ ~ 0 is to allocate space separately. Otherwise, how can php obtain data? If another process modifies this value, will it affect another process? Is the logic correctness of each process not guaranteed? Because every process has the same code, I don't know if my $ a is in the initial status?

Therefore, the opcode cache does not cache data. it only caches another code (script) format! The specific data must be re-allocated during running!

For more opcode, click me.

In addition, the example (of course 1.txt can be replaced by any tool, such as memcache and SQL) is just an interface that you can develop. for example, you can read an allocated memory space and use this memory space to access the next content, in this way, the memory space of each process can be reduced.

Pseudocode example:

While ($ data = get ()){
// Dosamething
}

To
Foreach ($ datas as $ data ){
}

Reduces memory consumption by increasing the time needed to obtain a piece of data.

------ Solution --------------------
I really want to join this topic. Unfortunately, my personal skills are limited and I don't know much about cache or anything.

Let me talk about my opinion from another perspective --

Web programming must consider concurrency, including server connections, instead of simply solving the problem.
This is also something that many desktop developers do not understand, leading to a saying like "php is bad ".

The advantage of php is to quickly solve the simple problem, complete and throw it to the client, end a connection, and leave it blank for the next request.
Complex computing should actually be completed with a higher quality server in a more core language.

A complex computing resource is worthwhile for the core language and high-performance servers, because the main purpose is to calculate the results.
For example, the servers of nasa and the above practical programs must be refined, and the calculation of an error of billions of decimal places may result in a "fire hit the Earth ".

However, the loss is huge for the information dissemination network, because every second you consume, your information may be transmitted to one or even dozens of people.
Both memory consumption and time consumption are the same. if the memory usage is too large, the tcp connection may be unstable.

Therefore, a large number of parameters may be flexible to the program and adapt to more users, but it is a failure to access the general network.
We should cut down these parameters and classify user groups. it is enough for each user to select fewer parameters.
For example, a global website does not set all languages together for parameter program self-adaptation. Instead, it selects a language and only writes code for this language.

For complex computing, you should use other languages or controls to complete certain established processes and reduce the web program response process.
#6. this is one of the situations. of course, this may not be suitable for your needs, but it is in line with the business logic of the website.

After talking so much, it is estimated that you have not been able to solve your needs. I feel honored to be able to read it ......
------ Solution --------------------
4 M, click it, not too large

Simplify it into a value, and then test the memory peak value of your currently running php ....

------ Solution --------------------
Check whether this code is helpful to you. you do not need to load files to the memory.

 /*
111
222
333
444
555

*/


$ File = new SplFileObject (_ FILE __);
$ File-> seek (3); // number of rows, starting from 0
Echo $ file-> current ().'
';
?>

Test an xml file of about 50 MB. it takes less than 0.01 seconds. the larger the number of rows, the more time required.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Research on memory overhead of giant data objects in PHP

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support