Research on the memory cost of huge data objects in PHP

Source: Internet
Author: User
Tags apc spl
First of all, please do not misunderstand, not I want to publish what the research results on this issue, but I would like to ask you to help me to analyze this problem together:)

Describe the simplified issue background: In a PHP-implemented Web site, all program files contain a common file common.php at the beginning. Now, due to business needs, a "mega" data Object (an array object containing about 500k int values) is defined in common.php, such as $huge _array = Array (three-to-one,..., 500000), and the entire system is $hu Ge_array only "read" Access. Assume that the system needs to continue to run stably in 100 concurrent requests.

Question 1: $huge _array in the common.php source file will probably occupy 10M (this is not a problem), loading into memory may occupy 4M (just estimate, as for the accurate measurement of its size, not the main points of this article to discuss). The problem is that PHP itself, each processing an HTTP request, is to enable a separate process (or a thread), is it not to re-load this memory of about 4M memory block it? If memory cannot be shared, it is likely to take up nearly 400M of physical memory at the same time, both in terms of memory consumption and memory access efficiency.

Question 2: When some kind of caching mechanism (such as APC, XCache, etc.) is enabled, we know that this kind of caching mechanism has the ability to cache the opcode, but it also seems to reduce the duplication of the script compilation process, whether the variable load at runtime can also play a role in shared memory? Hope it can play a role, after all, that a lot of int value is certainly also as part of the opcode existence.

Question 3: If the above borrowing XCache to the opcode cache does not achieve the purpose, then I direct operation XCache will be effective? That is, instead of writing $huge _array in common.php, write it into the cache engine.

This article is intended to determine whether this usage can lead to memory usage bottlenecks, and how to optimize if there are problems. Of course, trying to reduce the size of the mega data object itself is the first and foremost consideration, but the topic of data structures and algorithms is not discussed in this article. You are welcome to publish your own analysis of this issue, if you can give your own ideas to design some operational test verification scheme is better, if you do not have time to write code, as long as the scheme seems reasonable, I would like to write test code ^_^


????????????????????????????????
Based on the plugin extension function provided by CSDN forum, I have made a signature file tool to share with you, welcome technical exchange:)


Reply to discussion (solution)

Since it is more than 500 k data, is not to consider the index, and then segmented processing it, why should all be loaded in???

1, yes, to re-load in memory
2, since all the program files are included then should be cached, because he is also part of the opcode
3, that is a very boring thing, rather than optimize the data structure and algorithm

1, the boss said
2, is a part of the opcode, of course, to optimize
3, if the compile time is greater than the time to get the data by index, then why compile it? You compile just to follow the key to get the corresponding data, when you have a way to solve the data by key, why must use the structure of the PHP array data? Is this a bit like getting data from a TXT file or getting it from a contained PHP?

For example, the contents of 1.txt are as follows
1111
2222
3333
4444
......

The contents of 1.php are as follows
$a = Array (
1111,
2222,
3333,
4444,
......
);

If every process is 1.php, initialize $ A, and get $a[3]
If it is 1.txt, then as long as the optimization of how to quickly get the 4th (3+1) line of content can be, all processes can be shared,

But there is a problem here: how to reasonably optimize the concurrency problem?

There is no common way to solve this problem, either the Space change time (400M does not have concurrent wait time), or the time to change space (4M, but there is waiting time to get the data)

I think it's necessary for shared memory to handle big data, or else each HTTP process will increase memory pressure because of your big data. XCache and APC are opcode caches, right? That is, even if PHP does not use lexical parsing, directly to the opcode execution phase, still to allocate memory for large arrays, the focus is an HTTP request allocation of memory once, if you put this large array to memcache, that is, multiple HTTP requests share a piece of memory, since there is no write operation, 100 read concurrency Memcache still no problem, even when it comes to read and write concurrency, Memcached has supported optimistic locking.

First, let's put this thing memcache.

Where you need it, that's it, that's enough.

You can put this huge chunk of data on the client side.
Do it in a JSON way.
Service End With File_put_contents ("?. JS ", $value, LOCK_EX); output. js file, file format is as follows:

var class=[{"_p": [1, 26, "?? /shadow? Display "]," _l ": [" 20| Extra large table (more than 11 people) "," 21|12 table "," 22|16 People's Desk "," 23|18 People's Desk "," 24|20 Man's Table "," 143| wedding/Banquet/? "," 144|??? "," 145| show? "," 146| show? "," 147|, "" 148| Dance, "" 149| night?? "," 150| Disco Pub "," 151| MTV Room "," 152| Karaoke Room "," 153|??? ",]},{" _p ": [2, 17," New points? "1"], "_l": [""]},{"_p": [2, 18, "new points? 2"], "_l": [""]},{"_p": [4, 15, " Vehicle "]," _l ": [" "]},];

When the client needs to use this data, you use jquery's
$.getscript ("?. JS ", function () {});
Read the JSON data, and that would only consume network bandwidth and not consume any of the server's memory resources.

1, yes, to re-load in memory
I think so too!

2, since all the program files are included then should be cached, because he is also part of the opcode
The array initial value that contains a large number of int values, as constant data, is definitely part of the opcode. It's just that I'm not sure that when it is used in the program to assign a variable, it will be allocated a piece of memory to load it. If it is not allocated, it is to let this variable point directly to the address of the storage opcode, then the problem again, storage opcode block memory is the process of sharing it?

3, that is a very boring thing, rather than optimize the data structure and algorithm
I think so too! The direct manipulation of the cache engine, in the end, is nothing more than assigning a variable, seemingly no more than the previous "indirect through the cache opcode and share the use of memory" the number of opportunities.

2, is a part of the opcode, of course, to optimize
Don't look very clearly, you mean, after using the XCache, the use of memory has been optimized? Or do I still have to think about optimizing my own work?

3, if the compile time is greater than the time to get the data by index, then why compile it? You compile just to follow the key to get the corresponding data, when you have a way to solve the data by key, why must use the structure of the PHP array data? Is this a bit like getting data from a TXT file or getting it from a contained PHP?
For example, the contents of 1.txt are as follows
1111
2222
33 ...
You're right, this is really an optimization idea. However, because of the large number of key/value, if all are broken up, each key/value as a cache entry, does not seem to match the typical use of the cache (it is very much like nosql,hehe), anyway, this is not the purpose of this post. If the original scheme is really fruitless, it is only considering similar optimization schemes.

I think it's necessary for shared memory to handle big data, or else each HTTP process will increase memory pressure because of your big data. XCache and APC are opcode caches, right? That is, even if PHP does not use lexical parsing, directly to the opcode execution phase, still want to allocate memory for large arrays, focus on an HTTP request to allocate memory once,
This is the problem!

If you save this large array to memcache, that is, multiple HTTP requests share a piece of memory, since there is no write operation, 100 read concurrency memcache still no problem, even if it involves read and write concurrency, Memcached has supported optimistic lock. The
certainly does not require a "write" operation, "read-only" is enough. But the problem is that when you put this huge data object in the Memcache, it only takes up a portion of the memory in the cache engine, but how do you use it? Do you always assign a variable to a value? Memcache is a "network cache", so be sure to transfer the data to the PHP process, so also must be a new allocation of memory, from the "expect to be able to access this huge data in a shared memory" perspective, memcache odds are not as good as APC, XCache this The in-process cache engine:)

The array initial value that contains a large number of int values, as constant data, is definitely part of the opcode. It's just that I'm not sure that when it is used in the program to assign a variable, it will be allocated a piece of memory to load it. If it is not allocated, it is to let this variable point directly to the address of the storage opcode, then the problem again, storage opcode block memory is the process of sharing it?

don't think about it, opcode just omitted the steps to compile the syntax, for example:
$a = 1 + 1;
replaced with opcode is
Zend_add ~0 1 1
Just ~0 instead of $ A, if really run ~0 or to separate space, otherwise, PHP how to get data? If another process modifies this value, does it have to affect another process? Is it not possible to guarantee the logical correctness of each process? Because each process is the same code, I don't know if my $ A is still in the state of the first initialization?

So the opcode cache does not cache data, just cache another code (script) format! The specific data is still to be re-allocated memory at the time of operation!

About opcode More can click on me

Another example I said (of course 1.txt can be replaced by any tool, memcache, SQL, etc.), just you can develop interfaces, such as reading a allocated memory space, If you want the next content, you can also use this memory space to access the next content, so that the memory space of each process is reduced by the

Pseudo-code example:


while ($data = Get ()) {
//dosamething br>}

to play
foreach ($datas as $data) {
}

Reduce memory consumption by increasing the time it takes to get a single piece of data

Very want to join this topic, unfortunately, personal level is limited, cache what do not understand, say nothing


I'm going to talk about personal opinion from another angle.

Web programming is a must to consider concurrency, including server connections, rather than simply to solve the problem
This is also a lot of people do not understand the desktop development, so that the "PHP sucks," and the like to say

PHP's advantage is to quickly solve simple problems, complete and throw to the client, end a connection, vacated to the next request
Complex computing should actually be done with a more core language with higher-quality servers.

A complex compute-consumed resource is worthwhile for both the core language and the high-performance server, because the main purpose is to calculate the results
For example, NASA's server and the above practical procedures, should strive for excellence, may calculate the wrong decimal point billions of bits will cause "Mars hit the Earth" results

But the loss for the information-dissemination network is huge, because every second you consume, your information may be passed on to one or even dozens of people.
Memory consumption and time consumption are the same, memory usage is too large, also cause the TCP connection instability


Therefore, the large number of parameters may be very flexible to the program, to adapt to more users, but the general network access is a failure of choice
These parameters should be cut, the user group classification, each user can choose less parameters to achieve the goal is enough
For example, a global-oriented web site, not all languages are put together to do the parameter program adaptation, but by the user to select the language, only for this language to write code

If you have complex calculations, you should use other languages or controls to complete certain processes and reduce the response of your Web program
#6说的是其中一种情况, of course, this may not be appropriate for your needs, but his formulation is consistent with the business logic of spreading the website


Said so much, estimated also failed to solve your needs, can read I feel honored ...

4M, a click, not too big

Simplify it to a value and then measure your current running PHP memory spikes compared to the look ....

See if this code is helpful for you without loading the file into memory

 
  Seek (3);//Here is the number of lines, starting from 0 echo $file->current (). '
';? >


It takes less than 0.01 seconds to test an XML of about 50M?? The larger the number of rows, the more time it takes

Sorry for the delay so long to follow up this post, during this period of work in the spare time to cram a bit of opcode related knowledge.

Thank Hnxxwyq in the #10 floor of the reply, about opcode explanation is very enlightening, prompted me to find a small tool for analysis opcode (please refer to VLD is a good thing), by the way, build PHP under Windows also drilled again, hehe.

The result of the practice is that for this procedure:

 
  

The compiled opcode are:
Number of OPS:  10compiled vars:  !0 = $huge _array,! 1 = $elemline     # *  op                           fetch          ext  return  operands---------------------------------------------------------------------------------   2     0  >   ext_stmt         1      init_array                                       ~0      1         2      add_array_element                                ~0      2         3      add_ Array_element                                ~0      3         4      ASSIGN                                                   ! 0, ~0   3     5      ext_stmt         6      Fetch_dim_r                                      $      0, ' 2 '         7      ASSIGN                                                   ! 1, $4 8    > RETURN                                                   1         9*   > Zend_ Handle_exception

This is easy to see, as Hnxxwyq said, the "mega" data object, is a code executed by a single stack, that is, the cache is only the opcode code, the runtime data object or to occupy new memory space. It seems that the opcode cache must be hopeless, and can only look for other optimization options.


????????????????????????????????
Based on the plugin extension function provided by CSDN forum, I have made a signature file tool to share with you, welcome technical exchange:)

Make it into an in-memory database, so it's always in memory, and you don't have to go through a big process every time.

To snmr_com: Thank you for your warm replies, I looked seriously:)

Most of the time retreat thinking is necessary, and I basically agree with most of the points you have mentioned. Back to my specific question, since the goal of the caching mechanism to solve the memory-use bottleneck is dashed, then I'm afraid I can only consider other optimization options. The method you gave in #13 building is quite creative, hehe. In fact, if the use of data file storage scheme, it is not necessary to save the data in the PHP program file itself, and "location-by-row" access is not as efficient as the way to use binary storage. I personally prefer shared memory schemes like SHMOP, and for a stereotype (perhaps biased), I always feel that memory access is faster than file access. This is actually a very open question, the specific optimization plan is definitely to target specific needs, I gave in the original post is only "simplified problem background", completely not enough to design the specific optimization scheme, anyway, thank you very much for your enthusiastic help ^_^

1. The above program is written together just to facilitate you to test, did not see I said the test XML, that is, external files

2. Fast loading data I also tend to binary, but the problem of binary is not clear text search, to search for the whole load, but splfileobject full-text search does not need to load the whole?? Used in conjunction with other functions such as Sscan

3. Not my idea, I was learning to see foreigners in the process of SPL discuss how to paser a 1G SQL file for the scenario that Jiehuaxianfo

CSDN on the continuous reply there is a limit, here to the previous several friends reply together, forgive:)

Since it is more than 500 k data, is not to consider the index, and then segmented processing it, why should all be loaded in???
This can be considered in the design optimization program, thank you!

First, let's put this thing memcache.
Where you need it, that's it, that's enough.
Memcache's programming interface is simple, and as a data storage solution, this is its advantage. But in the context of the problem I am talking about, it does not seem to solve the principal contradiction.

You can put this huge chunk of data on the client side.
Do it in a JSON way.
......
This method will have a more appropriate scenario for it, specifically to my problem, not appropriate. Because I want to use this "mega" data object is "server-side business logic", and 10M of data as part of the Web page to the browser is not appropriate:)

4M, a click, not too big
Simplify it to a value and then measure your current running PHP memory spikes compared to the look ....
A single 4M is not big, the problem I now consider is in 100 concurrent, if not "share", it will be 400M, and each of these memory blocks of the construction and destruction is a non-negligible overhead.

Make it into an in-memory database, so it's always in memory, and you don't have to go through a big process every time.
is a way of thinking, seemingly nosql solutions can also solve similar problems.

If APC does not cache
It should be a better plan to use Memcache.

1. The above program is written together just to facilitate you to test, did not see I said the test XML, that is, external files
2. Fast loading data I also tend to binary, but the problem of binary is not clear text search, to search for the whole load, but splfileobject full-text search does not need to load the whole?? Used in conjunction with other functions such as Sscan
3. Not my idea, I was learning to see foreigners in the process of SPL discuss how to paser a 1G SQL file for the scenario that Jiehuaxianfo
There's a bunch of babies in SPL, and I've never noticed it before.

Do not want to put in memory all of a sudden, and then put in the file. When searching is required, a binary search is used.

The method of SPL should probably be the same.

Thank you for your participation, I think my initial question basically has the answer.

To #6 Floor said sorry, CSDN knot interface not know what logic, you that reply not to give points:(

Compile into extension

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.