When processing a large amount of data, Xiao Dingdong encountered a memory leak.
Recently, we have been testing the effect of applying the word segmentation to the weblucene search engine.
We use an XML file of about 1.2 GB for the source data.
The index files after the creation are compared as follows:
Source File: 1.2 GB
Index file generated by the word segmentation: 2217 MB
Index file generated by binary word segmentation: 2618 MB difference: 401 MB
For more detailed comparison, see the comprehensive comparison between Chinese Word Segmentation and binary word segmentation.
The following is a comparison of index files. We can see that the main difference lies in the difference in term information.
Index file list |
(121 m) Word Segmentation |
(146 m) binary word segmentation
|
|
Deletable |
4 |
4 |
|
_ Fg4.f10 |
19 k |
19 k |
|
_ Fg4.f11 |
19 k |
19 k |
|
_ Fg4.f12 |
19 k |
19 k |
|
_ Fg4.f13 |
19 k |
19 k |
|
_ Fg4.f19 |
19 k |
19 k |
|
_ Fg4.fdt |
80 m |
80 m |
Domain value |
_ Fg4.fdx |
156 K |
156 K |
Domain Index |
_ Fg4.fnm |
135 |
135 |
Standardization factor |
_ Fg4.frq |
12 m |
23 m |
Item Frequency |
_ Fg4.prx |
26 m |
36 m |
Item location |
_ Fg4.tii |
15 K |
74 K |
Item index |
_ Fg4.tis |
1.1 m |
5.8 m |
Item Information |
Segments |
17 |
17 |
|
Two problems were encountered during the test,
1. Memory leakage
There are two scenarios for Memory leakage:
1. memory usage increases over time (Memory leakage ?), You can see through the top command in Linux;
2. When the program was running for half an hour, the memory usage suddenly increased, and the CPU usage also increased.
2. High CPU usage
The CPU usage is proportional to the memory usage, that is, when the memory increases to around 99.9% MB, the CPU usage jumps.
Therefore, we need to solve the problem of increasing memory usage.
Lhelper also recommends many tools:
Http://www.samspublishing.com/articles/article.asp? P = 23618 & seqnum = 7 & RL = 1
Http://tech.ccidnet.com/pub/article/c1112_a265199_p1.html
Check Java Memory Leak
Tips
Memory leaks, be gone
I do not know whether you have shared any experience in this regard.
Related connections:
[Sandbox] Two Test Modules for Lucene Chinese Word Segmentation
» Grassland development Diary
Bea releases jrock IT 5.0 Update and Memory Leak Detector Tool |