How to count the frequency of each word in a 2G size file

Source: Internet
Author: User
Beginner Novice Dish Chicken encounter a problem, statistics a 2G size file each word occurrence frequency, modified memory limit or total error allowed memory size of xxxx bytes exhausted, light measurement head count or character number can produce results, How to optimize it?
Ini_set ("Memory_limit", "1"), function Calcwordfrequence ($sFilePatch) {$aWordsInFile = array (); $aOneLineWords = array (); $sOneLineWords = ""; $fp = fopen ($sFilePatch, "R"), while (!feof ($fp)) {$sOneLineWords = Fgets ($fp); $aOneLineWords = str _word_count ($sOneLineWords, 1), foreach ($aOneLineWords as $v) {Array_push ($aWordsInFile, $v);}} Fclose ($fp); $aRes = Array_count_values ($aWordsInFile); Arsort ($aRes); return $aRes;} Echo calcwordfrequence ("2013.mp4");


Reply to discussion (solution)

This problem can not be solved, 2G size of the file hardware almost open the computer to consume light memory. Do a distributed design on storage.

This problem can not be solved, 2G size of the file hardware almost open the computer to consume light memory. Do a distributed design on storage.


Is there a way to separate this file from the code into a few partial statistics or the one with the most frequency output?

Use the split command to cut files into small files and count them.

Only text files have the concept of a line
The 2013.mp4 you're testing is obviously not a text file.
If the file does not appear \ n, or appears by the back of your $sOneLineWords = Fgets ($FP); It's going to drain the memory.

If you are a text file such as a log, you can use the PHP splfileobject () class, specifically for the operation of large files, previously used this analysis Nginx access logs, more than 5 G.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.