Question:
There are 0.25 billion unsigned integers (but in the file). You need to find out the number of non-repeated numbers (the number of numbers that only appear once). In addition, the available memory is limited to 600 mb, requiring efficient and optimal algorithms.
Ideas:
So many numbers cannot be read in the memory, so some processing is required. Imagine using a flag array, which contains true or false, to indicate whether a number is read for the first time. It is best to use this number as the subscript of the array to access this flag, for example, read 234432 and check whether flag [234432] is true or false. This is very convenient (this is not similar to the concept of hash ).
Well, now the main contradiction lies in how to define the flag array. Unsigned int, ranging from 0 to 2 ^ 32-1 (4 bytes ), make sure that the array is large enough to use the subscript 2 ^ 32-1 to access this number. True or false, so only one digit is enough. How big is the flag array:
2 ^ 32 bits, 2 ^ 29 byte, 2 ^ 19 kb, 2 ^ 9 m, and 512 M. The memory size is smaller than 600 mb.
Summary:
Use bits to indicate whether a number has exists.
Directly use numbers as the subscript of the bitarray for access.
Calculate the number of non-repeating elements in hundreds of millions