How to use Python to analyze 1.4 billion of data! Senior programmers to teach you! Hundreds of millions of levels!

Source: Internet
Author: User

Challenge

1-gram datasets can be expanded to become gigabytes of data on the hard disk, which is a large data magnitude when read into Python. Python can easily process gigabit data at once, but when the data is corrupted and processed, the speed slows down and memory efficiency becomes lower.

Overall, these 1.4 billion data (1,430,727,243) are scattered in 38 source files, a total of 24 million (24,359,460) words (and part-of-speech tagging, see below), calculated from 1505 to 2008.

Loading the data

All of the following code/examples are Macbook Pro that runs in 8 GB of memory for 2016 years. Performance is better if the hardware or cloud instance has better RAM configuration.

The 1-gram data is stored in the File as tab-Split and looks like this:

Each piece of data contains the following fields:

In order to generate the chart as required, we only need to know this information, namely:

1. Is this word interesting to us? 2. Year of Release 3. Total number of words used

By extracting this information, the extra consumption of processing string data of different lengths is ignored, but we still need to compare the values of different strings to distinguish which rows of data are of interest to us. This is the work that Pytubes can do:

After almost 170 seconds (3 minutes), One_grams is an numpy array containing almost 1.4 billion rows of data, which looks like this (add table header to illustrate):

╒═══════════╤════════╤═════════╕│is_word│year│count│╞═══════════╪════════╪═════════╡│0│1799│2│├───────────┼───  ─────┼─────────┤│0│1804│1│├───────────┼────────┼─────────┤│0│1805│1│├───────────┼────────┼─────────┤│0│1811 │1│├───────────┼────────┼─────────┤│0│1820│ ... │╘═══════════╧════════╧═════════╛

From there, it's just a matter of using the NumPy method to calculate something:

The total number of words used per year

Google shows the percentage of each word appearing (the number of occurrences of a word in this year/the total number of words in the year), which is more useful than just calculating the original word. To calculate this percentage, we need to know what the total number of words is.

Fortunately, NumPy makes this very simple:

Draw this chart to show how many words Google collects each year:

It is clear that before 1800, the amount of data dropped very quickly, so it distorts the end result and hides the patterns we are interested in. To avoid this problem, we only import data from 1800 onwards:

This returns 1.3 billion rows of data (only 3.7% of the proportion before 1800)

Percentage of Python in annual percentage

Getting Python's percentage of annual percentages is now particularly simple.

Draw the result of the word_counts:

The shape looks similar to Google's version.

Performance

Google generates images in about 1 seconds, which is reasonable compared to 8 minutes for this script. The background of Google's word calculation will work in a clearly prepared view of the dataset.

For example, it can be significant to save time by calculating the total number of words used in the previous year and having a separate lookup table. Similarly, storing word usage in a separate database/file and then building the index of the first column will eliminate almost all processing time.

This exploration does show that using NumPy and fledgling pytubes and standard commercial hardware and Python, it is possible to load, process and extract arbitrary statistics from 1 billion of rows of data within a reasonable time,

Results:

Compare Google (without any baseline adjustments):

More filtering Logic-tube.skip_unless () is a relatively simple way to filter rows, but lacks the ability to combine conditions (and/or/not). This can reduce the volume of data loading faster, under some use cases.

Better string matching-the simple tests are as follows: StartsWith, EndsWith, contains, and is_one_of can be easily added to significantly improve the validity of loading string data.

Thank you for reading!! Is it super-diao, 1.4 billion ah, this is not a small number!

How to use Python to analyze 1.4 billion of data! Senior programmers to teach you! Hundreds of millions of levels!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.