How to use Python to analyze 1.4 billion of data! Senior programmers to teach you! Hundreds of millions of levels!

Last Update:2018-05-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Challenge

1-gram datasets can be expanded to become gigabytes of data on the hard disk, which is a large data magnitude when read into Python. Python can easily process gigabit data at once, but when the data is corrupted and processed, the speed slows down and memory efficiency becomes lower.

Overall, these 1.4 billion data (1,430,727,243) are scattered in 38 source files, a total of 24 million (24,359,460) words (and part-of-speech tagging, see below), calculated from 1505 to 2008.

Loading the data

All of the following code/examples are Macbook Pro that runs in 8 GB of memory for 2016 years. Performance is better if the hardware or cloud instance has better RAM configuration.

The 1-gram data is stored in the File as tab-Split and looks like this:

Each piece of data contains the following fields:

In order to generate the chart as required, we only need to know this information, namely:

1. Is this word interesting to us? 2. Year of Release 3. Total number of words used

By extracting this information, the extra consumption of processing string data of different lengths is ignored, but we still need to compare the values of different strings to distinguish which rows of data are of interest to us. This is the work that Pytubes can do:

After almost 170 seconds (3 minutes), One_grams is an numpy array containing almost 1.4 billion rows of data, which looks like this (add table header to illustrate):

╒═══════════╤════════╤═════════╕│is_word│year│count│╞═══════════╪════════╪═════════╡│0│1799│2│├───────────┼───  ─────┼─────────┤│0│1804│1│├───────────┼────────┼─────────┤│0│1805│1│├───────────┼────────┼─────────┤│0│1811 │1│├───────────┼────────┼─────────┤│0│1820│ ... │╘═══════════╧════════╧═════════╛

From there, it's just a matter of using the NumPy method to calculate something:

The total number of words used per year

Google shows the percentage of each word appearing (the number of occurrences of a word in this year/the total number of words in the year), which is more useful than just calculating the original word. To calculate this percentage, we need to know what the total number of words is.

Fortunately, NumPy makes this very simple:

Draw this chart to show how many words Google collects each year:

It is clear that before 1800, the amount of data dropped very quickly, so it distorts the end result and hides the patterns we are interested in. To avoid this problem, we only import data from 1800 onwards:

This returns 1.3 billion rows of data (only 3.7% of the proportion before 1800)

Percentage of Python in annual percentage

Getting Python's percentage of annual percentages is now particularly simple.

Draw the result of the word_counts:

The shape looks similar to Google's version.

Performance

Google generates images in about 1 seconds, which is reasonable compared to 8 minutes for this script. The background of Google's word calculation will work in a clearly prepared view of the dataset.

For example, it can be significant to save time by calculating the total number of words used in the previous year and having a separate lookup table. Similarly, storing word usage in a separate database/file and then building the index of the first column will eliminate almost all processing time.

This exploration does show that using NumPy and fledgling pytubes and standard commercial hardware and Python, it is possible to load, process and extract arbitrary statistics from 1 billion of rows of data within a reasonable time,

Results:

Compare Google (without any baseline adjustments):

More filtering Logic-tube.skip_unless () is a relatively simple way to filter rows, but lacks the ability to combine conditions (and/or/not). This can reduce the volume of data loading faster, under some use cases.

Better string matching-the simple tests are as follows: StartsWith, EndsWith, contains, and is_one_of can be easily added to significantly improve the validity of loading string data.

Thank you for reading!! Is it super-diao, 1.4 billion ah, this is not a small number!

How to use Python to analyze 1.4 billion of data! Senior programmers to teach you! Hundreds of millions of levels!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to use Python to analyze 1.4 billion of data! Senior programmers to teach you! Hundreds of millions of levels!

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to use Python to analyze 1.4 billion of data! Senior programmers to teach you! Hundreds of millions of levels!

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support