Python's tutorial on using multiple processes to put large amounts of data into limited memory _python

Source: Internet
Author: User

Brief introduction

This is a brief tutorial on how to put a lot of data into a limited amount of memory.

When working with customers, they sometimes find that their databases are actually just a CSV or Excel file warehouse, and you can only work with them, often without updating their data warehouses. In most cases, it might be better to store these files in a simple database framework, but time may not be allowed. This approach is required for time, machine hardware, and the environment.

Here's a good example: suppose you have a bunch of tables (not using neo4j, MongoDB, or other types of databases, only tables stored in the format of Csvs, TSVS, etc.), and if you combine all the tables, the resulting data frames are too large to fit into memory. So the first idea is to split it up into different parts and store it one at a while. The plan looks good, but it's slow to handle. Unless we use multi-core processors.
Target

The goal here is to identify the relevant positions from all positions (about 10,000). Combine these positions with the government's job code. The result of the combination is then combined with the corresponding State (administrative unit) information. Then use the property information generated by Word2vec to enhance the existing properties in our customer's pipeline.

This task requires a short time to complete, no one is willing to wait. Imagine this as a connection to multiple tables without using a standard relational database.
data

Sample Script

The following is a sample script that shows how to use multiprocessing to speed up the operation in a limited memory space. The first part of the script is related to a specific task and is free to skip. Please focus on the second part, which focuses on the multiprocessing engine.

#import the necessary packages import pandas as PD import US import numpy as NP from multiprocessing import Pool,cpu_coun T,queue,manager # The data in one particular column is number in the form that horrible Excel version # of a number whe
Re ' 12000 ' is ' 12,000 ' and that beautiful useless comma in there.
# Did I mention I Excel bothers me? # instead of converting the number right away, we are convert them when we need to Def median_maker (column): Return NP. Median (int (x.replace (', ', ') for x in column]) # Dictionary_of_dataframes contains a dataframe with information for EA CH title; E.g title is ' Data scientist ' # RELATED_TITLE_SCORE_DF are the dataframe of information for the title; columns = [' title ', ' score '] ### where title is a Similar_title and score are how closely the two are, related, e.g. ' Alyst ', 0.871 # CODE_TITLE_DF contains columns [' Code ', ' title '] # OES_DATA_DF is a HUGE dataframe with all of the Bureau O F Labor Statistics (BLS) data for a given time period(YAY free DATA, BOO-Bad census data!) def job_title_location_matcher (title,location): TRY:RELATED_TITLE_SCORE_DF = Dictionary_of_dataframes[title] # We limit dataframe1 to only those related_titles that are above # a previously established threshold Related_title _SCORE_DF = related_title_score_df[title_score_df[' score ']>80] #we merge the related titles with another table A D its codes codes_reltitles_scores = Pd.merge (CODE_TITLE_DF,RELATED_TITLE_SCORE_DF) codes_reltitles_scores = codes _reltitles_scores.drop_duplicates () # Merge the two dataframes by the codes MERGED_DF = Pd.merge (codes_reltitles _scores, OES_DATA_DF) #limit the BLS data to the state we want all_merged = merged_df[merged_df[' Area_title ']==str
 
    (Us.states.lookup (location). Name)] #calculate Some summary statistics for the time we want Group_med_emp,group_mean,group_pct10,group_pct25,group_median, Group_pct75,group_pct90 = all_merged[[' tot_emp ', ' A_mean ', ' a_pct10', ' a_pct25 ', ' A_median ', ' a_pct75 ', ' A_pct90 ']].apply (median_maker) row = [Title,location,group_med_emp,group_mean,  Group_pct10,group_pct25, Group_median, Group_pct75, Group_pct90] #convert it all to strings so we can combine all  When writing to file row_string = [str (x) for x into row] return row_string except: # if it doesnt work for a

 Particular title/state just throw it out, there are enough to make this insignificant ' doing nothing '

Something magical is happening here:

#runs the function and puts the answers in the queue def worker (Row, q): ans = job_title_location_matcher (row[0],row[ 1]) q.put (ANS) # This writes to the file while there are still things that could is in the queue # This is allows for M Ultiple processes to write to the same file without blocking eachother Def listener (q): F = open (filename, ' WB ') while 
  1:m = Q.get () if m = = ' kill ': Break F.write (', '. Join (m) + ' n ') F.flush () F.close () def main (): #load all your data, then throw out all unnecessary tables/columns filename = ' skill_test_pool.txt ' #sets up the necessary multiprocessing Tasks Manager = Manager () Q = Manager. Queue () Pool = Pool (Cpu_count () + 2) Watcher = Pool.map_async (Listener, (q,)) jobs = [] #titles_states is a DATAF Rame of millions of job titles and states they were found in to I in titles_states.iloc:job = Pool.map_async (Worke R, (i, Q)) Jobs.append (Job) for job in Jobs:job.get () q.put (' Kill ') Pool.close () Pool.join () if __name__ = = "__main__": Main ()

 

Because each data frame is of a different size (about 100Gb in total), it is impossible to put all the data into memory. The final data frame is written to memory line by row, but the complete data frame is never stored in memory. We can complete all the calculations and combinations of tasks. The "standard method" here is that we can simply write a "write_line" method at the end of "Job_title_location_matcher", but only one instance is processed at a time. Depending on the number of positions/States we need to deal with, this may take about 2 days. And through multiprocessing, just 2 hours.

Although readers may not be able to access the task environment that this tutorial deals with, they can break through the limitations of many computer hardware through multiprocessing. The working environment for this example is C3.8XL ubuntu EC2, with hardware 32 kernel 60Gb memory (although this memory is large, it cannot be put into all data at once). The key here is that we are effectively processing about 100Gb of data on a 60Gb memory machine, with a speed increase of about 25 times times. By multiprocessing the large-scale process on multi-core machines, the utilization of machines can be effectively improved. Perhaps some readers already know this method, but for others, it can bring a lot of benefits through multiprocessing. By the way, this part is the continuation of skill assets in the Job-market blog post.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.