The application of R language space time-change algorithm and hash key value pair in string processing

Last Update:2015-06-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently has been dealing with traffic data, there is time, license plate, through the intersection address, data volume is large, this article for each car through the intersection time sequence, the generation of Guiyang traffic can be used to map, that is connected to the intersection between the two-way traffic, one-way access.

I. Notes on the data

License plate number, intersection address is a string
Time is the date time format
The amount of data is about 6.8 million.

Second, the original algorithm code

RM (List=ls (ALL=TRUE)) GC () library (RODBC) channel=odbcconnect ("Transport-connector-r", uid= "Transport", pwd= " Transport ") #连接mysql test Database SQLTables (channel) # Show tables in test database # Retrieving information about vehicles in Guiyang, including license plates, in test.transport20140901 Pass the junction transections_data<-sqlquery (Channel, "Select Plate,address from transport20140901 where plate like ' expensive a% ' order by Plate,time ") Odbcclose (channel) # reads the sorted intersection address data in the file Address_file <-file ("/home/wanglinlin/transport/address.txt " , "R") sorted_address <-readlines (address_file) Close (address_file) #sorted_address [256]# Generate Guiyang Traffic junction connectivity graph initial matrix Transection_count <-Length (sorted_address) tansport_map <-matrix (0,transection_count, Transection_count) #tansport_map # Find its location number find_address<-function (target,address_table) {len=, based on the destination address name, in the Address table) Length (address_table) for (i in 1:len) if (Target==address_table[i]) return (i) return (0)} #根据贵阳本地车辆信息, the generation of Guiyang traffic map bidirectional has To the graph matrix Transport_data_count <-6725490 counter <-transport_data_count-1transection_id_one=find_address ( transections_data[1,2],sorted_address) for (i in 1:counter) {transection_id_two=find_address (transections_data[i+1,2],sorted_address) if (  transections_data[i,1]==transections_data[i+1,1]) {Tansport_map[transection_id_one,transection_id_two] <-1} Transection_id_one <-transection_id_two}write.table (Tansport_map, "/home/wanglinlin/transport/tansport_map_ Two.txt ", row.names = False,col.names = FALSE)

The code core above is the statement in the For loop, the for loop number is unlikely to be reduced, and there are two time-consuming operations in the loop:

<ul><li><span style= "font-family:arial, Helvetica, Sans-serif;" >find_address (transections_data[i+1,2],sorted_address) </span></li><li><span style= " Font-family:arial, Helvetica, Sans-serif; >transections_data[i,1]==transections_data[i+1,1]</span></li></ul>

These two operations are in the array to find the location of the string (the current intersection address in the Address list position), compare the two strings are equal (two car grades are the same), is the operation of the string, is quite time-consuming. In factfind_address is already an optimized operation, originally the which function, finding all matching positions, returning to the first position, and iterating through the entire list for each lookup.
Started running the program around 3:00 yesterday, the number of cycles as of today 9:15 only 14,412 times, to run the entire program takes a few 10 days, is unacceptable. Last night has been to change the algorithm, I hope that the string operation can be completed, the best operation is to take the car brand, intersection address digitization, with digital contrast, efficiency will greatly improve. In this way, the best solution is to use hash or hash key value pair operation, find long time, finally found R hash packet can do this operation. With a hash packet to the intersection address and license plate converted to a key value of the store in memory, the location of the intersection of the search converted to a hash value of the search, the car's brand comparison into the search license plate hash value. The final code is as follows:

RM (List=ls (ALL=TRUE)) GC () library (RODBC) library (hash) channel=odbcconnect ("Transport-connector-r", uid= "transport ", pwd=" Transport ")  #连接mysql test Database SQLTables (channel)  # Show table in test database # Retrieve vehicle information for Guiyang in test.transport20140901, With the license plate, pass the junction transections_data<-sqlquery (Channel, "Select Plate,address from transport20140901 where plate like ' expensive a% ' ORDER by Plate,time ")

#找出贵阳所有车牌号, and hash, form a key-value pair table Plates<-sqlquery (channel, "SELECT distinct plate from transport20140901 where plate like ' your a% ' ") Odbcclose (channel) plate_list= (As.matrix (plates)) [, 1]plate_count=length (plate_list) Plate_hash_pairs=hash ( Plate_list,1:plate_count) # Read the sorted intersection address data in the file Address_file <-file ("/home/wanglinlin/transport/address.txt", "R") Sorted_address <-readlines (address_file) Sorted_address_hash_pairs<-hash (sorted_address,1:269) Close ( Address_file) #sorted_address [<-] #生成贵阳交通路口连通性有向图初始矩阵transection_count Length (sorted_address) Transport_map <-Matrix (0,transection_count,transection_count) #tansport_map # based on Guiyang local vehicle information, the bidirectional graph matrix of Guiyang traffic map is generated transport_data_count <-6725490 counter <-transport_data_count-1plate_hash_pairs[[as.character (transections_data[1,1])]]plate_ Hash_pairs[[as.character (transections_data[2,1])]]sorted_address_hash_pairs[[as.character (transections_data[ ])]]sorted_address_hash_pairs[[as.character (transections_data[2,2])]]for (i in 1:counter) {if (plate_hash_pairs[[As.character (transections_data[i,1])] ==plate_hash_pairs[[as.character (transections_data[i+1,1])] {transport_map[sorted_address_hash_pairs[[ As.character (transections_data[i,2])]],sorted_address_hash_pairs[[as.character (transections_data[i+1,2]) []] <-1}}write.table (Transport_map, "/home/wanglinlin/transport/transport_map.txt", row.names = FALSE,col.names = FALSE)

The end result, today 8:30 A.M. to the computer, the discovery has run out, the results data will no longer show. The overall algorithm is hundreds of times times more efficient.

The application of R language space time-change algorithm and hash key value pair in string processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The application of R language space time-change algorithm and hash key value pair in string processing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The application of R language space time-change algorithm and hash key value pair in string processing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support