The application of R language space time-change algorithm and hash key value pair in string processing

Source: Internet
Author: User

Recently has been dealing with traffic data, there is time, license plate, through the intersection address, data volume is large, this article for each car through the intersection time sequence, the generation of Guiyang traffic can be used to map, that is connected to the intersection between the two-way traffic, one-way access.

I. Notes on the data

    • License plate number, intersection address is a string
    • Time is the date time format
    • The amount of data is about 6.8 million.
Second, the original algorithm code
RM (List=ls (ALL=TRUE)) GC () library (RODBC) channel=odbcconnect ("Transport-connector-r", uid= "Transport", pwd= " Transport ") #连接mysql test Database SQLTables (channel) # Show tables in test database # Retrieving information about vehicles in Guiyang, including license plates, in test.transport20140901 Pass the junction transections_data<-sqlquery (Channel, "Select Plate,address from transport20140901 where plate like ' expensive a% ' order by Plate,time ") Odbcclose (channel) # reads the sorted intersection address data in the file Address_file <-file ("/home/wanglinlin/transport/address.txt " , "R") sorted_address <-readlines (address_file) Close (address_file) #sorted_address [256]# Generate Guiyang Traffic junction connectivity graph initial matrix Transection_count <-Length (sorted_address) tansport_map <-matrix (0,transection_count, Transection_count) #tansport_map # Find its location number find_address<-function (target,address_table) {len=, based on the destination address name, in the Address table) Length (address_table) for (i in 1:len) if (Target==address_table[i]) return (i) return (0)} #根据贵阳本地车辆信息, the generation of Guiyang traffic map bidirectional has To the graph matrix Transport_data_count <-6725490 counter <-transport_data_count-1transection_id_one=find_address ( transections_data[1,2],sorted_address) for (i in 1:counter) {transection_id_two=find_address (transections_data[i+1,2],sorted_address) if (  transections_data[i,1]==transections_data[i+1,1]) {Tansport_map[transection_id_one,transection_id_two] <-1} Transection_id_one <-transection_id_two}write.table (Tansport_map, "/home/wanglinlin/transport/tansport_map_ Two.txt ", row.names = False,col.names = FALSE)

The code core above is the statement in the For loop, the for loop number is unlikely to be reduced, and there are two time-consuming operations in the loop:
<ul><li><span style= "font-family:arial, Helvetica, Sans-serif;" >find_address (transections_data[i+1,2],sorted_address) </span></li><li><span style= " Font-family:arial, Helvetica, Sans-serif; >transections_data[i,1]==transections_data[i+1,1]</span></li></ul>
These two operations are in the array to find the location of the string (the current intersection address in the Address list position), compare the two strings are equal (two car grades are the same), is the operation of the string, is quite time-consuming. In factfind_address is already an optimized operation, originally the which function, finding all matching positions, returning to the first position, and iterating through the entire list for each lookup.
Started running the program around 3:00 yesterday, the number of cycles as of today 9:15 only 14,412 times, to run the entire program takes a few 10 days, is unacceptable. Last night has been to change the algorithm, I hope that the string operation can be completed, the best operation is to take the car brand, intersection address digitization, with digital contrast, efficiency will greatly improve. In this way, the best solution is to use hash or hash key value pair operation, find long time, finally found R hash packet can do this operation. With a hash packet to the intersection address and license plate converted to a key value of the store in memory, the location of the intersection of the search converted to a hash value of the search, the car's brand comparison into the search license plate hash value. The final code is as follows:
RM (List=ls (ALL=TRUE)) GC () library (RODBC) library (hash) channel=odbcconnect ("Transport-connector-r", uid= "transport ", pwd=" Transport ")  #连接mysql test Database SQLTables (channel)  # Show table in test database # Retrieve vehicle information for Guiyang in test.transport20140901, With the license plate, pass the junction transections_data<-sqlquery (Channel, "Select Plate,address from transport20140901 where plate like ' expensive a% ' ORDER by Plate,time ")
#找出贵阳所有车牌号, and hash, form a key-value pair table Plates<-sqlquery (channel, "SELECT distinct plate from transport20140901 where plate like ' your a% ' ") Odbcclose (channel) plate_list= (As.matrix (plates)) [, 1]plate_count=length (plate_list) Plate_hash_pairs=hash ( Plate_list,1:plate_count) # Read the sorted intersection address data in the file Address_file <-file ("/home/wanglinlin/transport/address.txt", "R") Sorted_address <-readlines (address_file) Sorted_address_hash_pairs<-hash (sorted_address,1:269) Close ( Address_file) #sorted_address [<-] #生成贵阳交通路口连通性有向图初始矩阵transection_count Length (sorted_address) Transport_map <-Matrix (0,transection_count,transection_count) #tansport_map # based on Guiyang local vehicle information, the bidirectional graph matrix of Guiyang traffic map is generated transport_data_count <-6725490 counter <-transport_data_count-1plate_hash_pairs[[as.character (transections_data[1,1])]]plate_ Hash_pairs[[as.character (transections_data[2,1])]]sorted_address_hash_pairs[[as.character (transections_data[ ])]]sorted_address_hash_pairs[[as.character (transections_data[2,2])]]for (i in 1:counter) {if (plate_hash_pairs[[As.character (transections_data[i,1])] ==plate_hash_pairs[[as.character (transections_data[i+1,1])] {transport_map[sorted_address_hash_pairs[[ As.character (transections_data[i,2])]],sorted_address_hash_pairs[[as.character (transections_data[i+1,2]) []] <-1}}write.table (Transport_map, "/home/wanglinlin/transport/transport_map.txt", row.names = FALSE,col.names = FALSE)

The end result, today 8:30 A.M. to the computer, the discovery has run out, the results data will no longer show. The overall algorithm is hundreds of times times more efficient.



The application of R language space time-change algorithm and hash key value pair in string processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.