The text is as follows:
#日期serial number is browsed page URL x page visitor IP Access time whether portal operating system browser language time zone screen resolution screen color digit number province whether the internet site of the city access provider installs Alexa
2014-7-17 11452775496 http://www.imaibo.net/space/178120 1 59.41.23.101 2014-7-17 13:38:14 0 Windows XP Chrome 21.0.8
2014-7-17 11452775466 http://www.imaibo.net/space/24649 3 211.143.83.60 2014-7-17 13:38:13 0
2014-7-17 11452775362 http://www.imaibo.net/live/216945 4 119.130.98.225 2014-7-17 13:38:12 0
2014-7-17 11452775357 http://www.imaibo.net/space/170907 6 58.50.218.120 2014-7-17 13:38:12 0
2014-7-17 11452775291 http://www.imaibo.net/space/113374 1 14.158.204.146 2014-7-17 13:38:11 1 Windows xpchrome 21. 0.1180 ZH-CN 8 1360x768 32 Guangdong Guangdong province Telecom 0149
。。。
Ideas:
1. Store the key for each field in the Bson document first in a list, such as:
Keyarray = ["Date", "Swiftnum", "url", "page", "IP", "Atime", "Is_inpoint", "System", "browser", "locale", "timezone",
"Screen", "Colornum", "Province", "City", "Aprovider", "Netcase", "Isalexa", "Ipheader", "count", "keyword", "Domain", " Incomeurl "]
2. Read text data from a file, iterate through each text record with a For in loop, use the split method of the RE module, and use the defined pattern=r ' \t* ' to break each line of records into a list of field components
3. Using a for in loop for each record's list traversal, the value in the list as value, and then take the elements in Keyarray as key, each text record corresponding to the Bson document, so that the text records to convert to Bson document.
4. To prevent duplicate insertions of the same record, an algorithm is designed to achieve the following:
A. Find the Key-value that uniquely identifies each record (in this case: Swiftnum)
B. Each time the Bson document of each record is inserted into MongoDB, the value of the corresponding Swiftnum is written to a file swiftnums.txt.
C. Make a decision before inserting a record Bson a document, whether the value of the swiftnum of the Bson document to be inserted is not in the swiftnums.txt, if it is inserted, or otherwise skipped.
The code is as follows:
Import Os,glob
Import Re,pymongo
#-----Connect MongoDB-----
conn = Pymongo. Connection ("localhost", 27017)
db = Conn.flowcounter
Flowlist = Db.flowcollection
#要存储的文本
File = Open ("Test.txt", "R")
#用于存储每一条记录的swiftnum值 to prevent repeated insertions
swiftnums = open ("Swiftnums.txt", "r+")
#每条记录转成BSON文档时, the key
for each domain Keyarray = ["Date", "Swiftnum", "url", "page", "IP", "Atime", "Is_inpoint", "System", " Browser "," locale "," timezone ",
" screen "," Colornum "," Province "," City "," Aprovider "," Netcase "," Isalexa "," Ipheader "," count "," keyword "," domain "," incomeurl "]
#分解文本记录用到的正则
pattern = R ' \t* '
# List
Swiftnum_list = Swiftnums.read () of Swiftnum that contains records already in the database. Split (",")
for Line in File.readlines () [1:]:
i = 0
data = {}
Li_list = Re.split (pattern,line) #分解记录为list
for field in Li_list:
Data[keyarray[i]]=field #将记录 Convert to Bson document
I +=1
if data[' swiftnum '] not in swiftnum_list: #判断是否重复
Flowlist.insert (data)
Swiftnums.write ( data[' Swiftnum ']+ ",") #记录已有记录
Swiftnums.close ()