Python implementation saves each line of text in a file to the MongoDB database and prevents duplicate insertions

Source: Internet
Author: User
Tags locale

The text is as follows:

#日期serial number is browsed page URL x page visitor IP Access time whether portal operating system browser language time zone screen resolution screen color digit number province whether the internet site of the city access provider installs Alexa

2014-7-17 11452775496 http://www.imaibo.net/space/178120 1 59.41.23.101 2014-7-17 13:38:14 0 Windows XP Chrome 21.0.8
2014-7-17 11452775466 http://www.imaibo.net/space/24649 3 211.143.83.60 2014-7-17 13:38:13 0
2014-7-17 11452775362 http://www.imaibo.net/live/216945 4 119.130.98.225 2014-7-17 13:38:12 0
2014-7-17 11452775357 http://www.imaibo.net/space/170907 6 58.50.218.120 2014-7-17 13:38:12 0
2014-7-17 11452775291 http://www.imaibo.net/space/113374 1 14.158.204.146 2014-7-17 13:38:11 1 Windows xpchrome 21. 0.1180 ZH-CN 8 1360x768 32 Guangdong Guangdong province Telecom 0149

。。。

Ideas:

1. Store the key for each field in the Bson document first in a list, such as:

Keyarray = ["Date", "Swiftnum", "url", "page", "IP", "Atime", "Is_inpoint", "System", "browser", "locale", "timezone",
"Screen", "Colornum", "Province", "City", "Aprovider", "Netcase", "Isalexa", "Ipheader", "count", "keyword", "Domain", " Incomeurl "]

2. Read text data from a file, iterate through each text record with a For in loop, use the split method of the RE module, and use the defined pattern=r ' \t* ' to break each line of records into a list of field components

3. Using a for in loop for each record's list traversal, the value in the list as value, and then take the elements in Keyarray as key, each text record corresponding to the Bson document, so that the text records to convert to Bson document.

4. To prevent duplicate insertions of the same record, an algorithm is designed to achieve the following:

A. Find the Key-value that uniquely identifies each record (in this case: Swiftnum)

B. Each time the Bson document of each record is inserted into MongoDB, the value of the corresponding Swiftnum is written to a file swiftnums.txt.

C. Make a decision before inserting a record Bson a document, whether the value of the swiftnum of the Bson document to be inserted is not in the swiftnums.txt, if it is inserted, or otherwise skipped.

The code is as follows:

Import Os,glob
Import Re,pymongo
#-----Connect MongoDB-----
conn = Pymongo. Connection ("localhost", 27017)
db = Conn.flowcounter
Flowlist = Db.flowcollection

#要存储的文本
File = Open ("Test.txt", "R")

#用于存储每一条记录的swiftnum值 to prevent repeated insertions
swiftnums = open ("Swiftnums.txt", "r+")

#每条记录转成BSON文档时, the key
for each domain Keyarray = ["Date", "Swiftnum", "url", "page", "IP", "Atime", "Is_inpoint", "System", " Browser "," locale "," timezone ",
" screen "," Colornum "," Province "," City "," Aprovider "," Netcase "," Isalexa "," Ipheader "," count "," keyword "," domain "," incomeurl "]
#分解文本记录用到的正则
pattern = R ' \t* '
# List
Swiftnum_list = Swiftnums.read () of Swiftnum that contains records already in the database. Split (",")
for Line in File.readlines () [1:]:
i = 0
data = {}
Li_list = Re.split (pattern,line) #分解记录为list
for field in Li_list:
Data[keyarray[i]]=field #将记录 Convert to Bson document
I +=1
if data[' swiftnum '] not in swiftnum_list: #判断是否重复
Flowlist.insert (data)
Swiftnums.write ( data[' Swiftnum ']+ ",") #记录已有记录
Swiftnums.close ()

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.