Overview
From the Selenium modularization article, we can see the necessity of parameterization, this article introduces the method of reading the external TXT file.
How to open a file
The following two functions can be applied to open a file:
1. Open (File_name,access_mode)
file_name: File path and name;
Access_mode: Access method, the specific parameters are as follows, and no parameters are provided, the default is R:
- R: indicates read;
- W: Indicates write;
- A: means to add;
- +: indicates read and write;
- B: indicates 2 binary access;
2 , File function
The file () built-in function is equal to open (), as described in the documentation:
>>> help(open)
open(...)
File Object
Open a file using the file () type, returns a file object. The
Preferred to open a file. See file.__doc__ for further information. (END)
Read TXT in English
Next introduce the method of reading TXT file content, Python provides several ways to read the file, as follows;
- Read () reads the entire file
- ReadLines () reads the entire file by line
- Readeline () reads a line of content by row
Now suppose that the read TXT file stores the test data for the user's login name and password, as follows:
Admin,adminguest,guesttest,test
This is a good way to get a file in a row-by-line manner, as in the following example:
#Coding:utf-8ImportCodecsdefStr_reader_txt (address): FP=open (Address,'R') Users=[] PWDs=[] Lines=Fp.readlines () forDatainchlines:name,pwd=data.split (',') name=name.strip ('\t\r\n') PWD=pwd.strip ('\t\r\n') users.append (name) pwds.append (PWD)Print "user:%s (len (%d))"%(Name,len (name))Print "pwd:%s (len (%d))"%(Pwd,len (pwd))returnUsers,pwds fp.close ()
The above through ReadLines () read TXT file content by line, and use the split () function to cut the string, respectively, to get the user name and password, you need to note that the read out of the word have the last face of the carriage return, it is necessary to filter the Strip function.
Read Chinese txt
But in the actual testing process, it may also be necessary to enter the Chinese user and password, can test pass? Modify the test document TXT user name is in Chinese, the content is as follows:
Administrator, admin Guest, guest tester, test
After executing the above script, the results are as follows:
As can be seen, the above script, in the Chinese processing, encountered an exception, the characters displayed garbled, the following two solutions:
Method One
#Coding:utf-8ImportCodecsdefStr_reader_txt (address): FP=open (Address,'R') Users=[] PWDs=[] Lines=Fp.readlines () forDatainchlines:Printtype (data) data=data.decode ("GB18030")#dealing with Chinese coding problems Printtype (data) Name,pwd=data.split (',') name=name.strip ('\t\r\n') PWD=pwd.strip ('\t\r\n') users.append (name) pwds.append (PWD)Print "user:%s (len (%d))"%(Name,len (name))Print "pwd:%s (len (%d))"%(Pwd,len (pwd))returnUsers,pwds fp.close ()
The method is displayed after the code is decode ("GB18030") before the content is split, and the result is as follows
Reason Description
Referring to Unicode in Python, generally refers to Unicode objects, such as ' haha ' Unicode object is U ' \u54c8\u54c8 '
STR, which is a byte array, represents the format of the storage after encoding the Unicode object (which can be utf-8, GBK, cp936, GB2312). Here it is just a stream of words, no other meaning, if you want to make this byte stream display content meaningful, you must use the correct encoding format, decoding display.
In the above script run, use type (data) to print the data format before and after Decode, as follows:
<type ' str ' >
<type ' Unicode ' >
As you can see, when the built-in open () method opens the file, read () reads the STR format:
- Read () reads, if the parameter is str (and the content contains Chinese), after reading it needs to use the correct encoding format for decode (), after the conversion to Unicode characters, to display correctly.
- Write (), if the parameter is Unicode, you need to encode () with the encoding you wish to write, and if it is a different encoded format str, you need to first decode () with that Str's encoding, Convert to Unicode and then use the written encoding for Encode ().
Method Two (recommended)
When the file is opened directly by specifying the use of GB18030 format read, you can directly operate, in addition, the method for the Chinese txt and English txt processing is applicable
#Coding:utf-8ImportCodecsdefStr_reader_txt (address): FP=codecs.open (Address,'R',"GB18030") #Fp=open (address, ' R ')users=[] PWDs=[] Lines=Fp.readlines () forDatainchlines:name,pwd=data.split (',') name=name.strip ('\t\r\n') PWD=pwd.strip ('\t\r\n') users.append (name) pwds.append (PWD)Print "user:%s (len (%d))"%(Name,len (name))Print "pwd:%s (len (%d))"%(Pwd,len (pwd))returnUsers,pwds fp.close ()
Note: Codecs.getreader can also achieve the same effect, as follows:
#Coding:utf-8ImportCodecsdefstr_reader_txt_csv (address): F=file (Address,'RB') Users=[] PWDs=[] CSV=codecs.getreader ('GB18030') (f)#Codecs.getreaderf Method forDatainchcsv:name,pwd=data.split (',') name=name.strip ('\t\r\n') PWD=pwd.strip ('\t\r\n') users.append (name) pwds.append (PWD)returnUsers,pwds f.close ()Reason Description
Module codecs provides an open () method that can specify an encoding for opening a file, and using this method to open a file read returned will be Unicode.
When writing, if the parameter is Unicode, the encoding specified when using open () is encoded and then written;
If it is STR, it is decoded into Unicode and then the aforementioned operation according to the character encoding declared by the source code file. For the built-in open (), this method is less prone to coding problems, it is recommended to use
Why use the GB18030 encoding format
Here is the result of the comparison test, which shows the results of using the GB18030 and UTF-8 operations:
Under the Windows platform, the default document is saved in ANSI, and the ANSI code represents GB2312 encoding under the Simplified Chinese system.
When TXT is saved, when you modify the save format to UTF-8, you can use UTF-8 encoding to open it, but its character length differs for the following reasons:
One need to mention is the BOM (Byte Order Mark). When we save the file, the encoding used for the file is not saved, and when we open it we need to remember the encoding we used when we saved it and open it with this code, which creates a lot of trouble.
When the Notepad opened the file, it did not make the selected code? To open a TXT document saved in UTF-8 encoded format, open Notepad and then use file.
UTF introduces a BOM to represent its own encoding, and if the first few bytes read are one of them, then the encoding used to represent the text to be read is the corresponding encoding:
Bom_utf8 ' \XEF\XBB\XBF '
Bom_utf16_le ' \xff\xfe '
Bom_utf16_be ' \xfe\xff '
How can I get the contents of the BOM in the case of UTF-8 format files? Codec there is a method Codecs.bom_utf8 can refer to, here does not explain in detail
The differences and relations between GB2312, GBK and GB18030
Here is a reference link, http://www.zhihu.com/question/19677619
This article describes the more comprehensive and clear, summed up is:
- GBK fully compatible with GB2312
- GB 18030 is fully compatible with GB 2312, basic compatible with GBK, support GB 13000 and Unicode all the unified Chinese characters, a total of 70,244 Chinese characters.
GB 18030, full name: National standard GB 18030-2005 "information Technology Chinese coded character set", is the People's Republic of China is the latest internal code word set, GB 18030-2000 "Information Technology information interchange with Chinese character encoding set basic set of the expansion of the" revision.
Summary of Chinese processing process
The best way to work with Chinese data is as follows:
1. Decode early (early Decode, convert the contents of the file into Unicode and proceed to the next step)
2. Unicode everywhere (Unicode for program internal processing)
3. Encode Late (finally Encode back the required encoding, such as writing the final result into the result file)
Here are a few things to explain:
* The so-called "correct" encoding means that the specified encoding must be the same as the encoding of the string itself. This is actually not so easy to judge, generally speaking, we directly input the Simplified Chinese characters, there are two possible encodings: GB2312 (GBK, GB18030), and UTF-8
* GB2312, GBK, GB18030 are essentially the same coding standard. It just expands the number of characters on the former basis.
* UTF-8 and GB encoding not compatible
* Second, when you convert str to Unicode, you can use the following two methods: Convert gb2312 encoded STR to Unicode encoding
- Unicode (str, ' gb2312 ')
- Str.decode (' gb2312 ')
* In addition, when defining a string, Chinese is used, which is defined using str=u ' kanji '.
Resources
In-depth analysis of Python Chinese garbled problem
Http://www.jb51.net/article/26543.htm
Python character encoding in detail
Http://www.cnblogs.com/huxi/archive/2010/12/05/1897271.html
Detailed python Chinese coding and processing
http://my.oschina.net/leejun2005/blog/74430
Encode and decode of Python string--solving garbled problem
http://blog.csdn.net/lxdcyh/article/details/4018054
Selenium+python parameterization: Read TXT file