"R" data import read Read.table function detailed, how to read irregular data (fill=t)

Source: Internet
Author: User

Functions read.table are the most convenient way to read rectangular lattice data. Some functions are preset because there are many situations that you might actually encounter. These functions have been called read.table but have changed some of its default parameters.

Note that it read.table is not an efficient way to read a large numerical matrix: see the following scan function.

Some of the issues that need to be considered are:

  1. Coding issues

    If the file contains non--ascii character fields, be sure to read them in the correct encoding. This is a major problem in reading Latin-1 files inside the UTF-8 Local system. At this point, you can handle the following

              Read.table (File ("file.dat", encoding= "Latin1"))     

    Note that this runs on any local system that can render the Latin-1 name.

  2. First line problem

    We recommend you to set the header parameters explicitly. By convention, the first row has only the corresponding column field and no row label corresponding field. Therefore, it will be one less field than the remaining rows. (If you need to see this line in R, set header = TRUE .) If the file you want to read has a row label header field (which may be empty), read it in the following way

              Read.table ("file.dat", Header = TRUE, row.names = 1)     

    The name of the column can be set explicitly, and the name that is col.names explicitly set replaces the name of the column in the first row (if one exists).

  3. Delimiter problem

    Usually, open a file to see the field delimiter used to determine the file, but for the blank partition of the file, you can choose the default sep = "" (it can use any whitespace as a delimiter, such as spaces, tabs, newline characters), sep = " " or sep = "\t" . Note that the selection of separators affects the input string that is referenced.

    If you have a tab-delimited file that contains empty fields, be sure to use it sep = "\t" .

  4. references By default, strings can be "or" enclosed, and in both cases, the inner character of the quotation marks is nonalphanumeric as part of the string. A valid reference character (possibly not) is controlled by the parameter quote . For sep = "\n" , the default value is changed to quote = "" .

    If you do not set the delimiter character, in the quoted string, quotation marks need to escape in C format, that is, before the quotation mark directly with the backslash \.

    If you set a delimiter, in the quoted string, follow the spreadsheet's habit of repeating the quotation mark two times to achieve the escape effect. For example

              ' One string isn ' t, ' one more '     

    Can be read by the following command

              Read.table ("testfile", Sep = ",")     

    This does not work in the file with the default delimiter.

  5. defect value By default, the file is assumed to NA represent a defect value, but this can be changed by the parameter na.strings . A parameter na.strings is a vector that can include one or more defects that are worth describing in a character.

    Empty fields of numeric columns are also considered to be defective values.

    In numeric columns, values NaN , Inf and -Inf both can be accepted.

  6. Line omitted from trailing empty field

    Files that are exported from a spreadsheet will usually have trailing empty fields (including the Sungaia? Ignored. In order to read such a file, parameters must be set fill = TRUE .

  7. Whitespace in a character field

    If a delimiter is set, the whitespace at the beginning and end of the character field is treated as part of the field. To remove these blanks, you can use parameters strip.white = TRUE .

  8. Blank Line

    By default, read.table blank lines are ignored. This can be changed by setting blank.lines.skip = FALSE . However, this parameter is valid only if and fill = TRUE when used together. At this point, a blank line may be used to indicate a defective sample in the rule data.

  9. Types of variables

    Unless you take special action, you read.table will choose an appropriate type for each variable in the data frame. If the field is not defective and cannot be converted directly, it will determine the field type in the order in which it is, logical integer numeric and, complex in turn. If all of these types fail, the variables are transformed into factors.

    Parameters colClasses and as.is provides a lot of control. as.iswill suppress the character vector conversion Genesis (this function only). colClassesrun to set the required type for each column in the input.

    Note, colClasses and as.is dedicated to each column, instead of each variable. Therefore, it also applies to row label columns (if any).

  10. Comments

    By default, read.table characters are identified with # as comments. If the character is encountered (except within the referenced string), the subsequent contents of the row are ignored.  Lines that contain only whitespace and comments are treated as blank lines.

    If you confirm that there are no comments in the data file, it comment.char = "" is safer to use (and possibly faster).

  11. Escape

    Many operating systems have the habit of using backslashes as escape identity characters in text files, but Windows systems are an exception (use backslashes in pathname). In R, the user can set whether the custom is used for data files.

    read.tableAnd scan both have a logical parameter allowEscapes . Starting with R 2.2.0, this parameter defaults to no, and the backslash is the only character that is interpreted as an escape reference (in the environment described earlier). If this parameter is set to Yes, the escape rule in the form of C is interpreted, that is, the control character, such as \a, \b, \f, \n, \r, \t, \v octal and hexadecimal, as described in the \040 \0x2A same. Any other escape character is looked at by itself, including the backslash.

Commonly used read.csv functions read.delim and read.table files that are separated by CSV and tab characters for setting parameters that conform to the spreadsheet exported in the English language local system. These two functions correspond to variants read.csv2 and read.delim2 are designed for use in countries where commas are used as decimal points.

If read.table the options are not set correctly, the error message is usually shown in the following form

     "Error in Scan" (File = file, what = what, Sep = Sep, "Line 1 does not" have             5 elements

Or

     Error in Read.table ("Files.dat", Header = TRUE): More             columns than column names

This information may be sufficient to find the problem, but count.fields the helper function can further explore the problem.

Efficiency is most important when reading a large data grid. Set comment.char = "" , in the atomic vector type (logical, integer, numeric, plural, character or plain) setting each column colClasses , given the number of rows that need to be read nrows (properly overestimated a bit better than not setting this parameter), etc. will improve efficiency.

"R" data import read Read.table function detailed, how to read irregular data (fill=t)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.