MaxCompute Tunnel uploading typical problem scenarios

Last Update:2018-08-29 Source: Internet

Author: User

Tags failover table definition

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Abstract: Describes the use of Maxcompute Tunnel command-line tools to upload data data class, network class, billing problems, long-term update, welcome everyone to shoot bricks ~ ~

Data issues
Q: Using the tunnel Java SDK to upload data, can uploading data be automatically assigned to each partition?
A: Currently tunnel is unable to automatically upload data and automatically assigned to each partition: each upload only support data upload to a table or a partition of the table, partitioned tables must specify the upload partition, multilevel partition must be assigned to the last level partition. For the Java SDK, refer to: Java SDK.

Q: Using the tunnel Java SDK to upload data, if it is a partitioned table, can the SDK dynamically create different partitions based on the data?
A: Partitions need to be created first and specified when uploading data using the SDK. You can also upload data to a table on Maxcompute and then dynamically partition it with SQL statements.

Q: Use tunnel command line tunnel upload D:\test test/pt= "Time" on Dataide for partition upload why error: Failed:error occurred while running tunnel comma nd
A:dataide is a upload statement that does not support Maxcompute Tunnel command-line tools.

Q: When uploading data using the tunnel command line tool, it is divided into 50 blocks, starting all normal, but at the 22nd block, upload fail,retry 5 times, skip the start uploading 23rd block directly, why this situation occurs?
A:block concept: a block corresponding to an HTTP request, multiple blocks of the upload can be concurrent and atomic, a synchronous request either succeeds or fails, does not pollute the other block.
Retransmission retry is limited in number of times, and when retransmission exceeds this limit, it will continue uploading to the next block. Once the upload is complete, you can check for data loss by using the Select COUNT (*) statement.

Q: The local server collected daily log 10GB, need to upload to Maxcompute, in the use of tunnel upload command upload reached the speed of about 300kb/s how to improve the upload speed?
A:tunnel upload command upload is not set speed limit. The bottleneck of upload speed is in network bandwidth and server performance. In order to improve performance, you can consider a separate table to upload and download data in multiple ECS instances.

Q: How can I upload the data from a TXT file to a maxcompute table in a shell script, combining the two commands into one? The command is as follows:
/odpscmd/bin/odpscmd

Tunnel upload "$FILE" project.table

A: You can refer to the Client setting command line client startup parameters, in the shell Start command is:
/odpscmd/bin/odpscmd-e "tunnel upload" $FILE "project.table"

Q:maxcompute Use the tunnelupload command to upload data, if there is a carriage return or space in the data why upload failed?
A: If the data has a carriage return or a space, you can set the data with a carriage return or a space separator, with-RD and-FD to specify the corresponding delimiter to achieve the data upload. If you cannot replace the separators in the data, you can upload the data as a single line, and then use the UDF parsing.

For example, the following data contains a carriage return, using "," as the column Delimiter Rd, using "@" as the line delimiter FD, can be uploaded normally:

Data content:

Upload command:

[Email protected] Maxcompute_doc>tunnel u d:\data.txt sale_detail/sale_date=201312,region=hangzhou-s false-fd ","-rd "@";

Upload results:

Q:maxcompute Use the tunnelupload command to upload data, use "," for column segmentation, now Description field, the data has a comma, or "|" Symbol, how is this case split?
A: If the data Description field itself has a comma, you can consider converting the data delimiter to other symbols, and then specify the other delimiter by-FD for uploading, for example:

The user has a post demand for Excel data in a Windows environment that needs to be uploaded via the tunnel upload command, and the form itself contains ",". First, you can set the default delimiter for Excel to CSV files through the Windows environment: Win7 System in Control Panel clock, language and region Select Change Date, time or number format, click Other Settings. This example takes into account that there is no "" character in the original data, set the delimiter to "" character, and set the delimiter to "", as shown in:

When you are finished setting up, use Excel to save the data as a CSV file, and use the text editing tools such as notepad++ to Transcode to UTF-8 encoding (tunnel the encoding format used by default) to check if the file delimiter has changed to "$":

Title $ location $ salary $ company $ company Profile link $ company type $ company size $ industry $ work Experience $ Education $ recruit $ publish Time $ label $ job Information $ office Address $ company Information $ page URL $ acquisition Time

Upload the data using the Tunnelupload command and specify the delimiter (you need to pre-create a table on Maxcompute) to upload successfully:

Q:maxcompute uses the tunnel upload command to upload data. The Tunnel upload command is separated by commas by default, but the data CSV file is also separated by commas: a column of data inside the file contains a comma enclosed in quotation marks. How is this handled?
The A:csv file uses a different delimiter, which can be specified by the-FD parameter.

In general, if the data sample has many symbols that may conflict with the delimiter, you can customize the delimiter in the data to avoid collisions, such as #@#@@ 或者#@#@@$.

Q:maxcompute failed to upload data using tunnel Upload command, memory overflow error Java.lang.OutOfMemoryError:Java What is the reason for heap space?

A: From the error point of view is the data upload when the memory overflow. At present, the Tunnelupload command is to support the uploading of large amounts of data, if a memory overflow, possibly because the data row delimiter and column delimiter set error, resulting in the entire text will be considered the same data, cached in memory and then split, resulting in memory overflow error.

In this case, you can first take a small amount of data testing, the-TD and-FD debugging over and then upload to take the full amount of data.

Q:maxcompute use tunnel upload command to upload data, you need to upload a number of data files to a table, whether there is a way to write a script can be the folder of all the data files to be uploaded up?
The a:tunnelupload command uploads the upload of a supporting file or directory (referred to as a level directory).

For example, the following command, upload data for folder d:data, upload command:

[Email protected] Maxcompute_doc>tunnel u d:\data sale_detail/sale_date=201312,region=hangzhou-s false;

For details, see the Tunnel command operation.

Q: Import folder error: The field does not match Colum mismatch, but the file under this folder can be imported separately, because the file is too large?

A: In this case, you can verify the data format by adding-dbr=false-s True after the upload command. The column mismatch usually occurs because the number of columns is not correct: the reason for the greater likelihood is that the column delimiter setting is incorrect or the last blank line of the file, which causes the empty rows to break apart as many columns as the delimiter is separated.

Q:maxcompute use tunnel upload command to upload two files, upload the first file after the end of the command, the second file will not upload what is the reason? No error message, that is, the first file upload after the second file Upload command is not executed. The upload command is as follows:
D:\odps\bin\odpscmd.bat-e "Tunnel upload d:\data1.txt sale_detail/sale_data=201312-fd=" $ "-mbr=5--scan=true; "

D:\odps\bin\odpscmd.bat-e "Tunnel upload d:\data2.txt sale_detail/sale_data=201312-fd=" $ "-mbr=5--scan=true; "

A: When using the old version of the Maxcompute command line client, the upload parameters are--scan, the continuation mode of the parameter transfer problem, will--scan=true remove the retry.

Q:maxcompute Use the tunnel upload command to upload all files in a directory to a table, and want to automatically create a partition, the specific command is tunnel Upload/data/2018/20180813/*.json App_log/ Dt=20180813-fd ' @@ '-ACP true; , perform an error:
Unrecognized option:-ACP
Failed:error occurred while running tunnel command
A: This error usually occurs because an unsupported command or character is used. Maxcompute using tunnel upload command upload does not support wildcard characters and regular expressions.

Q:maxcompute use tunnel upload command to upload file data error, is there a command like MySQL-F to force skipping error data to continue uploading?
A: This error occurs because of a data format problem, such as the wrong data type, refer to the tunnel command operation, using the-DBR true parameter to ignore dirty data (multiple columns, fewer columns, column data type mismatch, etc.). The-DBR parameter defaults to False, which means that dirty data is not ignored, and data that does not conform to the table definition is ignored when the value is true.

Q:maxcompute using tunnel upload command to upload file data the error is as follows why?

A: This error prompt is currently in the upload or download, so it can no longer be manipulated.

Q:maxcompute uploading file data using the Tunnel SDK Why is the duplicate submission error?

A: From the above error can be seen, this problem is in the preparation of close this writer appears, there may be several situations:

Closed to a writer that has been closed.
The session that this writer corresponds to has been closed.
The session has been submitted.

You can troubleshoot the possible causes, such as printing the status of the current writer and session.
Q:maxcompute when uploading data using the tunnel SDK, after writing the UDF into jar package upload, is there a requirement for jar package size?
The A:jar package cannot exceed 10M, and if the jar exceeds 10M, it is recommended to switch to Maxcompute Tunnel upload command line to upload the data.

Q:maxcompute uploading data using tunnel upload command line, is there a limit to the size of the data?
The A:tunnel upload command line typically does not limit the size of data that needs to be uploaded.

Q:maxcompute uploading a CSV file using the Tunnel upload command line, how do I skip the first row of headers to upload additional data?
A: We recommend that you skip the table header by using the-H parameter.

Q: Are there any zoning restrictions for importing Maxcompute databases using the tunnel bulk data channel SDK?
A: Use the tunnel bulk data channel SDK to import the Maxcompute database. Currently, 60,000 partitions are supported.

The excessive number of partitions can cause great inconvenience to statistics and analysis. Maxcompute limits the maximum number of instance that can be exceeded in a single job. The instance of the job and the amount of data entered by the user are closely related to the number of partitions, so it is recommended to first evaluate the business, select the appropriate partitioning strategy, and avoid the impact of too many partitions.

For more information on partitioned tables, refer to partitioning.

In addition, Maxcompute supports tunnel bulk uploads via the Python SDK, please refer to the data upload/download configuration in the Python SDK.

Q: To upload 8000W of data at once, finally in ODPs Tunnel recordwriter.close () times wrong, the error content is as follows:
Errorcode=statusconflict, Errormessage=you cannot complete the specified operation under the current upload or download St ATUs.

A: This error indicates the status of the session is wrong, it is recommended to re-create a session to re-upload data. From the error, it is likely that the previous operation has close the session, or has been a commit. For different partitions, each partition is required to be a separate session.

In order to prevent this error caused by multiple commits, you can first check if the data upload has been successfully uploaded, and if it fails, upload it again. You can refer to the multithreaded upload example.

Q: How can I use Tunnelbufferedwriter to circumvent the problem of using the Tunnel SDK for error on bulk data?
The A:maxcompute Java SDK has added a Bufferredwriter SDK after the 0.21.3-public version, simplifying data uploads and providing fault tolerance.

BufferedWriter the concept of block to the user: from the user's point of view, is to open a writer on the session and then write Records. When implemented, the BufferedWriter first caches the records in the client's buffer and opens an HTTP connection for uploading after the buffer fills up.

BufferedWriter will do the best possible fault tolerance, to ensure data upload up. Please refer to the BufferedWriter Usage guide for instructions on how to use.

Q:maxcompute Use the tunnelupload command line to upload a CSV file, why the original text after the successful import of a large part of the content disappeared, "-" replaced?
A: This is probably because the data encoding format is not the wrong data to upload to the table, or the delimiter is used incorrectly. It is recommended to standardize the original data after uploading.

Q:maxcompute Do I use tunnel upload command line uploads to support the configuration of referencing a table?
A: You can execute the tunnel upload command line upload implementation in shell script mode. You can execute the script through/odpscmd/bin/odpscmd-e and paste the table configuration within the script.

Q:maxcompute when uploading data using the tunnel SDK, it is often found that the select query is slow and the execution performance of the SQL statement is poor.
A similar situation may be caused by too many maxcompute small files, which can affect performance. How do I deal with too many small files?
A:
Reasons for small file generation:

The Distributed file system used by Maxcompute is stored in block blocks, usually file sizes larger than blocks (the default block size is 64M), which is called small files.

Currently Maxcompute has the following scenarios to produce small files:

The reduce calculation process produces a large number of small files;
Small files are generated during the tunnel data acquisition process;
The various temporary files generated during job execution, the expired files retained by the Recycle Bin, etc., are mainly categorized as:
Table_backup: Tables in the Recycle Bin that exceed the number of days reserved
FUXI_JOB_TMP: Job Run Temp directory
Tmp_table: Temporary table generated in job run
INSTANCE: The log that is kept in the meta table when the job is run
LIFECYCLE: Data tables or partitions that exceed the life cycle
Instanceprofile: Profile information After job submission and execution is completed
VOLUME_TMP: No meta information, but there are path data on Pangu
Tempresource: A one-time temporary resource file used by user-defined functions
FAILOVER: Temporary files that are retained when the system occurs FAILOVER
Too many small files can have the following effects:

Affect map instance performance: By default a small file corresponds to a instance, which wastes resources and affects the overall performance of the execution.
Too many small files put pressure on the distributed file system, and affect the effective use of space, the serious will directly cause the file system is not available.
To view the number of small files in a table command:

Desc Extended + Table name

How to handle small files
Small files that are produced for different reasons require different processing methods:

(1) Small files generated during the reduce process

Use the Insert Overwrite source table (or partition), or write to a new table to delete the source table.

(2) Tunnel small files generated during the data acquisition process

When the tunnel SDK is called, it is submitted once when buffer reaches 64MB;
Use the console to avoid frequent uploading of small files, it is recommended to accumulate a large one-time upload;
If the import is a partitioned table, it is recommended to set the life cycle of the partition, outdated data automatically clean up;
Insert Overwrite source table (or partition):
Alter merge mode, merge by command line: Set odps.merge.cross.paths=true;
Set odps.merge.max.partition.count=100; --optimizes 10 partitions by default and is set to optimize 100 partitions at this time.
ALTER TABLE tablename [PARTITION] MERGE smallfiles;
Temp table

The life cycle is added when the temporary table proposal is created, and garbage collection is automatically reclaimed after expiration.
Network problems
Q:maxcompute Why was the error java.io.IOException:Error writing request body to server when uploading data using tunnel upload command?
A: This is an exception when uploading data to the server, usually due to the disconnection/timeout of the network link during the upload process:

It may be that the user's data source is not from a local file but needs to be fetched from a class such as a database, causing the data to wait for the time-out caused by data acquisition during the writing process. At present uploadsession in the process of uploading data, if there is no data upload for 600 seconds, it is considered to be timed out.
Users through the public network endpoint to do data upload, because the public network quality instability caused the timeout.
Workaround:

In the process of uploading, we first get the data and then call the Tunnel SDK to upload the data.
A block can upload 64m-1g data, preferably no more than 10,000 data to avoid the time-out caused by retries. A session can hang up to 20,000 blocks.

If the user's data is on an ECS, you can speed up and save money by referencing the appropriate endpoint to access the domain name and data center configuration.
Q:maxcompute uses tunnel upload command line to upload data, set the Endpoint of the classic network, but why is it connected to the tunnel Endpoint of the extranet?
A: Configuration file Odps_config.ini In addition to endpoint need to configure Tunnel_endpoint. Please refer to the Access Domain name and data Center for configuration. Currently only Shanghai region does not need to set tunnel endpoint

Q:maxcompute does uploading data using tunnel upload command line support speed limit? Uploading too quickly can take up too much of the server's I/O performance.
A: currently maxcompute use Tunne lupload command line does not support speed limit, need to be processed separately through the SDK.

Billing issues
Q:maxcompute is the bandwidth charged for uploading data using the Tunnel upload command line, which is billed before or after the data is compressed?
A: Billing is based on the bandwidth of tunnel compression.

Original link

This article is the original content of the cloud-Habitat community and cannot be reproduced without permission.

MaxCompute Tunnel uploading typical problem scenarios

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More