Hadoop O & M record series (6)

Source: Internet
Author: User

In the past two days, some business departments have reported that hive may fail to run logs on a certain day. After reading the log, although the errors are different, they are basically OOM. After an afternoon, I finally caught up with the reason. I would like to record it.


This problem is very strange. In the map stage, oom, a map corresponds to a data block during the map stage, which occupies up to MB of memory. How can it overflow? Generally, it should be reduce overflow. First, check the error log reported by tasktracker of each hadoop node.

Node 1

20:59:06, 467 FATAL org. apache. hadoop. mapred. Child: Error running child: java. lang. OutOfMemoryError: Java heap space

At java. nio. HeapByteBuffer. <init> (HeapByteBuffer. java: 39)

At java. nio. ByteBuffer. allocate (ByteBuffer. java: 312)

At sun. nio. cs. StreamDecoder. <init> (StreamDecoder. java: 231)

At sun. nio. cs. StreamDecoder. <init> (StreamDecoder. java: 211)

At sun. nio. cs. StreamDecoder. forInputStreamReader (StreamDecoder. java: 50)

At java. io. InputStreamReader. <init> (InputStreamReader. java: 57)

At org. apache. hadoop. util. Shell. runCommand (shells. java: 211)

At org. apache. hadoop. util. Shell. run (Shell. java: 182)

At org.apache.hadoop.util.shell?shellcommandexecutor.exe cute (Shell. java: 375)

At org.apache.hadoop.util.Shell.exe cCommand (shells. java: 461)

At org.apache.hadoop.util.Shell.exe cCommand (shells. java: 444)

At org.apache.hadoop.fs.FileUtil.exe cCommand (FileUtil. java: 710)

At org. apache. hadoop. fs. RawLocalFileSystem $ RawLocalFileStatus. loadPermissionInfo (RawLocalFileSystem. java: 443)

At org. apache. hadoop. fs. RawLocalFileSystem $ RawLocalFileStatus. getOwner (RawLocalFileSystem. java: 426)

At org. apache. hadoop. mapred. TaskLog. obtainLogDirOwner (TaskLog. java: 267)

At org. apache. hadoop. mapred. TaskLogsTruncater. truncateLogs (TaskLogsTruncater. java: 124)

At org. apache. hadoop. mapred. Child $ 4.run( Child. java: 260)

At java. security. AccessController. doPrivileged (Native Method)

At javax. security. auth. Subject. doAs (Subject. java: 396)

At org. apache. hadoop. security. UserGroupInformation. doAs (UserGroupInformation. java: 1121)

At org. apache. hadoop. mapred. Child. main (Child. java: 249)


Node 2

20:56:37, 012 INFO org. apache. hadoop. mapred. Task: Communication exception: java. lang. OutOfMemoryError: Java heap space

At java. io. BufferedReader. <init> (BufferedReader. java: 80)

At java. io. BufferedReader. <init> (BufferedReader. java: 91)

At org. apache. hadoop. util. ProcfsBasedProcessTree. constructProcessInfo (ProcfsBasedProcessTree. java: 396)

At org. apache. hadoop. util. ProcfsBasedProcessTree. getProcessTree (ProcfsBasedProcessTree. java: 151)

At org. apache. hadoop. util. LinuxResourceCalculatorPlugin. getProcResourceValues (LinuxResourceCalculatorPlugin. java: 401)

At org. apache. hadoop. mapred. Task. updateResourceCounters (Task. java: 808)

At org. apache. hadoop. mapred. Task. updateCounters (job. java: 830)

At org. apache. hadoop. mapred. Task. access $600 (Task. java: 66)

At org. apache. hadoop. mapred. Task $ TaskReporter. run (Task. java: 666)

At java. lang. Thread. run (Thread. java: 662)


Node 3

21:02:26, 489 FATAL org. apache. hadoop. mapred. Child: Error running child: java. lang. OutOfMemoryError: Java heap space

At com.sun.org. apache. xerces. internal. xni. XMLString. toString (XMLString. java: 185)

At com.sun.org. apache. xerces. internal. parsers. AbstractDOMParser. characters (AbstractDOMParser. java: 1185)

At com.sun.org. apache. xerces. internal. xinclude. xshortdehandler. characters (xshortdehandler. java: 1085)

At com.sun.org. apache. xerces. internal. impl. XMLDocumentFragmentScannerImpl. scanDocument (XMLDocumentFragmentScannerImpl. java: 464)

At com.sun.org. apache. xerces. internal. parsers. XML11Configuration. parse (XML11Configuration. java: 808)

At com.sun.org. apache. xerces. internal. parsers. XML11Configuration. parse (XML11Configuration. java: 737)

At com.sun.org. apache. xerces. internal. parsers. XMLParser. parse (XMLParser. java: 119)

At com.sun.org. apache. xerces. internal. parsers. DOMParser. parse (DOMParser. java: 232)

At com.sun.org. apache. xerces. internal. jaxp. DocumentBuilderImpl. parse (DocumentBuilderImpl. java: 284)

At javax. xml. parsers. DocumentBuilder. parse (DocumentBuilder. java: 180)

At org. apache. hadoop. conf. Configuration. loadResource (Configuration. java: 1168)

At org. apache. hadoop. conf. Configuration. loadResources (Configuration. java: 1119)

At org. apache. hadoop. conf. Configuration. getProps (Configuration. java: 1063)

At org. apache. hadoop. conf. Configuration. get (Configuration. java: 416)

At org. apache. hadoop. conf. Configuration. getLong (config. java: 521)

At org. apache. hadoop. io. nativeio. NativeIO. ensureInitialized (NativeIO. java: 120)

At org. apache. hadoop. io. nativeio. NativeIO. getOwner (NativeIO. java: 103)

At org. apache. hadoop. io. SecureIOUtils. openForRead (SecureIOUtils. java: 116)

At org. apache. hadoop. mapred. TaskLog. getAllLogsFileDetails (TaskLog. java: 191)

At org. apache. hadoop. mapred. TaskLogsTruncater. getAllLogsFileDetails (TaskLogsTruncater. java: 342)

At org. apache. hadoop. mapred. TaskLogsTruncater. truncateLogs (TaskLogsTruncater. java: 134)

At org. apache. hadoop. mapred. Child $ 4.run( Child. java: 260)

At java. security. AccessController. doPrivileged (Native Method)

At javax. security. auth. Subject. doAs (Subject. java: 396)

At org. apache. hadoop. security. UserGroupInformation. doAs (UserGroupInformation. java: 1121)

At org. apache. hadoop. mapred. Child. main (Child. java: 249)


The error messages for each node are still different, which is a headache. However, the log below is similar in different ways, and all nodes finally report an insignificant file error.

20:58:47, 568 ERROR org. apache. hadoop. security. UserGroupInformation: PriviledgedActionException as: hadoop cause: org. apache. hadoop. io. SecureIOUtils $ AlreadyExistsException: EEXIST: File exists

20:58:47, 569 WARN org. apache. hadoop. mapred. Child: Error running child

Org. apache. hadoop. io. SecureIOUtils $ AlreadyExistsException: EEXIST: File exists

At org. apache. hadoop. io. SecureIOUtils. createForWrite (SecureIOUtils. java: 167)

At org. apache. hadoop. mapred. TaskLog. writeToIndexFile (TaskLog. java: 312)

At org. apache. hadoop. mapred. TaskLog. syncLogs (TaskLog. java: 385)

At org. apache. hadoop. mapred. Child $ 4.run( Child. java: 257)

At java. security. AccessController. doPrivileged (Native Method)

At javax. security. auth. Subject. doAs (Subject. java: 396)

At org. apache. hadoop. security. UserGroupInformation. doAs (UserGroupInformation. java: 1121)

At org. apache. hadoop. mapred. Child. main (Child. java: 249)

Caused by: EEXIST: File exists

At org. apache. hadoop. io. nativeio. NativeIO. open (Native Method)

At org. apache. hadoop. io. SecureIOUtils. createForWrite (SecureIOUtils. java: 161)

... 7 more


Finally, this log is exactly the same, because after each map is in OOM, tasktracker tries to restart the map and re-create the file handle of the same TaskLog, but it is terminated due to exceptions, the file handle of the previous TaskLog has been created, but is out of memory (OOM). This handle is not properly closed, so an exception of write failure is reported. The source code is as follows:


/** * Open the specified File for write access, ensuring that it does not exist. * @param f the file that we want to create * @param permissions we want to have on the file (if security is enabled) * * @throws AlreadyExistsException if the file already exists * @throws IOException if any other error occurred */public static FileOutputStream createForWrite(File f, int permissions)throws IOException {  if (skipSecurity) {    return insecureCreateForWrite(f, permissions);  } else {    // Use the native wrapper around open(2)    try {      FileDescriptor fd = NativeIO.open(f.getAbsolutePath(),        NativeIO.O_WRONLY | NativeIO.O_CREAT | NativeIO.O_EXCL,        permissions);      return new FileOutputStream(fd);    } catch (NativeIOException nioe) {      if (nioe.getErrno() == Errno.EEXIST) {        throw new AlreadyExistsException(nioe);      }      throw nioe;    }  }}




The De bug is a bit difficult. It is said that the map stage is not prone to overflow. The map stage overflow is most likely to occur in the sort and spill stages, that is, io. sort. in the fast sorting of the ring memory controlled by the mb parameter, the memory size has been set to mb, mapred. map. child. java. opts is also set to 1.5 GB, which is enough. Then, the overflow errors of each node are different, but the primary difference is that the node I and II heap and buffer read/write problems.


Go back and look at his HiveQL. It is very simple. Create a uid's count distinct, and then the group by statements for several fields will be gone. It seems that the problem mainly lies in the uid. His hql is a specific day in May April. Then I tried to select one day in March. It's okay. It's okay if I select one day in May. The log in May is certainly bigger than that in April. In this way, the log size is eliminated, and only GB of logs are generated in a day. It must not be the map stage oom caused by too many logs.


Continue to exclude, and then the climax will come. It's okay to select the day before the day. It's okay to select the day after the day. It seems that the problem lies in the log of this day. After selecting distinct length (uid), it is found that there is a uid with 71 bytes. Normally, all are 32 bytes. Remove the uid whose uid length is 71, that is, add where length (uid )! = 71. The group by statement that is the same as the HQL statement is normal immediately. Then, try the original HQL error and still report the error. It seems that it is him.


Write a separate select * from xxx where length (uid) = 71 HQL to extract this uid and find it contains special characters, which is the ghost of this uid. It may be that nginx encountered a TCP error during log receiving, and data cleaning was not cleared. Because it contains special characters, when Hadoop is doing a local map, the data stream may contain EOF or NOP, which makes it impossible to use BufferedReader to read and use StreamDecoder to explain, it inherits to map when executing local tasks. local. the job in dir. the InputFormat specified in xml is not correct. To put it bluntly, it cannot be executed normally and then overflows.


This problem is too tricky, but it may also be very common. The log cleansing cannot be 100% correct. It's just a line of special characters in the log, so I am stuck for one afternoon. This log is more than one hundred GB in a day, so you can't look at it with your eyes. awk runs regular expressions and you have to think of a work und. In the end, I used Hive to solve the problem. For hadoop cluster O & M, you must understand hadoop principles and business. Fortunately, before taking over cluster O & M, I wrote map/reduce. There is still a little bit of programming basics. I understand some businesses and it is easier to analyze some bugs. But I still have to speak out. special characters are really a big pitfall. They need to add filter conditions to the cleaning program.


Hundreds of millions of lines contain a line of bad logs, and a typical mouse is polluted by a sea.


In addition, the reference labels in the new version editor of 51cto seem to be difficult to use. I put the log in the reference, and it seems that there is no beautification effect.


This article was posted on the "practice test truth" blog and declined to be reproduced!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.