Hadoop has a "garbage bin" feature for recovering files that have been deleted in the past period of time. If a file has been deleted more than once, you can also restore the specific file that was deleted. This feature is turned off by default, and if you want to turn it on, you need to add the following configuration in the $hadoop_home/etc/hadoop/core-site.xml file:
<property><name>fs.trash.interval</name> <value>10</value></property>
The above configuration means: Hadoop will set up a recycle Bin, and the Recycle Bin is emptied every 10 minutes.
If you delete the same file or directory multiple times within a collection cycle, the files you delete will be saved in trash. This means that you can recover files that were deleted at some point.
As an example:
Point in time |
Action |
Trash Content |
12:40 |
Recycle Bin Empty |
Empty |
12:41 |
Delete Fruit.data |
Fruit.data |
12:42 |
Re-upload fruit.data and delete fruit.data again |
Fruit.data,fruit.data1446352935186 |
12:45 |
Re-upload fruit.data and delete fruit.data again |
fruit.data,fruit.data1446352935186,fruit.data1446353100390 |
12:50 |
Recycle Bin Empty |
Empty |
According to the table above, the second time you delete Friut.data at 12:41, the fruit appears in the Recycle Bin. ? data1446352935186 ? , the number that follows is the timestamp of the time you deleted it. Then we can recover the deleted files from 12:41 or 12:45 before emptying the garbage collection station.
Combined with the use of hive, there are often many timed tasks inserting update data into hive, then. There are many versions of trash for a table, and if you want to see the data for a certain moment, it is particularly useful to have the data restored in trash for the time being deleted.
HDFS Recover deleted files at some point