Articles > Storage
Printer Friendly Version
Views: 4405

Restoring files in bulk from AWS S3 Glacier Deep archive

Last Updated: 9/7/22

Fun With AWS (Amazon Web Services) << That's a joke

I do not have time to create a full write-up right now, but if you move 10 TB of data to AWS S3 Glacier deep archive and then you want to move it all to a different tier of S3 or S3 Glacier, the process is not straight forward or simple.  It took a lot of testing, research, and many AWS support tickets. There is no one place to get this all done. All I have time for right now is an outline overview of the process.  I used AWS DataSync service to get my data into Glacier Deep archive over about 1 week.  Then my company wanted it all back. (Note: I should have used AWS S3 Storage Gateway instead of DataSync, because Storage gateway can preserve NTFS security ACLs, up to 10 per file)

 

Restore files in bulk from S3 Glacier deep archive and move to Glacier instant access tier:

* Keep in mind that files under 40 kb cannot be stored in Glacier deep archive and instead are automatically placed in the S3 standard tier. Like maybe 4+ million files in Deep Archive and 4+ million files in S3 standard in the same bucket. Just as an example.

 

If your situation is anything like mine you get to learn the following AWS skills if you do not already know them: AWS S3, AWS S3 Glacier Deep Archive restore, AWS S3 Batch operation, AWS Athena, AWS Lambda functions, and maybe more (AWS Cloud Trail, etc...)




Inventory Report:

  • Locate your archive bucket in S3 and enable a full inventory report. [This can take a few hours to 48 hours to run]. There is a way to get an email when the report is done.
  • After the inventory report is done, it will contain a list of all files (objects) in your S3 bucket. This includes all deep archive items and any other items in the bucket that did not get stored in Deep archive.
  • Note: The inventory report will be a JSON file (manifest.json) that references multiple csv.gz files which contain the actual compressed CSV object list. This format is accepted as input for S3 batch operations jobs even though the file you select is a tiny JSON file.

Changing Inventory Report output:

  • Next, we want to leverage S3 Batch Operations to create a job from the bucket inventory report. The key word here is "want". You cannot go to that step yet, even though AWS support will direct you there. If you do it will fail because the report contains more than just deep archive objects and the list must be 100% pure (objects from the same tier) for a batch job. Trust me.
  • The actual next step is to import the S3 bucket inventory report into AWS Athena. AWS support can provide you with a nice article (ok link1, better link2) on how to create a table from the inventory data pretty easily. Make sure to set your query output bucket in Athena settings.
  • After the table is created in Athena, use a query to select all the deep archive files only: select bucket, key from table-name where storage_tier = "DEEP_ARCHIVE". or if the columns got messed up as mine did: select bucket, key from table-name where encryption_status= "DEEP_ARCHIVE".  (read this link about how to format your query to avoid additional steps at removing the CSV headers.)
  • Execute the Athena query.
  • CSV headers: If the links above about CSV headers worked, you can skip this step. If you still need to remove the CSV headers, then go to the output CSV file and download it to your computer. Open the file in Notepad++ (There is a 2 GB file size limit with this program), and remove the headers at the top "bucket","key". Save the file. upload it back to your S3 Bucket. You are now ready for the next step.

The Actual Data Restore:




  • Create an S3 batch operations job and make sure the operation is "Restore" to restore the files from Deep archive by using your freshly prepared 2-column (unless you are using version numbers, then it is 3 columns) CSV file. Make sure to pick a reasonable amount of days to keep the restored files available. You will need at least enough time to perform another job that touches every file for the next step. A couple of days should be enough, but allow time for weekends and errors with your job. I recommend at least 7 days unless you know your data size really well.
  • Next Step: Use another S3 batch operations job (with the same CSV input file) to copy all the objects from "restore status" to a permanent home in the same or another bucket. Make sure to pick the correct destination storage tier. In my case S3 Glacier Instant Access tier.
  • When your copy job finishes, you will likely have some failures. This is expected if the object size is over 5 GB. There is a 5 GB limit for a standard S3 copy action. You might be saying: "WTF!? Does AWS not realize people have files over 5 GB sometimes?". I know I did. Do not worry. AWS S3 does allow you to copy S3 objects over 5 GB, you just cannot do it with the native copy command. You have to use another AWS product: AWS Lambda Functions. There is an AWS blog post all about how to do this. Why can they not do this magic in the background when you call the native S3 copy command? I do not know. They want your money and IT is never simple. You should know that by now.  UPDATE: Apparently they do the magic for you if you use the CLI.  If you have a small number of files over 5 GB, you can use the AWS CLI to run the AWS S3 Copy command. If you have a lot of files over 5 GB you may need to build a CLI script or use the blog post that I linked to.
    • aws s3 cp "s3://bucket/path" "s3://bucket/path" --storage-class GLACIER_IR
    • Note: if you are using a CSV inventory report from the S3 bucket as I did, then you need to decode the paths. Replace all "+" signs with a space and then run a URL decode utility to convert "%2b" and things like that to the correct character. I used the MIME plugin of Notepad++ to run the URL decode. it handles everything well, you just have to do the "+" to spaces conversion yourself with a replace all.
  • You may still have small files (< 40 kb) in your bucket in the standard tier. If those need to go somewhere else, then use another S3 batch operations job to copy them to the correct storage tier.  Use the skills above to dump a CSV from Athena with only S3 standard files, remove the header from CSV, and use it as input for the S3 batch operations job.

You should be done now. You are now an expert on AWS and need to update your resume. Also, expect your next AWS bill to have all sorts of interesting things on it.





Keywords: none