I recently found I problem I needed to solve; remove hundreds of thousands of files from Amazon S3. I mean, it had to be a common problem, right? Well, it certainly be a common problem but the solution was less than common.
I tried a few tools available, both the tool from the Amazon S3 site but it keep erroring out and I was never sure why. I then went to third-party tools used for managing S3 buckets but their either errored-out or behaved as if they worked but later determined did nothing.
I posted my need on Twitter and was pointed to a solution (thanks @Kishfy) I had not thought of, use Ruby. There is a great open-source project named S3Nukem which its sole purpose is to remove Amazon S3 buckets.
S3Nukem
This is an open source project hosted on Github. Installation and setup is pretty simple (from the Github repo readme), install required gems.
For Ruby >= 1.9:
sudo gem install dmarkow-right_aws --source http://gems.github.com
The docs don?t mention it but I needed to install the right_http_connection gem, the above command fails unless it is installed.
For Ruby < 1.9:
sudo gem install right_aws
Install S3Nukum:
curl -O http://github.com/lathanh/s3nukem/raw/master/s3nukem
Make it executable:
chmod 755 s3nukem
This is done in the directory where the above curl command was executed from.
Usage:
Usage: ./s3nukem [options] buckets...
Options:
-a, --access ACCESS Amazon Access Key (required)
-s, --secret SECRET Amazon Secret Key (required)
-t, --threads COUNT Number of simultaneous threads (default 10)
-h, --help Show this message
Running the application in a terminal window shows large numbers of files being deleted:
This script is fast. I tried running this under both Ruby 1.8.7 and 1.9.2 with 1.9.2 quite a bit faster. I didn?t run any benchmarks but it was noticeably faster, my goal was really to just delete large amounts of files. Ruby 1.9.2 thread handling really shines here and with the ability to control the number threads from the command line, is really nice.
The nice thing about this script version is the cap on the number of items to be deleted each time, 1000 * thread_count, which defaults to 10. With this limit in place the script won?t chew up all your system memory.
This script is designed to delete an entire bucket but could be modified to just remove the content or a directory tree within the bucket. I may do this for a project I am working which has a need for such functionality.