Delete Large Numbers of Amazon S3 Files using Ruby

I recently found I problem I needed to solve; remove hundreds of thousands of files from Amazon S3.  I mean, it had to be a common problem, right?  Well, it certainly be a common problem but the solution was less than common.

I tried a few tools available, both the tool from the Amazon S3 site but it keep erroring out and I was never sure why.  I then went to third-party tools used for managing S3 buckets but their either errored-out or behaved as if they worked but later determined did nothing.

I posted my need on Twitter and was pointed to a solution (thanks @Kishfy) I had not thought of, use Ruby.  There is a great open-source project named S3Nukem which its sole purpose is to remove Amazon S3 buckets.

S3Nukem

This is an open source project hosted on Github.   Installation and setup is pretty simple (from the Github repo readme), install required gems.

For Ruby >= 1.9:

sudo gem install dmarkow-right_aws --source http://gems.github.com

The docs don’t mention it but I needed to install the right_http_connection gem, the above command fails unless it is installed.

For Ruby < 1.9:

sudo gem install right_aws

Install S3Nukum:

curl -O http://github.com/lathanh/s3nukem/raw/master/s3nukem

Make it executable:

chmod 755 s3nukem

This is done in the directory where the above curl command was executed from.

Usage:

Usage: ./s3nukem [options] buckets...

Options:
    -a, --access ACCESS              Amazon Access Key (required)
    -s, --secret SECRET              Amazon Secret Key (required)
    -t, --threads COUNT              Number of simultaneous threads (default 10)
    -h, --help                       Show this message

Running the application in a terminal window shows large numbers of files being deleted:

s3nukem

This script is fast.  I tried running this under both Ruby 1.8.7 and 1.9.2 with 1.9.2 quite a bit faster.  I didn’t run any benchmarks but it was noticeably faster, my goal was really to just delete large amounts of files.  Ruby 1.9.2 thread handling really shines here and with the ability to control the number threads from the command line, is really nice.

The nice thing about this script version is the cap on the number of items to be deleted each time, 1000 * thread_count, which defaults to 10.  With this limit in place the script won’t chew up all your system memory. 

This script is designed to delete an entire bucket but could be modified to just remove the content or a directory tree within the bucket.  I may do this for a project I am working which has a need for such functionality.

  • http://twitter.com/caseyf Casey Forbes

    I recently had to delete a massive number of files and Amazon’s s3-curl.pl turned out to be the fastest option (also, I ran several in parallel)

    http://developer.amazonwebservices.com/connect/entry.jspa?externalID=128&categoryID=47

  • http://accidentaltechnologist.com Rob Bazinet

    @casey – very cool, thanks for the info. I will have to check it out. Have you looked at S3Nukem? I wonder how well Ruby 1.9.2 fairs against Perl in this instance. It would be interesting to compare.

    By the way, how did you find my blog, are you a regular reader? I am a huge fan of the work you write about on the Ravelry blog, good stuff going on with Rails.

  • http://one.valeski.org Jud Valeski

    this is a big problem w/ S3. here are my trials and tribulations with it (parallelizing is key to solving it for really big sets).

    http://one.valeski.org/2010/05/take-2-amazon-s3-file-deletion-fail.html