Streaming off disk to S3 in Perl

Tue Oct 21 09:30:10 2014

At the day job we're all Perl and all AWS. And lately we've been having some problems moving multi-gig files up to s3, as it's sucking ALL the RAM off the box. A little investigating revealed that we where pushing the whole file into RAM (and then swap), to push it to S3, and that's what needed to change.

CastingWords has been use S3 for ages now and we've been doing it based off of an in house module that is based off of LWP and an early signature version*, and though we could implement the Multipart Upload we don't have it yet. The simplest thing to do was to stream directly from disk to s3. This doesn't require any new API code, just a bit optimization. 2 things basically.

First up S3 needs and MD5 of the file, and since we're not sticking the sucker in memory anymore we need to do it from the file. That's simple Digest::MD5, which we where already using has an addfile method, which a quick inspection of the code shows to do the right thing, and not load the whole file at once:

    open (my $fh, '<', $filename) or die "Can't open '$filename': $!";
    binmode($fh);
    $md5 = Digest::MD5->new->addfile($fh)->b64digest;

Next up streaming itself. That's one trick with an easy gotcha. Looking at LWP it turns out that it will only stream POSTs, not PUTs and S3 needs a put, thankfully HTTP::Request::StreamingUpload exists. The gotcha is that LWP::Protocol::http inserts a Transfer-Encoding: chunked into your headers if you forget to include a content-length on a file upload and S3 replies with a 501 and 'A header you provided implies functionality that is not implemented' if you include that header. So the final code here looks like:

    use HTTP::Request::StreamingUpload;
    $req = HTTP::Request::StreamingUpload->new($args->{method} => $args->{url},
                                               path           => $args->{filename});
    $req->content_length(-s  $args->{filename}); 

And now we're streaming our files from disk!

1) "Transfer-Encoding: Chunked" does work if you are using S3 signature version 4 something I did not find until writing up this blog post. Amazon could work on the find-ability here. Still the signature calculation doesn't look fun to implement, and this sucker's pretty easy.