Bug #81219 for Net-Amazon-Glacier: Memory Usage [partial fix included]

Thu Nov 15 18:28:15 2012 treed [...] imvu.com - Ticket created

Subject:	Memory Usage [partial fix included]
Date:	Thu, 15 Nov 2012 15:27:43 -0800
To:	bug-Net-Amazon-Glacier [...] rt.cpan.org
From:	Ted Reed <treed [...] imvu.com>

It turns out that when you're uploading large files, the memory usage is a bit extreme. It's basically 7x the size of the file. It seems like this is because things are passing the payload around byval instead of byref, so each function call makes another copy. With 100MB payloads, the memory usage is pretty noticeable. I've got a patch that removes one of the byvals, but most of the rest can't be fixed without either going into the libraries that Net::Amazon::Glacier uses or reimplementing their functionality. You may or may not want to look into either of those solutions, but I thought I'd mention it and pass the limited fix along. HTTP::Request::Common has a special dynamic mode that will stream payloads, which you might be able to use, but you wouldn't be able to use Net::Amazon::Signature::V4 with it, as far as I can tell. We (IMVU) contribute this code specifically under the GPL Version 1 and Artistic License (Perl) Version 1. Here is the patch: --- a/lib/Net/Amazon/Glacier.pm +++ b/lib/Net/Amazon/Glacier.pm @@ -150,7 +150,7 @@ 'x-amz-sha256-tree-hash' => $th->get_final_hash(), 'x-amz-content-sha256' => sha256_hex( $content ), ], - $content + \$content ); return 0 unless $res->is_success; if ( $res->header('location') =~ m{^/([^/]+)/vaults/([^/]+)/archives/(.*)$} ) { @@ -209,7 +209,7 @@ my $res = $self->_send_receive( POST => "/-/vaults/$vault_name/jobs", [ ], - encode_json($content_raw), + \encode_json($content_raw), ); return 0 unless $res->is_success; @@ -245,7 +245,7 @@ my $res = $self->_send_receive( POST => "/-/vaults/$vault_name/jobs", [ ], - encode_json($content_raw), + \encode_json($content_raw), ); return 0 unless $res->is_success; @@ -309,13 +309,15 @@ sub _craft_request { my ( $self, $method, $url, $header, $content ) = @_; my $host = 'glacier.'.$self->{region}.'.amazonaws.com'; + $content //= \undef; + my $total_header = [ 'x-amz-glacier-version' => '2012-06-01', 'Host' => $host, 'Date' => strftime( '%Y%m%dT%H%M%SZ', gmtime ), $header ? @$header : () ]; - my $req = HTTP::Request->new( $method => "https://$host$url", $total_header, $content); + my $req = HTTP::Request->new( $method => "https://$host$url", $total_header, $$content); my $signed_req = $self->{sig}->sign( $req ); return $signed_req; }

Mon Dec 10 07:21:33 2012 pyry [...] automattic.com - Correspondence added

From:

pyry [...] automattic.com

On Thu Nov 15 18:28:15 2012, treed@imvu.com wrote: Show quoted text

> It turns out that when you're uploading large files, the memory usage

is a Show quoted text

> bit extreme. It's basically 7x the size of the file.

I also noticed this. I submitted a patch to Net::Amazon::Signature::V4, that is needed before this patch should be applied: https://rt.cpan.org/Public/Bug/Display.html?id=81864 The attached patch buffers ~1MiB at a time, and it's now possible to upload multi GB files with reasonable memory usage (20 MiB or so). The file is still being read twice, once for TreeHash and once for content SHA256, it could be optimized, but I didn't really need it.

Subject:

0001-Fix-memory-issues-with-large-files.patch

From a07df000ff5e28e3a27902e8bc5f21a7ae4da309 Mon Sep 17 00:00:00 2001 From: Pyry Hakulinen <pyry@automattic.com> Date: Sun, 9 Dec 2012 22:46:30 +0200 Subject: [PATCH] Fix memory issues with large files This patch eliminates content buffering and passing etc. And reduces memory usage to sane levels even for multi GB files. This was already reported in: https://rt.cpan.org/Public/Bug/Display.html?id=81219 This patch depends on this: The SHA calculation could be optimized by moving it to the TreeHash module, that way we could read it the file only once. But it mostly affects huge files. --- Net-Amazon-Glacier-0.12/lib/Net/Amazon/Glacier.pm | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/Net-Amazon-Glacier-0.12/lib/Net/Amazon/Glacier.pm b/Net-Amazon-Glacier-0.12/lib/Net/Amazon/Glacier.pm index 276a405..b695049 100644 --- a/Net-Amazon-Glacier-0.12/lib/Net/Amazon/Glacier.pm +++ b/Net-Amazon-Glacier-0.12/lib/Net/Amazon/Glacier.pm @@ -135,22 +135,31 @@ sub upload_archive { croak "no archive path given" unless $archive_path; croak 'archive path is not a file' unless -f $archive_path; $description //= ''; - my $content = read_file( $archive_path ); my $th = Net::Amazon::TreeHash->new(); open( my $content_fh, '<', $archive_path ) or croak $!; $th->eat_file( $content_fh ); - close $content_fh; $th->calc_tree; + my $content_sha = Digest::SHA->new("sha256"); + $content_sha->addfile( $archive_path ); + + seek( $content_fh, 0, 0 ); + my $reader_sub = sub { + my $content; + read( $content_fh, $content, 1048576 ); + return $content; + }; + my $res = $self->_send_receive( POST => "/-/vaults/$vault_name/archives", [ 'x-amz-archive-description' => $description, 'x-amz-sha256-tree-hash' => $th->get_final_hash(), - 'x-amz-content-sha256' => sha256_hex( $content ), + 'x-amz-content-sha256' => $content_sha->hexdigest, + 'content-length' => -s $archive_path, ], - $content + $reader_sub ); return 0 unless $res->is_success; if ( $res->header('location') =~ m{^/([^/]+)/vaults/([^/]+)/archives/(.*)$} ) { -- 1.7.10.4

Mon Dec 10 07:21:35 2012 The RT System itself - Status changed from 'new' to 'open'

Sun Dec 23 06:23:47 2012 victor [...] vsespb.ru - Correspondence added

I am the author of TreeHash module. There is a method eat_data in TreeHash which takes reference to data chunk. You can read data _chunks_ from file one by one in your code a pass it by _reference_ to ead_data. This way you won't have any performance/memory problems. Also note that if data size is <= 1M TreeHash(data) == Sha256(Data).

Sun Dec 23 07:53:37 2012 victor [...] vsespb.ru - Correspondence added

I remembered now that parameters to Perl functions are actually passed always by references. Copying happen when those parameters get modified or returned or copied to scalar (like copy-on-write). ### Small memory usage #!/usr/bin/perl use strict; use warnings; my $s = "a" x 100_000_000; sub mysub { print `ps aux|grep $$|grep -v grep|awk '{print \$6}'`; return length($_[0]); } print `ps aux|grep $$|grep -v grep|awk '{print \$6}'`; my $x = mysub($s); ### Double memory usage #!/usr/bin/perl use strict; use warnings; my $s = "a" x 100_000_000; sub mysub { my (@a) = @_; print `ps aux|grep $$|grep -v grep|awk '{print \$6}'`; return length($a[0]); } print `ps aux|grep $$|grep -v grep|awk '{print \$6}'`; my $x = mysub($s); ### Small momory usage again #!/usr/bin/perl use Digest::SHA qw/sha256_hex/; use strict; use warnings; my $s = "a" x 100_000_000; sub mysub { print `ps aux|grep $$|grep -v grep|awk '{print \$6}'`; my $x = \shift; return sha256_hex($$x); } print `ps aux|grep $$|grep -v grep|awk '{print \$6}'`; print mysub($s);