Skip Menu |

This queue is for tickets about the Net-SSLeay CPAN distribution.

Report information
The Basics
Id: 128505
Status: rejected
Priority: 0/
Queue: Net-SSLeay

People
Owner: Nobody in particular
Requestors: leonerd-cpan [...] leonerd.org.uk
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: 1.85
Fixed in: 1.86_04



Subject: Memory leak from Net::SSLeay::connect
I have a long-running process which talks lots of SSL, by constantly starting and shutting down new connections talking to many hosts. Previously I had found and fixed some memory leaks visible at the SV heap level, and now the process no longer leaks according to SV count, but yet the virtual memory size still increases very linearly over time. Having run this for a while in valgrind, I get the following leak report: ==9498== 109,623 (7,392 direct, 102,231 indirect) bytes in 22 blocks are definitely lost in loss record 143 of 143 ==9498== at 0x4C2BBAF: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==9498== by 0xE34442D: CRYPTO_zalloc (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==9498== by 0xE27BA27: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==9498== by 0xE27987E: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==9498== by 0xE279F5C: ASN1_item_ex_d2i (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==9498== by 0xE279FDA: ASN1_item_d2i (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==9498== by 0xE3B156C: d2i_X509_AUX (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==9498== by 0xE35763D: PEM_ASN1_read_bio (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==9498== by 0xE3A152C: X509_load_cert_file (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==9498== by 0xE3A0C6F: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==9498== by 0xE3A516A: X509_STORE_CTX_get_by_subject (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==9498== by 0xE3A588C: X509_STORE_CTX_get1_issuer (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==9498== by 0xE3AA67E: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==9498== by 0xE3AB47F: X509_verify_cert (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==9498== by 0xDF915D7: ??? (in /usr/lib/x86_64-linux-gnu/libssl.so.1.1) ==9498== by 0xDFA3DCA: ??? (in /usr/lib/x86_64-linux-gnu/libssl.so.1.1) ==9498== by 0xDFA14FE: ??? (in /usr/lib/x86_64-linux-gnu/libssl.so.1.1) ==9498== by 0xDF99F60: SSL_do_handshake (in /usr/lib/x86_64-linux-gnu/libssl.so.1.1) ==9498== by 0xDD367B0: XS_Net__SSLeay_connect (SSLeay.c:2454) ==9498== by 0x242F1D: Perl_pp_entersub (pp_hot.c:3988) ==9498== ==9498== LEAK SUMMARY: ==9498== definitely lost: 7,829 bytes in 49 blocks ==9498== indirectly lost: 107,020 bytes in 1,745 blocks I'm still investigating this, so I hope to have more details and analysis later. -- Paul Evans
On Wed Feb 13 16:43:01 2019, PEVANS wrote: Show quoted text
> I'm still investigating this, so I hope to have more details and > analysis later.
On further investigation I have a reproducible minimal test. The test appears to be specific to certain hosts. Attached is the script. In particular, most servers don't leak; e.g. www.google.com www.github.com www.ctrlo.com lenio.ctrlo.com A few do, such examples are: app4.infosaas.uk web3.infosaas.uk Unsure yet if it's related to the SSL certs themselves, or the way these hosts communicate SSL, or some other factor. -- Paul Evans
Subject: make-many-http-hits.pl
#!/usr/bin/perl use strict; use warnings; use IO::Async::Loop; use Net::Async::HTTP; # Most servers don't leak; e.g. # www.google.com # www.github.com # www.ctrlo.com # lenio.ctrlo.com # # A few do, such examples are: # app4.infosaas.uk # web3.infosaas.uk my $HOST = "app4.infosaas.uk"; my $loop = IO::Async::Loop->new; # Don't add a single UA because that will use connection keepalive. We'll # burn memory faster by constantly making them and throwing them away again sub do_a_request { my ( $host ) = @_; print STDERR "Requesting $host...\n"; $loop->add( my $ua = Net::Async::HTTP->new ); $ua->GET( "https://$host" )->then( sub { print STDERR "Done $host\n"; $loop->remove( $ua ); Future->done; }); } while(1) { $loop->delay_future( after => 1 )->get; do_a_request( $HOST )->get; }
Using script gives me a nicely repeatable: ==5996== 1,883,000 (140,800 direct, 1,742,200 indirect) bytes in 400 blocks are definitely lost in loss record 125 of 125 ==5996== at 0x483577F: malloc (vg_replace_malloc.c:299) ==5996== by 0x7DD2D38: CRYPTO_zalloc (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==5996== by 0x7CF9F48: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==5996== by 0x7CF79AA: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==5996== by 0x7CF858C: ASN1_item_ex_d2i (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==5996== by 0x7CF860A: ASN1_item_d2i (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==5996== by 0x7E52E14: d2i_X509_AUX (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==5996== by 0x7DE682C: PEM_ASN1_read_bio (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==5996== by 0x7E42A15: X509_load_cert_file (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==5996== by 0x7E42122: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==5996== by 0x7E4676A: X509_STORE_CTX_get_by_subject (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==5996== by 0x7E46EAC: X509_STORE_CTX_get1_issuer (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==5996== by 0x7E4BBBD: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==5996== by 0x7E4CCB5: X509_verify_cert (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==5996== by 0x7BE1EC7: ??? (in /usr/lib/x86_64-linux-gnu/libssl.so.1.1) ==5996== by 0x7C032D3: ??? (in /usr/lib/x86_64-linux-gnu/libssl.so.1.1) ==5996== by 0x7C05834: ??? (in /usr/lib/x86_64-linux-gnu/libssl.so.1.1) ==5996== by 0x7BFF392: ??? (in /usr/lib/x86_64-linux-gnu/libssl.so.1.1) ==5996== by 0x7BEB1C3: SSL_do_handshake (in /usr/lib/x86_64-linux-gnu/libssl.so.1.1) ==5996== by 0x7B3F64F: XS_Net__SSLeay_connect (SSLeay.c:2451) ==5996== by 0x25CF44: Perl_pp_entersub (pp_hot.c:5232) ==5996== by 0x215CA9: Perl_runops_debug (dump.c:2539) ==5996== by 0x16DFD3: Perl_call_sv (perl.c:3004) ==5996== by 0x4875C23: ??? (in /usr/lib/x86_64-linux-gnu/perl5/5.28/auto/Linux/Epoll/Epoll.so) ==5996== by 0x25CF44: Perl_pp_entersub (pp_hot.c:5232) ==5996== by 0x215CA9: Perl_runops_debug (dump.c:2539) ==5996== by 0x179C7D: S_run_body (perl.c:2694) ==5996== by 0x179C7D: perl_run (perl.c:2617) ==5996== by 0x13E36D: main (perlmain.c:122) The number "400 blocks" is nicely round probably because I let the script run 200 iterations of the loop. -- Paul Evans
Now with full symbol names and source location information because I installed libssl1.1-dbgsym: ==1803== 1,759,200 (140,800 direct, 1,618,400 indirect) bytes in 400 blocks are definitely lost in loss record 85 of 85 ==1803== at 0x483577F: malloc (vg_replace_malloc.c:299) ==1803== by 0x7DAFD38: CRYPTO_zalloc (mem.c:230) ==1803== by 0x7CD6F48: asn1_item_embed_new (tasn_new.c:122) ==1803== by 0x7CD49AA: asn1_item_embed_d2i (tasn_dec.c:306) ==1803== by 0x7CD558C: ASN1_item_ex_d2i (tasn_dec.c:124) ==1803== by 0x7CD560A: ASN1_item_d2i (tasn_dec.c:114) ==1803== by 0x7E2FE14: d2i_X509_AUX (x_x509.c:118) ==1803== by 0x7DC382C: PEM_ASN1_read_bio (pem_oth.c:31) ==1803== by 0x7E1FA15: X509_load_cert_file (by_file.c:90) ==1803== by 0x7E1F122: get_cert_by_subject (by_dir.c:317) ==1803== by 0x7E2376A: X509_STORE_CTX_get_by_subject (x509_lu.c:307) ==1803== by 0x7E23EAC: X509_STORE_CTX_get1_issuer (x509_lu.c:669) ==1803== by 0x7E28BBD: get_issuer (x509_vfy.c:2852) ==1803== by 0x7E28BBD: build_chain (x509_vfy.c:2991) ==1803== by 0x7E28BBD: verify_chain (x509_vfy.c:216) ==1803== by 0x7E29CB5: X509_verify_cert (x509_vfy.c:295) ==1803== by 0x7BBEEC7: ssl_verify_cert_chain (ssl_cert.c:430) ==1803== by 0x7BE02D3: tls_process_server_certificate (statem_clnt.c:1907) ==1803== by 0x7BE2834: ossl_statem_client_process_message (statem_clnt.c:1045) ==1803== by 0x7BDC392: read_state_machine (statem.c:634) ==1803== by 0x7BDC392: state_machine.part.5 (statem.c:432) ==1803== by 0x7BC81C3: SSL_do_handshake (ssl_lib.c:3579) ==1803== by 0x7B2C64F: XS_Net__SSLeay_connect (SSLeay.c:2451) ... -- Paul Evans
And now with a bit of annotation from the source code files themselves: ==1803== 1,759,200 (140,800 direct, 1,618,400 indirect) bytes in 400 blocks are definitely lost in loss record 85 of 85 ==1803== at 0x483577F: malloc (vg_replace_malloc.c:299) ==1803== by 0x7DAFD38: CRYPTO_zalloc (mem.c:230) ==1803== by 0x7CD6F48: asn1_item_embed_new (tasn_new.c:122) if (embed) { memset(*pval, 0, it->size); } else { *pval = OPENSSL_zalloc(it->size); if (*pval == NULL) goto memerr; } ==1803== by 0x7CD49AA: asn1_item_embed_d2i (tasn_dec.c:306) if (!*pval && !ASN1_item_ex_new(pval, it)) { ASN1err(ASN1_F_ASN1_ITEM_EMBED_D2I, ERR_R_NESTED_ASN1_ERROR); goto err; } ==1803== by 0x7CD558C: ASN1_item_ex_d2i (tasn_dec.c:124) rv = asn1_item_embed_d2i(pval, in, len, it, tag, aclass, opt, ctx, 0); ==1803== by 0x7CD560A: ASN1_item_d2i (tasn_dec.c:114) if (ASN1_item_ex_d2i(pval, in, len, it, -1, 0, 0, &c) > 0) return *pval; ==1803== by 0x7E2FE14: d2i_X509_AUX (x_x509.c:118) ((Via macro IMPLEMENT_ASN1_FUNCTIONS(X509) )) ret = d2i_X509(a, &q, length); ==1803== by 0x7DC382C: PEM_ASN1_read_bio (pem_oth.c:31) d2i_of_void *d2i; ret = d2i(x, &p, len); ==1803== by 0x7E1FA15: X509_load_cert_file (by_file.c:90) x = PEM_read_bio_X509_AUX(in, NULL, NULL, ""); ==1803== by 0x7E1F122: get_cert_by_subject (by_dir.c:317) if (type == X509_LU_X509) { if ((X509_load_cert_file(xl, b->data, ent->dir_type)) == 0) break; ==1803== by 0x7E2376A: X509_STORE_CTX_get_by_subject (x509_lu.c:307) lu = sk_X509_LOOKUP_value(ctx->get_cert_methods, i); j = X509_LOOKUP_by_subject(lu, type, name, &stmp); ==1803== by 0x7E23EAC: X509_STORE_CTX_get1_issuer (x509_lu.c:669) xn = X509_get_issuer_name(x); ok = X509_STORE_CTX_get_by_subject(ctx, X509_LU_X509, xn, obj); These bottom few frames are concerning themselves with reading issuer certs off the disk, in order to verify the one received on the new connection. I have additionally made the observation that the two leaking certs were generated by Lets Encrypt Issuer: C = US, O = Let's Encrypt, CN = Let's Encrypt Authority X3 and the two nonleaking ones by DigiCert: Issuer: C = US, O = DigiCert Inc, OU = www.digicert.com, CN = RapidSSL RSA CA 2018 There may be something in that; perhaps something related to loading the issuer CA cert off the disk, rather than relating to the cert served by the https server *itself* As it happens, my personal server also uses LE, so I can also test against that one and see if similar things happen there. Either way; this is feeling verymuch like a bug inside libssl itself, and not the fault of Net::SSLeay, but lacking anywhere else to write about it at present I shall continue here for now. Perhaps when I have something more concrete I'll send it upstream. -- Paul Evans
Thanks for investigating this, Paul. I was just trying to reproduce the problem locally using make-many-http-hits.pl, but wasn't able to (with Perl 5.22.2, OpenSSL 1.1.1b, Net-SSLeay master branch). Could you share some more details about your setup?
On Thu Mar 07 13:41:50 2019, CHRISN wrote: Show quoted text
> Thanks for investigating this, Paul. I was just trying to reproduce > the problem locally using make-many-http-hits.pl, but wasn't able to > (with Perl 5.22.2, OpenSSL 1.1.1b, Net-SSLeay master branch). Could > you share some more details about your setup?
Oooh, that's interesting. Same host? In case it's relevant then: $ perl -v This is perl 5, version 28, subversion 1 (v5.28.1) built for x86_64-linux-gnu-thread-multi (from debian package) ii perl 5.28.1-4 amd64 Larry Wall's Practical Extraction a $ perlmodversion Net::SSLeay 1.85 ii libssl1.1:amd64 1.1.1a-1 amd64 Secure Sockets Layer toolkit - sha I'm only using libssl 1.1.1a-1. I can try updating to 1.1.1b from debian/unstable and see if that changes things. -- Paul Evans
On Thu Mar 07 18:46:43 2019, PEVANS wrote: Show quoted text
> On Thu Mar 07 13:41:50 2019, CHRISN wrote:
> > Thanks for investigating this, Paul. I was just trying to reproduce > > the problem locally using make-many-http-hits.pl, but wasn't able to > > (with Perl 5.22.2, OpenSSL 1.1.1b, Net-SSLeay master branch). Could > > you share some more details about your setup?
> > Oooh, that's interesting. Same host?
Yep, app4.infosaas.uk. The only change I made was to hardcode the while loop to 200 iterations, figuring that'd be enough to reproduce the problem. Show quoted text
> In case it's relevant then: > > $ perl -v > > This is perl 5, version 28, subversion 1 (v5.28.1) built for x86_64- > linux-gnu-thread-multi > > (from debian package) > ii perl 5.28.1-4 amd64 Larry Wall's Practical > Extraction a > > $ perlmodversion Net::SSLeay > 1.85 > > ii libssl1.1:amd64 1.1.1a-1 amd64 Secure Sockets Layer > toolkit - sha > > I'm only using libssl 1.1.1a-1. I can try updating to 1.1.1b from > debian/unstable and see if that changes things.
I've just tried with a self-built copy of 1.1.1a but still couldn't reproduce it, so maybe there's more to it (although I agree that it doesn't look like a leak in Net-SSLeay). I'll ask Heikki if he can reproduce it on his end...
On Thu Mar 07 21:46:23 2019, CHRISN wrote: Show quoted text
> Yep, app4.infosaas.uk. The only change I made was to hardcode the > while loop to 200 iterations, figuring that'd be enough to reproduce > the problem.
Show quoted text
> I've just tried with a self-built copy of 1.1.1a but still couldn't > reproduce it, so maybe there's more to it (although I agree that it > doesn't look like a leak in Net-SSLeay). I'll ask Heikki if he can > reproduce it on his end...
Indeed. I've updated to 1.1.1b and that still leaks for this host. $ ./pull-to-csv.pl -i 5 "# timestamp",process_virtual_memory_bytes,process_resident_memory_bytes,perl_heap_svs "2019-03-08 10:54:21",46489600,37793792,83442 "2019-03-08 10:54:26",46628864,37806080,82878 "2019-03-08 10:54:31",46628864,37982208,82890 "2019-03-08 10:54:36",46764032,37982208,83513 "2019-03-08 10:54:41",46764032,37982208,82915 ... "2019-03-08 10:55:46",47722496,39030784,83079 ... "2019-03-08 10:56:36",48427008,39571456,84240 "2019-03-08 10:56:41",48574464,39841792,83225 "2019-03-08 10:56:46",48574464,39841792,83239 What's interesting is that my own home server is also using LE, using the same authority: app4.infosaas.uk.PEM.txt: X509v3 Authority Key Identifier: app4.infosaas.uk.PEM.txt- keyid:A8:4A:6A:63:04:7D:DD:BA:E6:D1:39:B7:A6:45:65:EF:F3:A8:EC:A1 -- home.leonerd.org.uk.PEM.txt: X509v3 Authority Key Identifier: home.leonerd.org.uk.PEM.txt- keyid:A8:4A:6A:63:04:7D:DD:BA:E6:D1:39:B7:A6:45:65:EF:F3:A8:EC:A1 -- But yet doesn't show this leaking behaviour. $ ./pull-to-csv.pl -i 5 "# timestamp",process_virtual_memory_bytes,process_resident_memory_bytes,perl_heap_svs "2019-03-08 10:45:50",43565056,34701312,60914 "2019-03-08 10:45:55",43565056,34729984,60958 "2019-03-08 10:46:00",43565056,34729984,60967 "2019-03-08 10:46:05",43708416,34729984,60974 "2019-03-08 10:46:10",43708416,34729984,61625 "2019-03-08 10:46:15",43708416,34729984,60990 "2019-03-08 10:46:20",43708416,34729984,60999 ... "2019-03-08 10:47:25",43708416,34729984,61112 "2019-03-08 10:47:30",43708416,34729984,61121 "2019-03-08 10:47:35",43708416,34729984,61130 In case it still somehow related to the issuer cert, I notice mine is loading the cert *twice* for the leaking host, but only once for the nonleaky one: $ strace -e file ... ... Requesting app4.infosaas.uk (5 of 200)... stat("/usr/lib/ssl/certs", {st_mode=S_IFDIR|0755, st_size=20480, ...}) = 0 openat(AT_FDCWD, "/usr/lib/ssl/certs", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 8 stat("/usr/lib/ssl/certs/4f06f81d.0", 0x7fff47d47a00) = -1 ENOENT (No such file or directory) stat("/usr/lib/ssl/certs/2e5ac55d.0", {st_mode=S_IFREG|0644, st_size=1200, ...}) = 0 openat(AT_FDCWD, "/usr/lib/ssl/certs/2e5ac55d.0", O_RDONLY) = 8 stat("/usr/lib/ssl/certs/2e5ac55d.1", 0x7fff47d47a00) = -1 ENOENT (No such file or directory) stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3687, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3687, ...}) = 0 stat("/usr/lib/ssl/certs", {st_mode=S_IFDIR|0755, st_size=20480, ...}) = 0 openat(AT_FDCWD, "/usr/lib/ssl/certs", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 10 stat("/usr/lib/ssl/certs/4f06f81d.0", 0x7fff47d47a00) = -1 ENOENT (No such file or directory) stat("/usr/lib/ssl/certs/2e5ac55d.0", {st_mode=S_IFREG|0644, st_size=1200, ...}) = 0 openat(AT_FDCWD, "/usr/lib/ssl/certs/2e5ac55d.0", O_RDONLY) = 10 stat("/usr/lib/ssl/certs/2e5ac55d.1", 0x7fff47d47a00) = -1 ENOENT (No such file or directory) stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3687, ...}) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3687, ...}) = 0 Done app4.infosaas.uk $ strace -e file ... ... Requesting home.leonerd.org.uk (17 of 200)... stat("/usr/lib/ssl/certs", {st_mode=S_IFDIR|0755, st_size=20480, ...}) = 0 openat(AT_FDCWD, "/usr/lib/ssl/certs", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 8 stat("/usr/lib/ssl/certs/4f06f81d.0", 0x7ffcd8b037a0) = -1 ENOENT (No such file or directory) stat("/usr/lib/ssl/certs/2e5ac55d.0", {st_mode=S_IFREG|0644, st_size=1200, ...}) = 0 openat(AT_FDCWD, "/usr/lib/ssl/certs/2e5ac55d.0", O_RDONLY) = 8 stat("/usr/lib/ssl/certs/2e5ac55d.1", 0x7ffcd8b037a0) = -1 ENOENT (No such file or directory) Done home.leonerd.org.uk If it's useful to compare if the file is identical to yours: $ sha256sum /usr/lib/ssl/certs/2e5ac55d.0 139a5e4a4e0fa505378c72c5f700934ce8333f4e6b1b508886c4b0eb14f4be99 /usr/lib/ssl/certs/2e5ac55d.0 -- Paul Evans
On Fri Mar 08 06:03:38 2019, PEVANS wrote: Show quoted text
> In case it still somehow related to the issuer cert, I notice mine is > loading the cert *twice* for the leaking host, but only once for the > nonleaky one:
Further on this, I've done some gdb'ing (gdb) break X509_load_cert_file (gdb) break X509_verify_cert Running in either leaky or non-leaky case it starts off the same: Breakpoint 2, X509_verify_cert (ctx=ctx@entry=0x555555a983b0) at ../crypto/x509/x509_vfy.c:255 255 SSL_DANE *dane = ctx->dane; (gdb) bt #0 X509_verify_cert (ctx=ctx@entry=0x555555a983b0) at ../crypto/x509/x509_vfy.c:255 #1 0x00007ffff7425ea8 in ssl_verify_cert_chain (s=s@entry=0x555556bc0210, sk=sk@entry=0x555556bc2340) at ../ssl/ssl_cert.c:430 #2 0x00007ffff7447433 in tls_process_server_certificate (s=0x555556bc0210, pkt=0x7fffffffd950) at ../ssl/statem/statem_clnt.c:1907 #3 0x00007ffff7449a05 in ossl_statem_client_process_message (s=0x555556bc0210, pkt=<optimized out>) at ../ssl/statem/statem_clnt.c:1045 #4 0x00007ffff7443423 in read_state_machine (s=0x555556bc0210) at ../ssl/statem/statem.c:634 #5 state_machine (s=0x555556bc0210, server=0) at ../ssl/statem/statem.c:432 #6 0x00007ffff742f1d4 in SSL_do_handshake (s=0x555556bc0210) at ../ssl/ssl_lib.c:3579 #7 0x00007ffff7501650 in XS_Net__SSLeay_connect (my_perl=<optimized out>, cv=<optimized out>) at SSLeay.c:2451 ... Breakpoint 1, X509_load_cert_file (ctx=ctx@entry=0x555556bb7ca0, file=0x555556bd3810 "/usr/lib/ssl/certs/2e5ac55d.0", type=1) at ../crypto/x509/by_file.c:81 81 in = BIO_new(BIO_s_file()); (gdb) bt #0 X509_load_cert_file (ctx=ctx@entry=0x555556bb7ca0, file=0x555556bd3810 "/usr/lib/ssl/certs/2e5ac55d.0", type=1) at ../crypto/x509/by_file.c:81 #1 0x00007ffff730d2a3 in get_cert_by_subject (xl=0x555556bb7ca0, type=X509_LU_X509, name=<optimized out>, ret=0x7fffffffd690) at ../crypto/x509/by_dir.c:317 #2 0x00007ffff73118fb in X509_STORE_CTX_get_by_subject (vs=vs@entry=0x555555a983b0, type=type@entry=X509_LU_X509, name=name@entry=0x555556bb5860, ret=ret@entry=0x555556bd37a0) at ../crypto/x509/x509_lu.c:307 #3 0x00007ffff731203d in X509_STORE_CTX_get1_issuer (issuer=0x7fffffffd780, ctx=0x555555a983b0, x=0x5555558b4690) at ../crypto/x509/x509_lu.c:669 #4 0x00007ffff7316d5e in get_issuer (cert=0x5555558b4690, ctx=0x555555a983b0, issuer=0x7fffffffd780) at ../crypto/x509/x509_vfy.c:2852 #5 build_chain (ctx=<optimized out>) at ../crypto/x509/x509_vfy.c:2991 #6 verify_chain (ctx=0x555555a983b0) at ../crypto/x509/x509_vfy.c:216 #7 0x00007ffff7317e56 in X509_verify_cert (ctx=ctx@entry=0x555555a983b0) at ../crypto/x509/x509_vfy.c:295 (inside above) ... But at this point, a non-leaking host will stop after that - just one invocation of X509_verify_cert that invokes X509_load_cert_file just once. On a leaking host, there are a number more invocations: Breakpoint 2, X509_verify_cert (ctx=ctx@entry=0x5555559b2ba0) at ../crypto/x509/x509_vfy.c:255 255 SSL_DANE *dane = ctx->dane; (gdb) bt #0 X509_verify_cert (ctx=ctx@entry=0x5555559b2ba0) at ../crypto/x509/x509_vfy.c:255 #1 0x00007ffff72ad540 in OCSP_basic_verify (bs=bs@entry=0x555556be94c0, certs=certs@entry=0x0, st=st@entry=0x5555569e61a0, flags=<optimized out>, flags@entry=0) at ../crypto/ocsp/ocsp_vfy.c:88 #2 0x00007ffff7515863 in XS_Net__SSLeay_OCSP_response_verify (my_perl=<optimized out>, cv=<optimized out>) at SSLeay.xs:6688 #3 0x00005555556405d1 in Perl_pp_entersub (my_perl=0x555555869260) at pp_hot.c:5232 #4 0x00005555556368f6 in Perl_runops_standard (my_perl=0x555555869260) at run.c:42 #5 0x0000555555678909 in S_docatch (my_perl=0x555555869260, firstpp=firstpp@entry=0x555555686200 <Perl_pp_entereval>) at pp_ctl.c:3241 #6 0x000055555568681c in Perl_pp_entereval (my_perl=0x555555869260) at pp_ctl.c:4379 #7 0x00005555556368f6 in Perl_runops_standard (my_perl=0x555555869260) at run.c:42 #8 0x00005555555a9d75 in Perl_call_sv (my_perl=0x555555869260, sv=sv@entry=0x555556b839e8, flags=flags@entry=2) at perl.c:3004 #9 0x00007ffff75108ca in tlsext_status_cb_invoke (ssl=<optimized out>, arg=<optimized out>) at SSLeay.xs:839 #10 0x00007ffff74497a9 in tls_process_initial_server_flight (s=0x555556b9ebc0) at ../ssl/statem/statem_clnt.c:2831 #11 tls_process_initial_server_flight (s=0x555556b9ebc0) at ../ssl/statem/statem_clnt.c:2813 #12 0x00007ffff7449880 in tls_process_server_done (s=0x555556b9ebc0, pkt=<optimized out>) at ../ssl/statem/statem_clnt.c:2877 #13 0x00007ffff7449a45 in ossl_statem_client_process_message (s=0x555556b9ebc0, pkt=<optimized out>) at ../ssl/statem/statem_clnt.c:1060 #14 0x00007ffff7443423 in read_state_machine (s=0x555556b9ebc0) at ../ssl/statem/statem.c:634 #15 state_machine (s=0x555556b9ebc0, server=0) at ../ssl/statem/statem.c:432 #16 0x00007ffff742f1d4 in SSL_do_handshake (s=0x555556b9ebc0) at ../ssl/ssl_lib.c:3579 #17 0x00007ffff7501650 in XS_Net__SSLeay_connect (my_perl=<optimized out>, cv=<optimized out>) at SSLeay.c:2451 Breakpoint 2, X509_verify_cert (ctx=ctx@entry=0x5555559b2ba0) at ../crypto/x509/x509_vfy.c:255 255 SSL_DANE *dane = ctx->dane; (gdb) bt #0 X509_verify_cert (ctx=ctx@entry=0x5555559b2ba0) at ../crypto/x509/x509_vfy.c:255 #1 0x00007ffff7425ea8 in ssl_verify_cert_chain (s=s@entry=0x555556bf2440, sk=sk@entry=0x5555569ffe30) at ../ssl/ssl_cert.c:430 #2 0x00007ffff7447433 in tls_process_server_certificate (s=0x555556bf2440, pkt=0x7fffffffd980) at ../ssl/statem/statem_clnt.c:1907 #3 0x00007ffff7449a05 in ossl_statem_client_process_message (s=0x555556bf2440, pkt=<optimized out>) at ../ssl/statem/statem_clnt.c:1045 #4 0x00007ffff7443423 in read_state_machine (s=0x555556bf2440) at ../ssl/statem/statem.c:634 #5 state_machine (s=0x555556bf2440, server=0) at ../ssl/statem/statem.c:432 #6 0x00007ffff742f1d4 in SSL_do_handshake (s=0x555556bf2440) at ../ssl/ssl_lib.c:3579 #7 0x00007ffff7501650 in XS_Net__SSLeay_connect (my_perl=<optimized out>, cv=<optimized out>) at SSLeay.c:2451 Breakpoint 1, X509_load_cert_file (ctx=ctx@entry=0x555556bed5a0, file=0x555556c05bf0 "/usr/lib/ssl/certs/2e5ac55d.0", type=1) at ../crypto/x509/by_file.c:81 81 in = BIO_new(BIO_s_file()); (gdb) bt #0 X509_load_cert_file (ctx=ctx@entry=0x555556bed5a0, file=0x555556c05bf0 "/usr/lib/ssl/certs/2e5ac55d.0", type=1) at ../crypto/x509/by_file.c:81 #1 0x00007ffff730d2a3 in get_cert_by_subject (xl=0x555556bed5a0, type=X509_LU_X509, name=<optimized out>, ret=0x7fffffffd6c0) at ../crypto/x509/by_dir.c:317 #2 0x00007ffff73118fb in X509_STORE_CTX_get_by_subject (vs=vs@entry=0x5555559b2ba0, type=type@entry=X509_LU_X509, name=name@entry=0x555556beffd0, ret=ret@entry=0x555556c05b80) at ../crypto/x509/x509_lu.c:307 #3 0x00007ffff731203d in X509_STORE_CTX_get1_issuer (issuer=0x7fffffffd7b0, ctx=0x5555559b2ba0, x=0x555556b968a0) at ../crypto/x509/x509_lu.c:669 #4 0x00007ffff7316d5e in get_issuer (cert=0x555556b968a0, ctx=0x5555559b2ba0, issuer=0x7fffffffd7b0) at ../crypto/x509/x509_vfy.c:2852 #5 build_chain (ctx=<optimized out>) at ../crypto/x509/x509_vfy.c:2991 #6 verify_chain (ctx=0x5555559b2ba0) at ../crypto/x509/x509_vfy.c:216 #7 0x00007ffff7317e56 in X509_verify_cert (ctx=ctx@entry=0x5555559b2ba0) at ../crypto/x509/x509_vfy.c:295 #8 0x00007ffff7425ea8 in ssl_verify_cert_chain (s=s@entry=0x555556bf2440, sk=sk@entry=0x5555569ffe30) at ../ssl/ssl_cert.c:430 #9 0x00007ffff7447433 in tls_process_server_certificate (s=0x555556bf2440, pkt=0x7fffffffd980) at ../ssl/statem/statem_clnt.c:1907 #10 0x00007ffff7449a05 in ossl_statem_client_process_message (s=0x555556bf2440, pkt=<optimized out>) at ../ssl/statem/statem_clnt.c:1045 #11 0x00007ffff7443423 in read_state_machine (s=0x555556bf2440) at ../ssl/statem/statem.c:634 #12 state_machine (s=0x555556bf2440, server=0) at ../ssl/statem/statem.c:432 #13 0x00007ffff742f1d4 in SSL_do_handshake (s=0x555556bf2440) at ../ssl/ssl_lib.c:3579 #14 0x00007ffff7501650 in XS_Net__SSLeay_connect (my_perl=<optimized out>, cv=<optimized out>) at SSLeay.c:2451 Breakpoint 2, X509_verify_cert (ctx=ctx@entry=0x5555559b2ba0) at ../crypto/x509/x509_vfy.c:255 255 SSL_DANE *dane = ctx->dane; (gdb) bt #0 X509_verify_cert (ctx=ctx@entry=0x5555559b2ba0) at ../crypto/x509/x509_vfy.c:255 #1 0x00007ffff72ad540 in OCSP_basic_verify (bs=bs@entry=0x555556c07630, certs=certs@entry=0x0, st=st@entry=0x555556bb5250, flags=<optimized out>, flags@entry=0) at ../crypto/ocsp/ocsp_vfy.c:88 #2 0x00007ffff7515863 in XS_Net__SSLeay_OCSP_response_verify (my_perl=<optimized out>, cv=<optimized out>) at SSLeay.xs:6688 #3 0x00005555556405d1 in Perl_pp_entersub (my_perl=0x555555869260) at pp_hot.c:5232 #4 0x00005555556368f6 in Perl_runops_standard (my_perl=0x555555869260) at run.c:42 #5 0x0000555555678909 in S_docatch (my_perl=0x555555869260, firstpp=0x5555556870c0 <Perl_pp_entertry>) at pp_ctl.c:3241 #6 0x00005555556368f6 in Perl_runops_standard (my_perl=0x555555869260) at run.c:42 #7 0x00005555555a9d75 in Perl_call_sv (my_perl=0x555555869260, sv=sv@entry=0x555556b3c6c8, flags=flags@entry=2) at perl.c:3004 #8 0x00007ffff75108ca in tlsext_status_cb_invoke (ssl=<optimized out>, arg=<optimized out>) at SSLeay.xs:839 #9 0x00007ffff74497a9 in tls_process_initial_server_flight (s=0x555556bf2440) at ../ssl/statem/statem_clnt.c:2831 #10 tls_process_initial_server_flight (s=0x555556bf2440) at ../ssl/statem/statem_clnt.c:2813 #11 0x00007ffff7449880 in tls_process_server_done (s=0x555556bf2440, pkt=<optimized out>) at ../ssl/statem/statem_clnt.c:2877 #12 0x00007ffff7449a45 in ossl_statem_client_process_message (s=0x555556bf2440, pkt=<optimized out>) at ../ssl/statem/statem_clnt.c:1060 #13 0x00007ffff7443423 in read_state_machine (s=0x555556bf2440) at ../ssl/statem/statem.c:634 #14 state_machine (s=0x555556bf2440, server=0) at ../ssl/statem/statem.c:432 #15 0x00007ffff742f1d4 in SSL_do_handshake (s=0x555556bf2440) at ../ssl/ssl_lib.c:3579 #16 0x00007ffff7501650 in XS_Net__SSLeay_connect (my_perl=<optimized out>, cv=<optimized out>) at SSLeay.c:2451 These latter invocations appear to be coming from a nested perl call (the perl_call_sv in the stack trace), related to the tls_status_cb. This doesn't seem to happen in the nonleaking host case. -- Paul Evans
Stack traces summarised: Nonleak just does one iteration of: X509_load_cert_file get_cert_by_subject X509_STORE_CTX_get_by_subject X509_STORE_CTX_get1_issuer get_issuer build_chain verify_chain X509_verify_cert ssl_verify_cert_chain tls_process_server_certificate ossl_statem_client_process_message read_state_machine state_machine SSL_do_handshake XS_Net__SSLeay_connect Leaky case does that, followed by X509_verify_cert OCSP_basic_verify XS_Net__SSLeay_OCSP_response_verify Perl_pp_entersub Perl_runops_standard S_docatch Perl_pp_entereval Perl_runops_standard Perl_call_sv tlsext_status_cb_invoke tls_process_initial_server_flight tls_process_initial_server_flight tls_process_server_done ossl_statem_client_process_message read_state_machine state_machine SSL_do_handshake XS_Net__SSLeay_connect then repeats the entire process a second time, this time curiously lacking the Perl_pp_entereval. I haven't yet analysed where in relation to these stacktraces the actual leak happens. That's next... -- Paul Evans
On Thu Mar 14 11:18:33 2019, PEVANS wrote: Show quoted text
> Stack traces summarised: > > Nonleak just does one iteration of: >
... Show quoted text
> tls_process_server_certificate > ossl_statem_client_process_message > read_state_machine > state_machine > SSL_do_handshake > XS_Net__SSLeay_connect
This part of the stacktrace is in response of (both) servers sending a TLS_ST_CR_CERT message, which ssl/statem/statem_clnt.c processes. Show quoted text
> Leaky case does that, followed by >
... Show quoted text
> tls_process_server_done > ossl_statem_client_process_message > read_state_machine > state_machine > SSL_do_handshake > XS_Net__SSLeay_connect
This part is as a result of TLS_ST_CR_SRVR_DONE, which the nonleaking server seems not to issue. -- Paul Evans
Much more prodding by using $ PERL_DESTRUCT_LEVEL=2 valgrind --vgdb=yes --vgdb-error=0 debugperl make-many-http-hits.pl meanwhile $ gdb debugperl (gdb) target remote | vgdb (gdb) break X509_load_cert_file I trace what happens inside int X509_load_cert_file() (crypto/x509/by_file.c:74) I print the value of x after line 90 each time: (gdb) p x $8 = (X509 *) 0x4fd8490 ... (gdb) p x $11 = (X509 *) 0x536d280 thus obtaining two addresses. Once it finishes one iteration I ask for a leak report and inspect some blocks: (gdb) monitor leak_check full definiteleak ... ==19506== 7,603 (+7,603) (704 (+704) direct, 6,899 (+6,899) indirect) bytes in 2 (+2) blocks are definitely lost in loss record 7,659 of 8,055 ==19506== at 0x483577F: malloc (vg_replace_malloc.c:299) ==19506== by 0x7DAFD38: CRYPTO_zalloc (mem.c:230) ==19506== by 0x7CD6F18: asn1_item_embed_new (tasn_new.c:122) ==19506== by 0x7CD498A: asn1_item_embed_d2i (tasn_dec.c:306) ==19506== by 0x7CD556C: ASN1_item_ex_d2i (tasn_dec.c:124) ==19506== by 0x7CD55EA: ASN1_item_d2i (tasn_dec.c:114) ==19506== by 0x7E2FFE4: d2i_X509_AUX (x_x509.c:118) ==19506== by 0x7DC388C: PEM_ASN1_read_bio (pem_oth.c:31) ==19506== by 0x7E1FBA5: X509_load_cert_file (by_file.c:90) ==19506== by 0x7E1F2A2: get_cert_by_subject (by_dir.c:317) ==19506== by 0x7E238FA: X509_STORE_CTX_get_by_subject (x509_lu.c:307) ==19506== by 0x7E2403C: X509_STORE_CTX_get1_issuer (x509_lu.c:669) (gdb) monitor block_list 7659 ==19506== 7,603 (+7,603) (704 (+704) direct, 6,899 (+6,899) indirect) bytes in 2 (+2) blocks are definitely lost in loss record 7,659 of 8,055 ==19506== at 0x483577F: malloc (vg_replace_malloc.c:299) ==19506== by 0x7DAFD38: CRYPTO_zalloc (mem.c:230) ==19506== by 0x7CD6F18: asn1_item_embed_new (tasn_new.c:122) ==19506== by 0x7CD498A: asn1_item_embed_d2i (tasn_dec.c:306) ==19506== by 0x7CD556C: ASN1_item_ex_d2i (tasn_dec.c:124) ==19506== by 0x7CD55EA: ASN1_item_d2i (tasn_dec.c:114) ==19506== by 0x7E2FFE4: d2i_X509_AUX (x_x509.c:118) ==19506== by 0x7DC388C: PEM_ASN1_read_bio (pem_oth.c:31) ==19506== by 0x7E1FBA5: X509_load_cert_file (by_file.c:90) ==19506== by 0x7E1F2A2: get_cert_by_subject (by_dir.c:317) ==19506== by 0x7E238FA: X509_STORE_CTX_get_by_subject (x509_lu.c:307) ==19506== by 0x7E2403C: X509_STORE_CTX_get1_issuer (x509_lu.c:669) ==19506== 0x4FD8490[352] ==19506== 0x4E97940[16] indirect loss record 4359 ... ==19506== 0x536D280[352] ==19506== 0x57C3A00[256] indirect loss record 6690 ... and sure enough, those same two memory addresses appear in the loss records. -- Paul Evans
By comparison, if I pick the nonleaky host, I don't get this leak and instead I can see where it was freed: (gdb) p x $2 = (X509 *) 0x5001110 ... (gdb) monitor check_memory addressable 0x5001110 Address 0x5001110 len 1 not addressable: bad address 0x5001110 Address 0x5001110 is 0 bytes inside a block of size 352 free'd ==22165== at 0x48369AB: free (vg_replace_malloc.c:530) ==22165== by 0x7CD692B: asn1_item_embed_free (tasn_fre.c:115) ==22165== by 0x7CD69A4: ASN1_item_free (tasn_fre.c:20) ==22165== by 0x7E22FF8: X509_OBJECT_free (x509_lu.c:463) ==22165== by 0x7E10E3F: OPENSSL_sk_pop_free (stack.c:368) ==22165== by 0x7E2352C: sk_X509_OBJECT_pop_free (x509_vfy.h:58) ==22165== by 0x7E2352C: X509_STORE_free (x509_lu.c:225) ==22165== by 0x7BC77C8: SSL_CTX_free (ssl_lib.c:3110) ==22165== by 0x7BC77C8: SSL_CTX_free (ssl_lib.c:3080) ==22165== by 0x7B3E0D6: XS_Net__SSLeay_CTX_free (SSLeay.xs:1847) ==22165== by 0x25D444: Perl_pp_entersub (pp_hot.c:5232) ==22165== by 0x2160C9: Perl_runops_debug (dump.c:2539) ==22165== by 0x16E39C: Perl_call_sv (perl.c:3021) ==22165== by 0x2658E9: S_curse (sv.c:6938) Block was alloc'd at ==22165== at 0x483577F: malloc (vg_replace_malloc.c:299) ==22165== by 0x7DAFD38: CRYPTO_zalloc (mem.c:230) ==22165== by 0x7CD6F18: asn1_item_embed_new (tasn_new.c:122) ==22165== by 0x7CD498A: asn1_item_embed_d2i (tasn_dec.c:306) ==22165== by 0x7CD556C: ASN1_item_ex_d2i (tasn_dec.c:124) ==22165== by 0x7CD55EA: ASN1_item_d2i (tasn_dec.c:114) ==22165== by 0x7E2FFE4: d2i_X509_AUX (x_x509.c:118) ==22165== by 0x7DC388C: PEM_ASN1_read_bio (pem_oth.c:31) ==22165== by 0x7E1FBA5: X509_load_cert_file (by_file.c:90) ==22165== by 0x7E1F2A2: get_cert_by_subject (by_dir.c:317) ==22165== by 0x7E238FA: X509_STORE_CTX_get_by_subject (x509_lu.c:307) ==22165== by 0x7E2403C: X509_STORE_CTX_get1_issuer (x509_lu.c:669) -- Paul Evans
Applying a debug-print patch: --- openssl-1.1.1a.orig/crypto/x509/by_file.c +++ openssl-1.1.1a/crypto/x509/by_file.c @@ -98,6 +98,8 @@ int X509_load_cert_file(X509_LOOKUP *ctx goto err; } } + fprintf(stderr, "LEODEBUG: add x=%p to store=%p\n", + x, ctx->store_ctx); i = X509_STORE_add_cert(ctx->store_ctx, x); if (!i) goto err; --- openssl-1.1.1a.orig/crypto/x509/x509_lu.c +++ openssl-1.1.1a/crypto/x509/x509_lu.c @@ -209,6 +209,10 @@ void X509_STORE_free(X509_STORE *vfy) if (vfy == NULL) return; + + fprintf(stderr, "LEODEBUG: free store=%p\n", + vfy); + CRYPTO_DOWN_REF(&vfy->references, &i, vfy->lock); REF_PRINT_COUNT("X509_STORE", vfy); if (i > 0) Yields in a leaky case: Requesting app4.infosaas.uk (1 of 200)... LEODEBUG: add x=0x55a94ac6e3f0 to store=0x55a94bf33600 LEODEBUG: add x=0x55a94ab6f100 to store=0x55a94bf6b020 Done app4.infosaas.uk LEODEBUG: free store=0x55a94bf33600 LEODEBUG: free store=0x55a94bf6b020 Requesting app4.infosaas.uk (2 of 200)... LEODEBUG: add x=0x55a94bf76a50 to store=0x55a94c1d97d0 LEODEBUG: add x=0x55a94acdbfb0 to store=0x55a94bf6a650 Done app4.infosaas.uk LEODEBUG: free store=0x55a94c1d97d0 LEODEBUG: free store=0x55a94bf6a650 ... Whereas a non-leak is: Requesting home.leonerd.org.uk (1 of 200)... LEODEBUG: add x=0x5572c5cf04e0 to store=0x5572c6d8e930 Done home.leonerd.org.uk LEODEBUG: free store=0x5572c6d8e930 Requesting home.leonerd.org.uk (2 of 200)... LEODEBUG: add x=0x5572c70cf090 to store=0x5572c70ce4f0 Done home.leonerd.org.uk LEODEBUG: free store=0x5572c70ce4f0 ... Looking at `monitor leak_check` in the leaky case reveals that it's usually both occasions of pointer that get leaked. -- Paul Evans
PROGRESS! By adding more debug that prints the `references` count of the X509 objects at the time the store is thrown away, I get: $ PERL_DESTRUCT_LEVEL=2 debugperl make-many-http-hits.pl Requesting app4.infosaas.uk (1 of 200)... LEODEBUG: add x=0x56498d56c3f0 to store=0x56498e910c20 LEODEBUG: free store=0x56498e910c20 LEODEBUG: store contains 1 items [0] = x509=0x56498d56c3f0 (references=2) LEODEBUG: add x=0x56498d6ba8d0 to store=0x56498e948de0 Done app4.infosaas.uk LEODEBUG: free store=0x56498e948de0 LEODEBUG: store contains 1 items [0] = x509=0x56498d6ba8d0 (references=2) Requesting app4.infosaas.uk (2 of 200)... LEODEBUG: add x=0x56498e949960 to store=0x56498e937a70 LEODEBUG: add x=0x56498d64cb00 to store=0x56498e9400b0 Done app4.infosaas.uk LEODEBUG: free store=0x56498e937a70 LEODEBUG: store contains 1 items [0] = x509=0x56498e949960 (references=2) LEODEBUG: free store=0x56498e9400b0 LEODEBUG: store contains 1 items [0] = x509=0x56498d64cb00 (references=2) vs $ PERL_DESTRUCT_LEVEL=2 debugperl make-many-http-hits.pl -h home.leonerd.org.uk Requesting home.leonerd.org.uk (1 of 200)... LEODEBUG: add x=0x55deeef3d570 to store=0x55def0203ac0 Done home.leonerd.org.uk LEODEBUG: free store=0x55def0203ac0 LEODEBUG: store contains 1 items [0] = x509=0x55deeef3d570 (references=1) Requesting home.leonerd.org.uk (2 of 200)... LEODEBUG: add x=0x55deeefab130 to store=0x55def0203b60 Done home.leonerd.org.uk LEODEBUG: free store=0x55def0203b60 LEODEBUG: store contains 1 items [0] = x509=0x55deeefab130 (references=1) It seems the X509 objects have leaked a reference so when the store throws them away, the refcount doesn't hit zero yet but remains at 1. I wonder where that went... -- Paul Evans

Message body is not shown because it is too large.

On Fri Mar 15 13:52:40 2019, PEVANS wrote: Show quoted text
> Applying some gdb `watch` magic to see where the refcount gets > tickled.
Analysing that long list of stack traces, I make the following summary: * X509 cert gets created (references=1) * gets added to store (references=2) * local variable gets dropped (references=1) this seems fine so far. * some internal fiddling within ssl_verify_cert_chain ... as a result of tls_process_server_certificate references ends up being 2 Then the OCSP basic verify step comes in, invoked via tlsext_status_cb_invoke, through a separate perl interpreter level, working within OCSP_basic_verify. That fiddles with the cert but leaves reference count as 2 again by the end. Then crucially are three modifications that appear unbalanced, all invoked from inside Net::SSLeay::OCSP_cert2ids: up (references=3) X509_up_ref X509_STORE_CTX_get_by_subject X509_STORE_CTX_get1_issuer -> find_issuer -> Net::SSLeay::OCSP_cert2ids ... tlsext_status_cb_invoke tls_process_initial_server_flight -> tls_process_initial_server_flight -> tls_process_server_done up (references=4) X509_up_ref X509_STORE_CTX_get1_issuer -> find_issuer -> Net::SSLeay::OCSP_cert2ids ... tlsext_status_cb_invoke tls_process_initial_server_flight -> tls_process_initial_server_flight -> tls_process_server_done down (references=3) CRYPTO_DOWN_REF -> asn1_do_lock -> asn1_item_embed_free -> ASN1_item_free X509_OBJECT_free X509_STORE_CTX_get1_issuer -> find_issuer -> Net::SSLeay::OCSP_cert2ids ... tlsext_status_cb_invoke tls_process_initial_server_flight -> tls_process_initial_server_flight -> tls_process_server_done I believe it is this unbalancing step where the leak comes from, because at this point references=3, not restored to 2. Hereafter, the SSL library seems to clean itself up OK: * SSL_free cleans up one ref (references=2) * SSL_CTX_free cleans up one ref (references=1) were it not for that unbalancing step at Net::SSLeay::OCSP_cert2ids, I believe now the reference count would be 0, and the memory reclaimed. -- Paul Evans
On Fri Mar 15 14:36:40 2019, PEVANS wrote: Show quoted text
> I believe it is this unbalancing step where the leak comes from, > because at this point references=3, not restored to 2. Hereafter, the > SSL library seems to clean itself up OK:
Looking in SSLeay.xs at `find_issuer` function, I wonder if this explains it. The function has two cases; maybe it'll find the issuer inside the "chain" argument - if so it returns a value taken directly from that STACK_OF(X509) objects: issuer = sk_X509_value(chain,i); that wouldn't increment a refcount. If it fails to find it there, it asks the X509_STORE for its issuer: int ok = X509_STORE_CTX_get1_issuer(&issuer,stx,cert); Looking at the stack traces, it appears that this latter function does result in the X509's refcount being incremented. It appears that the other callers of `find_issuer` don't take account of this fact. Speculating on this, I'm going to add a refcount-increment step to the first half of that function, and then have all of find_issuer()'s callers remember to drop it again after they're done with the result. -- Paul Evans
On Fri Mar 15 14:43:58 2019, PEVANS wrote: Show quoted text
> Speculating on this, I'm going to add a refcount-increment step to the > first half of that function, and then have all of find_issuer()'s > callers remember to drop it again after they're done with the result.
I believe we're onto a winner :) The attached patch implements this strategy, and so far under my testing hasn't managed to leak any memory of any of the previously-problematic hostnames. -- Paul Evans
Subject: rt128505.patch
=== modified file 'SSLeay.xs' --- SSLeay.xs 2019-02-15 18:23:30 +0000 +++ SSLeay.xs 2019-03-15 18:49:33 +0000 @@ -1651,6 +1651,7 @@ if ( X509_check_issued(sk_X509_value(chain,i),cert) == X509_V_OK ) { TRACE(2,"found issuer in chain"); issuer = sk_X509_value(chain,i); + X509_up_ref(issuer); } } } @@ -6600,7 +6601,9 @@ croak("no OCSP request for self-signed certificate"); if (!(issuer = find_issuer(cert,store,chain))) croak("cannot find issuer certificate"); - if (!(id = OCSP_cert_to_id(EVP_sha1(),cert,issuer))) + id = OCSP_cert_to_id(EVP_sha1(),cert,issuer); + X509_free(issuer); + if(!id) croak("out of memory for generating OCSP certid"); pi = NULL; @@ -6696,6 +6699,7 @@ ERR_clear_error(); /* clear error from last OCSP_basic_verify */ if (last && (issuer = find_issuer(last,store,chain))) { OCSP_basic_add1_cert(bsr, issuer); + X509_free(issuer); TRACE(1,"run OCSP_basic_verify with issuer for last chain element"); RETVAL = OCSP_basic_verify(bsr, NULL, store, flags); }
On Fri Mar 15 14:43:58 2019, PEVANS wrote: Show quoted text
> The function has two cases; maybe it'll find the issuer inside the > "chain" argument - if so it returns a value taken directly from that > STACK_OF(X509) objects: > > issuer = sk_X509_value(chain,i); > > that wouldn't increment a refcount. If it fails to find it there, it > asks the X509_STORE for its issuer: > > int ok = X509_STORE_CTX_get1_issuer(&issuer,stx,cert); > > Looking at the stack traces, it appears that this latter function does > result in the X509's refcount being incremented. It appears that the > other callers of `find_issuer` don't take account of this fact.
I suspect this might also explain why some hosts leak and some don't. The leak is all about whether the chain in the original server cert happened to contain the issuer in the right place, or not. Whether it did, makes the `find_issuer` function take one of two paths, resulting in mismatched refcounts depending on which path it took. -- Paul Evans
On Fri Mar 15 18:53:33 2019, PEVANS wrote: Show quoted text
> On Fri Mar 15 14:43:58 2019, PEVANS wrote:
> > Speculating on this, I'm going to add a refcount-increment step to > > the > > first half of that function, and then have all of find_issuer()'s > > callers remember to drop it again after they're done with the result.
> > I believe we're onto a winner :) > > The attached patch implements this strategy, and so far under my > testing hasn't managed to leak any memory of any of the previously- > problematic hostnames.
Thanks for the huge amount of investigative work you did on this, Paul, and I'm glad you found the source of the problem in the end. Unfortunately, I realised what was going on as soon as you mentioned find_issuer() --- Steffen Ullrich already spotted and fixed this specific leak with that function and its callers (albeit in a different context), which explains why I wasn't able to reproduce the leak with the git master branch: https://github.com/radiator-software/p5-net-ssleay/commit/0a648a564176b539e2022f61d4130fdbd9b7ef05 Steffen's fix was quite similar to yours, the only difference being that he used X509_dup(issuer) in find_issuer() where you instead used X509_up_ref(issuer). Given the way the return value from find_issuer() is used in SSLeay.xs, I think X509_dup() is the best fit here. Sorry for not connecting the dots sooner, it would've saved you a lot of hassle. Thanks again for all the effort you put into tracking this down!