Skip Menu |

This queue is for tickets about the Date-Manip CPAN distribution.

Report information
The Basics
Id: 102188
Status: resolved
Priority: 0/
Queue: Date-Manip

People
Owner: Nobody in particular
Requestors: gdg [...] zplane.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Surprising ParseDate() performance differential, DM5 vs. DM6
Date: Tue, 17 Feb 2015 15:42:33 -0700
To: bug-Date-Manip [...] rt.cpan.org
From: Glenn Golden <gdg [...] zplane.com>
Hello, Wondering if you'd mind having a look at the attached simple ParseDate() benchmark script to see if you get results similar to what I'm seeing, between DM5 and DM6. On my setups, I see a differential of roughly a factor of 3 in favor of DM5, as evaluated on the following systems: x686 Perl 5.20.1 x86-64 Perl 5.20.1 x86-64 Perl 5.16.3 I'm not suggesting this is a bug or even complaining about the efficiency in an absolute sense, since the doc is clear that efficiency was not a prime design consideration of the module; but it does seem to contradict this statement about DM5 vs. DM6 relative performance in Date::Manip.3pm: Performance Issues ------------------ Considerable time has been spent speeding up Date::Manip, and fairly simple benchmarks show that version 6 is around twice as fast as version 5. I understand that the above statement was probably meant to be applicable to the package as a whole, i.e. over the entire set of D::M functions. But I think that many readers could gain the impression that using DM6 is likely to nearly always provide a speed improvement over DM5, and so perhaps might not even attempt a comparison. But a 3x slowdown in what is probably one of the most commonly used routines might make the difference between choosing 5 over 6 in some applications. (That was the case for me, in a script in which ParseDate() was the primary workhorse in an inner loop.) So I'm wondering if you'd be amenable to a doc update to the above blurb that points out that efficiency crossovers may occur in some routines, just to bring it to users' attention, even though DM5 is no longer in active development. If you agree with this approach, I'd be happy to submit a doc patch. Thanks, Glenn

Message body is not shown because sender requested not to inline it.

Your timing, though accurate, does not reflect anything useful. Let me explain. Date::Manip 5.x made extensive use of caching... i.e. if you called a subroutine with the same arguments, rather than redoing the calculation, it made use of the cached value determined previously. Date::Manip 6.x also caches values, but it is a far more complex and powerful module, and it does more extra work than 5.x, so there is more overhead for each date operation. However, the actual work is far more optimized, so even with the overhead (much of which is actually designed to make the operation faster), overall it is faster. Your simple script does not parse a single date 3000 times... instead, it parses a single date and then uses cached values the other 2999 times. Since 5.x doesn't have as much overhead, you're really not exercising any of the date operations, so the factor of 3 is not surprising... but also not representative of anything useful. If you change your script to be a more real world script, you'll find the results shift in favor of 6.x. For example, change the middle portion of your script to read: ####################################### my $t0 = [gettimeofday()]; foreach my $d (1..12) { foreach my $h (10..22) { foreach my $m (10..30) { foreach my $s (10..15) { my $date_in = 'Fri, $d Jan 2015 $h:$m:$s -0500'; my $date_out = ParseDate($date_in); }}}} my $t1 = [gettimeofday()]; my $et = tv_interval($t0, $t1); ####################################### Now, your script is actually parsing unique dates (19,656 of them to be exact). They are all within a few days of each other, and this is a very real-world situation (you might get it parsing a log file for example). Now, 6.x is faster (though not by a lot) than 5.x. If you switch to the OO method (which is actually how 6.x is best used), then you get a slight speedup (by discarding the overhead of calling the functional interface which immediately calls the OO methods. That script now looks like: ####################################### use Date::Manip::Date; my $obj = new Date::Manip::Date; printf(STDERR "\n"); #printf(STDERR "D::M Version: %s\n", DateManipVersion(1)); my $t0 = [gettimeofday()]; foreach my $d (1..12) { foreach my $h (10..22) { foreach my $m (10..30) { foreach my $s (10..15) { my $date_in = 'Fri, $d Jan 2015 $h:$m:$s -0500'; #my $date_out = ParseDate($date_in); $obj->parse($date_in); }}}} ####################################### Finally, with the OO method, you can optimize it by discarding some parts of the parsing. So, if you change the parse line to: $obj->parse($date_in,"noiso8601"); you get additional speedup since ISO 8601 dates are the first ones parsed, and your format falls into the second category of dates parsed. With these changes, the ratio of run times is: 5.x 1.000 6.x 0.937 6.x OO 0.935 + opt 0.894 At some point, I do intent to add a lot more timing information... it interests me, and will be useful, but I haven't taken the time to do all the work to do it yet. However, Date::Manip 6 will outperform 5 in practically every way.
Subject: Re: [rt.cpan.org #102188] Surprising ParseDate() performance differential, DM5 vs. DM6
Date: Wed, 18 Feb 2015 07:16:26 -0700
To: Sullivan Beck via RT <bug-Date-Manip [...] rt.cpan.org>
From: Glenn Golden <gdg [...] zplane.com>
Sullivan Beck via RT <bug-Date-Manip@rt.cpan.org> [2015-02-18 08:08:51 -0500]: Show quoted text
Show quoted text
> Your timing, though accurate, does not reflect anything useful. Let me > explain. >
First, thanks for your quick and detailed response, really appreciate your time. Show quoted text
> > Your simple script does not parse a single date 3000 times... instead, it > parses a single date and then uses cached values the other 2999 times. > Since 5.x doesn't have as much overhead, you're really not exercising any > of the date operations, so the factor of 3 is not surprising... but also > not representative of anything useful. > > If you change your script to be a more real world script, you'll find the > results shift in favor of 6.x. For example, change the middle portion of > your script to read: >
I'm puzzled by this, because that trivial benchmark was of course only a very simplified view of the loop in my actual code (in which different dates were being parsed each time) and yet in the real code I was still seeing a large DM5 vs DM6 differential. (As an aside, the differential in the real code was even larger than the factor than 3 observed using the repetitive date example, perhaps as large as 5 or 6, but I didn't attempt to characterize it very closely at that time.) I supplied the benchmark repetitive-date example only because it displayed in a simple way a consistent relative speed differential, even though that difference was not as large as in my real code. So let me look into this more and try to create a more realistic minimal example if I can, that uses an actual subset of date strings obtained from the real application. Perhaps there is some mistake in my code that has fooled me nto thinking the DM5/DM6 speed differential is larger than it is, and so the simple repetitive-date benchmark example simply "confirmed" what was not really true. Anyway... I will post when I have more info. Thanks again for your time on this. Glenn
Subject: Re: [rt.cpan.org #102188] Resolved: Surprising ParseDate() performance differential, DM5 vs. DM6
Date: Wed, 18 Feb 2015 11:17:46 -0700
To: Sullivan Beck via RT <bug-Date-Manip [...] rt.cpan.org>
From: Glenn Golden <gdg [...] zplane.com>
OK, here's another minimal example, using unique dates (read from a file, outside the timing loop) on each iteration. I see a speed ratio of about 7:1 with this set of date, in favor of DM5. This is realistic data from my app. (The dates are extracted from an actual collection of email messages.)

Message body is not shown because sender requested not to inline it.

Message body is not shown because sender requested not to inline it.

Okay, I understand what you're seeing now, and it is a legitimate difference... though I'm still not sure it's a justification for using 5.x. 5.x does NO timezone handling (or I should say, no CORRECT timezone handling). There's a hash with offsets (i.e. -0500) hardcoded to a timezone... but they are wrong as often as they are right and conversions are often off by some multiple of hours. So, with you reading a large number of dates with offsets on them, 5.x is basically doing no valid timezone handling, so it may be faster than 6.x... but it's wrong much of the time. 6.x of course does real timezone handling, and does it correctly, so the results you get from running your script are correct (where they are usually wrong with 5.x). So, 5.x is faster... but gives incorrect results with respect to timezones. Not a good reason to use it.
One other thing that I wanted to note, but omitted in my previous reply: If you look in the initial test script that I included, the dates DO have an offset, but they are all the same offset. In real life, this is the more common problem since most of the dates that you read will come from a single timezone. Your newest example (which is still real life, but a little less common) has them coming from many different timezones. This does slow down Date::Manip since it has to calculate and cache descriptions of so many different timezones, so it's making a lot less use of some of the internal speedups (i.e. internally cached values). So, when the dates all come from a single timezone, 6.x outperforms 5.x. When they come from many different timezones, 5.x outperforms 6.x... but it gets the values wrong.
Subject: Re: [rt.cpan.org #102188] Surprising ParseDate() performance differential, DM5 vs. DM6
Date: Sat, 21 Feb 2015 09:04:59 -0700
To: Sullivan Beck via RT <bug-Date-Manip [...] rt.cpan.org>
From: Glenn Golden <gdg [...] zplane.com>
Sullivan Beck via RT <bug-Date-Manip@rt.cpan.org> [2015-02-18 13:57:00 -0500]: Show quoted text
Show quoted text
> > Okay, I understand what you're seeing now, and it is a legitimate > difference... >
Aha, yep, agree that the DM6 TZ processing is the culprit, not the cacheing. Several more comparitive experiments using datasets with/without TZ info confirm that this is the source of the issue, thanks. Show quoted text
> > So, 5.x is faster... but gives incorrect results with respect to timezones. > Not a good reason to use it. >
Wellll... allow me to push back on that a little from a user's perspective: I can't agree with it as a general statement; it depends on the application tradeoffs. In my case, it happens that I'm not really concerned with DM5's TZ errors (which occur only during DST, presumably because the TZ's are all numerical) but cannot tolerate the large 7:1 speed differential. Would you consider perhaps adding just a minor informative change to the man page text to point this out? E.g., instead of just Considerable time has been spent speeding up Date::Manip, and fairly simple benchmarks show that version 6 is around twice as fast as version 5. maybe add one sentence something like this: One exception is parsing of dates containing timezone information, such as RFC5322 email 'Date' headers: DM5 may be considerably faster than DM6 in this case because it handles timezones in a simpler (though incorrect) way. That extra sentence offers the reader practically useful information which he can use to determine his own accuracy vs. speed tradeoff. Just a suggestion. Also, as an aside: In comparitive tests with very large datasets (so as to wash out any overhead differences between DM5 and DM6) it looks like DM6 is around 2x faster than DM5 without TZs, but roughly 4x slower with TZs. This suggests that the TZ handling in DM6 represents something like 80% of the total parsing time per date. Is there perhaps some simple improvement (maybe even just a bug?) which could cut this down? In any case... thanks again for you time in looking into this. Glenn