Bug #102284 for Date-Manip: ParseDate() TZ processing is a significant (~8x) performance bottleneck

Sun Feb 22 12:16:23 2015 gdg [...] zplane.com - Ticket created

Subject:	ParseDate() TZ processing is a significant (~8x) performance bottleneck
Date:	Sun, 22 Feb 2015 10:15:35 -0700
To:	Sullivan Beck via RT <bug-Date-Manip [...] rt.cpan.org>
From:	Glenn Golden <gdg [...] zplane.com>

Just for completeness, filing this as a separate ticket pertaining only to DM6, rather than confusing the issue with differential comparisons relative to DM5, per the earlier ticket (102188). The intent is just to document the behavior of DM6 in hopes that at some future time it might be examined to see if the with-TZ performance can perhaps be improved, since it's clearly a bottleneck as it stands now. The two attached datasets (dateset3.txt and dateset3_notz.txt) contain about 2400 date strings drawn from email messages, and are identical except that in the latter set the TZ strings have been removed. The script displays timing results for both DM5 and DM6, but the intent here is just to draw attention to the DM6 ParseDate() performance differential between dates with vs. without TZs. I see about 7x - 9x across three machines (two x86-64, one x686) and two perl versions (5.16.3 and 5.20.1). This suggests that something like 80% of the ParseDate() realtime is spent in handling the TZ info. Again, I'm not complaining, since realtime performance was explicitly not a design consideration; just trying to document the behavior and provide a helpful example in case it becomes of interest at a later time to look into potential speedups. Thanks, Glenn

Message body is not shown because sender requested not to inline it.

Mon Feb 23 16:09:05 2015 sbeck [...] cpan.org - Correspondence added

I have spent some time adding some optimizations and that has speeded up timezone handling somewhat. Unfortunately, I don't anticipate being able to get anywhere near even with Date::Manip 5 in terms of speed because there is just too much overhead for each new timezone. When you load almost all of the timezone modules (which your test case does), it'll be slow. I have added a note to this effect in the docs... but I've also stated (correctly) that the results for 5.x cannot be relied upon for accuracy. I'm afraid that if timezones are important to you... you just have to pay the priced. Incidentally, this isn't unique to Date::Manip. DateTime will have the same issue since it also stores all of the timezones in separate modules that have to be read in upon first use. I'm going to continue to play with this and try to get more optimizations, but at this point, I am pretty comfortable with the current state of things (meaning that it is working as desired and expected, and while I hope to speed things up, I don't think I'll be getting factors of 2 out of it).

Mon Feb 23 16:09:06 2015 The RT System itself - Status changed from 'new' to 'open'

Fri Feb 27 22:03:00 2015 gdg [...] zplane.com - Correspondence added

Subject:	Re: [rt.cpan.org #102284] ParseDate() TZ processing is a significant (~8x) performance bottleneck
Date:	Fri, 27 Feb 2015 20:02:42 -0700
To:	Sullivan Beck via RT <bug-Date-Manip [...] rt.cpan.org>
From:	Glenn Golden <gdg [...] zplane.com>

Sullivan Beck via RT <bug-Date-Manip@rt.cpan.org> [2015-02-23 16:09:06 -0500]: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=102284 > > > Unfortunately, I don't anticipate being able to get anywhere near even with > Date::Manip 5 in terms of speed because there is just too much overhead for > each new timezone. When you load almost all of the timezone modules (which > your test case does), it'll be slow. >

Aha... I did not appreciate earlier that so much TZ info was being loaded for this test case. I naively thought that since the test case TZs were all expressed as numeric offsets, there would be no need to load more than 24 TZ modules. But now I understand (I think?) that it has to load every module whose offset is equal to the given numerical offset. Is that right? (My question is only for curiosity at this point.) Show quoted text

> > Incidentally, this isn't unique to Date::Manip. DateTime will have the same > issue since it also stores all of the timezones in separate modules that have > to be read in upon first use. >

Date::Parse seems to handle RFC-5322 dates ok (except for a few that were malformed and technically not in compliance) so wound up using that when I needed blazing speed. Show quoted text

> I'm going to continue to play with this and try to get more > optimizations, but at this point, I am pretty comfortable with the > current state of things (meaning that it is working as desired and > expected, and while I hope to speed things up, I don't think I'll > be getting factors of 2 out of it).

Understandable, and totally in agreement given the design philosophy of the module. Once again I thank you for your time in looking into this and responding to the queries. It's really an extremely useful and well-documented module, thanks for writing and maintaining it. Glenn

Sat Feb 28 07:35:14 2015 sbeck [...] cpan.org - Correspondence added

Show quoted text

> Aha... I did not appreciate earlier that so much TZ info was being > loaded for > this test case. I naively thought that since the test case TZs were > all > expressed as numeric offsets, there would be no need to load more than > 24 TZ > modules. But now I understand (I think?) that it has to load every > module > whose offset is equal to the given numerical offset. Is that right? > (My question > is only for curiosity at this point.)

You are correct. Offsets are not timezones... offsets are simply the current offset from GMT. A timezone is a full description of time changes over all of history and includes any number of offsets, abbreviations (EDT, EST, etc.), and critical dates (dates where the offset and/or abbreviation change). So, when you use an offset of -0700, Date::Manip has to check every timezone that has ever used that offset, and then it tries to determine which timezone you are actually referring to. In actuality, your test case loaded over 340 timezones (there are right around 450 total timezones), so you really managed to come up with a worst-case sceneario. Luckily, I consider this case fairly rare in that most real-life situations, you'll be parsing dates in your timezone, and that's it. Also, although your test case seemed large, it was only a few thousand dates, and spreading the overhead of loading so many timezones over such a small number of dates exagerated the difference. If you were bump up the number of dates by several orders of magnitude, I believe that the ratio of 6.xx to 5.xx would come down. Show quoted text

> > Incidentally, this isn't unique to Date::Manip. DateTime will have > > the same > > issue since it also stores all of the timezones in separate modules > > that have > > to be read in upon first use. > >

> > Date::Parse seems to handle RFC-5322 dates ok (except for a few that > were > malformed and technically not in compliance) so wound up using that > when > I needed blazing speed.

Date::Parse is not going to help you get timezones correct. As a matter of fact, Date::Parse is closely related to Date::Manip 5.xx. Date::Parse is written by Graham Barr and is an extension to a simple timezone module that he wrote quite a long time ago, which I borrowed and expanded for use in Date::Manip 5.xx. Both are able to parse a very small subset of timezone offsets and abbreviations. Neither is guaranteed to do them right though. For example, if you parse a date with an offset of -0700, is that in daylight saving time or in standard time? Neither module knows when the critical dates are (i.e. when it switched) so they just guess you're referring to standard time usually. Unfortunately, in the US, standard time only lasts for 1/3 of the year! So 2/3 of the time, parsing a date with an offset will assign it to the wrong timezone. What this means is that you will not be able to do calculations or conversions with the date, because your results will be wrong. Also, if you look at any timezone information about the date (i.e. what timezone or abbreviation it has), it'll be wrong so often as to be useless. To wrap all of this up... If you want to parse dates where timezone information is not important, there are any number of options. If you want to parse dates where timezone information is important, there are only two options: Date::Manip 6.xx and DateTime. Both are slowere than your other options.

Sat Mar 07 14:51:22 2015 gdg [...] zplane.com - Correspondence added

Subject:	Re: [rt.cpan.org #102284] ParseDate() TZ processing is a significant (~8x) performance bottleneck
Date:	Sat, 7 Mar 2015 12:51:04 -0700
To:	Sullivan Beck via RT <bug-Date-Manip [...] rt.cpan.org>
From:	Glenn Golden <gdg [...] zplane.com>

Sullivan Beck via RT <bug-Date-Manip@rt.cpan.org> [2015-02-28 07:35:14 -0500]: Show quoted text

> > <URL: https://rt.cpan.org/Ticket/Display.html?id=102284 > > > You are correct. Offsets are not timezones... offsets are simply > the current offset from GMT. A timezone is a full description of > time changes over all of history and includes any number of offsets, > abbreviations (EDT, EST, etc.), and critical dates (dates where > the offset and/or abbreviation change). >

OK, got it, thanks. I've thought about this a lot more now, and did some experiments over the past week to understand better how DM6 behaves in the face of dates with offsets. I still have some further questions and comments though, hope you'll bear with me on this. Show quoted text

> > So, when you use an offset of -0700, Date::Manip has to check every timezone > that has ever +used that offset, and then it tries to determine which > timezone you are actually +referring to. >

But if a date is supplied with only an offset and no explicit putative TZ -- as the case for the "dateset3.txt" example dataset from the earlier post -- then DM can't uniquely determine which TZ was intended. The best it can do is choose one among the (possibly several) feasible TZs, and then assign that TZ to the date object. (At least, that seems to be how it behaves, based on a number of experiments. Is that right?) If so, I was wondering if you'd entertain knocking around some ideas about a potential enhancement to DM, in the form of an option that allows date strings of this specific type (offset only, no explicit putative TZ) to be special-cased during parsing. The general idea would be that enabling this option would affect validation and TZ assignment something along the following lines: * Validation would be relaxed, so as to avoid the need to load all the feasible TZs (which I presume is required in order to validate that the given {absolute time, offset} pair actually corresponds to a period contained in some extant TZ in your database). Exactly what might be meant by "relaxed" is open to some debate. For example, perhaps accept only offsets having the minimum granularity among all actual TZs (I think this would be 15 mins?). Or perhaps just have a small in-core table listing all valid TZ offsets and permit the given offset iff it appears in that table. * Assigning a TZ to the date object would become somewhat tricky with the option enabled, since there might be no actual tabled TZ satisfying both the offset and the {absolute time, offset} pair simultaneously. But perhaps instead there could be the notion of a "pseudo-TZ" especially for offset-only cases, which would be dynamically created based on the offset, rather than drawn from the database. (For example PSEUDO+0100, or PSEUDO-0445, etc.) The benefits of enabling the option would be (a) much faster parsing of offset- only dates, and (b) avoidance of assigning artificially chosen TZs from the database, in preference for a more "honest" pseudo-TZ assignment which essentially admits, "we have no clue what TZ you mean, but the information in this pseudo-TZ is consistent with your supplied offset." I have some more detailed ideas on the above, but first wanted to ask whether this even sounds like something you'd consider debating about or whether, due to implementation constraints or lack of utility, it is simply a non-starter. Anyway... just tossing it out as a hopefully constructive suggestion. I'm very well aware that simple-sounding suggestions may be hugely complicated to actually implement, and not even worth considering for that reason of course. Just thinking out loud. One more topic though, related to (and motivating the above): Honestly, I can't agree with your assessment: Show quoted text

> > Luckily, I consider this case fairly rare in that most real-life situations, > you'll be parsing dates in your timezone, and that's it. >

IMO, the parsing of large sets of dates having different offsets is not a corner case at all, but probably reasonably common, especially for email processing. Any application parsing email "Date" or "Received" header fields will very quickly wind up having to handle a significant fraction of all the possible offsets. The "dateset3.txt" dataset from my earlier post was derived in exactly this way, not artificially at all, by simply extracting "Date" headers from a few of my own mail folders. And if "Received" headers are considered as well, then an even wider fraction of offsets is going to quickly arise: Even in the case when senders are all located in geographically narrow area and their "Date" fields are closely clustered in a few TZs, there will generally be a much wider array of "Received" dates because server locations (which handle intermediate movements of the message) are scattered all over. I can't justify the above based on any objective facts that I'm aware of, just opining based on my own experiences. Anyway... once again, thank you for all your time on this. Regards, Glenn