Sullivan Beck via RT <bug-Date-Manip@rt.cpan.org> [2015-02-28 07:35:14 -0500]:
Show quoted text>
> <URL:
https://rt.cpan.org/Ticket/Display.html?id=102284 >
>
> You are correct. Offsets are not timezones... offsets are simply
> the current offset from GMT. A timezone is a full description of
> time changes over all of history and includes any number of offsets,
> abbreviations (EDT, EST, etc.), and critical dates (dates where
> the offset and/or abbreviation change).
>
OK, got it, thanks.
I've thought about this a lot more now, and did some experiments over the past
week to understand better how DM6 behaves in the face of dates with offsets.
I still have some further questions and comments though, hope you'll bear with
me on this.
Show quoted text>
> So, when you use an offset of -0700, Date::Manip has to check every timezone
> that has ever +used that offset, and then it tries to determine which
> timezone you are actually +referring to.
>
But if a date is supplied with only an offset and no explicit putative TZ --
as the case for the "dateset3.txt" example dataset from the earlier post --
then DM can't uniquely determine which TZ was intended. The best it can do is
choose one among the (possibly several) feasible TZs, and then assign that TZ
to the date object. (At least, that seems to be how it behaves, based on a
number of experiments. Is that right?)
If so, I was wondering if you'd entertain knocking around some ideas about a
potential enhancement to DM, in the form of an option that allows date strings
of this specific type (offset only, no explicit putative TZ) to be special-cased
during parsing.
The general idea would be that enabling this option would affect validation
and TZ assignment something along the following lines:
* Validation would be relaxed, so as to avoid the need to load all the
feasible TZs (which I presume is required in order to validate that the
given {absolute time, offset} pair actually corresponds to a period
contained in some extant TZ in your database).
Exactly what might be meant by "relaxed" is open to some debate. For
example, perhaps accept only offsets having the minimum granularity
among all actual TZs (I think this would be 15 mins?). Or perhaps just have
a small in-core table listing all valid TZ offsets and permit the given
offset iff it appears in that table.
* Assigning a TZ to the date object would become somewhat tricky with the
option enabled, since there might be no actual tabled TZ satisfying both
the offset and the {absolute time, offset} pair simultaneously. But
perhaps instead there could be the notion of a "pseudo-TZ" especially for
offset-only cases, which would be dynamically created based on the offset,
rather than drawn from the database. (For example PSEUDO+0100, or
PSEUDO-0445, etc.)
The benefits of enabling the option would be (a) much faster parsing of offset-
only dates, and (b) avoidance of assigning artificially chosen TZs from the
database, in preference for a more "honest" pseudo-TZ assignment which
essentially admits, "we have no clue what TZ you mean, but the information in
this pseudo-TZ is consistent with your supplied offset."
I have some more detailed ideas on the above, but first wanted to ask whether
this even sounds like something you'd consider debating about or whether, due
to implementation constraints or lack of utility, it is simply a non-starter.
Anyway... just tossing it out as a hopefully constructive suggestion. I'm very
well aware that simple-sounding suggestions may be hugely complicated to
actually implement, and not even worth considering for that reason of course.
Just thinking out loud.
One more topic though, related to (and motivating the above): Honestly, I can't
agree with your assessment:
Show quoted text>
> Luckily, I consider this case fairly rare in that most real-life situations,
> you'll be parsing dates in your timezone, and that's it.
>
IMO, the parsing of large sets of dates having different offsets is not a corner
case at all, but probably reasonably common, especially for email processing.
Any application parsing email "Date" or "Received" header fields will very
quickly wind up having to handle a significant fraction of all the possible
offsets. The "dateset3.txt" dataset from my earlier post was derived in exactly
this way, not artificially at all, by simply extracting "Date" headers from a
few of my own mail folders. And if "Received" headers are considered as well,
then an even wider fraction of offsets is going to quickly arise: Even in the
case when senders are all located in geographically narrow area and their
"Date" fields are closely clustered in a few TZs, there will generally be a
much wider array of "Received" dates because server locations (which handle
intermediate movements of the message) are scattered all over.
I can't justify the above based on any objective facts that I'm aware of,
just opining based on my own experiences.
Anyway... once again, thank you for all your time on this.
Regards,
Glenn