Bug #87684 for DBD-CSV: Note that huge $TMP folders may cause the test to run slow

Wed Aug 07 17:04:41 2013 ANDK [...] cpan.org - Ticket created

Subject:

Note that huge $TMP folders may cause the test to run slow

In the Changelog I find this irritating sentence: Note that huge $TMP folders may cause the test to run slow Does this mean that the reason for running into timeouts often might be the fact that my /tmp/ directory is huge? Please explain. If you actually depend on a small /tmp/ directory, please issue a warning when you encounter a huge one. What does 'huge' mean? Many files? Big files? And what's the reasoning? Thanks,

Thu Aug 08 02:51:16 2013 h.m.brand [...] xs4all.nl - Correspondence added

Subject:	Re: [rt.cpan.org #87684] Note that huge $TMP folders may cause the test to run slow
Date:	Thu, 8 Aug 2013 08:50:55 +0200
To:	bug-DBD-CSV [...] rt.cpan.org
From:	"H.Merijn Brand" <h.m.brand [...] xs4all.nl>

On Wed, 7 Aug 2013 17:04:42 -0400, "Andreas Koenig via RT" <bug-DBD-CSV@rt.cpan.org> wrote: Show quoted text

> Wed Aug 07 17:04:41 2013: Request 87684 was acted upon. > Transaction: Ticket created by ANDK > Queue: DBD-CSV > Subject: Note that huge $TMP folders may cause the test to run slow > Broken in: 0.41 > Severity: (no value) > Owner: Nobody > Requestors: ANDK@cpan.org > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=87684 > > > > In the Changelog I find this irritating sentence:

Irritating? It just states what it said. Show quoted text

> Note that huge $TMP folders may cause the test to run slow > > Does this mean that the reason for running into timeouts often > might be the fact that my /tmp/ directory is huge? Please explain.

It means that the *tests* for the module itself, executed ONLY when you install the module, will include one specific test that will require considerable time if your $TMP is huge. The reason is to test the new functionality of enabling DBD::CSV to look for tables in multiple folders/directories from the same database handle, enabling you to split the CSV files across logical folders. You might look at it like using schema's in a relational database. Future changes will most likely include the possibility to add readonly flags to those extra folders or to specific files in these folders. Show quoted text

> If you actually depend on a small /tmp/ directory, please issue a warning > when you encounter a huge one. What does 'huge' mean? Many files? Big > files? And what's the reasoning?

Huge means *many* files, not big files. The reasoning is explained above. What the functionality does is scan the content of the folder to see if there are files suitable for the current handle. Using attributes like f_ext might speed up that search in many cases my $dbh = DBI->connect ("dbi:CSV:", undef, undef, { f_ext => ".csv/r", RaiseError => 1, PrintError => 1, FetchHashKeyName => "NAME_lc", }); But that is all in the docs If I read your comments on perlmonks, I think you are not in search of DBD::CSV, but you require Text::CSV_XS or Text::CSV. These are the actual parsers and are also used in DBD::CSV under the hood. Hope this helps. Enjoy CSV! -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.19 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

Thu Aug 08 02:51:16 2013 The RT System itself - Status changed from 'new' to 'open'

Fri Aug 09 02:41:21 2013 ANDK [...] cpan.org - Correspondence added

Sorry for the unspecified term 'irritating', I posted this in a bit of a hurry. What I was trying to say was: (1) that a testing setup that scans my /tmp/ directory, or any other of my directories should always be justified. Testing the ability to scan several directories can also be done by generating several directories and scan those, this would be less introsive and more to the point of testing. There is no inherent reason to scan my /tmp/ directory, so I'm irritated when a testing software does it. (2) if a testing software can have extremely different behaviour depending on the content of my /tmp/ directory, I would expect that this is communicated to the tester in form of a warning or a question to answer or an option to activate it, not as a sentence in the Changes file. (3) I have plenty of seemingly hanging tests for DBD-CSV-0.41. Maybe this test or this new functionality has a bug or maybe my /tmp/ directory is huge. We will only find out if the test stops scanning my /tmp/ directory. Please consider this for the next release. Apart from that I have not posted for quite a while on PerlMonks. Maybe you can pont me to the posting? Thanks,

Fri Aug 09 03:03:44 2013 h.m.brand [...] xs4all.nl - Correspondence added

Subject:	Re: [rt.cpan.org #87684] Note that huge $TMP folders may cause the test to run slow
Date:	Fri, 9 Aug 2013 09:03:27 +0200
To:	bug-DBD-CSV [...] rt.cpan.org
From:	"H.Merijn Brand" <h.m.brand [...] xs4all.nl>

On Fri, 9 Aug 2013 02:41:21 -0400, "Andreas Koenig via RT" <bug-DBD-CSV@rt.cpan.org> wrote: Show quoted text

> Queue: DBD-CSV > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=87684 > > > Sorry for the unspecified term 'irritating', I posted this in a bit of a hurry. What I was trying to say was: > > (1) that a testing setup that scans my /tmp/ directory, or any other of > my directories should always be justified. Testing the ability to scan > several directories can also be done by generating several directories > and scan those, this would be less introsive and more to the point of > testing. There is no inherent reason to scan my /tmp/ directory, so > I'm irritated when a testing software does it.

The testing of /tmp/ can be changed - I will document that, by setting (temporarily) the env variable that designate what should be used as tmp. Usually that is something like $TMP or $TMPDIR. There are several reasons to use $TMP. You might or might not agree with any of these reasons being justification, but given that we want to run on all supported OS's (Linux, Unix, MacOSX, HP-UX, AIX, VMS, and Windows to name a few) we chose - after a lot of discussion with people using those non-standard OS's, to rely on File::Spec->tmpdir () as 1. It is used quite extensive is other modules 2. It returns something guaranteed to be usable 3. It works on all OS's. Transparantly If we create our own folder, and take the absolute path of that, the functionality that we want to test isn't tested. Again, if you do not want this test to be run using /tmp on the single moment that you install the module (remember that this is not used unless specified by *you* during actual use of the module), set $TMPDIR tmpdir Returns a string representation of the first writable directory from a list of possible temporary directories. Returns the current directory if no writable temporary directories are found. The list of directories checked depends on the platform; e.g. File::Spec::Unix checks $ENV{TMPDIR} (unless taint is on) and /tmp. $tmpdir = File::Spec->tmpdir (); Show quoted text

> (2) if a testing software can have extremely different behaviour > depending on the content of my /tmp/ directory, I would expect > that this is communicated to the tester in form of a warning or > a question to answer or an option to activate it, not as a > sentence in the Changes file.

The *behavior* of the tests didn't change other than that the tests now work on *all* supported OS's instead of just the ones that I had access to. The *only* impact is that on some OS's the tests will take longer if that folder is stuffed with a lot of files. The content is not used at all. Show quoted text

> (3) I have plenty of seemingly hanging tests for DBD-CSV-0.41. Maybe > this test or this new functionality has a bug or maybe my /tmp/ > directory is huge. We will only find out if the test stops scanning > my /tmp/ directory. Please consider this for the next release.

set %TMPDIR% before running the tests Show quoted text

> Apart from that I have not posted for quite a while on PerlMonks. > Maybe you can pont me to the posting?

Chatterbox Aug 07 21:30:55 <cbstream> [Lady_Aleena] Which is the preferred CSV parser, Text or DBD CSV? I have an actual CSV file I would like to parse. Aug 07 21:31:55 <cbstream> [Lady_Aleena] This thing has actual commas and a lot of quotes, and the first line is the headings. Aug 07 21:36:02 <cbstream> [Lady_Aleena] My data files are all pipe separated values with no quotes so I use my own little fuction to parse mine normally. I haven't tried using a module in many years. Aug 07 21:39:07 <cbstream> [Lady_Aleena] I think I need more caffiene. I forgot Text::CSV was hard to use. Aug 07 21:55:03 <cbstream> [Lady_Aleena] Now I remember why I rolled my own parser for sep values files. Aug 07 21:57:07 <cbstream> [Lady_Aleena] There are no headings saying "Putting your data into a hash". Aug 07 22:06:52 <cbstream> [Lady_Aleena] /me gets something to eat. Aug 07 22:25:29 <cbstream> [Lady_Aleena] /me gives up. Aug 07 22:29:47 <cbstream> [Lady_Aleena] For me it figures Outlook 2003 can't make a usable CSV file. When it made the file, it didn't ignore line breaks in some fields, so put parts of fields on different lines. (face palm) Aug 07 22:33:22 <cbstream> [Lady_Aleena] I can't parse it then, since neither Text nor DBD CSV goes from file to aoh or hoh in the documentation. Aug 07 22:44:11 <cbstream> [Lady_Aleena] MLX, I need to figure out how to make it so the first line of the CSV file is read as the headings. Aug 07 22:44:11 <cbstream> [Lady_Aleena] I think. Aug 07 22:53:26 <cbstream> [space_monk] @Lady_Aleena: you can use Text:CSV to read your pipe separated data - just change the separator character when creating an instance Aug 07 22:55:34 <cbstream> [Lady_Aleena] space_monk, my pipe separated files are a breeze to parse with my home rolled function, but parsing the csv created by Outlook 2003 with all its quirks is difficult since the documentation to the various CSV modules is incomplete. Aug 07 23:00:11 <cbstream> [Lady_Aleena] Also, there isn't any special characters used at the end of each record. Aug 08 08:21:57 <TuxCM-> [Lady_Aleena] late, but maybe you read backlog. "Which is the preferred CSV parser, Text or DBD CSV?" - Technically [mod://DBD::CSV] is not a parser: is is a DBD for [mod://DBI] to enable a database interface over CSV files. It uses [mod://Text::CSV_XS] to parse, but can be forced to use [mod://Text::CSV] Aug 08 08:22:31 <TuxCM-> [Lady_Aleena] late, but maybe you read backlog. "Which is the preferred CSV parser, Text or DBD CSV?" - Technically [mod://DBD::CSV] is not a parser: is is a DBD for [mod://DBI] to enable a database interface over CSV files. It uses [mod://Text::CSV_XS] to parse, but can be forced to use [mod://Text::CSV] Aug 08 08:22:34 <cbstream> [Tux] [Lady_Aleena] late, but maybe you read backlog. "Which is the preferred CSV parser, Text or DBD CSV?" - Technically [ http://search.cpan.org/perldoc?DBD%3A%3ACSV |DBD::CSV] is not a parser: is is a DBD for [ http://search.cpan.org/perldoc?DBI |DBI] to enable a database interface over CSV files. It uses [ http://search.cpan.org/perldoc?Text%3A%3ACSV_XS |Text::CSV_XS] t Aug 08 08:27:35 <TuxCM-> [Lady_Aleena] "I can't parse it then, since neither Text nor DBD CSV goes from file to aoh or hoh" - read again, then use <c>getline_all ()</c> for AoA or <c>getline_hr_all ()</c> for AoH. The latter needs some setup Aug 08 08:27:37 <cbstream> [Tux] [Lady_Aleena] "I can't parse it then, since neither Text nor DBD CSV goes from file to aoh or hoh" - read again, then use <c>getline_all ()</c> for AoA or <c>getline_hr_all ()</c> for AoH. The latter needs some setup Show quoted text

> Thanks,

-- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.19 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

Sat Aug 10 01:54:52 2013 ANDK [...] cpan.org - Correspondence added

Thanks for the clarifications. At the moment my smoker has only a medium sized /tmp/ directory with 200000 files. In this environment make test runs in 83 secs. If I set $TMPDIR to an empty directory, it is reduced to 11 secs. I can extrapolate that the test takes one hour when my /tmp/ directory has 10M files. My distroprefs now say: match: distribution: '/DBD-CSV-\d' test: commandline: "TMPDIR=`mktemp -d /tmp/DBD-CSV-test-tempdirectory-XXXX` make test" That makes my smokers sane again. I'm still not convinced that this test should be kept as is. A smoker should not run into timeouts based on the contents of the /tmp/ directory.

Sat Aug 10 02:20:34 2013 ANDK [...] cpan.org - Correspondence added

And BTW, Lady Aleena is not me.

Sat Aug 10 03:51:40 2013 h.m.brand [...] xs4all.nl - Correspondence added

Subject:	Re: [rt.cpan.org #87684] Note that huge $TMP folders may cause the test to run slow
Date:	Sat, 10 Aug 2013 09:51:24 +0200
To:	bug-DBD-CSV [...] rt.cpan.org
From:	"H.Merijn Brand" <h.m.brand [...] xs4all.nl>

On Sat, 10 Aug 2013 02:20:34 -0400, "Andreas Koenig via RT" <bug-DBD-CSV@rt.cpan.org> wrote: Show quoted text

> Queue: DBD-CSV > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=87684 > > > And BTW, Lady Aleena is not me.

Sorry. I know the both of you and answered two tickets almost simultaneously. -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.19 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

Sat Aug 10 04:38:36 2013 CHORNY [...] cpan.org - Correspondence added

I've encountered this problem when DBD-CSV was killed (after 30 minutes of testing) is cpan smoke run. New version of File::Temp that is included into perl 5.18 generated many files (after a day of smoking it can be 25000-30000). I'm using Windows. output is Show quoted text

>perl -Mblib t/55_dir_search.t

ok 1 - use DBI; ok 2 - DSN for C:\strawberry180\cpan\build\DBD-CSV-0.41-5Zjyxs\output1980 ok 3 - DSN for t ok 4 - DSN for C:\DOCUME~1\c\LOCALS~1\Temp -- Alexandr Ciornii, http://chorny.net

Sat Aug 10 04:39:05 2013 CHORNY [...] cpan.org - Requestor CHORNY added

Sat Aug 10 10:55:39 2013 h.m.brand [...] xs4all.nl - Correspondence added

Subject:	Re: [rt.cpan.org #87684] Note that huge $TMP folders may cause the test to run slow
Date:	Sat, 10 Aug 2013 16:55:18 +0200
To:	bug-DBD-CSV [...] rt.cpan.org
From:	"H.Merijn Brand" <h.m.brand [...] xs4all.nl>

On Sat, 10 Aug 2013 01:54:52 -0400, "Andreas Koenig via RT" <bug-DBD-CSV@rt.cpan.org> wrote: Show quoted text

> Queue: DBD-CSV > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=87684 > > > Thanks for the clarifications. At the moment my smoker has only a medium sized /tmp/ directory with 200000 files. In this environment make test runs in 83 secs. If I set $TMPDIR to an empty directory, it is reduced to 11 secs. I can extrapolate that the test takes one hour when my /tmp/ directory has 10M files. > > My distroprefs now say: > > match: > distribution: '/DBD-CSV-\d' > test: > commandline: "TMPDIR=`mktemp -d /tmp/DBD-CSV-test-tempdirectory-XXXX` make test" > > That makes my smokers sane again. > > I'm still not convinced that this test should be kept as is. A smoker should not run into timeouts based on the contents of the /tmp/ directory.

After more thought, I think the right solution in this case might to skip the tests for that folder under AUTOMATED_TESTING -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.19 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

Sun Aug 11 00:04:00 2013 ANDK [...] cpan.org - Correspondence added

Depending on AUTOMATED_TESTING is action at a distance, isn't it? I mean every user would need a warning and an opt out as long as there is a test that scans /tmp/.

Sun Aug 11 05:15:58 2013 h.m.brand [...] xs4all.nl - Correspondence added

Subject:	Re: [rt.cpan.org #87684] Note that huge $TMP folders may cause the test to run slow
Date:	Sun, 11 Aug 2013 11:15:41 +0200
To:	bug-DBD-CSV [...] rt.cpan.org
From:	"H.Merijn Brand" <h.m.brand [...] xs4all.nl>

On Sun, 11 Aug 2013 00:04:00 -0400, "Andreas Koenig via RT" <bug-DBD-CSV@rt.cpan.org> wrote: Show quoted text

> Queue: DBD-CSV > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=87684 > > > Depending on AUTOMATED_TESTING is action at a distance, isn't it? I > mean every user would need a warning and an opt out as long as there > is a test that scans /tmp/.

Your points and use-case are clear. Is https://github.com/perl5-dbi/DBD-CSV/commit/dafa218010c32783e0023d doing what you would like to see as default behavior? (Docs not yet updated for the use of $TMPDIR) -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.19 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

Sun Aug 11 13:30:33 2013 ANDK [...] cpan.org - Correspondence added

Show quoted text

> Is https://github.com/perl5-dbi/DBD-CSV/commit/dafa218010c32783e0023d > doing what you would like to see as default behavior?

Yes, it does. Thanks a lot, your effort is highly appreciated!

Thu Aug 29 03:08:56 2013 HMBRAND [...] cpan.org - Status changed from 'open' to 'patched'

Thu Jul 03 12:20:42 2014 HMBRAND [...] cpan.org - Fixed in 0.43 added

Thu Jul 03 12:20:43 2014 HMBRAND [...] cpan.org - Status changed from 'patched' to 'resolved'