Subject: | New feature standardization: non-blocking IO |
Hello,
I am the author of Parse::MediaWikiDump which frequently has to deal with the English
Wikipedia dump files that currently sit at 22 gigabytes. I'm in an never ending quest to get
processing speeds of that module up as fast as possible. Because Parse::MediaWikiDump
expresses an API that uses a pull method I need non-blocking IO from an XML parser which
limits me currently to XML::Parser.
XML::Parser works well however XML::SAX::ExpatXS is considerably faster. I would like to
switch over to ExpatXS to gain the speed but I need to retain the non-blocking IO feature of
XML::Parser. I've looked into XML::SAX::ExpatXS and there are existing and seemingly unused
hooks for supporting parsing a document a part at a time. I'm confident I can enable this
feature fairly easily however after searching for non-blocking perl SAX I see a lot of people
who are unhappy about the fact that non-blocking IO is not part of the standard. I propose
that it become one.
Here is my proposal:
* Add a new feature called http://xml.org/sax/features/non-blocking
* Use the following methods for the non-blocking API:
- parse_start() - setup the parser instance and get it ready to accept data; no return value
- parse_more($data) - parse a piece of the document and invoke any callbacks required;
returns true if everything is ok or false if not
- parse_done() - signal that there is no more of the document; returns true if everything is
ok or false otherwise
This follows the API expressed by XML::SAX::Expat::Incremental which is unfortunately built
on top of XML::Parser so it won't give me the speed increase I need. I think that
standardizing non-blocking IO is a worthwhile endeavor.