Subject: | HTML output too tight |
When extracting text from HTML, too much whitespace is discarded. For example, processing my page http://www.chrisdolan.net, I get text with the words butted up against eaach other:
Chris Dolanskip navigationChristopher J. DolanHomeAboutProjectsTalkChris Dolan is a software developer living in Madison, Wisconsin. With a PhD in Astronomy, he has a very strong math and science background. He started programming professionally as a teenager in the late 1980s. During free time, he is an active participant in several online software development communities and is an avid bicyclist. ? 2005 Chris Dolan | xhtml, css, gpg vcard
If I edit File::Extract::HTML and add the "tighten => 0" option to the HTML::TreeBuilder constructor, I get more useful output, but still with a little too much whitespace:
Chris Dolan skip navigation Christopher J. Dolan Home About Projects Talk Chris Dolan is a software developer living in Madison, Wisconsin. With a PhD in Astronomy, he has a very strong math and science background. He started programming professionally as a teenager in the late 1980s. During free time, he is an active participant in several online software development communities and is an avid bicyclist. ? 2005 Chris Dolan | xhtml, css, gpg vcard