Click to See Complete Forum and Search --> : Specifying DOCTYPE for xml in Perl


lazyme
04-09-2010, 03:08 AM
Hi,

I have written a perl script which scrapes a website and generates an HTML file. I pass on this file to a java servlet. On parsing the xml in the servlet i sometimes get a org.xml.sax.SAXParseException. I noticed that this exception is because the generated xml sometimes contains characters like nbsp, Iuml etc which cannot be parsed by the xml parser. Is there some way I can get over the problem?

Doing a bit of online search I found that declaring entities like
<!ENTITY nbsp CDATA "*" is one way to have a well formed xml. But how do I declare the entities in the perl file?

Any help would be appreciated.

Thanks.

Sixtease
04-13-2010, 03:57 AM
I suggest you to convert the entities to numeric ones. I suggest using one of these modules:
HTML::Entities::Numbered (http://search.cpan.org/perldoc?HTML::Entities::Numbered)
or XML::Entities (http://search.cpan.org/perldoc?XML::Entities) (the author of which is accidentally me).
Both are capable of converting named entities into numeric ones, which are inherently supported by XML.

lazyme
04-13-2010, 11:54 AM
Thanks Sixtease,

I was able to get around the problem using the same method!