DTD declarations in XML files are evil
8:39 PM, Tue, Mar 4 2008
Yesterday I went to a customer site to install a revision of a software package Chiral Software is developing for them. The application worked fine on our machine here at the office. After shutting down their JBoss server and deploying the new EAR file, we got:
java.lang.IllegalStateException: entityManager is null at org.jboss.seam.framework.EntityQuery.validate(EntityQuery.java:39) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ..... java.lang.RuntimeException: org.dom4j.DocumentException: jboss.com Nested exception: jboss.com at org.jboss.seam.navigation.Pages.getDocumentRoot(Pages.java:956) at org.jboss.seam.navigation.Pages.parse(Pages.java:942) ....... Caused by: org.dom4j.DocumentException: jboss.com Nested exception: jboss.com at org.dom4j.io.SAXReader.read(SAXReader.java:484) at org.dom4j.io.SAXReader.read(SAXReader.java:343) at org.jboss.seam.util.XML.getRootElement(XML.java:21) at org.jboss.seam.navigation.Pages.getDocumentRoot(Pages.java:952)
It's not good when that happens during a customer installation.
I knew what the problem was. I had previously filed a JIRA on this issue, and discussed it in a blog entry. I went back to the office, and quickly found the guilty file. It was
login.page.xml, a file I haven't touched since we switched the application from Seam 1.2 to Seam 2.0. It had this DTD line:
<!DOCTYPE page PUBLIC "-//JBoss/Seam Pages Configuration DTD 1.2//EN" "http://jboss.com/products/seam/pages-1.2.dtd">
Presumably Seam 1.2 had the
pages-1.2.dtd file in its classpath, but Seam 2.0 does not, so dom4j attempted to fetch that file over the net. The customer site had blocked outgoing connections from their internal server, as a prudent security practice.
The solution was simple. I took out the DTD line entirely. Now dom4j won't have any file to attempt to retrieve. I would recommend removing all DTD lines from your XML files until dom4j fixes this. Silently fetching files over the network, and including them as artifacts in your application, is broken behavior for an XML parser.
Of all institutions on Web, you might think that W3C would be most supportive of DTD declarations, because they exist to support standards-based web applications. In fact, their servers are constantly being pounded by XML parsers fetching the same DTD file, over and over again, even though it never changes. They push out 100 million DTD fetches per day, all of DTDs which never change. None of these requests are necessary. The XML parsers on the client sides are making users wait while they fetch DTD files which aren't needed, and meanwhile W3C is paying the bandwidth bills for it. W3C says:
But many of these systems continue to re-request the same DTDs from our site thousands of times over, even after we have been serving them nothing but 503 errors for hours or days. Why are these systems bothering to request these resources at all if they don't care about the response? (For repeat offenders we eventually block the IPs at the TCP level as well.)
There's also a security issue here. It's possible that a carefully-constructed replaced DTD for
pages.xml might be able to get the parser to ignore parts of it, such as
Restrict blocks. Given that these DTDs are served without any type of certificates, this is potentially dangerous.
Until dom4j fixes their handling of DTD lines to make it possible to disable fetching them while parsing, don't use them.