Ensuring Valid XML Content
December 13, 2002 Timothy Prickett Morgan
Hey, David:
I will preface this email by saying that my knowledge of XML is limited, but I’ve been able to use the Xerces parser to move XML data to a physical file using QSH/Java.
|
Now, if you’re receiving an XML file from a customer, how do you guarantee that the customer’s XML document references your external DTD? What would prevent the customer from embedding an “invalid” DTD in the XML document?
— Chris
The problem you describe is one of the limitations you have to deal with when you use a Document Type Definition (DTD) to validate XML documents. A DTD enables you to do some high-level checking of an XML document. With a DTD, you can check the basic structure of an XML document, but you cannot check the actual content.
You cannot easily ignore a DTD embedded in XML content. In addition to providing validation, DTDs are used to supply default information. For example, you can define an entity in a DTD that provides a replacement value for entity references in an XML document.
In this situation, an XML schema, which is more flexible than a DTD, might work better to validate the data you receive. Schemas can coexist with DTDs and allow you to check an XML document’s structure and content. The ability to check content allows you to type-check the elements and attributes contained in an XML document. For example, you can make sure that start_date elements contain a date, and age elements contain a positive integer. In addition, you can use a schema to check that an attribute’s values fall in a certain range or contain certain values.
If you decide that DTDs are more trouble than they are worth, you might also want to consider using Simple Object Access Protocol (SOAP). The SOAP specification specifically prohibits the use of DTDs and is gaining popularity allowing you to sidestep this issue without appearing too rigid.
I created an XML document with an internal DTD and validated it against a schema using the JDOM b8 parser. I like to use JDOM because it simplifies processing of XML documents. Under the covers, JDOM uses a parser that you can specify. Because it has the best schema support, I specified the Xerces 2.2.1 parser from the Apache Software Foundation.
In the example, the XML document has an internal DTD with an entity that supplies a company name. Ignoring or removing the DTD from this document will make it so that the reference to company name (the personnel attribute entity reference that specifies “&company-name;”) will not be replaced with “Big Company.” Here is that XML document, which you should save as personal.xml.
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE personnel [ <!ENTITY company-name "Big Company"> <!ELEMENT family (#PCDATA)> <!ELEMENT given (#PCDATA)> <!ELEMENT name (family,given)> <!ELEMENT nationality (#PCDATA)> <!ELEMENT person (name,nationality)> <!ATTLIST person id CDATA #IMPLIED> <!ELEMENT personnel (person+)> <!ATTLIST personnel company CDATA #IMPLIED> <!ATTLIST personnel >> <personnel company="&company-name;" xsi:noNamespaceSchemaLocation="personal.xsd" >"http://www.w3.org/2001/XMLSchema-instance"> <person id="Big.Boss"> <name><family>Boss</family> <given>Big</given></name> <nationality>Roman</nationality> </person> <person id="one.worker"> <name><family>Worker</family> <given>One</given></name> <nationality>Greek</nationality> </person> <person id="two.worker"> <name><family>Worker</family> <given>Two</given></name> <nationality>Phoenician</nationality> </person> <person id="three.worker"> <name><family>Worker</family> <given>Three</given></name> <nationality>Greek</nationality> </person> <person id="four.worker"> <name><family>Worker</family> <given>Four</given></name> <nationality>Greek</nationality> </person> </personnel>
The schema supplied with the example checks to see that the XML document is properly structured. In addition to structure, the schema ensures that the ID and name element values are unique and that the nationality specified is either Greek or Roman. Here is the schema I used, which you should save as personal.xsd.
<?xml version="1.0" encoding="UTF-8"?> <xs:schema >'http://www.w3.org/2001/XMLSchema'> <xs:element name="personnel"> <xs:complexType> <xs:sequence> <xs:element ref="person" minOccurs='1' maxOccurs='unbounded'/> </xs:sequence> <xs:attribute name="company" type="xs:string" use="required"/> </xs:complexType> <xs:unique name="unique1"> <xs:selector xpath="person"/> <xs:field xpath="name/given"/> <xs:field xpath="name/family"/> </xs:unique> <xs:key name='empid'> <xs:selector xpath="person"/> <xs:field xpath="@id"/> </xs:key> </xs:element> <xs:element name="person"> <xs:complexType> <xs:sequence> <xs:element ref="name"/> <xs:element ref="nationality"/> </xs:sequence> <xs:attribute name="id" type="xs:ID" use='required'/> </xs:complexType> </xs:element> <xs:element name="name"> <xs:complexType> <xs:all> <xs:element ref="family"/> <xs:element ref="given"/> </xs:all> </xs:complexType> </xs:element> <xs:element name="family" type='xs:string'/> <xs:element name="given" type='xs:string'/> <xs:element name="email" type='xs:string'/> <xs:element name="nationality"> <xs:simpleType> <xs:restriction base = "xs:string"> <xs:enumeration value="Roman"/> <xs:enumeration value="Greek"/> </xs:restriction> </xs:simpleType> </xs:element> </xs:schema>
The Java program I wrote applies the schema to the XML document. Save the source in a demo directory as ValidDocument.java. Next, edit the program so that it specifies the correct location for the personal.xsd file. Here is the source for the program:
package demo; import org.jdom.Document; import org.jdom.JDOMException; import org.jdom.input.SAXBuilder; import org.jdom.output.XMLOutputter; import java.io.File; import java.io.FileInputStream; import java.io.IOException; /** * Class ValidDocument provides an XML document validated against the personal.xsd schema. * @author David Morris */ public class ValidDocument { // XML file to read File file; public ValidDocument(File file) { this.file = file; } public Document build() throws JDOMException, IOException { // Create new SAXBuilder, using default parser SAXBuilder builder = new SAXBuilder("org.apache.xerces.parsers.SAXParser", true); // Uncommenting the following line ensures that the document received stands alone // builder.setFeature("http://apache.org/xml/features/ nonvalidating/load-external-dtd", false); builder.setFeature("http://apache.org/xml/features/ validation/schema", true); builder.setProperty("http://apache.org/xml/properties/schema/ external-noNamespaceSchemaLocation", "file:///C:/projects/examples/src/xml/personal.xsd"); Document doc = builder.build(new FileInputStream(file)); return doc; } public Document validate(Document doc) throws JDOMException, IOException { return doc; } public static void main(String[] args) { try { File file = new File(args[0]); ValidDocument validDocument = new ValidDocument(file); Document doc = validDocument.build(); // Output the document to System.out XMLOutputter outputter = new XMLOutputter(); outputter.output(doc, System.out); } catch (Exception e) { e.printStackTrace(); } System.exit(0); } }
Before compiling or running this program, add xercesImpl.jar and jdom.jar to your CLASSPATH. On my system, I stored the documents and program in the /temp directory. I switched to the temp directory and used the following commands to compile and run the program:
javac demo/ValidDocument.java java demo.ValidDocument
Running the program resulted in the following output:
Value ‘Phoenician’ is not facet-valid with respect to enumeration ‘[Roman, Greek]’
I have covered a lot of ground here, but there is no easy way to deal with DTDs embedded in XML content.
— David
Sponsored By ADVANCED SYSTEMS CONCEPTS |
Business Analytics SEQUEL FYI
User Quote: SEQUEL FYI offers outstanding OLAP business intelligence functionality for a fraction of the cost of comparable solutions. |