PHP 5 DOM and XMLReader: Reading XML with Namespace (Part 1)

user warning: Table './itsalif/captcha_sessions' is marked as crashed and last (automatic?) repair failedquery: INSERT into captcha_sessions (uid, sid, ip_address, timestamp, form_id, solution, status, attempts) VALUES (0, 'tb7n2cb58s9vuiak3uqq3qevn2', '202.43.144.67', 1328264744, 'comment_form', 'undefined', 0, 0) in /home/vhosts/itsalif.info/public/sites/default/modules/captcha/captcha.inc on line 92.
user warning: Table './itsalif/captcha_sessions' is marked as crashed and last (automatic?) repair failedquery: SELECT status FROM captcha_sessions WHERE csid = 337263 in /home/vhosts/itsalif.info/public/sites/default/modules/captcha/captcha.inc on line 112.
user warning: Table './itsalif/captcha_sessions' is marked as crashed and last (automatic?) repair failedquery: SELECT status FROM captcha_sessions WHERE csid = 337263 in /home/vhosts/itsalif.info/public/sites/default/modules/captcha/captcha.inc on line 112.
user warning: Table './itsalif/captcha_sessions' is marked as crashed and last (automatic?) repair failedquery: UPDATE captcha_sessions SET timestamp=1328264744, solution='1' WHERE csid=337263 in /home/vhosts/itsalif.info/public/sites/default/modules/captcha/captcha.inc on line 104.

By alif - Posted on 07 April 2009

PHP-5's DOM and XMLReader provides the ability to read XML files easily. The good thing about PHP-5's DOM (mainly DomDocument, DomNodeList, DomNode) is that it implements the standard DOM features as specified by W3C. W3C's reference on DOM can be viewed here. So, if someone has used DOM before (say on JavaScript), then it would be easy for him/her to grasp PHP-5's DOM.

The following are the functions of PHP5's DOM I commonly use:

getElementsByTagName
getAttribute
childNodes
nodeName
nodeValue
getElementsByTagNameNS

Here's a Simple XML File called test.xml:

<?xml version="1.0" encoding="ISO-8859-1"?>
<library>
 <book isbn="781">
   <name>SCJP 1.5</name>
   <info><![CDATA[Sun Certified Java Programmer book]]></info>
 </book>
 <book isbn="194">
   <name>jQuery is Awesome!</name>
   <info><![CDATA[jQuery Reference Book]]></info>
 </book>	
</library>

Below I will explain how to read the XML. At first load the file on DomDocument

$dom = new DomDocument();
$dom->load('test.xml');

So, $dom now has the XML file loaded, now using getElementsByTagName I will get the list of elements/nodes called 'book'

$bookElemList = $dom->getElementsByTagName('book');

bookElemList is an object of DomNodeList and it contains List of DomNode of 'book' tags/elements. It has a instance variable 'length' which returns the number of DomNodes (items) in it, and it has a method called item (index), which returns the item based on the index passed on it. Below, I parse through bookElemList and store contents of 'book' in an assoc array. To get access to an Attribute, I use getAttribute method as shown below

$bookList = array();
// run a for loop to iterate through all bookElemList index.
for($i=0;$i<$bookElemList->length;$i++) {
	$bookList[$i] = array (
          // get Attribute of book Element as store it in book_isbn
	  'book_isbn' => $bookElemList->item($i)->getAttribute('isbn'),
          // get 'name' element inside bookElemList at $i index.
	  'name'      => $bookElemList->item($i)->getElementsByTagName('name')->item(0)->nodeValue,
	  'info'      => $bookElemList->item($i)->getElementsByTagName('info')->item(0)->nodeValue
	);
 
}

Instead of getting name and info separately I could have easily used childNodes method to access the elements like below: (Note that below I had to use nodeType to check if the node is Element or not, this is required because Blank spaces on XML is considered as a text node by DOM. If you want to avoid checking nodeType, then remove whitespaces from XML before reading it). Values of NodeType can be viewed at W3C's page

$bookList = array();
for($i=0;$i<$bookElemList->length;$i++) {
  $bookList[$i]['book_isbn'] = $bookElemList->item($i)->getAttribute('isbn');
 
 foreach($bookElemList->item($i)->childNodes as $eachChild) {
  if( $eachChild->nodeType == 1 )  // ensure nodeType is Element
   $bookList[$i][$eachChild->nodeName] = $eachChild->nodeValue;
 }
}

But, I prefer to manually get the contents, because in most cases, I only need the values/texts of few elements on the XML, so if instead I use childNodes, it means I would be consuming memory for large XML files which has many elements/tags.

Here's a print_r of how $bookList looks like:

Array
(
    [0] => Array
        (
            [book_isbn] => 781
            [name] => SCJP 1.5
            [info] => Sun Certified Java Programmer book
        )
 
    [1] => Array
        (
            [book_isbn] => 194
            [name] => jQuery is Awesome!
            [info] => jQuery Reference Book
        )
 
)

The above was a very simple XML. Now, lets parse an XML a bit complex and which has namespaces.An XML Namespace is used to avoid conflicts on XML Elements/Tags by using a prefix. Brief info on XML Namespaces can be viewed here.

I chose to read reading an XML featured on JWPlayer's setup wizard. It can be viewed here JWPlayer's Rss XML

Here's the XML:

<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss/">
	<channel>
		<title>Example media RSS playlist for the JW Player</title>
		<link>http://www.longtailvideo.com</link>
 
		<item>
			<title>Big Buck Bunny - FLV Video</title>
			<link>http://www.bigbuckbunny.org/</link>
 
			<description>Big Buck Bunny is a short animated film by the Blender Institute, part of the Blender Foundation. Like the foundation's previous film Elephants Dream, the film is made using free and open source software.</description>
			<media:credit role="author">the Peach Open Movie Project</media:credit>
			<media:content url="http://www.longtailvideo.com/jw/upload/bunny.flv" type="video/x-flv" duration="33" />
		</item>
 
		<item>
			<title>Big Buck Bunny - MP3 Audio with thumb</title>
			<link>http://www.bigbuckbunny.org/</link>
 
			<description>Big Buck Bunny is a short animated film by the Blender Institute, part of the Blender Foundation. Like the foundation's previous film Elephants Dream, the film is made using free and open source software.</description>
			<media:credit role="author">the Peach Open Movie Project</media:credit>
			<media:content url="http://www.longtailvideo.com/jw/upload/bunny.mp3" type="audio/mpeg" duration="33" />
			<media:thumbnail url="http://www.longtailvideo.com/jw/upload/bunny.jpg" />
		</item>
 
		<item>
			<title>Big Buck Bunny - PNG Image with start</title>
 
			<link>http://www.bigbuckbunny.org/</link>
			<description>Big Buck Bunny is a short animated film by the Blender Institute, part of the Blender Foundation. Like the foundation's previous film Elephants Dream, the film is made using free and open source software.</description>
			<media:group>
				<media:credit role="author">the Peach Open Movie Project</media:credit>
				<media:content url="http://www.longtailvideo.com/jw/upload/bunny.png" type="image/png" duration="20" start="10" />
			</media:group>
		</item>
 
	</channel>
</rss>

Here's the first tag from the File which declares the XML Namespace

<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss/">

The Namespace is defined on the first line, i.e. xmlns:media (so 'media' is the localname of that Element on this XML File, while its namespace is 'http://search.yahoo.com/mrss/')

To read a node with a namespace, the following method can be used:

$dom->getElementByTagNameNS('namespaceURI', 'local_Name_of_Node');

The code below explains how to read the above XML

// load the file on the DOM
$dom = new DomDocument();
$dom->load('http://www.longtailvideo.com/jw/upload/mrss.xml');
 
$itemList 		= array();
 
// get the list of Items.
$itemElemList 	= $dom->getElementsByTagName('item');
for($i=0;$i<$itemElemList->length;$i++) {
	$itemList[$i] = array (
		'title'       => $itemElemList->item($i)->getElementsByTagName('title')->item(0)->nodeValue,
		'link'        => $itemElemList->item($i)->getElementsByTagName('link')->item(0)->nodeValue,
		'description' => $itemElemList->item($i)->getElementsByTagName('description')->item(0)->nodeValue,			
		'credit'      => $itemElemList->item($i)->getElementsByTagNameNS('http://search.yahoo.com/mrss/', 'credit')->item(0)->nodeValue,
		'content_url' => $itemElemList->item($i)->getElementsByTagNameNS('http://search.yahoo.com/mrss/', 'content')->item(0)->getAttribute('url'),
	);
 
}

Here's a print_r of how itemList looks like:

Array
(
    [0] => Array
        (
            [title] => Big Buck Bunny - FLV Video
            [link] => http://www.bigbuckbunny.org/
            [description] => Big Buck Bunny is a short animated film by the Blender Institute, part of the Blender Foundation. Like the foundation's previous film Elephants Dream, the film is made using free and open source software.
            [credit] => the Peach Open Movie Project
            [content_url] => http://www.longtailvideo.com/jw/upload/bunny.flv
        )
 
    [1] => Array
        (
            [title] => Big Buck Bunny - MP3 Audio with thumb
            [link] => http://www.bigbuckbunny.org/
            [description] => Big Buck Bunny is a short animated film by the Blender Institute, part of the Blender Foundation. Like the foundation's previous film Elephants Dream, the film is made using free and open source software.
            [credit] => the Peach Open Movie Project
            [content_url] => http://www.longtailvideo.com/jw/upload/bunny.mp3
        )
 
    [2] => Array
        (
            [title] => Big Buck Bunny - PNG Image with start
            [link] => http://www.bigbuckbunny.org/
            [description] => Big Buck Bunny is a short animated film by the Blender Institute, part of the Blender Foundation. Like the foundation's previous film Elephants Dream, the film is made using free and open source software.
            [credit] => the Peach Open Movie Project
            [content_url] => http://www.longtailvideo.com/jw/upload/bunny.png
        )
 
)

So far I explained reading XML by loading on DomDocument. An important thing to realize is that when an XML is loaded on DomDocument, the entire XML is converted into a DomDocument, thus giving the ability to parse through each Nodes on the XML.

But, if the XML is very large, then loading them via DomDocument is unwise, because it means using a lot of memory (loading entire file on Memory), so, PHP-5 provides a Class: XMLReader. In part 2 of this article, I explain how to use XMLReader.

Attachment	Size
test.xml	284 bytes
test.php.txt	867 bytes
jwplayer_rss.php.txt	906 bytes

alif's blog