XML to JSON in PHP


The story of an odyssey Posted by on March 03, 2011

Last friday, in occasion of the April Zend Framework Bug-Hunt, I started to look at this bug: ZF-3257. This is an issue related to the Zend_Json class that occurs during the conversion from XML to JSON for some specific XML documents, like this one:

$xml = '<a><strong id="foo"></strong>bar</a>';

The result using Zend_Json::fromXml($xml, false) , where false indicated the usage of XML attributes, was:

{"a":{"b":{"@attributes":{"id":"foo"}}}}

As you can see the bar value, of the a element, is not represented in JSON. This issue comes also with other XML documents, and in general when an XML node has a single character data child, any attributes are lost.

For instance, the following code:

$xml = '<a><strong id="foo">bar</strong></a>';
echo Zend_Json::fromXml($xml, false);

Produced the output:

{"a":{"b":"bar"}}

in this case the attribute id and the value foo are lost.

After my first investigation, I discovered that the problem was not so straightforward. First of all, there is not a common standard to translate XML document in JSON. Recently IBM proposed the JSONx standard but is still a proposal (you can read it here).

Regarding the Zend_Json implementation, the bug "seems" to be related to the SimpleXML extension of PHP. I used the term "seems" because I will show you that is possible to implement a valid conversion from XML to JSON using SimpleXML, so for me it's quite ambigous if the problem is related to SimpleXML, anyway let's continue with the discussion.

If you try to use SimpleXML to load the following XML:

$xml = '<a><strong id="foo"></strong>bar</a>';
$simpleXml = simplexml_load_string($xml);
var_dump($simpleXml);

You will get the following var dump:


object(SimpleXMLElement)#1 (1) {
  ["b"]=&gt;
  object(SimpleXMLElement)#2 (1) {
    ["@attributes"]=&gt;
    array(1) {
      ["id"]=&gt;
      string(3) "foo"
    }
  }
}

Where is the bar value? Seems that SimpleXMLElement doesn't contains this value, but if you try to cast the value of a SimpleXMLElement element using:

$bar = (string) $simpleXML;

or

$bar = strval($simpleXML);

you will get the value. That means bar value is inside the SimpleXMLElement object. Coming back to the previous question, the bug ZF-3257 depends on a problem of SimpleXML? I would say yes and no. Yes because the bar value is not represented in a correct format inside the dump of the object and I would say No because the bar value is, anyway, stored in the object.

Regarding this Hamlet-like question I opened a bug in the PHP.net web site: Bug #54632.

I also discovered that other PHP implementations, that use SimpleXML to convert data from XML to JSON, are affected by this problem. For instance, the one proposed by IBM in the article Convert XML to JSON in PHP written by S.Nathan, Edward J Pring, and John Morar contains the same problem.

I decided to fix the bug ZF-3257 proposing a new converstion algorithm from XML to JSON in PHP. My solution uses a special key value (@text) to store the text value of an XML element, only if this element contains attributes or sub-elements (as in the previous examples). If you need to convert a simple XML element with only text values, like foo the JSON is {"a":"foo"} that is quite intuitive.

Using this algorithm the translation of the following XML:


<a><strong id="foo">bar</strong></a>

in JSON, will be:


{"a":{"b":{"@attributes":{"id":"foo"}},"@text":"bar"}}

You can find the correct implementation of the conversion of XML to JSON in the Zend_Json class of Zend Framework starting from the version 1.11.6, that should be released shortly. Below I reported the source code of the changes of the Zend_Json class for the conversion algorithm from XML to JSON. In particular, I rewrote the _processXml static method and I created the new _getXmlValue to manage Zend_Json_Expr in text and attributes values of XML.

/**
 * Return the value of an XML attribute text or the text between
 * the XML tags
 *
 * In order to allow Zend_Json_Expr from xml, we check if the node
 * matchs the pattern that try to detect if it is a new Zend_Json_Expr
 * if it matches, we return a new Zend_Json_Expr instead of a text node
 *
 * @param SimpleXMLElement $simpleXmlElementObject
 * @return Zend_Json_Expr|string
 */
protected static function _getXmlValue($simpleXmlElementObject)
{
    $pattern = '/^[s]*new Zend_Json_Expr[s]*([s]*["']{1}(.*)["']{1}[s]*)[s]*$/';
    $matchings = array();
    $match = preg_match ($pattern, $simpleXmlElementObject, $matchings);
    if ($match) {
        return new Zend_Json_Expr($matchings[1]);
    } else {
        return (trim(strval($simpleXmlElementObject)));
    }
 }
/**
 * _processXml - Contains the logic for xml2json
 *
 * The logic in this function is a recursive one.
 *
 * The main caller of this function (i.e. fromXml) needs to provide
 * only the first two parameters i.e. the SimpleXMLElement object and
 * the flag for ignoring or not ignoring XML attributes. The third parameter
 * will be used internally within this function during the recursive calls.
 *
 * This function converts the SimpleXMLElement object into a PHP array by
 * calling a recursive (protected static) function in this class. Once all
 * the XML elements are stored in the PHP array, it is returned to the caller.
 *
 * Throws a Zend_Json_Exception if the XML tree is deeper than the allowed limit.
 *
 * @param SimpleXMLElement $simpleXmlElementObject
 * @param boolean $ignoreXmlAttributes
 * @param integer $recursionDepth
 * @return array
 */
protected static function _processXml ($simpleXmlElementObject, $ignoreXmlAttributes, $recursionDepth=0)
{
    // Keep an eye on how deeply we are involved in recursion.
    if ($recursionDepth > self::$maxRecursionDepthAllowed) {
        // XML tree is too deep. Exit now by throwing an exception.
        require_once 'Zend/Json/Exception.php';
        throw new Zend_Json_Exception(
            "Function _processXml exceeded the allowed recursion depth of " .
            self::$maxRecursionDepthAllowed);
    } // End of if ($recursionDepth > self::$maxRecursionDepthAllowed)
    $childrens= $simpleXmlElementObject->children();
    $name= $simpleXmlElementObject->getName();
    $value= self::_getXmlValue($simpleXmlElementObject);
    $attributes= (array) $simpleXmlElementObject->attributes();
    if (count($childrens)==0) {
        if (!empty($attributes) && !$ignoreXmlAttributes) {
            foreach ($attributes['@attributes'] as $k => $v) {
                $attributes['@attributes'][$k]= self::_getXmlValue($v);
            }
            if (!empty($value)) {
                $attributes['@text']= $value;
            }
            return array($name => $attributes);
        } else {
           return array($name => $value);
        }
    } else {
        $childArray= array();
        foreach ($childrens as $child) {
            $childname= $child->getName();
            $element= self::_processXml($child,$ignoreXmlAttributes,$recursionDepth+1);
            if (array_key_exists($childname, $childArray)) {
                if (empty($subChild[$childname])) {
                    $childArray[$childname]=array($childArray[$childname]);
                    $subChild[$childname]=true;
                }
                $childArray[$childname][]= $element[$childname];
            } else {
                $childArray[$childname]= $element[$childname];
            }
        }
        if (!empty($attributes) && !$ignoreXmlAttributes) {
            foreach ($attributes['@attributes'] as $k => $v) {
                $attributes['@attributes'][$k]= self::_getXmlValue($v);
            }
            $childArray['@attributes']= $attributes['@attributes'];
        }
        if (!empty($value)) {
            $childArray['@text']= $value;
        }
        return array($name => $childArray);
    }
}