Microsoft’s Jean Paoli on the XML document debate

I spoke to Microsoft’s Jean Paoli, General manager for interoperability and  Office Open XML architecture on the hot topic of the Microsoft Office XML formats and their standardisation. Paoli was an editor on the W3C committee which specified XML 1.0 in 1998.

Why a hot topic? Briefly, in 1999 Sun purchased an office suite called Star Office and released the code as open source, creating a free Office suite called Open Office. One of its likely goals was to undermine the dominance of Microsoft Office by providing a free alternative. In 2002 work began to standardize an XML specification which is an evolution of the Open Office document formats as Open Document Format (ODF); in 2006 it became an ISO standard. The combination of Open Office plus an ISO standard document format is proving an attractive combination, especially to government departments, and threatens Microsoft’s near-monopoly in Office productivity software.

Perhaps in response to ODF, Microsoft set to work standardizing its own XML document format, this time based on the files used by Microsoft Office, and called Office Open XML (OOXML). It has achieved ECMA standardization, and is now proceeding towards ISO, a move ardently opposed by IBM and other ODF supporters.

Paoli talked about how Microsoft has been working with XML in office for a long time, since Office 2000 in fact. It’s true, and I remember the launch of Office 2000 and the demonstration of round-tripping Office documents between HTML and native office, thanks to embedded XML tags. It turned out to be a feature more cursed than admired, because of the bloat it added to Word documents exported to HTML, but nevertheless he is right – Microsoft has been working seriously on XML in Office for many years, and it must frustrate the company to see the later ODF specifications sneaking ahead in the standards game. Let’s also note that the decision to have Office 2007 save by default in XML is a bold move, ensuring immediate wide usage of OOXML, but also risking the annoyance of existing customers, most of whom know and care little about document formats but just want seamless interchange. Sending a .Docx to a Mac user, for example, may cause real difficulty for the recipient.

Different goals

“Open Document Format and Office Open XML have very different goals”, says Paoli, responding to the claim that the world needs only one standard XML format for office documents. “Both of them are formats for documents … both are good.”

What’s distinctive about the goals of OOXML? Primarily, to have full fidelity with pre-existing binary documents created in Microsoft Office. “What people want is to make sure that their billions of important documents can be saved in a format where they don’t lose any information. As a design goal, we said that those formats have to represent all the information that enables high-fidelity migration from the binary formats”, says Paoli. He mentions work with institutions including the British Library and the US Library of Congress, concerned to preserve the information in their electronic archive.

I asked Paoli if such users could get equally good fidelity by converting their documents to ODF. “Absolutely not,” he says. “I am very clear on that. Those two formats are done for different reasons.”

What can go wrong? Paoli gives as an example the myriad ways borders can be drawn round tables in Microsoft Office and all its legacy versions. “There are 100 ways to draw the lines around a table,” he says. “The Open XML format has them all, but ODF which has not been designed for backward compatibility, does not have them. It’s really the tip of the iceberg. So if someone translates a binary document with a table to ODF, you will lose the framing details. That is just a very small example.”

Another benefit Paoli claims for OOXML is performance. “A lot of things are designed differently because we believe it will work faster. The spreadsheet format has been designed for very big spreadsheets because we know our users, especially in the finance industry, use very large spreadsheets. They use spreadsheets like databases. It’s not that one is better than the other, it’s that they have been designed for different things.”

I asked Paoli what would be the consequences if in fact OOXML does not become an ISO standard. He will not answer the question directly, but is defensive. “This is a long process. We will continue discussing what we should do better. It’s not like a yes or no. But what’s important is that it is already an ECMA standard. Some governments told us they would prefer it were an ISO standard. So we know that, and respect that.

“We have been in discussion with the IDABC (Interoperable Delivery of European eGovernment Services to Public Administrations, Businesses and Citizens) .In 2004, the IDABC said ‘Microsoft should consider the merits of submitting the XML format to an international standards body of their choice.’ We responded to the IDABC specific ask. It’s surprising to see some people now saying this is a bad thing, when the EU asked us to standardise the format.”

See my further comments about IDABC below. It is clear though that Paoli is upset by what he sees as an international campaign against OOXML orchestrated by IBM, the sole naysayer in the ECMA voting. “There are IBM employees going to ISO, and saying a lot of technically incorrect things. When ODF went to ISO Microsoft did not interfere. IBM is betting on ODF, to have governments preferentially buying IBM software. It is OK to compete, but using this kind of argument around is it an open format or not … it’s widely known now, Office Open XML is an open format, even the EU says it is.”

I put it to Paoli that OOXML is hard to implement because of all its legacy support, some of which is currently not well documented. “I don’t believe that at all. It’s actually the opposite,” he says. He make the point that third parties like Corel, which have previously implemented support for binary formats like .doc and .xls, should find it easy to transition to OOXML. “We believe Open XML adoption by vendors like Corel will be very easy because they have already been doing 90% of the work, doing the binary formats. The features are already there.”

I have been critical of the Microsoft-sponsored open source converter between OOXML and ODF, and its integration with Microsoft Office. Wouldn’t it have been better if Microsoft’s own Office team had worked on this, and come up with something of higher quality?

“It is a version 1, honestly,” he says, adding “I am sure it is not perfect. On performance, we were surprised by the delay that you got. In terms of the fidelity of the translation, that’s why we put it in the open. I am sure this is going to evolve. There are going to be things that will not be able to be translated because the formats are different.”

I mistrust Microsoft’s motives here. Paoli points to the conversion errors as evidence of how poorly ODF can represent legacy Office documents. My hunch is that this has more to do with the poor quality of the converter. Nor is its open source status any excuse. This component, or an alternative converter, is critical to the future of Microsoft Office if, as expected, significant numbers of institutions standardise on ODF. Without a good converter, mandating ODF is in effect mandating non-use of Microsoft Office.

Finally, I asked Paoli whether there will ever be a reference implementation of OOXML other than Microsoft Office. “Absolutely,” he says. “It was announced by both Corel and Sun. They are going to fully implement Office Open XML. Novell also integrated the translator into Open Office. Sun developers also posted on their blog that they are implementing Office Open XML.”

This is a stretch, to say the least. It’s true that Sun says here that there will be support for OOXML in Star Suite (the commercial version of Open Office):

Q: Will StarSuite be compatible with the new ‘Microsoft Office Open XML Formats’ – the new file format for the next release of Microsoft Office?

A: Yes, StarSuite will be compatible with the new file format. Microsoft had not published a specification at the launch time of StarSuite 8. The next release of StarSuite will be able to load and save those files.

Having said that, even if it delivers some sort of import filter, the idea that Sun is preparing a reference implementation of OOXML is laughable. It’s also true that Corel has announced its support for both OOXML and ODF:

Corel’s pragmatic approach to emergent XML file formats provides customers with maximum flexibility, lowers costs and reduces risk by insulating customers against committing to a standard that may not become adopted.

But once again this is not a reference implementation, merely a promise of compatibility, with who knows how long a list of errors and omissions.

I am bewildered by Paoli’s response to my question. Surely he understands the difference between a reference implementation and an import/export filter? Here’s Wikipedia , quoting from NIST (National Institute of Standards and Technology):

A reference implementation is, in general, an implementation of a specification to be used as a definitive interpretation for that specification. During the development of the … conformance test suite, at least one relatively trusted implementation of each interface is necessary to (1) discover errors or ambiguities in the specification, and (2) validate the correct functioning of the test suite.

Some closing thoughts

On the face of it standardising Office Open XML is a benefit. Even if there is no full implementation other than Microsoft Office, it helps developers working with the formats, by making them less of a moving target. So what reason is there to oppose standardisation?

Well, if a customer is offered two office suites, and one can the tick the ISO box whereas the other cannot, that could well swing the deal, especially in government and academic markets. It follows that opposing standardisation is a good way to damage Microsoft in one of its core markets. However, such motivations are not meant to drive standards bodies.

Does it matter if OOXML is not standardised with ISO? For those with a commercial interest, of course it does. For users, it could matter if they are forced to switch from Microsoft Office to Open Office solely because ODF is the ISO standard, and suffer loss of productivity or failures in working with existing documents.

In saying this, I am presuming that Microsoft Office is a poor choice if you want to work with ODF – correct, I think. Switching from one office suite to another can be costly, not only because of training, but also because of the large number of templates, macros and applications which rely on Microsoft Office.

Of course there may also be good reasons for migrating from Microsoft Office to Open Office. One is that Open Office is free, open source and cross-platform, while Microsoft Office is not. It is easy to build a case for Open Office without needing to play the ISO card.

In practice, I doubt that ISO standardisation of OOXML would much hold back ODF adoption. Even Microsoft’s own arguments may count against OOXML. Microsoft seems be saying that Office Open XML is designed primarily to be able to translate to and from Microsoft Office binary formats without loss of fidelity. That is its foremost argument for wanting OOXML standardized alongside ODF, since ODF does not have this goal. This legacy support is costly, because it bloats the specification. Should we than conclude that ODF is the best specification for new documents, while OOXML is mainly suitable for archiving legacy documents? That seems logical; yet I am sure Microsoft would resist such a conclusion. If ISO standardization is achieved, Microsoft is not going to go to its customers and say, “Use OOXML for legacy documents, and ODF for new documents”. No, it is going to say, “Use ISO Standard OOXML for all your documents.” I suggest there is doublethink here.

Finally, those on both sides of this debate could do better in presenting their case. On the IBM/ODF side there is open hostility; while Microsoft does too little to engage with the community and to have the technical debate it claims to welcome (I give honourable exception to Brian Jones, who is a model advocate on his blog). Silly comments about reference implementations and the poor quality of the Microsoft-sponsored OOXML/ODF converter do not help.

Postscript on IDABC recommendations

The documents from the IDABC referenced by Paoli are here. The recent PEGSCO (Pan-European eGovernment Services) Committee report is an interesting read. As Paoli notes, it welcomes the standardisation of both OOXML and ODF, and adds:

Both the ODF and the OpenXML document format specifications are XML based, promising great opportunities to explore the information contained in documents via tools other than traditional office suites. Examples of such exploration include indexing of document collections, automatic extraction of metadata from documents, search-engines, extraction of specific information for re-use, etc.

That said, more is said about adopting ODF than OOXML, and the report is worried about the existence of two standards:

Member State experts have identified the perceived compatibility problems between ISO 26300 (ODF) based products and the commercial applications that dominate the offices of today’s administrations as the main barrier for the use open document exchange and storage formats. The potential arrival of a second international standard for revisable documents may mean that administrations will be required to support multiple formats leading to more complexity and increased costs. Although filters, translators and plug-ins may theoretically enable interoperability, experience shows that multiple transformations of formats may lead to problems, especially as there is no complete mapping between all features of each of the different standards. Technical experts that are familiar with both standards also indicate that there remain, for each of the two standards, a number of technical problems to be solved.

On to the key section, recommendations:

Industry, industry consortia and international standardisation bodies are invited:

6.6. To work together towards one international open document standard, acceptable to all, for revisable and non-revisable documents respectively.

So standards are good; one standard is better. Comfort for both sides here.

Technorati tags: , , , , , ,

11 thoughts on “Microsoft’s Jean Paoli on the XML document debate”

  1. Forgive me for saying (probably rather glibly) that it sounds like the Betamax/VHS format wars all over again. What decided the outcome of that particular battle was the relative availability of titles – information in other words – in either format. I wonder if the deciding factor is going to be similar: how much information exists in a compatible format to the ultimately successful one?

  2. You report that “Paoli talked about how Microsoft has been working with XML in office for a long time, since Office 2000 in fact. It’s true…”

    Yes indeed, but you could argue that IBM has been doing it even longer, with SGML. SGML isn’t XML, of course, but the issues should be fairly similar. If Microsoft is suffering from a little “not invented here” syndrome, then so may IBM…

  3. “Microsoft seems be saying that Office Open XML is designed primarily to be able to translate to and from Microsoft Office binary formats without loss of fidelity.”

    Except that it is not true at all.

    I’ll give you two counter-examples, but you can find a ton more.

    1) VBA macros. MS OOXML does not define them, therefore they are a) not expressed using XML b) not made interoperable. In practice, it means you need Office 2007 to use your existing VBA macros. Office 2007 is not just a UI on top of OOXML. Office 2007 is a UI on top of 15 years of legacy and proprietary protocols and formats. Out of which OOXML is an emanation.

    2) Charts. There is a new chart engine in Office 2007. But it does not respect existing charts created using an older version of Office. There are MANY bugs, it’s just incredible. Oh, and of course, they don’t look the same anyway visually, so any “full fidelity” requirement à la SOX puts OOXML out of the game.

    The level of hypocrisy in MS blogs is astounding. You seem to be following them pretty well too. I am surprised by the dissonance between what they say in their blogs, and the huge lobbying in private.

    I’ll add a couple more things.

    One of new scenarios enabled by the “custom xml parts” (again, if you read their blogs, you must have heard of this stuff) is the ability to bind xml sources and a control+layout so that it enables the equivalent of data queries (we’ve had in Excel for many years already), just with a source which is part of the package, contrary to the typical external data source connection. Well this stuff, besides the declaration (which includes, big surprise, GUIDs and stuff like that) requires the actual Office 2007 run-time to work. So whenever MS says this stuff is interoperable, they cannot mean you can take this stuff away in another application. Because you can’t. This binding is more or less the same than the embedding of VBA macros. It’s all application-specific, and only Microsoft’s own suite knows how to instantiate this stuff.

    The same for Word document chunk merge feature. A number of people over at the openxmldeveloper.org (MS owned, with paid MS consultants answering questions) are willing to do this with just a ZIP and XML stack, failing to understand that the merging feature requires an actual instance of Word to do it given the complexity/remapping of all the underlying objects.

    Anyway, this comment is way too long already. I guess you get an idea where I am at…

  4. Tim, a very nice post, but… I wished you would have pressed more follow-up questions. Will we see an interview with the other side of this issue, perhaps with Sam Wier or Bob Sutor? That would offer balance. Just let them respond here, on your blog, if you will. Anyway, here are some of the statements that jumped out at me:

    One of its “likely goals was to undermine” the dominance of Microsoft Office by providing a free alternative.
    Really? Can you source this belief?

    …it must frustrate the company to see the later ODF specifications “sneaking ahead” in the standards game.
    ODF went through a laborious, non fast-tracked ISO certification. Microsoft refused to participate in ODF’s development even though they were invited. How is that “sneaking ahead?”

    The Open XML format has them all, but ODF which has not been designed for backward compatibility, does not have them. It’s really the tip of the iceberg. So if someone translates a binary document with a table to ODF, you will lose the framing details. That is just a very small example.
    Has Microsoft published the .doc spec publicly? Then why should ODF worry about the past? It’s not ODF’s concern to worry about Microsoft’s past formats. (Understand that the .doc format alone changed six times in the last eight versions of Office!) That’s Microsoft’s legacy problem, not ODF’s.

    …third parties like Corel, which have previously implemented support for binary formats like .doc and .xls, should find it easy to transition to OOXML.
    They have? Where is there file converters? What’s taking so long? If it was so easy, why was Novell’s first effort so incredibly poor? (You’ve got to ask the follow-up question man!)

    On the IBM/ODF side there is open hostility.
    Yes, I would say I’m openly hostile to lies, in every part of life, work, and business. Aren’t we all?

    Silly comments about reference implementations and the poor quality of the Microsoft-sponsored OOXML/ODF converter do not help.
    By “silly” do you mean critically honest? That statement is confusing.

  5. Thanks for your comments Zaine.

    I wished you would have pressed more follow-up questions.

    Actually this was a long interview, about twice as long as I normally do. Of course I have not quoted everthing that was said. I would like to have asked more questions, but Paoli tends to give lengthy replies and we simply ran out of time.

    Will we see an interview with the other side of this issue, perhaps with Sam Wier or Bob Sutor? That would offer balance. Just let them respond here, on your blog, if you will.

    Sure, they can comment, though I think you have conflated a couple of names there 🙂 And possibly an interview, if it can be worked out.

    — One of its “likely goals was to undermine” the dominance of Microsoft Office by providing a free alternative.
    Really? Can you source this belief?

    I’d refer you for example to many comments from Scott McNealy over the years. Here’s a quick link for example.

    ODF went through a laborious, non fast-tracked ISO certification. Microsoft refused to participate in ODF’s development even though they were invited. How is that “sneaking ahead?”

    Just that Microsoft was an early XML adopter, both in general and in Office.

    — Silly comments about reference implementations and the poor quality of the Microsoft-sponsored OOXML/ODF converter do not help.
    By “silly” do you mean critically honest? That statement is confusing.

    I mean only that it is silly to say that Corel or Sun is producing a reference implementation of Office Open XML. I don’t think that claim helps the case.

    Tim

  6. FWIW, I think what Joel Spolsky said in his blog about product specifications (Monday, October 2, 2000) et alii, is probably the most relevant comment anyone could make on the current document file format standards issue/brouhaha.

    To be brief, that if you don’t get to work on your specifications, and just get to work, you’ll come a cropper.

    And you’ll probably see why I for one distrust ECMA 376. It is intended to represent at least ten years of hidden (and/or reverse-engineered) file formats, most of which turn out to have niggling incompatibilities with the ones immediately before them or after them, and wild differences if you get to three or more major versions between them.

    So it’s just going to incorporate those incompatibilities? That sounds like bad engineering. It’s going to harmonize them? That sounds better, but how? It’s going to ignore them? How, while retaining this this widespread compatibility?

    I don’t think Paoli makes this clear.

    And I think the example given of tables with borders in DOC format and their non-transfer to ODF is somewhat ingenuous. Isn’t that supposed to be a matter for the application plus file filter to handle? As long as ODF can put a border around a table, as long as the application supporting ODF can give a set of styles for borders, then it’s not a major issue – borders around tables do not as a general rule, represent significant information. If I insist on a particular style and only that one style, and will not be satisfied with anything else, no matter how close it gets, then I should see a psychiatrist, not a programmer.

    I’m sure Joel Spolsky could enlarge on that, if Paoli ever asked him.

  7. Too bad you didn’t ask Mr. Pauoli about compliance with the other half of XML – the Extensible Stylesheet Language (XSL) compliance and it’s vital component XSL Transformations (XSLT) compliance. Most users of XML intend to use XSLT to transform XML data from one format to another. OOXML apparently isn’t compliant, otherwise XSLT would be used to transform OOXML to ODF and other formats. Until this is fixed, OOXML is DOA, it might as well be a “.doc” or RTF format. There’s nothing really beneficial for the customers in OOXML.

  8. James

    Too bad you didn’t ask Mr. Pauoli about compliance with the other half of XML – the Extensible Stylesheet Language (XSL) compliance and it’s vital component XSL Transformations (XSLT) compliance.

    Have you got any more detail about this? It’s odd because apparently the Microsoft-sponsored ODF/OOXML converter uses XSL.

    I actually said that it should NOT use XSL because it tends to be slow and memory-hungry; not good if you are working with large documents.

    Tim

  9. I’ve heard it mentioned several times now that Microsoft’s OOXML format isn’t XSL-compliant. perhaps I’m being obtuse here, but as far as I was led to believe, any XML can be handled by an XSL style sheet. I could be wrong, of course, in which case I would be very interested to know how XML documents can be structured in order to break an XSL stylesheet, either a specific stylesheet or in general.

  10. Tim,

    A small typo perhaps?
    On line 5 (6) you wrote:

    “in 2006 it because an ISO standard.”

    Did you mean “became”?

    Cheers
    Malcolm

Comments are closed.

Tech Writing