Category Archives: xml

Microsoft Open XML embarrassment: spaces go missing between words

Microsoft’s controversial Office Open XML format, now officially called just Open XML*, has an embarrassing bug in its Office 2010 and/or Office 2007 implementation, as reported by  Dennis O’Reilly on Cnet.

In a nutshell: if you save a document from Word 2010 using the default .docx format, and send it to a user with Word 2007 but who has a different default printer driver, then a few seemingly random spaces may get dropped from between words or sentences when it is opened on the other machine. When saved in Word 2007, the spaces remain missing if the document is re-opened in Word 2010.

The consequences for one user were severe:

I had this same problem the other day, when I finished writing an in-class essay on my laptop (Win7 64-bit, Office 2010 32-bit), transferred it to a classroom computer (WinXP, Office 2007), and printed the document. I was out of time, so I had to turn in the paper without reading over the printed copy. I had triple-checked the essay on my laptop, so it had no spelling or formatting errors, right?

I got my essay back, and I had 20% of my grade taken away due to frequent spacing errors between words. Shocked, I double-checked my original copy of the document, and there were no spacing errors. Even more perplexing, I opened the file on a classroom computer, and, sure enough, I found many spacing errors between words and sentences.

Now, as I understand it a large part of the point of Open XML is to preserve fidelity in archived documents so I consider this a significant bug.

I’ll speculate a bit on why this problem occurs. It is a bug; but it also reflects the fact that Word is a word processor, not a professional text layout tool. Word processor documents may change formatting slightly according to the printer driver installed; and I’d guess that the missing spaces occur when the line breaks are altered by a different printer driver.

This is why a workaround is for both users to set Adobe PDF as the default printer driver, making them consistent. Another workaround is to revert to the old binary .doc format.

It is still quite wrong for spaces to disappear in this manner, though the bug could be in Word 2007 rather than in Word 2010.

I also notice that nobody from Microsoft has officially commented on the problem. Disclosure is important.

Update: Microsoft has now commented and says:

This is an issue related to how Word 2007 opened files. In other words, the issue is not with Word 2010, it was a defect in the file / open code of Word 2007 that caused the problem. Reports that Open XML caused this issue are not accurate. We discovered and fixed the issue in Word 2007 as part of a release that first appeared on September 25, 2008, well before shipping Office 2010.

The suggested remedy is to apply Office 2007 Service Pack 2.

If you have already applied this and still get the problem, please inform Microsoft – and I would be interested too.

*Note: Although Microsoft sites like this one say Open XML I’m told that the official name is still Office Open XML or possibly something like ISO/IEC 29500:2008 Office Open XML File Formats.

Office Web Apps better then Open Office for .docx on Linux

I’ve been reviewing Office and SharePoint 2010, and trying out Ubuntu Lucid Lynx, so I thought I would put the two together with a small experiment.

I borrowed a document from Microsoft’s press materials for Office 2010. Perhaps surprisingly, they are in .doc format, not the Open XML .docx that was introduced in Office 2007. That didn’t suit my purposes, so I converted it to .docx using Save As in Office 2010.


Then I stuck it on SharePoint 2010.

Next, I downloaded it to Ubuntu and opened it in Open Office. It was not a complete disaster, but the formatting was badly messed up.

Finally, still in Ubuntu, I navigated to SharePoint and viewed the same document there. It looked fine.

Even better, I was able to click Edit in Browser, make changes, and save. The appearance is not quite WYSIWYG in edit mode, but is the same as in IE on Windows.

The exercise illustrates two points. One is that Open Office is not a good choice for working with Open XML – incidentally, the document looked fine when opened in the old binary .doc format. The other is that SharePoint 2010 and Office Web Apps will have real value on mixed networks suffering from document compatibility issues with Office and its newer formats.

Office 2010 offers choice of Open Document or Microsoft XML formats

I was surprised to see the following dialog after an in-place upgrade of Office 2007 to Office 2010:


Admittedly there is a strong steer towards the Microsoft formats which, we are told, are “designed to support all the features of Microsoft Office”.

On the other hand, this was an in-place upgrade and default save options were already present in Office 2007. Given that most in-place upgrades preserve settings – which is part of the point of an in-place upgrade – you would expect it just to keep the old defaults.

I’m guessing therefore that this is aimed at appeasing/convincing regulators and governments that Microsoft Office plays nice with standards.

That said, there is little reason to choose the ODF format unless it is required. It will cause problems with formatting and content, and is especially risky with Excel spreadsheets.

If you want to use ODF, save money and get more complete support by using OpenOffice.

Update: Neowin has some background here.

Dancing on a pin: Microsoft belatedly answers Open XML critics

Microsoft’s Doug Mahugh has replied to accusations from ISO expert Alex Brown that the company is doing little to implement its own Open XML standard. The issue is that the XML document formats in Office 2007 are, from the ISO perspective, meant to be “Transitional” – a compromised format designed to interoperate with existing binary documents – and that the standard Microsoft is meant to be implementing is “Strict”, an improved standard that can more easily be implemented by others.

Mahugh says:

I’d like to state clearly and unequivocally at this time that we will support reading and writing of ISO/IEC 29500 Strict no later than the next major release of Office, code-named Office “15.”

He doesn’t say whether or not it “Strict” will be the default in Office 15, which we can expect to see in around 2013. This is the real pain-point for users: if the default changes, the result is the frustration of sending or receiving unreadable documents.

Microsoft is dancing on a pin. On the one hand, it wants to convince governments, academics and other standards-sensitive organisations that Microsoft Office does the right thing. On the other hand, the benefit to users of breaking document compatibility for the sake of ISO compliance is rather invisible.

Document compatibility is the thinking behind having read-only support for Strict in Office 2010 (and coming to Office 2007). If Microsoft can get read-only support widely deployed, then in 2013 the Strict documents that start to circulate will not be so problematic.

The approach is not completely unreasonable; these things take time. That said, Microsoft’s communication of its intentions has been poor. Further, Mahugh does not answer the parts of Alex Brown’s post that address quality:

It is also a worrying commentary on the standards-savvyness of the Office developers that the first amateur attempts of part-time outsiders find problems with documents which Redmond’s internal QA processes have missed. I confidently predict that fuller validation of Office document is likely to reveal many problems both with those documents, and with the Standard itself, over the coming years.

My perspective on this as a journalist is that Microsoft did not consider Open XML or standards compliance even worth a mention in its publicity so far and its detailed reviewers’ guide for Office 2010. That suggests it is not much of a priority.

So full support in 2013 or thereabouts. My expectation is that by then saving and editing documents online will be more common than it is today, and that the assumptions the Office team seems to make about the steady progress of its huge desktop suite are likely to prove faulty.

Microsoft accused of failure to observe Open XML standards process

XML specialist Alex Brown, who was involved in the ISO standardisation of Microsoft’s Open XML – still perhaps best known as OOXML – says Microsoft has failed to honour the commitments it made when the standard was approved. In particular, it seems little progress has been made between Office 2007 and Office 2010. The key problem is that Microsoft implemented Open XML before it was standardised. There were numerous changes made during the standardisation process, but what to do about the existing implementation? Loosely, the existing unacceptable format was given a “Transitional” status, while the more satisfactory, corrected format was called “Strict”. Microsoft promised to implement the “Strict” variant as soon as it could. Brown adds:

I was convinced at the time, and remain convinced today, that the division of OOXML into Strict and Transitional variants was the innovation which allowed the Standard to pass. Enough National Bodies could then vote in good conscience for OOXML knowing that their preferred, Strict, variant would be under their control into the future while the Transitional variant (which – remember – they had effectively rejected in 2007) would remain purely for the purpose of accurately specifying old documents: a useful aim in itself.

It is now two years since Open XML was approved, and Microsoft is on the brink of releasing a new version of Office. So does Office 2010 implement Open XML Strict? Apparently not – it’s the Transitional version. That is bad enough; worse still, according to Brown, it does not even conform correctly to that:

It is also a worrying commentary on the standards-savvyness of the Office developers that the first amateur attempts of part-time outsiders find problems with documents which Redmond’s internal QA processes have missed. I confidently predict that fuller validation of Office document is likely to reveal many problems both with those documents, and with the Standard itself, over the coming years.

Note that Brown is basing his remarks on the preview of Office 2010; we have not seen the final release yet. I can believe that Microsoft may fix some issues, but it looks vanishingly unlikely that Office 2010 will implement the “Strict” standard which ISO approved.

Brown’s remarks shed light on something I noticed when reviewing the preview:

As for Open XML, it’s notable that Microsoft neglects to mention it at all in its Reviewer’s Guide, even though this is supposedly the release that will fully implement ISO/IEC 29500. It is odd how this has gone from a cause to campaign for, to not-worth-mentioning in just over a year. To be fair, few users ever cared about XML formats themselves: it is only when documents get scrambled or fail to open that such things become important.

No wonder Microsoft said nothing about it, if in reality it has lost interest in conformance.

I think it is a good thing for Microsoft to standardise its Office formats. Selfish manipulation of standards committees on the other hand is not acceptable. One thing is for sure: if Brown is right and

without a change of direction, the entire OOXML project is now surely heading for failure.

then the company will only have itself to blame. Its nightmare will re-emerge: entire governments mandating OpenOffice for the sake of  standards conformance.

That said, and despite the hype, I regard Office 2010 as a minor release. 64-bit Excel, a few tweaks, and a first foray into browser-hosted versions. Microsoft often displays this pattern, following up a release with major changes – Office 2007, for example – with one that is really just a refinement of what went before. It is not impossible that somewhere in the corridors of Redmond a team is working on a new Office that does a much better job with the Open XML standard.

Over to Microsoft – serious about Open XML? Or just doing the minimum necessary to protect a lucrative market dominance – maybe a bit less than the minimum?

Update: Microsoft’s Doug Mahugh has replied to Brown’s comments here. I am writing separately about this.

More patent nonsense: Microsoft loses in Office custom XML appeal

Microsoft has lost its appeal in a case where a small company called i4i claims that Office 2003 and 2007 infringes its patent on embedding custom XML within a Word document. This is not the XML that defines the content and layout of the document. It is XML contained within the document that Word itself does not understand, because it conforms to a custom schema, and which will not be displayed unless you write code to parse it and output some sort of result to the document.

Microsoft now says:

With respect to Microsoft Word 2007 and Microsoft Office 2007, we have been preparing for this possibility since the District Court issued its injunction in August 2009 and have put the wheels in motion to remove this little-used feature from these products. Therefore, we expect to have copies of Microsoft Word 2007 and Office 2007, with this feature removed, available for U.S. sale and distribution by the injunction date.  In addition, the beta versions of Microsoft Word 2010 and Microsoft Office 2010, which are available now for downloading, do not contain the technology covered by the injunction.

The key phrase here is “little used feature”. It is true, in that the vast majority of Word documents do not use it; the only users who will be affected will be those who have built custom solutions which use it in some kind of workflow or for data analysis.

Why did Microsoft lose? Here I have to admit my lack of legal knowledge; though I’m aware that Microsoft’s track record in court is not good. One interesting aspect of the case reported here is that Microsoft was proven, by an email from January 22 2003, to have been aware of the patent and products from i4i:

we saw [i4i’s products] some time ago and met its creators. Word 11 will make it obsolete

says the internal email; Word 11 is another name for Word 2003.

That said, intuitively both the patent and the decision seem odd to me, in that XML is specifically designed to allow data with a custom schema to be embedded within a document defined by another schema. But does the i4i patent cover every XML document out there that does this – such as, for example, XHTML documents that include microformats? The answer, as I understand it, is no, because the patent is about how the custom XML is stored, not that it exists. Here’s a quote from the patent itself:

The present invention is based on the practice of separating encoding conventions from the content of a document. The invention does not use embedded metacoding to differentiate the content of the document, but rather the metacodes of the document are separated from the content and held in distinct storage in a structure called a metacode map, whereas document content is held in a mapped content area … delivering a complete document would entail delivering both the content and a metacode map which describes it.

In other words, the custom XML is not stored directly within the containing document, but in a separate file, together with an instruction that says “please insert me at location x”.

Is that really any different? Intuitively, I doubt it. What we think of as single files are often in reality a number of sections bundled together, such as a header part and a content part. Further, what we think of as a single file may be stored in several locations, with metadata that defines how to get from one part to the next.

An Office 2007 document such as .docx is in reality a ZIP archive which contains several separate files, organised according to the Open Packaging Convention; if the i4i patent has wider implications, it strikes me that they would be for the OPC rather than for XML itself.

I don’t claim any expertise in whether or not i4i has a valid claim against Microsoft or others. I do have an opinion though, which is that this kind of patent litigation does not benefit either the industry or the general public. This particular case concerns me, because the patent strikes me as generic, and one that could be applied elsewhere, which means more effort expended to workaround legal issues rather than in improving the software we use; and because even if the feature in Word is “little used”, the concept is an important one that still has great potential – though now probably not in Microsoft Office.

Technorati Tags: ,,,

Docx on a Mac: still rough without Microsoft Word

I’ve been living on a Mac recently, while thoroughly investigating the new Snow Leopard. One of the questions that interests me: how difficult is it to use a Mac in a Windows-centric environment? Once facet of this is Microsoft’s latest document formats, introduced with Office 2007: docx, xlsx and pptx. What if you get sent one of these, and don’t have Mac Office 2008 installed?

I downloaded a document on Azure blob storage from Microsoft – a random example. I opened it in four different applications: Apple’s TextEdit, which comes with docx support built-in; Microsoft Word 2008; Pages from Apple’s iWork 09, and NeoOffice, the Mac-specific port of OpenOffice. In the image below, Word is on the left, TextEdit on the right, and NeoOffice in the foreground.

Word 2008 opened it perfectly, as far as I could tell.

TextEdit crashed on the first attempt. On the second attempt it loaded, preserving the text but losing most of the formatting. Not a bad result, considering the scope of the application.

Pages was the best of the three non-Microsoft applications. It gave me a warning about paragraph borders being lost, but did not mention that the diagrams were messed up (Pages is on the right):

Image corruption in Pages with docx

NeoOffice made a fair stab at the formatting, but included some extraneous characters (you can spot these at top left in the screen grab) and omitted the pictures completely.

As a final test, I used Word’s Save As feature to convert the document to plain old .doc. This opened fine in Pages and in NeoOffice, though I have to say TextEdit gave a mixed result: the formatting was better, but the hyperlinked table of contents came out worse in .doc than in .docx.

Conclusion: don’t send .docx to Mac users unless you are sure that they have the latest Microsoft Word.

OpenDocument comes to Microsoft Word and Excel

After the intense interest in OOXML vs ODF during last year’s ISO document standardisation wars, I’m surprised that the inclusion of OpenDocument support in the newly-released Office 2007 SP2 has attracted so little attention. Well, not really surprised. The general public doesn’t care much about document formats as such, just that the documents they send and receive open OK. The anti-OOXML fervour was about exploiting a chink in the armour of Microsoft’s de facto near-monopoly in Office suites.

Well, Microsoft has ticked the box now. I haven’t done exhaustive tests; but I did some sanity checks. I opened a .docx (OOXML) in Word, saved it as OpenDocument Text; opened in, saved it out to a new .odt document, opened that in Word, saved it out as docx. And you know what? It looks the same. Even the styles are still there. What’s more the conversion was fast and convenient, just a Save As. All in all, a contrast with the wretched experience I had with the earlier Microsoft-sponsored converter.

Next, I tried a small stress-test; a .doc bidding card for Contract Bridge that has some tricky tables. This document crashed WordPerfect’s .odt converter. Word could happily save it as .odt and reopen. Opening the exported .odt in OpenOffice showed some minor differences – part of the table went slightly out of alignment, as the illustration shows (Word is on the left, OpenOffice 3.0 on the right), but nothing drastic.

Is this the end of the format wars? Not quite; there is still a long list of features not supported by the conversion, and if you want an easy life it still pays to stay with one vendor’s Office suite. My impression though is that Microsoft has done a decent job, and that for everyday documents the conversion will work as expected.

For the OpenDocument crowd, getting the format incorporated into Microsoft Office is a victory of sorts, but not the real goal, which is to establish it as the universal document format. Microsoft is betting that its inclusion will help it sell Office, but that customers will still mostly use .doc or .docx (and the Excel equivalents). If enough institutions mandate OpenDocument, that bet could yet fail, but right now that looks unlikely.


Ivan Zlatev reports on a less successful import here.

Update 2

While word processing import and export is reasonable in some circumstances, there is a deal-breaking problem with spreadsheet import and export: all formulae are either ignored or broken. That is, you can save from Excel to .ods, open in Calc, and get cells like msoxl:=SUM(C6:C8) (in plain text). You can save from Calc, open in Excel, and find formulae converted to plain text. If you save and open sheets from Excel, but in .ods format, it works; the clue why is in the rendering. It appears Microsoft has stuck by the letter of the standard, which does not specify how formulae work, but broken any kind of meaningful interoperability.

More OOXML than ODF on the Internet, according to Google

In May 2007, IBM’s Rob Weir made a point of how few of Microsoft’s Office Open XML documents were available on the Internet. Here are his figures from back then:

odt 85,200
ods 20,700
odp 43,400
Total ODF 149,300

docx 471
xlsx 63
pptx 69
Total OOXML 603

The ODF formats are those used by Open Office, Star Office, and Lotus Symphony. Now that Office 2007 has been out for a while, I thought it would be interesting to repeat his test, using the same methodology (as I understand it), a Google filetype search. I added the macro variants to the list as this seems fair, though they don’t affect the total much:

odt    82,000
ods    16,600
odp    26,100
Total ODF 124,700

docx    87,400
docm    1,440
xlsx    14,900
xlsm    738
pptx    31,400
pptm    1,300
Total OOXML 137,178

Let me say at once, I’m not sure this is significant. For one thing, I’m suspicious of Google’s arithmetic (in all search totals, not just these). For another, I reckon it is a mistake to put either format on the public Web: PDF, RTF, or even Microsoft’s thoroughly well-supported binary formats are more fit for purpose.

Even so, it is quite a turnaround. What is particularly odd is that the ODF figures appear to have declined. Again, it could just be that Google changed its way of estimating the totals.

Incidentally, I doubt that this has anything to do with ISO standardization, especially considering that the current OOXML implementation in Office 2007 does not conform. It has everything to do with the popularity of Microsoft Office and its default settings for saving documents.

10 things you might not have known about XAML

I’ve written a short piece on XAML for the Register. Here’s a few things you might not have known about Microsoft’s Extensible Application Markup Language:

1. It is not just for WPF (Windows Presentation Foundation); it is also used as a language for Workflow Foundation (WF). Microsoft has hinted that we will see more XAML applications announced at the forthcoming PDC.

2. XAML doesn’t have to be XML – see the intro to the XAML Object Mapping Specification 2006, which says that “any physical representation may be used.”

3. XAML is a small core and distinct from XAML vocabularies. The huge WPF is a XAML vocabulary. WF is another vocabulary.

4. Although XAML is usually represented as XML, it is near-impossible to create an XML Schema to validate it usefully. Here’s where Microsoft explains why.

5. In Visual Studio 2005, a huge but imperfect .xsd schema file was used for validation and to drive IntelliSense (things like code completion) in the XAML editor. In Visual Studio 2008 Microsoft abandoned that idea and uses a language service instead.

6. The core idea behind XAML is to be a declarative language for .NET. WPF is merely an early application for XAML.

7. XPS, Microsoft’s fixed-layout language that competes (just about) with Adobe’s PDF, uses XAML that is a subset of WPF. This means that you can actually display XPS documents in Silverlight – there’s no need for a viewer, it is native Silverlight code.

8. When you compile a Silverlight application, the XAML stays as XAML, albeit bundled into a resource.

9. Silverlight allows you to write inline XAML within HTML.

10. XAML rhymes with Camel. Sorry, you knew that already. But did you know that CAML (Compiled Application Markup Language) is XAML compiled to MSIL (Microsoft Intermediate Language)? Microsoft tested this idea in pre-release versions of WPF, but apparently the performance benefits were disappointing and it was less compact than BAML (Binary Application Markup Language), a tokenized representation of XAML. Silverlight doesn’t bother with either: XAML is saved as a resource in a .NET DLL, and then zipped as part of the .XAP package by which a Silverlight application is delivered.

Technorati tags: , , , ,