Can OEBPS Documents handle special characters? Equations?

OEBPS-conformant reading systems are required to display a set of special characters that includes accented characters, monotonic Greek characters, common mathematical characters, publishing-related punctuation characters (such as en and em dashes), and a few other characters. The full list of these characters, along with their character entity and Unicode representations, is given in Appendix D of the OEBPS.

The OEBPS does not require that reading systems display all Unicode characters (which would permit the display of a vast number of world writing systems, ancient and modern), although it does require that unfamiliar Unicode characters not crash the device, and be signaled in some way (e.g. by a question mark) to the device user.

The OEBPS does not provide a native method of coding and displaying complex mathematical or chemical equations. The only way to do this in an OEBPS Publication is to typeset the equations and convert the typeset equations into image files.

How does coding Basic OEBPS Documents differ from coding HTML for a web page?

In general terms, a lot of the sloppy practices that are accepted by Web browsers are not permitted by XML and OEBPS.

In particular:

It is worth noting that the W3C has reformulated HTML 4.0 as XML, calling the result XHTML. For the most part, any restriction true of XHTML will be true of OEBPS as well.

How can I convert from my word-processing or page-layout program to an OEBPS Document?

Many people are hoping for a push-button conversion to OEBPS Publications from all sorts of word-processing and page-layout programs. This is barely possible for some formats, and only if text and some basic appearance characteristics are all that matter. Other formats are extremely poorly-suited to automated XML conversion.

The problem is that eBook capabilities, and the XML markup and technology necessary to support them, go far beyond appearances. Push-button conversions cannot add intelligent markup, such as hyperlinking and complex structure-based markup, to a text; only humans can do that. Moreover, because appearances can be deceptive, push-button OEBPS conversion tools can and do make mistakes. The search for the Ultimate Push-Button is likely to be in vain.

That said, conversion tools can do a lot of the drudgework, leaving the finishing touches for humans. Since Basic OEBPS Documents are based on XHTML 1.1, conversion to HTML can be one way to get a head start on OEBPS conversion. Many word-processing and page-layout programs convert to HTML, or have available plugins that do. As with HTML authoring tools, conversion tools will produce HTML that is guaranteed to need cleanup, but at least much of the most repetitive work will be done.

What does it mean when a Basic OEBPS tag or attribute in the OEBPS is “deprecated”?

First, a bit of Web history: When HTML was first developed, it was intended for easy exchange of information rather than attractive text display. When the World Wide Web caught on, though, HTML added many tags related to display and layout. This annoyed many people who preferred that HTML be used, in the tradition of its parent SGML, to delineate the logical structure of a text rather than its appearance.

These people got busy designing stylesheet languages. Stylesheet languages, such as CSS and XSL, use the structure of the document (as defined by logical markup) to decide how to display it attractively, without cluttering up the markup itself with design issues (which tend to be much less stable than structure, as anyone who designs and redesigns Web pages can attest).

As stylesheets caught on with Web browsers and designers (and they are still catching on), purely appearance-oriented features of HTML (such as [align] attributes and tags like <CENTER>) were “deprecated,” meaning that while they probably worked in Web browsers, using them was not the best possible idea, since their functionality was now being replicated and improved upon by stylesheets.

The OEBPS deprecates or refuses to support everything that is deprecated in the HTML 4.0 specification (in addition to features of HTML that are irrelevant to eBooks, such as forms and programming hooks). Moreover, the Publication Structure states that deprecated features may not be available at all in future versions. If you use a deprecated feature, you use it at the risk that future reading systems may not be able to handle it.

Why should I use Extended OEBPS Documents at all? Can’t I just stick with basic OEBPS Documents?

If you are satisfied with the look and behavior of HTML-esque Basic OEBPS Documents, there is no pressing reason to go further. The single exception to this is if you are using HTML tags and attributes that are deprecated in the Open eBook Publication Structure. If you want to keep such functionality, you really should replicate it using stylesheets, because features that are deprecated now may become unusable later.

Liberal use of the “class” attribute, which is available to nearly every Basic OEBPS tag, can be used as a sort of “stealth” XML. There isn’t a great deal of difference between <p class="epigraph"> and <epigraph>. Both would allow the same manipulation via CSS.

Extended OEBPS Documents might offer some advantage. HTML was never designed to represent the structure of books. XML, however, can be tailored to be representative and descriptive of your specific book. This can be significant if you have other uses in mind for your book (e.g. typesetting, content management, programmed transformations).

Of course, if you have content already in XML that you wish to use as an eBook, then creating a CSS for it may be easier than transforming the markup to a Basic OEBPS Document. Then again, given the serious limits on hierarchy enforced by the CSS limitations of the Publication Structure, a transformation may prove the wiser course.

Why is a DTD important? Can’t an Extended OEBPS Document be coded without one?

The XML specification allows for document markup that does not abide by a DTD, as long as they conform to basic XML rules of structure. Such documents are known as “DTDless” or “well-formed” documents. (XML documents that abide by a DTD in addition to being well-formed are called "valid.") In other words, XML does not require that the structure of a document be determined in advance of tagging it. Any OEBPS-conformant device will display a well-formed but not valid XML document as long as a CSS stylesheet that specifies display of unfamiliar tagging is included.

That said, abiding by a DTD, whether the Basic OEBPS DTD or another DTD, confers several advantages. One, already mentioned, is that documents tagged to conform to a DTD can be checked with a parser, ensuring that the tagging is accurate and conformant to the DTD. This can eliminate some tedious editorial tasks (such as ensuring that different levels of heads are nested correctly). Another is that once a document conforms to a DTD, computer programs (such as XML editors, XML browsers, and eBook reading systems) can be designed to understand that particular DTD and deal with conformant documents quickly and easily.

Should I use one-step conversion tools from popular page-layout and word-processing formats?

Only if:

A quality eBook cannot be produced entirely mechanically, any more than a quality print book can. Automated tools can speed the process of conversion, but they cannot replace human judgment. Decent eBooks require human work.

The minutiae of eBook production are beyond the scope of this FAQ; however, a look at my article on conversion houses should be food for thought.