You consistently view things from only one viewpoint.
Wait wait wait, did I just read that, for real?

You're accusing
me of seeing only one viewpoint? Oh man, now I've heard it all!
Look at your reply to me. Every. Single. Bit. of the HTML issue is focused solely on faithful, rigid layout representation, e.g. in historical archiving. None of it considers new document creation, or the fact that in many cases now, there need not
exist an authoritative physical layout to begin with! This forum post, for example, has no inherent physical layout. Its rendered layout depends on the user's selected theme, whether they're on a desktop or mobile browser, or even using an app like Tapatalk. But by using semantic markup (like paragraphs, quote tags, etc), it renders correctly on each. So what's the "true" layout of this?
There is none. It
has no page dimensions or page numbers.
What I see in your replies is someone who is judging HTML by the criteria of something it never wanted or tried to be (and deliberately so), seeing how it sucks for your application
which is 180˚ diametrically opposed in its requirements, and then whining that it doesn't do it well. PDF (especially PDF/A) is designed expressly for faithful layout representation, which is part of why they removed the programmatic aspects of PostScript (only to later add in JavaScript...

), and why it does support page dimensions, numbering, etc.
Other rants are based on fundamentally wrong assumptions about the situation (PDF is not PostScript with document security bolted on, PDF is actually a declarative-only small subset of PostScript, whereas actual PostScript is a full programmatic language).
I do know a lot more about PDF internals now than when I wrote that. I know PDF uses a functional subset (and minified) version of Postscript. Essentially crippled and obfuscated Postscript, with no option for non-obfuscated format, if one didn't care about file size. But still fundamentally the same for document layout purposes, plus the object structuring and security features of PDF. So, that rant is still predominantly right, and PDF is still horrible. One most astonishing thing I found, is that the PDF standard still doesn't allow for newer graphics formats. Particularly no PNG, which is why I find I can produce better quality and smaller file scanned docs using html and PNG, than a PDF file can.
It would probably behoove the PDF standard to add support for some newer formats. Who knows, maybe that's in the works for a future version of the standard. The flip side is breaking compatibility with the millions of devices and workflows that currently support PDF, and that's nothing to take lightly.
I'm currently working towards being able to write a PDF forensic dissector utility (long story, no point going into that), and the more I delve, the more I hate PDF.
Non-PDF/A PDFs weren't really designed for that...
Others (like the rant on XML) show ignorance of the design goals. XML isn’t really intended for web documents, it’s for structured data. You can press it into service as an extended HTML, but it’s really mostly used for hierarchies of key-value pairs.
I knew that when I wrote that text. I didn't say XML was intended for web document presentation, though it is a full markup language (thus making it overlap html in function.) Perhaps I didn't express it well, but I mean the XML creators extended and generalized the html syntax to encompass structured data. But they did a horrible, overly complex job. And all the XML texts I've found suck. Suffer from 'excessive abstraction' disease.
This is probably why json is more popular than XML for actually passing data. And why XML playlists, bookmark files, etc are ridiculous.
What's ridiculous about them? They're easy to parse and edit and do the job fine. How is XML "horribly complex"? It strikes me as being quite simple indeed.
And others (like the “HTML doesn’t allow page layout!” rant) show both a fundamental lack of understanding of the design goals as well as not understanding that some decisions that come back to bite you in the ass later down the line were actually the correct and best decisions at the time. HTML was never originally envisioned for the things it ended up being used for. If you’re setting out to design a compact sedan, you’re not going to be thinking about “well what limitations will this have when used as a delivery vehicle for construction materials”.
You fundamentally missed my point. I know what html was intended to do. I'm saying that a design goal of creating a syntax that specifically excluded extensions to other objectives (such as representing fixed layout printed works) was insane. Because it meant millions of development man-hours and the entire infrastructure of the web, are incapable of representing books properly. And this was entirely predictable and stupid.
No, I get your point. You're just wrong about it. It
wasn't "entirely predictable" nor "insane" or "stupid". There already existed page layout standards at the time, for fixed layout printed works. HTML wasn't invented to format documents (never mind books!), it was for structured, hyperlinked content. And to a very large extent, this has remained the case still. And I'm not sure these two worlds ever will (or should) merge, since they have conflicting requirements and objectives.
It's crazy for you to be whining that HTML is a lousy layout markup, when in fact HTML was never intended for physical documents.
As for it being "entirely predictable", no, it wasn't. Do you know
anything about the contexts and environments of its creation? By the time it became clear that the web was going to be huge, HTML had already existed for years. (And its ancestor markup language, SGML, and its ancestors, had existed for far longer still.) And it wasn't until many more years that the web began to morph into its modern AJAXey flavor. If we were inventing HTML now, we'd do it differently, because we now have radically different requirements for it than back then.
And frankly, as someone who has worked extensively as a technical writer, structured documents with semantic markup are WILDLY underused. Decoupling meaning from display is a huge advantage in many situations (I’d argue “most situations”, actually), since it makes it far, far easier to manage consistency in appearance, reuse for other display formats, and making changes in visual appearance later on. Most people do not understand how or why to use semantic markup, which is why they whine about Microsoft Word (which is essentially based on this principle) being complex, or HTML not allowing absolute layout, etc.
Again, you're fixating on one single viewpoint, and refusing to allow that there are other viewpoints and objectives.
Fine, semantic markup is great for creating works intended to appear reasonably consistently on differing media. But you entirely fail to comprehend the need to accurately (and therefore absolutely inflexibly) represent the appearance of a historical artifact - particularly books. No, the display device does not get a say. It simply has to be capable of an accurate visual reproduction, or it's excluded. A true document encoding scheme MUST allow for exact, fixed, inflexible specification of a document's appearance, pages and all.
Of course I understand the need to accurately and inflexibly display historical artifacts! I just don't understand why anyone would want to do that in HTML. That makes as much sense as trying to use a fish net to carry water. Use the right tool for the job, man! It's lunacy to use the wrong tool and then whine that it doesn't work well!
HTML abjectly fails to provide that facility. For no reason at all, other than people like you who think 'semantic markup' and abstraction is the only acceptable way to do anything.
I never said that it's the only acceptable way. For some tasks it is, for others it's not. IMHO, many documents get created as fixed layouts that would actually be better made as semantic markup. But it's entirely situational, and even this is only referring to newly created stuff. Archiving historical printed matter is a completely, totally, entirely unrelated task from document creation.
So
of course HTML "fails" to provide faithful, inflexible document layout
because it was designed for a completely different task, with the express goal of leaving final layout up to the viewer, not the document creator. Calling that a "fail" is like saying that a clothes dryer "fails" to adequately wet your clothes, when in fact its goal is to
remove water...
I think it's quite delusional. Because the result is we still have no open, capable and universal format for capturing documents.
That in no way means that HTML should be pressed into service for that task. If we need a new format for document capture, that's fine, but we shouldn't destroy another, existing format to do so.
And the longer this goes on, the more cultural history is lost. For eg all those people scanning technical works into PDF, often destroying the original document in the process. Spine chopping, scanning in two-tone (fax mode), then binning the paper afterwards. Result: an extremely poor digital copy with all the illustrations ruined, and the text all jaggy. The spirit and style of the document lost, irrecoverably. An insult to the effort the authors and publishers put into the work. Plus the original (maybe a rare, irreplaceable copy) gone. This is really very criminal and stupid.
Agreed.
(Like you saying how HTML lacks page sizes and numbers... OMG, just no! How do you define a “page”?
You. just. don't. get. it.
Go pick up a book. Open it. Look at it. That is what a page is. Why do I have to say this?
Because you don't understand the concept of a non-physical work. Not all document handling has to do with capturing archival printed matter. A lot is about creating new documents. And not all new document creation has to do with creating a printed document. These are all different tasks with different requirements.
And btw, all decent graphics utilities (eg photoshop) typically know exactly what the physical dimensions of an image are, in addition to the abstract pixel dimensions. So does html even, if dimensions are specified in em, etc. Which are physical units.
No, they're not. The em is an abstract "unit" that is a
relationship to a physical dimension specified elsewhere.
And HTML didn't even support the em until CSS came along years later. It was added specifically to
reduce absolute dimensions and replace them with dimensions that are, in essence, percentages of each other. It's closer to semantic markup than to absolute layout markup.
Too bad the idiots didn't allow for metric/imperial units too.
But allow an actual page break and page dimensions? Noooo....
So what would specified page dimensions
mean when viewing them in a browser whose window doesn't match those dimensions? Or if the document isn't layout-based to begin with?
What happens when the document page size doesn’t match the output page size?)
Depends on what you want. If you want a true-size reproduction, then you don't get anything, and have to hunt a device that can produce it. Or... gasp.... it could scale. Just like every single printer on the planet is perfectly capable of. (Excluding a few ancient dot matrix, etc.)
Again, you're operating under the assumption that it's a representation of a physical document.
I do agree that a page or section break might be a useful tag, though!
"might be" ha ha ha! But it wouldn't make sense to include it, without a lot of other extensions to provide full and easy representation of physical layouts.
The thing is, HTML doesn't (and shouldn't) provide "full" representation of a physical layout. A page break would simply be there to
aid in the
creation of a physical layout from the HTML, like when you print a web page. It's a hint to the renderer on what to do in certain situations. (Much like how hinting works in fonts.)
HTML and its family of bolted-on standards are indeed a mess, no doubt. I’d love to see a new reinvention of them as a sort of “clean slate” implementation. But it got many fundamental principles right, and the last thing we need is to have it become a document layout page description language, which is what you seem to want. Just use PDF for that...
You're very big on false dichotomies. Like there could never be an html equivalent that allowed BOTH abstract markup for flexible adaptation to display devices, AND rigid page/document representation. Allowing use of whichever type was appropriate to the intended target.
It's not a
false dichotomy, it's a very
real one based on the fact that they have diametrically opposed requirements. What is gained by creating one uber-standard that has two wildly different flavors? We already HAVE two formats that handle each set of requirements well: PDF for accurate physical layouts, and HTML for structured logical documents.
So we have two retarded standards (each a mess): html/css/js/(and the rest) for the web, and PDF for 'documents' (anything that needs to be encapsulated in a single file.) Where there should be just one (and much cleaner.)
Because things like that mean compromises for every document, rather than individual standards doing each kind of document
well.
There's a reason why in desktop publishing, page layout programs (like InDesign) and word processing programs and text editors (like Word) are completely separate: one is for creating a precise layout, the other is for creating and structuring the content. It's a continuum of sorts, with some programs (like FrameMaker) being somewhere in between. But it's always a give-and-take, where strengthening one aspect (say, layout freedom) means weakening another (like structural consistency) or vice versa, which is why different programs exist to accomplish different tasks.
And indeed, when people try to force Word into service for strict layouts, they get frustrated. Just as a writer gets frustrated if they try to force InDesign into service as a word processor.
The only thing on which we agree, is the need for a clean slate.
Well, you want
one clean slate for everything. I'd say we need one clean slate for HTML, and a separate clean slate for PDF, if PDF is proving inadequate.
Well, some of those are genuine stupidity (like those bad JBIG2 implementations and the Windows registry, which is quite possibly one of the worst design decisions MS ever made).
You consistently view things from only one viewpoint. The Registry design is evil, but from Microsoft's perspective it was probably one of their best decisions. It achieved exactly what they wanted - an OS that was forever doomed to 'creeping installation senescence' greatly raising the probability that most users would regularly buy new systems.
That certainly wasn't the design goal. They wanted to eliminate tons of little config files and make it easy to save key-value pairs programmatically. But instead they ended up creating a single point of failure, which they're now separating back out via the rather complex (but necessary) method of registry virtualization.