Web of Information: Well Structured HTML

In my last post I started to get into the meat of the message I’m trying to get across. I’ll recap the message again in the hopes that in one of these posts I’m going to get it down to a brief, concise, and very easy to understand manner.

The Idea

The web, at its core, is all about the organization of information. It is organization that determines accessibility of information. Accessibility determines a client’s ability to consume information. The client’s ability to consume information determines the usefulness of the web (our work). Our ultimate goal, as web developers, is to make our work useful.

Organization begins with the structure of the individual document and flows out to how a group of documents relate and link to each other, which further flows out to how groups of documents relate and link to one another, to sub-sites or sub-domains, to whole web-sites, to the web itself.

Everything begins with document structure. We use HTML as (the primary) means to structure individual documents. How we structure HTML documents is the key to how well the web works. That is why, in this post, I will be focusing on well structured HTML.

A Bit Ambiguous

I was purposely being ambiguous in the last section. We need to recognize that a client is not necessarily a user accessing information via a graphical web browser. There will certainly be users who operate using a text-only browser such as Lynx. Others may be using a screen-reader such as JAWS. And others still might not be human at all, but a computer application designed to pick out the important parts of documents on the web. One such example would be a search engine’s webcrawler or bot. If the bot can understand the structure of your document, it will be better equipped to index your information and make it easier for others to locate and access it.

Likewise, documents could be more than just HTML structured documents. We might also find text documents, spreadsheets, presentation slides, images, and any other manner that facilitates the storage of information. For the purposes of this post, I’m going to focus on HTML, but we must keep in mind that HTML is not the only manner in which information is presented on the web.

What Is Well Structured HTML?

Well structured HTML is markup used to provide meaning and organization to the information contained within an HTML document. Well structured HTML is free of superfluous markup such as embedded presentation logic which I discussed in my previous post. If a piece of markup exists within an HTML document, it has reason and purpose and provides added meaning or insight to the information it is acting upon.

So what the hell does that really mean? I’ll get into that in just a second, but first I want to point out that when I talk about well structured HTML I am not focusing on correct HTML, where each opening tag has it’s corresponding closing tag, and where nested tags open and close in the correct order, and so on and so forth. Syntax is not what I’m talking about here. I’m talking about purpose and meaning. These are slightly abstract concepts but have a greater importance and is the subject that got me going on putting together this Web of Information series in the first place.

Let’s look at some HTML:

This is a line of text.<br>
And here we are on a second line of text.<br>Or is it the
second item in a list?<br>
Or is it a new paragraph?

So which is it? What is the relationship between the first and second blocks of text? There’s an empty line separating the two text blocks, why? What is the intent or purpose of that separation? To a graphical browser, it doesn’t matter much, does it? Whether I’m using line breaks or paragraph tags or a list with no style, to anyone viewing this through a browser they will probably interpret the information as two paragraphs of text.

And this is what’s plagued the web: too much focus on the graphical representation of the information, and not enough on the underlying structure of the information.

Detour In Philosophy

In the HTML sample above, the intent of the author is ambiguous. Us humans using a graphical browser can make assumptions easily enough, but other applications or methods being used to consume this information may not be able to make such assumptions. Not important? Perhaps. This is more academic than practical, isn’t it? If 99% of your information consumers are humans on a graphical interface, who cares? So this is where I tie into the unfinished thoughts in the philosophy section of the Philosophy In A Broken Web article. We don’t know what the future may hold, but we can be fairly certain a lot of the HTML documents out there will be around for some time. Down the road, the need to understand purpose and intent of a block of text could become very important. Advances in search engine technology might occur in which the position of text itself within the structure of an HTML document adds or detracts the ‘score’ given for a certain search term. A term appearing in a list might carry more weight than something found in a paragraph because the list will inherently carry more importance or be more closely related to the true subject matter of the document. This is one example of what might be. There are millions I can’t begin to imagine of. It is because those millions that I want to develop well structured HTML. It’s not only an investment in the now (which I’ll get into later) but it’s an investment in the future.

Back To The Show

Given the HTML example above, well structured HTML means adding meaning, purpose, intent, etc.. to the information. We do this by being explicit in the document’s structure. The well structured HTML version of the previous example would look something like this:

This is a line of text.
And here we are on a second line of text.
Or is it the second item in a list?
Or is it a new paragraph?

So now we know that each block of text is a paragraph. Well structured HTML is markup that keeps to the spirit or original intended purpose for a given HTML tag. Table blocks contain tabular data, and only tabular data. H2 and other headings define headings to different sections within the page. They form a type of hierarchy that a client can use to get a better handle on the structure and organization of the document. A quick scan of just headings will tell the client what each section is about without having to parse individual paragraphs. Strong and em wrap information that need to be elevated in importance from surrounding information (such as text in a paragraph). Lists contain lists of information, blockquotes contain quoted text from some other source… and so on.

You soon find yourself using br tags a lot less, to the point that a majority of documents will probably not have a single one. In fact the XHTML2 specs replace br with a less ambiguous l (line) element used to wrap individual lines. It explicitly defines a string of text as being part of a single line. It adds meaning, whereas br has virtually no meaning to it, but is a sort of cleverly hidden bit of embedded presentation logic.

Hr tags might be in the same boat. Although hr does have meaning and purpose by making the separation of information more pronounced. Headings ought to provide enough functionality for that, but perhaps there will be times where it isn’t logical to use headings but an explicit separation of information is needed. Maybe it’s a gray area. And this is where you really get to exercise your own philosophy. You make the choice. Is hr only a form of embedded presentation logic or is there a purpose to it? That’s your call.

I’m not going to go over the purpose of each and every HTML tag here. You can consult the HTML spec yourself and determine intent, purpose, and meaning that each tag provides.

What’s important here is that you break down your content into logical blocks that are defined by whatever markup you feel is appropriate. Whenever possible you are explicit in the intent of the information. Paragraphs are wrapped in p tags, tabular data is placed inside tables, and so on.

At the end of the day, your pages will be primarily headings, paragraphs and lists. Tables, certainly, but you don’t see tabular data as often, especially inside blogs or personal websites. And certainly you’ll have more than just what I’ve listed, but you should start to feel a great simplification in how your information is structured, and it should feel correct, clean, and good.

But Your Approach Gives Me Plain Pages. ICK!

As I said before, 99% of your clients will be humans operating a graphical browser. Color and other presentational elements will certainly come in handy, and actually aid in the visual breakdown of the information within the HTML document. And you most certainly can (and should) do so, even with the approach I’ve lined out here.

How? CSS! Throw an “underline” class into a stylesheet, link it from your HTML document, and apply it to your headings. Now you can get your hr effect without hr tags. Change colors for different heading levels to help guide a user’s eyes through your information at the depth they want to scan at. Alternating background colors for table rows certainly helps the eye when trying to follow a row across the screen. All done by applying a simple class or id attribute to your HTML document. That’s perfectly acceptable, although I do recommend using class and id values that have some meaning to them. Class names such as “blue” and “dots” work fine, but “worksheetTable” and “importantHeading” carry much more meaning in their names. Plus it saves you from having to apply red colors to a class named “blue” when you decide you’re no longer happy with blue headings.

If you’re reading this, you probably have some understanding of the power of CSS. I won’t go into it here. But suffice it to say that external stylesheets should provide you with all you need to add presentation logic to your HTML documents.

Practical Benefits

Screw those academic and philosophical approaches. I need something tangible; a real reason to actually care about well structured HTML.

Well you’ve got it. Well structured HTML documents, as I’ve discussed here, will almost always be smaller in filesize than ones with lots of embedded presentation logic and br tags. Furthermore, they are much easier to understand when looking at the raw HTML. How often have you gone to edit a page original created by Word or FrontPage or DreamWeaver, only to cringe at the sight of the cluttered and confused structure? With this approach, even documents produced by FrontPage and DreamWeaver will be much easier on the eyes and brain to edit. No more seemingly random font tags or empty strong or em tags. Things become much more clean and easier to manage. When was the last time you created an HTML document in FrontPage that you could say had clean HTML?

Well structured documents, being easier to follow and edit, means it will be easier for others to manipulate. No longer will you have to rely on a single person who understands the arcane purpose to the font tags that are nested 4-deep. Beginners to HTML will find it much easier to follow and understand as well. A less confusing document means less chance for mistakes while making changes.

But the biggest benefit of all is when you use external stylesheets to handle your presentation logic. With embedded presentation logic you have to edit every single page when you make a change to your site’s color scheme or layout. With external stylesheets, all your documents point to a single source to define colors and other presentation logic. A change to your site’s color scheme is but a few quick edits to a single CSS file and you’re entire site is updated. I covered this in my previous post but just wanted to double-up on the message.

Well structured HTML documents + CSS = a website that is small, efficient, easy to manage, and has an inherent organization to individual pages that will facilitate carrying that organization up through the entire site much easier. And I’ve even heard of several instances where sites that use well structured HTML will find themselves better ranked among search engines, and users of those search engines more likely to find exactly what they’re looking for.

Do As I Say…

So if you’ve viewed the source of this page at all you’ve seen it’s got not so great HTML structure. Yep. Headings within paragraph tags?! C’mon… I know better than that… don’t I? Wait… there’s even some embedded presentation logic that I’ve added to this article! With all my crap about embedded presentation logic, why am I still using it?!

A couple reasons. First, the headings inside paragraphs seen on this blog is directly related to my laziness. I could insert my own paragraph tags and tell MovableType to not convert newlines into appropriate HTML. At some point I might go back and fix that, but for now… I’m lazy.

What about that embedded presentation logic? Sometimes it’s the means to an end. Sometimes it’s just easier that way. And sometimes… see reason given in previous paragraph.

This is a good philosophy. It is a good guide. Do I expect everyone to follow these ideas to the letter? No. I’m a realist. But keeping these ideas in mind will certainly help you out, and help you to develop your own approach. You may not find yourself making every HTML document “perfect”, but it will be better if you can apply even just a little of what’s here.

And sometimes you just forget yourself completely and go off and do things you know you shouldn’t. It’s exploration. We can’t get stuck into a single paradigm and expect it to be right forever. Kick the tires, make sure this is still a strong idea. Maybe embedded presentation logic has its place. Maybe it’s not so black and white. I’m not going to tell you how it is. I’m going to give you some ideas. You make of them what you will.

Second Star To The Right…

So where to from here? The original message I wanted to get out when I started this has been, for the most part, been put down in these articles. But there are certainly other areas of web development to talk about, and that’s what I’m going to do. Each post may not tie in to its predecessors, but I hope to have a few m ore ideas to share as we go along here.

As always, I’d love to hear from you. Any thoughts at all, even “wow, this sucked”, I’ll take that. Either post a comment or e-mail me directly at: ruthsarian@gmail.com.



One thought on “Web of Information: Well Structured HTML

  1. Mr. Ruthsarian,

    I have been browsing over your work for the past several hours and find myself more and more thoroughly impressed. I have been a long time advocate of W3C standards and am amazed to see some of the layouts you have put together utilizing purely compliant CSS that still degrade beautifully back down to NS4. My applause to you.

    In regards to your blog articles… they seem very informative and worthwhile. I have seen few comments throughout and just want you to know that your work is greatly appreciated, respected, and helpful. I am sure that your articles, concepts, and layouts will save me countless hours in the future. I run a website design company and am the primary web developer. I find myself constantly struggling to try to stay away from tables, yet rarely finding good solutions. I am hoping that with inspiration from your sites I can finally work towards achieving XHTML 1.1 all the time with only tabular data (i.e. from database queries) within tables.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s