Lately I have been frustrated in general with forum-based sites, regardless of subject matter. And yet, I don’t blame them for my frustration. The problem is systemic in nature.

Forum-based web sites function as an archive of data over long stretches of time, like a traditional library, but are much more dynamically built. Its content as a whole has differing value to various people over the course of time. The forums contain a lot of information which, like all other information, needs to be evaluated against a number of factors to properly determine the value to the consumer. Just like any research, these factors include author experience and motivation, source of original information, logic and empirical evidence, verification, etc, etc.

One of these critical factors is the time of origination (or posting). Forum software puts a time stamp on any entry made in it. That time stamp tells you a lot of things. When researching the Iraq War, an entry made in 2003 is likely going to contain information from a more-speculative perspective, than one made 7 years after the war’s start. As a developer, seeing an entry about C# dated in 2003 tells me I am not looking at a technique which will contain functionality designed into version 4.0 of the language.

And lately, I go to the header or footer line of a forum entry to see when the author posted it, only to discover that it is … BLANK !!

To understand what is happening, you need to remember that most forum sites are supported by revenue from visitors clicking on ads displayed on the site’s pages. And the higher the site ranks in a Google (or another engines’) search, the more visitors the site will get, thus maximizing their ad revenue potential.

The forum sites have a commercial need to get the best rank possible. But Google and other search engines factor the freshness of a site’s content into its rank. Generally, the fresher the content, the more points it gets for its final rank determination.

How does a search engine determine how “fresh” or “new” a web site’s content is? Obviously you need some kind of time stamp derived directly or indirectly. The various bots and crawlers which sweep the net looking for this information have at least two options, which are simple to implement. At the protocol level, a GET or HEAD command to a web site will return the date for a page’s content in the response header. Unfortunately, it is always going to make that time stamp the current time, if the page content is built dynamically by a script or other executable. That date in the header is only reliable for static pages (*.html, *.txt, *.pdf, etc).

Content crawlers have to ignore what the server tells them about dynamic pages, and use option 2: dive into the content itself to determine the time the information was created.  Not so ironically, one of the simplest ways to achieve that is to harvest an existing time stamp being rendered right on the page. Forum sites are figuring this out. Since they archive discussions which are not new, timestamps putting their content into anything other than the very near past (24-48 hours, maybe a week) risks lowering their ranking in the results.

So one of the simplest ways to fight back is by denying the bots time information within the page content. If you have a time stamp associated with the content, display only the content to the user. This is done at the penalty of the consumers of the information, who need that time stamp to effectively evaluate the information as a whole.

This scenario is, to me, one of the best examples of the classic public interest versus commercial interest conflict we are seeing on the NET today. In the United States, we use a library system that was founded on principles designed by Ben Franklin. The libraries are supported by common public funds allocated for their operation. The libraries don’t need to compete with other libraries to get funds and, hence, don’t have to be reactive to how they catalog information based on a third party’s opinion and promotion of their libraries’ content. Web site forums are mostly privately financed, and don’t have this luxury.

Google’s advanced search options do have overrides to specify what data to include in the results, which includes a time span. But the web site operators are listening to optimization experts, who are telling them to eliminate the time stamp which can work against their ranking. And it makes sense, since most people aren’t going to override the default or even know how that default behaves. Google’s advanced search options use a default of “anytime” for data content. Still, what we don’t know is whether the algorithm generating the rank is using that setting to dynamically alter the rank. I also do not know what Bing or other search sites have implemented in their algorithms that may address this issue.

The bottom line… until Google and others make their methodology for handling time vividly clear to web site operators, the operators will continue to believe that time stamps work against their commercial interest, and a valuable piece of information for evaluating data will be suppressed to the users.