Main Page Content:
-

Online news, robots.txt, and ACAP

Posted by Martin Stabe on 16 November 2006 at 11:24
Tags: ACAP, Journalism

In the coming year, one of the more technical running debates in online publishing will concern the development of a new standard for automatically informing search engines’ indexing robots about the conditions for accessing online content.

A group of global publishing industry bodies, including the World Association of Newspapers, is proposing a new mechanism for this known as the Automatic Content Access Protocol, or ACAP.

Some time in the next two or three weeks, the consortium will launch a year-long pilot programme to develop the new standard. Six or seven online publishers, from both sides of the Atlantic, along with one — or possibly two — of the three major search engines will be involved, according to Mark Bide of Rightscom, who is coordinating the project.

The project, formally launched at the Frankfurt Book Fair last month, was first revealed in September after the Belgian newspapers sued Google for copyright infringement. ACAP, it was claimed, would help prevent similar disputes in the future.

At the time, Google seemed to insist that existing opt-out mechanisms were sufficient. On Google’s official blog, the company’s European Director of Communications and Public Affairs, Rachel Whetstone, wrote: “[I]f publishers don’t want their websites to appear in search results (most do) the robots.txt standard (something that webmasters understand) enables them to prevent automatically the indexing of their content.”

The “robots.txt” protocol, more formally known as the Robots Exclusion Standard, has been around for more than a decade — ancient by online standards. It works by letting webmasters include a simple text file in the top level directory of their web site, which tells robots about any sections of the site they should ignore. Here’s Guardian Unlimited’s robots.txt file, for example:

User-agent: *
Disallow: /sendarticle/
Disallow: /Users/

This means search robots are being given full access to the site, except for the two named directories.
Times Online, by contrast, has rather more extensive rules in its robots.txt file.

There is no mechanism for enforcing compliance, but reputable search engines follow the rules the set out in robots.txt files.

The problem is that the rules that can be set with robots.txt are fairly limited. It can be used to welcome search engines or to tell them to stay away, but that’s about it. ACAP’s supporters suggest that this binary on-off switch is inadequate. Their new standard, they say would build on robots.txt by allowing publishers to set more detailed terms and conditions of access.

Publishers might, for example, want to allow search engines to index their site, but not make a cached version available to its users — something that becomes important when a libelous article, already removed from the publisher’s web site, remains visible in a search engine’s cache feature.

Timing might also be important for some publishers. For whatever reason, publishers might not want their content to appear on aggregators for the first hour after publication — or they might want to allow archive material remain available in a cache only for the first 10 days following publication. None of this can be achived with the existing standard.

Or publishers might want to insist that a search engine displays their a page synopsis of the publisher’s chosing rather than simply drawing on a few lines of copy from the page.

The ACAP consortium is already concerned about how it is percieved, particularly blogosphere, Bide said at a briefing at the Association of Online Publishers yesterday afternoon. Bide is concerned, specifically, that the project is being wrongly framed as a case of “Europe vs. America” or “publishers vs. Google”. ACAP have set aside budgets for lobbying and PR.

There are likely to be many questions about ACAP in the coming months. Is it really necessary? What are the details of the specifications? How will web developers implement it in existing sites? Will the search engines cooperate?
See also: Rebuilding Media, Steve Yelvington.

Tags: ACAP, Journalism

E-mail Newsletter Signup

Weekly bulletins