Factual has analyzed data from 4 million web sites and provided a holiday gift for stats junkies.
Did you know 5% of pages have either a Twitter or Facebook link? Or that 28% of sites run Google Analytics? Or 12% of them run Google AdSense? Now you do! The core data comes from CommonCrawl, a non-profit group designed to crawl the web and provide data for anyone to use. Gil Elbaz is both a founder of CommonCrawl and of Factual, a start-up that creates tables of structured information from data found on the open web (see Factual: Parting The Curtains Of The Invisible Web).
Factual found stats such as I cited above after examining 4 million web sites. In particular: 28% of sites have Google Analytics on them 12% of sites have AdSense 5% of sites have EITHER a Twitter or Facebook link but… 2% of sites have BOTH a Twitter or Facebook link There’s also a chart that shows other interesting stats but without precise percentages.
I’ll estimate as best I can: About 20% of sites have Flash About 19% of sites have an RSS feed About 6% of sites have a sitemaps file About 1% of sites have a Google Webmaster Central verification code About 1% of sites have Quantcast tracking code About 0.5% of sites have a Creative Commons attribution One thing unclear is how the stats break down on a page versus web site basis. A web site might have multiple pages. So when a “web site” is said to have AdSense on it, does that mean each page within the site has AdSense code? Or only some of them? It appears a decision was made on a site-by-side basis, with “site” being defined as all the pages within a set domain or subdomain.
Those interested can play with the data themselves. It’s summarized in this very large table at Factual. CommonCrawl also gets a bit of publicity from this at an interesting time. Earlier this week, Google released a long internal memo talking about how important it was to the company to be open — except in the areas of search and ads: In many cases, most notably our search and ads products, opening up the code would not contribute to these goals and would actually hurt users. The search and advertising markets are already highly competitive with very low switching costs, so users and advertisers already have plenty of choice and are not locked in. I’ll likely do my own follow-up post to that memo in the near future.
In the meantime, a post I wrote back in 2007 — Google: As Open As It Wants To Be (i.e., When It’s Convenient) — looks at how Google’s claims of being open tend to ring false when open isn’t something it seems to pursue in areas where it is ahead. In part from my post: That large index gives Google a huge advantage over rivals. It knows more about what’s on the web than anyone else. So why not share? Why not start an Open Index Alliance where there’s a coordinated effort to crawl and index all the documents in the world, allowing anyone to tap into the raw data? That’s the idea behind CommonCrawl.
Maybe as part of being open, Google could get behind the project? See also Chris Dixon’s post from this week, Google should open source what actually matters: their search ranking algorithm, for related thoughts about Google, search and openness, along with comments from me and others, including the head of Google’s spam fighting team Matt Cutts. As for ads, see Schmidt: Someday, AdSense Publishers May Know Google’s Cut Of Ad Revenues, from me earlier this year, which looks at how most AdSense publishers have no idea how much money Google keeps back for itself. It’s hard to find an arugment that support not being open about this, in the face of Google’s declared love of open.