Google and Flash

I came across this interview with Dan Crow of Google. I’ll use that interview as a basis for a discussion on Flash and search engine optimization. The conclusion is that too many are talking about indexing SWF files, which is a non-problem, basically the same as saying that JavaScript files should be indexed.

We do a really efficient job with HTML, but with Flash we could improve. If we hit a web page with a Flash movie on it, we just extract the text out of it and index that text. But a Flash movie is much richer than that. So one of the projects I’m working on is to try and improve our Flash processing – that’s an example of an area where we could be better. But it’s not unique to us. [...] At the moment our advice is that webmasters need to give us a lot of help – a Flash movie is basically a set of virtual pages. If you used HTML for the links, then we’d be able to see the overall structure. The content will still be in Flash, but we can at least get some of it. Ultimately we’d like to be smart enough to look inside the Flash movie. We’re not quite there today.

Dan Crow of Google, source: Guardian Unlimited

Flash applications are stylesheets   not web pages

A HTML page is, ideally, annotated text. This is the basis for search engines, they extract the text and index it along with some extra info it gets from the annotations (a <h1> is likely to be a header, etc.) and which other pages it refers to.

A SWF file should be seen more like a stylesheet. It takes data and applies styles to it, animations, custom fonts, behaviours, etc. The actual content is often separate, usually XML produced by a CMS or just plain files. This is what Google should be trying to index, not the SWF itself, but for some reason the SWF is what people are talking about. To say that “Google should index my .swf files” is the same as to say that “Google Desktop should index my .exe files”.

I’m not sure that Dan Crow really understands Flash as a platform. If Google started to index SWF files they would find very little. The reason for this is that SWF files are more like JavaScript files or CSS files than HTML. Properly designed Flash web sites don’t have any indexable content in their SWF’s at all, but even if they did how on earth would Google make any sense of the data? Only in the exceptional case does a SWF contain information that would make sense to index, partly because the structure isn’t as clear as in HTML (how do you know what is a header, for example), and partly because very much is not decided until the SWF is actually running.

Consider this piece of code, it’s a bit contrived, but it makes my point:

var message = "Hello";

message += Math.random() < .5 ? " world" : " goodbye";

// show only on sundays if ( new Date().getDay() == 0 ) { myTextField.text = message; } else { myTextField.text = ""; }

How could Google even begin to make sense of this even if it could run the code? And how would it know that this code actually displayed the text to the user? and if figured out that it was sunday so the message would be displayed, what would it display?

That was a simple example. Consider things that are triggered by timers, things that happen on a specific frame during an animation, or would only be shown if the user moved the mouse, etc.

Crow’s idea of “Flash movie[s being] basically a set of virtual pages” gives some context to the idea of indexing Flash sites, the problem is that it isn’t relevant even in when it is true. Many Flash sites are nothing but pages with transitions, so far he is right, but as we have seen above, there is no way for Google to analyse the SWF and get how these pages look like, in which order the text comes, and so on and so forth. For sites that do not fit this description it is even more impossible.

(Proper) Solutions

There are already solutions to the problem. Many Flash sites are already indexed by Google. Their SWF files are not indexed, but the contents of the site are indexed. Here I present how this can be done now, and how it could be done in the future (meaning, what Google should be pursuing instead of the folly of trying to index SWF files).

Graceful Degradation / Progressive Enhancement

These two are more or less the same thing: provide alternate versions of your site. If the user agent supports Flash, give it the Flash version, else give it the HTML version. The difference is the order in which it is done (provide alternate content if the user agent doesn’t support the rich content or give everyone the simplest version of the content and upgrade if the user agent has what is required for the rich content, respectively).

I use progressive enhancement for every Flash site I make, but depending on the requirements of the site, I do it a bit differently. If the point of the site is that the user should see it (Whipping Floyd, for example) then the lo-fi contents are minimal, just enough so that Google will find it for the most common searches.

If, however, the site requires that the visitor should be able to find any page within the site with a Google search, or if the textual content is the focus, then I create a simple parallel site in HTML (Sandberg Trygg, for example). If the visitor lands on a HTML page but has Flash installed, a JavaScript snipplet redirects her to the page containing the Flash site but with a parameter that makes the Flash application display the correct page instead of the start page. To the user it looks as if she just landed on the right page inside the Flash site.

I don’t expect more than a handful of visitors to ever see the lo-fi version. It’s there only to describe the content of the Flash version. The metadata format just happens to be HTML, because that is what Google understands best. Progressive enhancement is good because it works, and it works now. It is not ideal because it requires some work, and HTML is not a very good metadata format.

Future solutions

Google has some tools that could be used as a basis for a future indexing system, for example sitemaps. Sitemaps is a protocol where a site can publish an XML file describing which pages it contains, their relative importance and how often they are updated. This gives Google some ideas of how to optimise its indexing schedule and some more information it can use to go about its job in the best way.

For most Flash sites I have worked on, I could create something similar. Perhaps not a list of pages, but at least “nodes” or what you want to call them, “views”, perhaps. But without using one of the techniques above, there would be no way for me to tell Google what they were or how to reach them.

Google needs to work with us, the Flash and RIA community, in creating a proper infrastructure for describing rich web content, Flash websites, Ajax applications, Silverlight sites, video, etc. Instead of Google trying to index the applications, we should be able to point googlebot to some metadata describing the content in general, the specific nodes (pages, views) within it and their relations. To support nodes the site must of course support deep linking, so that feature is probably for sites, rather than applications.

Unfortunately this looks a bit like the old meta-tags of HTML. They were meant to provide metadata about pages, but nowadays search engines do not trust them, but look at the actual contents of the site instead. However, there is not much alternative.

We have seen that it is virtually impossible for googlebot to understand a Flash site, so as it is now, the only way to get it indexed is to provide a lo-fi version in HTML. Google happily accepts the HTML (since it’s not aware of the Flash version at all). Using JavaScript we then make sure that the visitor gets the Flash site instead of the HTML indexed by Google. Because of this, Google cannot guarantee that the HTML it indexes is what the visitor will see. Google seems ok with this, but in essence it is meta-tags all over again.

Therefore, I suggest that instead of having to create mirror sites in HTML to help Google index our Flash sites, RIA:s, etc., we should be able to provide a metadata file, something like an extended sitemaps file, or a RDF document containing URI:s of all resources (nodes, views, pages, images, whatnot) in the site or application, the textual content, the relations between resources, etc. This would help Google (and other search engines, of course) to display better information to their users, and to link inside Flash sites (if the site supports it) and to include images from Flash sites and RIA:s in Google Image Search. It would be a simple task for us, the creators of such sites, to create the metadata document, much simpler than creating a mirror site in HTML, and if the site is built on top of a CMS it could easily be automated.

Concluding remarks

The indexing problem applies to all non-static web content, Ajax applications, Flash sites, RIA:s, even if, for example, Ajax happens to be HTML-based that doesn’t mean it’s indexable.

I would also like to add that I don’t think that the engineers of Google actually believe that indexing SWF files would do any good, they know better than that. Sadly, what Dan Crow said is just uninformed, and by saying such things he doesn’t help the problem. I found the Guardian article by way of a blog which said something on the lines of “yeah, SWF indexing is really something we need”, and I have seen at least one other blog saying more or less the same thing.

11 Responses to “Google and Flash”

  1. JabbyPanda Says:

    Another great post, cannot agree with you more on this point.

    Please, anyone, show this post to Google engineers

    Some keywords in order for Google Bot to index this post better: “google”, “flash”, “friends”, “content”, “rules”

  2. Theo Says:

    Thank you, I’ve added the words you suggested as keywords. Silly me for not doing that on a post about SEO =)

  3. TomH Says:

    Hi there, I’ve only just discovered your website and it has some excellent articles!

    I find the SEO Flash debate is an underdiscussed and misguided one, it’s good to see someone talking sense.

    I’ve had a look at your examples and they stand up as the way things should be done, however I’ve taken it a little further by focusing on accessibility and the user experience, which naturally helps indexing as a side-effect.

    You can read about it here: http://alastairc.ac/2007/06/cms-editable-flash/

  4. TroyWorks » Blog Archive » SEO with Flash Says:

    [...] Iconara has a great post on SEO with Flash/Flex centric sites. [...]

  5. Iconara » This week’s non-issue: Google indexes SWF files Says:

    [...] but why they have switched. I have commented on Google’s approach to SWF indexing before (Google and Flash) and found that they don’t understand the problem, this is just further proof of [...]

  6. Ahora Google rastrea e incluye en su índice el contenido Flash | Search Engine Land en Español Says:

    [...] de hacer las aplicaciones Flash más accesibles y más amables con los buscadores, utilizando el aumento progresivo* y surgieron tanto sFIR como SWFObject como métodos buenos para garantizar esto. Creo que sigue [...]

  7. SEO Flash Website cho Google Says:

    [...] Google – Flash [...]

  8. SEO Flash Website cho Google « Blog thủ thuật SEO Says:

    [...] Google – Flash [...]

  9. SEO Flash Website cho Google « SEO Blog Google Says:

    [...] Google – Flash [...]

  10. SEO Flash Website cho Google « cuocthiseo Says:

    [...] và Search-engine thân thiện. sIFR Documentation Google – Flash SWF [...]

  11. Thiet Ke Website – Thiet Ke eCatalogue – Thiet Ke Flash – Thiet Ke Do Hoa – Web Design – eCatalogue Design – Flash Design – Graphic Design – WGraphicDesign.Com Says:

    [...] Google – Flash [...]

Leave a Reply