SWF indexing is a red herring, and you should all know that by now

So, here we go again, Google has annonced that they will index SWF files with a new algorithm and the whole Flash blogosphere echobox is ringing with the words of the clueless. The announcement shows how little Google understands about Flash websites and needlessly diverts the attention away from developing a real solution to Flash website search engine optimization. The reaction to Google’s announcement also shows how little the Flash bloggers understand about the problem. I’m not sure which of these two is the most annoying.

The bottom line is that SWF indexing is a lost cause, it will not make a difference, and the only thing that has changed is that now Google is even better at finding nothing.

To illustrate this let’s analyse the the caveat at the end of the new announcement:

  1. Googlebot does not execute some types of JavaScript. So if your web page loads a Flash file via JavaScript, Google may not be aware of that Flash file, in which case it will not be indexed.
  2. We currently do not attach content from external resources that are loaded by your Flash files. If your Flash file loads an HTML file, an XML file, another SWF file, etc., Google will separately index that resource, but it will not yet be considered to be part of the content in your Flash file.

Source: Webmaster Central Blog Improved Flash Indexing

Translation:

  1. It will not work if you embed your website using JavaScript unless you’re lucky. Google will most likely not even find your Flash site unless they get really, really clever at running JavaScript and finding out how it modifies the page DOM to insert the Flash content. If Google is anything it’s clever, but there are limits.
  2. It will not work if your content is loaded dynamically as XML or if you use a bootstrap SWF. Even if it’s true as they say that they will discover links to external content when they scan a SWF, unless they actually execute the code they will miss most things. Just consider this code:
    var url : String = baseUrl + "/content.xml";
    
    There is no way that Google will be able to figure out what the URL to the content is, or even understand that the variable actually contains a URL, that would require very serious intelligence on the part of their spider.
    As has been pointed out in the comments, executing the code is exactly what Google claim that they do. Google doesn’t have to understand that the variable above contains an URL, they can just run the code and see that it indeed loads something. However, this doesn’t change the fact that figuring out what it really means is really, really hard. Why was the file loaded? What significance does it have? Was it loaded as a response to something or was it just preloaded to be used later. Does one part of the loaded data relate to any other part, or is it a random collection of stuff? etc. And even Google they got hold of that XML file, how would the spider know how to correctly parse the XML? Unlike HTML there is no given semantic structure to a XML document, there is no way for Google to make any sense out of it or where it would appear in the actual application, even if it would ever appear, it might just be configuration. Moreover, if you use a bootstrap SWF your content will not be indexed correctly since the relation between the main SWF and the bootstrap will not be maintained. The point of a bootstrap file is that it has to be loaded first, but since Google will not find any content in it, it will not be ranked and probably never found. Even if the main SWF is indexed and possible to find you will not be able to visit the site since it has to be loaded using the bootstrap, which will not be found… It’s Catch-22: you need the bootstrap to be what people find, but it can’t since the nature of a bootstrap is that it devoid of content.

As you can see, Google will acheive very little with their new indexing algorithms, and they must know so, I cannot belive that the Google engineers are not aware of these issues. It’s also surprising how few in the Flash blogosphere that are aware of the problems.

Indexing Flash web sites isn’t easy, but it will never be acheived by indexing SWF files — the problem is in the very nature of the format. SWF files are executable applications, not semantically structured data, like HTML.

What we need is a constructive dialogue with Google about how to solve the problem for real Flash sites, but first Google has to get rid of their extremely naive idea of what Flash sites are. The last few announcements from Google about SWF indexing have not been helpful and have only served to divert the attention away from solving the real problem.

23 Responses to “SWF indexing is a red herring, and you should all know that by now”

  1. Maz Says:

    At least someone knows it! And you are definitely right, it’s scary to see top notch flex developpers ignoring those issues.

    Indexing Flash sites is as difficult as indexing desktop applications… Why Google Desktop does not index my browser content? Or my MSN content ? It is just impossible unless GDesktop knows those applications (location of settings, data patterns…).

    The only way would be to let the swfs tell Google what to index. But this is not what google want. In the HTML world they build their page rank following their algorithms. They won’t let us tell them that…

    SWF SEO will stay a huge problem for our technology…

    {Maz}

  2. paddy Says:

    Thanks, you’d raised the real issues. The last time we made a flash peice that didn’t use a combo of javascript, xml or remoting was around 1999. I can’t see how this is going to help?

  3. senocular Says:

    I think you’re not understanding what this new approach is really doing.

    First concerning the caveats:

    For 1, most embed scripts have a noscript fallback for embedding SWFs – this includes all HTML/JS generated by Adobe software, and really something everyone should be doing anyway.

    Secondly, if you read point 2 carefully, you’ll notice multiple uses of the term “yet”. This technology is still early, and google/yahoo need time to perfect it and to get it up to speed.

    Concerning the technology itself, what it’s doing is actually loading and playing the SWF in a version of the Flash Player that allows google to read the display contents as though it were a web page. They have access to the DOM and see any text displayed in whatever way you found it necessary to present after you’ve parsed and done whatever to your XML file. This also pertains to dynamically generated URLs attached to buttons since the google SWF spider can simply click a button during playback and discover the URL being connected to as a result.

    This is a huge step forward since in terms of SWF, the spider is doing and seeing exactly what the user is seeing and [potentially] doing – though that all dependons on how thorough and “smart” the SWF spider really is…

  4. Chris H Says:

    actually point #1 is obsolete, since the newest swfobject script (which i’m guessing over 90% will use when this new crawling method is actually integrated into google/yahoo) uses standards compliant integration with the tag.

    but yeah, my beef is also with point #2. no medium to large site actually has all the content sitting in the swf file, it’s either loaded via XML or other methods. but then that is also a problem of AJAX driven sites.

  5. Search Engine War Says:

    Flash – Ahhha! Saviour of the indexable!…

    Well finally its here. Drumroll please! Google and Adobe have announced improved Flash indexing. Google’s Flash indexing algorithm will now detect textual content and crawl URL’s in your Flash movies, banners and websites.We’ve improved our ability …

  6. Theo Says:

    senocular & chis:

    about #1: yes and no, the newest version of swfobject has one variant which uses html tags to embed the content, but in reality I doubt it is very much used because of the “click to activate” bug (which will become less of an issue, but is still a problem). you may be right in that this announcement from google will see movement in the direction of using some kind of embed/object method.

    senocular:

    about #2

    I think it is naive to belive in google’s description. it might be true as a high level description, but the spider will in no way be “seeing exactly what the user is seeing”. this is not true when it comes to HTML indexing, and even less true for Flash. the spider will see the relations between different parts of the page/application and how they change when certain actions are performed, in HTML this is simple because there is a semantic meaning to the data, but there is no semantics in an executing SWF. the spider would see things but wouldn’t have a real idea of what they mean or what their internal relations are. a basic measure of relation when it comes to HTML is order, but how do you determine order in something displayed in flash? you could look at the coordinates, but do you really know that the things aren’t moving, or that they are even visible at the same time? those are just the most obvious problems.

    there is also the issue of understanding any XML that the spider would discover, that data would have no meaning unless parsed and interpreted. the only thing that can do that is the flash site that loads it and there is no practical way to follow how it interprets the data. you could make the site load the XML and then see how it changed, but how do you know that the change would take effect immediately and not at a later stage in the site?

    what I’m saying is that even though Google can do what they say they can, and I seriously doubt that their description is true for anything but the very simplest flash sites, it is the wrong solution. Google shouldn’t try to make sense out of something that doesn’t make the semantic structure of the data visible, but instead asking for information in a structured, semantic format.

  7. Peter Elst Says:

    I agree completely, good post — blogged some similar thoughts earlier today.

    http://www.peterelst.com/blog/2008/07/01/thoughts-on-fully-searchable-flash/

  8. senocular Says:

    Certainly there are limitations to how an interactive and dynamic interface can be indexed. And the fact remains that Flash content is not exactly a “structured, semantic format”. You can’t expect Google to just “ask” for that kind of information to be handed to them from Flash. Your stuck with what Flash is. And given what it is, about the best thing you can do is do what they’re doing now – take a version of the Flash Player, run your SWF through it, and extract data from the display list as it prods those each of those objects to discover different states of the application.

  9. Danny Miller Says:

    I agree with Senocular.

    But also, there are many clever ways of finding the data. The Flash parses the XML right and fills textfields with them. I’m sure Google is just monitoring the textfields of a flash while the flash is loading. It doesn’t have to look at the code to find the xml file. Plus, if could monitor http requests and listen for the xml file to come. How else would they be able to get external SWFs (which they mentioned)

  10. Theo Says:

    No, you don’t have to be stuck with only the SWF. The best solution at this point is progressive enhancement, it works now, is accepted by Google, and it will work better than SWF indexing unless Google does some serious investment in AI research.

  11. Danny Miller Says:

    Also, just do add:

    Another solution which Google may bring up would be for the Internet COMMUNITY to index Flash websites that it’s bot can’t index. You view a Flash and you type in words that you see. If what you type in matches what the community typed in that’s what the keywords will be etc…

  12. Jon B Says:

    Personally I acknowledge this to be a step forward, but also acknowledge the concerns people have with it – mine being mostly at the loss of ‘context’ and ‘semantics’ that is achieved easily within HTML (when you know what you’re doing)and something that is hard to even begin to replicate inside Flash content. That said, using a version of flash player within the bot does provide some interesting opportunities, albeit hard ones to fully buy into (like fully understanding the plethora of ways developers use to create GUI interactivity with their SWFs) Ideally I’d like to see this evolve along with some ‘best practices for indexable content’ which would possible even include a way for indexed content to link to the appropriate part of a SWF file – the way I see this being achievable is through the way google currently (so I believe) pass details of the terms searched for through the headers to any search result clicked on – doing something similar with maybe some identifiable ‘anchors’ could provide a way for google search results to directly link to SWF application states even – although again it is something that will need to evolve between both google adobe and content developers.

  13. TK Says:

    Wow, what startling news!!! Not.

    All that this “SWF indexing” will do is screw up SEO for Flash RIA’s. Google should also start indexing .exe’s! Good developers know that SEO is for HTML, not for Flash. As Theo has posted in the past, indexing SWF’s would be like Google Desktop indexing and searching Microsoft Word or even AIR applications… there’s no use! Google can’t index dynamic application scripts (middleware) because they can spit out anything they want and can change rapidly and completely dynamically. Flash is the same. Flash can pull info from web-services, PHP, ASP, Java… so it has the ability to spit out anything from generated bitmap data to text to xml, so what’s the point?

    “Communism was just a red herring” (love the Clue reference)

    • TK
  14. links for 2008-07-02 « Brent Sordyl’s Blog Says:

    [...] SWF indexing is a red herring, and you should all know that by now Google will acheive very little with their new indexing algorithms, and they must know so, I cannot belive that the Google engineers are not aware of these issues. It’s also surprising how few in the Flash blogosphere that are aware of the problems. (tags: flash seo) [...]

  15. SWF Searchability finally! « LADG Says:

    [...] And then to get us down to earth again Bjørn has told me to read this article on Iconara [...]

  16. jensa Says:

    I mostly agree here, but about the “unless they actually execute the code”-statement, this is what Adobe just enabled and the core of the news. This is a “headless” (not rendering) version of the player that enables access to any dynamic content loaded (just as in the ordinary player). Google and Yahoo can simulate clicks on buttons – thus enabling parsing of such URLs.

    But as pointed out by several here – the real difficulty is figuring what part of the loaded text that is relevant. Google have some merit here, but there’s no standard way to index a SWF with dynamic content.

    J

  17. Theo Says:

    Jensa:

    You’re right in that they actually do execute the code, that was badly argued by me.

  18. Ryan H Says:

    I think many of you are assuming way too much about the implementation. Google already indexes dynamic content rendered via JavaScript. My guess is that they do this at a fairy high level via DOM inspection — just like an automated test tool does. From that perspective, it’s relatively easy to determine which nodes are visible and which are likely candidates for user interaction (e.g. links, nodes with certain event handlers defined) that are, in turn, likely to change the state of the DOM.

    Take a look at FlexSpy for a visual demonstration of the DOM in a Flash display list. Just like a browser’s DOM, it should be pretty easy to determine what’s visible and what objects are likely candidates for user interaction.

    So where’s the big technical barrier here? A search engine bot doesn’t have to intercept or even care about your externally loaded content (and it shouldn’t, since the goal should be to recreate a user experience). It only has to be able to observe changes to the DOM in response to user interaction, and then determine how those changes affect visible (and I guess we’re only talking about text here) content.

    As far as understanding the semantic context of text content, you can determine quite a bit just by looking at a node’s object type (similar to element type in HTML) and it’s position in the hierarchy.

    So I think this is indeed a game-changing development and something all Flash/Flex developers should try to learn more about.

  19. Flash & SEO (Part II) « de-Hao! [ How it's done! ] Says:

    [...] attention to a different perspective of the SWF Searchability news, expressed in the article titled SWF indexing is a red herring, and you should all know that by now. I have to agree that there is still some room for improvement, so far as indexing dynamically [...]

  20. willwork Says:

    There have been a number of posts above that complain about Flash and SEO. Not only is SEO and Flash possible, but I have already been implementing my architecture and it works beautifully. Check out two of my sites: http://www.willworkforfilm.com http://www.davincimediaworks.com

    They both give you the benefits from Flash design and complete visibility and index-ability by all search engines.

  21. Theo Says:

    @Willwork

    As far as I can tell the two websites you link to use progressive enhancement to make them indexable, but I wouldn’t call it “Flash SEO”, because it has little to do with Flash as such, what you have is a HTML website that is replaced by a Flash website which displays the same data. The same technique can be used with Silverlight, Ajax or even Java applets, if you would want to.

    The discussion above is about “Flash SEO” as in indexing the actual Flash website. Not the HTML on the embedding page, but the running SWF.

    However, I’m completely with you, I think that progressive enhancement is the best way to make Flash websites indexable. Poking around inside a running SWF will never come near looking at the semantic, structured data of an HTML page.

  22. Theo Says:

    @Ryan H

    Looking at the display list in a running Flash site is not anything near looking at the DOM of a HTML page. Sure, you can see if a textfield or shape is visible, but you will not see any of the semantics or structure that you will for a HTML page. In Flash a textfield is just a textfield, never a “h1″, “em” or “small” and there is no way to determine the relation between two display objects. They may appear as siblings in the display list, but on opposite sides of the screen, or vice versa.

    Semantics, context and hierarchy is the kind of information that takes indexing from the dumb Altavista version (“this word appeared on that page”) to the intelligent, useful Google version (“this word is an important word on that page, because it’s in a header”). A running Flash site doesn’t provide that kind of information, so even if the indexing can be done, it will be inferior to a HTML rendering of the same data.

  23. willwork Says:

    @Theo

    Thanks for the comments. In this case, I’m more of the notion that it’s the results that matter, however you achieve it. When implemented, all my clients care about is that they have a beautiful flash site with complete SEO.

Leave a Reply