Presentation: Remixing and distribution of web content made dead easyTuesday, June 2nd, 2009 at 6:12 am
My talk at the Future of Web Apps Tour 2009 about remixing the web of data with YQL. I’ll turn this into a slidecast once I am back.
Today I will talk a bit about an evolution that we are all part of, although we might not be aware of it yet.
What is the web?
Whenever people asked me what my job is I told them I am a web developer. This brings the question what I develop, really.
The web as it stands is made up from documents. The technologies that run it – http for transport and HTML for structure haven’t changed much over the years. Linking documents was a revolution and it made the earth a much smaller place and allowed us to collaborate. However, it got boring quickly.
The web of things has been a running theme for a while. Initially it meant that all kind of devices can be connected to the web (self-ordering fridge and somesuch). It also means that with RESTful web services we can point directly at the thing we want to reach which could be a text but also an image or a video or other embedded rich content in web sites.
In essence, the web is data. Data can be anything that is available on the web or referred to. Data is what we look for, data is what we get. And there is much more than meets the eye.
By connecting different data sources we even get more information and new data emerges. As humans we all learn differently and having different data sets and various ways of connecting them makes it easy for us to grasp the learnings from the data.
Hunters and Gatherers
The issue is that we overshot the goal. We collect for the sake of collecting and we spend much more time chasing the next big thing to collect than giving the things we already have some love and tag and describe them. As humans we are hard-wired to find things and collect them. It also means that we always want to do everything ourselves and not rely on others. In essence, we collected a solid mass of data and now we don’t know how to plough through it anymore. This is why we try to use technology to clean up the mess for us by injecting landmarks and machine-readable information.
Let’s take this sentence for example. There is much more in there than meets the eye.
My name is Chris. I am a German living in London, England and one of my favourite places to go is Hong Kong. I also enjoyed Brazil a lot.
By using a geolocation service I can analyze the text and add extra information that allows me to make it easy for other systems to understand this sentence. That way I can enrich the information.
My name is Chris. I am a German living in London, England (Name: London,England, GB, Type: Town, Latitude: 51.5063, Longitude: -0.12714) and one of my favourite places to go is Hong Kong. I also enjoyed Brazil (Name: Brazil, Type: Country, Latitude: -14.2429, Longitude: -54.3878) a lot.
This makes data much easier to grasp and gives it a richer experience for us all. The question is how we can do this easily.
APIs are the web data publisher’s way to give us access to their data. There are hundreds out there and each of them is different. Which leads to another problem.
Each API uses its own language, ways of authenticating, data entry vocabulary and return value. You are lucky to find good documentation and many examples are hard to grasp as they are not available in the programming language you would like to work in.
Documentation can be confusing. And in most cases you don’t really want to have to dig in that deep into the API just get some information.
What we need is a simple way to access all these wonderful APIs and mix and match the content of them.
The Yahoo Query Language (or short YQL) is a SQL-style language for the data web. Using the YQL console you can easily build most complex queries and get them ready for copy and paste.
Very important is the permalink link. Click this every time you do a complex query as if you reload the page by accident it will still be available to you.
The REST query is a URL ready to copy and paste into a browser or your own script.
The formatted view shows you the XML or JSON; the tree view allows you to drill down into the returned information.
Recent queries are stored, and example queries show you how it is done.
The data tables show all the available data sources. Each table comes with an own description.
What can you do with this?
Say you want to find events in Cambridge. You can query upcoming.org. Sadly enough (and because of people entering bad data) this will not result in anything useful but give you results from London!
select * from upcoming.events where location = "cambridge,uk"
By using the geo.places API you can define Cambridge without a doubt (as a woeid) and then get events.
select * from upcoming.events where woeid in (select woeid from geo.places where text="Cambridge,UK")
The diagnostics part of the resulting data set tells you which URLs where called “under the hood” and how long it took to get them.
The results section has all the events but far too much data for each of them. Say you only want the url, the title, the venue and the description.
You can select only the parts that you want:
select title,url,venue_name, description from upcoming.events where woeid in (select woeid from geo.places where text= "cambridge,uk")
Which cuts down nicely on the resulting data.
You can get my latest updates from Twitter…
select title from twitter.user.timeline where id="codepo8"
Or only those that I replied to somebody…
select title from twitter.user.timeline where id="codepo8"and title like "%@%"
Or check several accounts!
select title from twitter.user.timeline where id="codepo8" or id="ydn" and title like "%@%"
You could also check my tweets for useful keywords:
select * from search.termextract where context in (select title from twitter.user.timeline where id="codepo8")
You can scrape the BBC’s news site for links:
select * from html where url="http://news.bbc.co.uk" and xpath="//td//a"
Or get all the alternative text of their news images:
select alt from html where url="http://news.bbc.co.uk" and xpath="//td//a/img[@alt]"
And get better photos from flickr…
select * from flickr.photos.search where text in (select alt from html where url="http://news.bbc.co.uk" and xpath="//td//a/img[@alt]")
- mix and match APIs
- filter results
- simplify authentication
- use in console or from code
- minimal research of documentation
- caching of results
- proxied on Yahoos servers
YQL gives you a lot of flexibility when it comes to remixing the web and filtering the results. However, there are some things that can not be done with them that are possible with other systems like for example Yahoo Pipes.
YQL can be extended by Open Tables. This is a simple XML schema that redirects YQL queries to your web service. That way you can be part of the simple YQL interface without needing to change your architecture. The other benefit is that YQL will cache queries, thus hitting your servers less and also limit the access of every user to 1000 calls per hour to YQL.
One of those is for example the real estate search engine nestoria. Using their open table I can look for flats in Cambridge:
use "http://www.datatables.org/nestoria/nestoria.search.xml" as nestoria; select * from nestoria where place_name="Cambridge"
Open tables can be added on github to a repository. This will make them available to the YQL community.
Clicking the Show Community Tables link in the console adds all these third party tables to the interface.