Monday, July 13, 2009

Linked Open Data and SPARQL

After reading about OpenCalais (see previous post), I wanted to know how to get to all the data associated with the Linked Open Data project! I found some SPARQL endpoints, such as the SPARQL explorer from DBPedia (http://dbpedia.org/snorql/?query=PREFIX+dbo:+%0D%0A%0D%0ASELECT+%3Fname+%3Fbirth+%3Fdeath+%3Fperson+WHERE+{%0D%0A+++++%3Fperson+dbpedia2:birthPlace++.%0D%0A+++++%3Fperson+dbo:birthdate+%3Fbirth+.%0D%0A+++++%3Fperson+foaf:name+%3Fname+.%0D%0A+++++%3Fperson+dbo:deathdate+%3Fdeath%0D%0A+++++FILTER+(%3Fbirth+<+"1900-01-01"^^xsd:date)+.%0D%0A}). Pretty sweet but can we get to data inside AND outside of DBPedia?

I found a nice set of slides that talks about SPARQL queries and then goes into Linked Open Data sources (http://www.slideshare.net/fulvio.corno/sparql-and-the-open-linked-data-initiative). It also mentions a tool called Pubby. It maps SPARQL endpoints so that they can become Linked Open Data Endpoints (http://www4.wiwiss.fu-berlin.de/pubby/).

It also had a slide saying "Tools for consuming Linked Data". It lists "Semantic Web browsers and client Libraries" as well as "Semantic Web Search Engines". We're getting closer but I don't want to use a search engine unless it is exposed as a SPARQL endpoint. I searched Google for " SPARQL endpoint linked open data" and read the BBC article (http://welcomebackstage.com/2009/06/bbc-backstage-sparql-endpoint/) about how they had several Linked Data sources for all their media and wanted them to all be searchable together. They contacted OpenLink software (makers of Virtuoso) and Talis (http://api.talis.com/stores/bbc-backstage).
The solution seems to be a classic SPARQL include of all the BBC graphs. However Virtuoso did have an interesting graphical query thin client (http://bbc.openlinksw.com/isparql/?default-graph-uri=%20&query=select%20*%20from%20%20where%20{%20?s%20?p%20?o%20}).

Still searching. More to come.

Taking a look at Open Calais

Thomson Reuters provides services for tagging web content as RDF, called OpenCalais (http://opencalais.com/). Several projects are leveraging their API including; the content management system Drupal (http://drupal.org/project/opencalais), the blog site WordPress.com (http://tagaroo.opencalais.com/), Microsoft, and several others. The services are exposed as SOAP, REST, and HTTP Traffic Compression (http://opencalais.com/documentation/calais-web-service-api/api-invocation). You can download a HTML page for making the REST calls. I only supplied my API key and the HTML and it returned the data as RDF (http://opencalais.com/files/HTMLform.zip).

Open Calais is offically part of the Linked Open Data Cloud and therefore the entities represented are dereferenceable URIs (http://opencalais.com/documentation/linked-data-entities). When I supplied an article about the Tour de France then one of the Lance Armstrong entities was marked http://d.opencalais.com/pershash-1/050fd058-00ac-3453-a376-45df3198a109.html and so it is now dereferenceable. However if you click it then it probably won't tell you much. This is initially a stub for the entity "Lance Armstrong". Eventually the entity disambiguation system (http://opencalais.com/documentation/calais-web-service-api/api-metadata/entity-disambiguation) will process it, add what it knows (including same as relationship and hopefully a link to the wikipedia article for Lance Armstrong). At this point it will be marked as disambiguated. The last question in the FAQ gives a good description (http://opencalais.com/documentation/calais-linked-data/linkfaq).

So how do we use Open Calais' data? Blog post for that next.