Monday, July 13, 2009

Linked Open Data and SPARQL

After reading about OpenCalais (see previous post), I wanted to know how to get to all the data associated with the Linked Open Data project! I found some SPARQL endpoints, such as the SPARQL explorer from DBPedia (http://dbpedia.org/snorql/?query=PREFIX+dbo:+%0D%0A%0D%0ASELECT+%3Fname+%3Fbirth+%3Fdeath+%3Fperson+WHERE+{%0D%0A+++++%3Fperson+dbpedia2:birthPlace++.%0D%0A+++++%3Fperson+dbo:birthdate+%3Fbirth+.%0D%0A+++++%3Fperson+foaf:name+%3Fname+.%0D%0A+++++%3Fperson+dbo:deathdate+%3Fdeath%0D%0A+++++FILTER+(%3Fbirth+<+"1900-01-01"^^xsd:date)+.%0D%0A}). Pretty sweet but can we get to data inside AND outside of DBPedia?

I found a nice set of slides that talks about SPARQL queries and then goes into Linked Open Data sources (http://www.slideshare.net/fulvio.corno/sparql-and-the-open-linked-data-initiative). It also mentions a tool called Pubby. It maps SPARQL endpoints so that they can become Linked Open Data Endpoints (http://www4.wiwiss.fu-berlin.de/pubby/).

It also had a slide saying "Tools for consuming Linked Data". It lists "Semantic Web browsers and client Libraries" as well as "Semantic Web Search Engines". We're getting closer but I don't want to use a search engine unless it is exposed as a SPARQL endpoint. I searched Google for " SPARQL endpoint linked open data" and read the BBC article (http://welcomebackstage.com/2009/06/bbc-backstage-sparql-endpoint/) about how they had several Linked Data sources for all their media and wanted them to all be searchable together. They contacted OpenLink software (makers of Virtuoso) and Talis (http://api.talis.com/stores/bbc-backstage).
The solution seems to be a classic SPARQL include of all the BBC graphs. However Virtuoso did have an interesting graphical query thin client (http://bbc.openlinksw.com/isparql/?default-graph-uri=%20&query=select%20*%20from%20%20where%20{%20?s%20?p%20?o%20}).

Still searching. More to come.

Taking a look at Open Calais

Thomson Reuters provides services for tagging web content as RDF, called OpenCalais (http://opencalais.com/). Several projects are leveraging their API including; the content management system Drupal (http://drupal.org/project/opencalais), the blog site WordPress.com (http://tagaroo.opencalais.com/), Microsoft, and several others. The services are exposed as SOAP, REST, and HTTP Traffic Compression (http://opencalais.com/documentation/calais-web-service-api/api-invocation). You can download a HTML page for making the REST calls. I only supplied my API key and the HTML and it returned the data as RDF (http://opencalais.com/files/HTMLform.zip).

Open Calais is offically part of the Linked Open Data Cloud and therefore the entities represented are dereferenceable URIs (http://opencalais.com/documentation/linked-data-entities). When I supplied an article about the Tour de France then one of the Lance Armstrong entities was marked http://d.opencalais.com/pershash-1/050fd058-00ac-3453-a376-45df3198a109.html and so it is now dereferenceable. However if you click it then it probably won't tell you much. This is initially a stub for the entity "Lance Armstrong". Eventually the entity disambiguation system (http://opencalais.com/documentation/calais-web-service-api/api-metadata/entity-disambiguation) will process it, add what it knows (including same as relationship and hopefully a link to the wikipedia article for Lance Armstrong). At this point it will be marked as disambiguated. The last question in the FAQ gives a good description (http://opencalais.com/documentation/calais-linked-data/linkfaq).

So how do we use Open Calais' data? Blog post for that next.

Monday, June 29, 2009

The video library grows...

The video library idea is working out so far (see previous post). With a little practice I can get a new video up in about 10 minutes. I think this makes it a viable method for explanation, collaboration, demonstration, etc. with the remote teams. Lets see how it evolves as the library grows.

Monday, June 22, 2009

Video library

I'm a big note taker. I like to keep records of almost everything. I almost believe that if it isn't written down then it doesn't exist. Extreme, yes. Too much falls through the cracks when things aren't written down, agreed? The research projects that I manage need to be able to collaborate, so I have them use a wiki. This is useful. But even reading and searching text can be too much. I'm thinking less text and more video. My current strategy is YouTube (hosting,indexing,playing), Ubuntu (jaunty makes me happy), Recordmydesktop (screencasting to ogv), Devede (to convert from ogv/ogg to avi), no audio, don't worry about small mistakes, and keep the videos short. Lets see how far we can get.

Friday, June 19, 2009

I was going through some information about deploying to the central maven repository:

1. Maven local and remote repository introduction:
http://maven.apache.org/guides/introduction/introduction-to-repositories.html

2. Putting your properly metadata'd jar in the Maven Central Repository:
http://maven.apache.org/guides/mini/guide-central-repository-upload.html

3. The central repository in all its glory:
http://repo1.maven.org/maven2/

10% of the work takes 90% of the time

And 90% of the effort for that matter. Banging out code is easy. I helped write a workflow system in 6 weeks that "worked". But that word is meaningless without measurements, reviews, testing, documentation, and more. Completing (most) of those tasks for the workflow system took another 9 months or so. The interesting part is that the demo for the 6 week old system and the 10 month old system didn't really change. The demo part of "worked" is unimpressive. Demos don't have much to do with changing the life of analysts... where the code meets the road so to speak.