Lazily stealing..errr. getting data from the Dutch Rijksmuseum

November 10, 2012 at 10:46 pm | Posted in Clojure, Programming | Leave a comment

Last year the Dutch Rijksmuseum published an API that allows a developer to retrieve information and images from a collections of more than 110,000 items. You have to register for an API key first.

Unless you are looking for a specific item, the interface every time returns 100 items in XML format. You will also get a resumption token so you can query for the next 100 items. I imagined it would be useful to abstract from this, using a lazy sequence in Clojure. So let me show you the resulting code and a brief explanation:

(ns rijksmuseum.core
  (:require [clojure.xml :as xml]
            [clojure.zip :as zip]
            [clojure.data.zip.xml :as zf]))

We will have to parse some XML that is returned, so we start with adding some convenient libraries. If you are not familiar with Clojure zippers, please look it up in the documentation and numerous blogs. They make navigating XML almost painless.

(def api-key "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx")
(def base-url (str "https://www.rijksmuseum.nl/api/oai/" api-key))

In line 5 you will have to fill in your own API key. Line 7 defines the base url that is used for all queries.


(defn- build-query [resumption-token]
  (str base-url "/?verb=ListRecords&"
       (if (nil? resumption-token)
         "set=collectie_online&metadataPrefix=oai_dc"
         (str "resumptiontoken=" resumption-token))))

The routine build-query builds up a query. If there is no resumption token yet, the resulting query loads the first data. Otherwise, it will continue with the next batch of records. Currently the Rijksmuseum API supports two kinds of queries. You can either ask for a list of records (using verb=ListRecords) or you can ask for a specific record (using verb=GetRecord and an identifier). The API documentation has all the details.

We will first start with a couple of helper routines. Basically they extract the information we are interested in from a XML stream:


(defn- get-records [zipper]
  (zf/xml-> zipper :ListRecords :record))

(defn- get-resumption-token [zipper]
  (zf/xml1-> zipper :ListRecords :resumptionToken zf/text))

get-records extracts the records from the (zipped) XML response. get-resumption-token returns (I’m sure you already guessed) the resumption token for our next query.

Now comes the construction of the lazy sequence:


(defn- lazy-get-works [resumption-token]
  (lazy-seq
   (let [zipper (zip/xml-zip (xml/parse (build-query resumption-token)))
         works (get-records zipper)
         token (get-resumption-token zipper)]
     (concat works (lazy-get-works token)))))

(defn get-works
  "Return all works as a lazy sequence"
  []
  (lazy-get-works nil))

lazy-seq (line 18) is a macro that creates a lazy sequence out of a body of expressions. Next we create a query, parse the resulting XML and create a zip structure. All in one single line of code: line 19. All we have to do now is to extract the records (called works in line 20), the resumption token (line 21) and call ourselves recursively (line 22). Don’t be afraid of stack overflows: the lazy-seq macros takes care of this.

Now we are ready to use our lazy sequence. The next example creates a list with the image url’s of the first 10 items in the collection:


(defn get-image-url [work]
  (zf/xml1-> work :metadata :oai_dc:dc :dc:format zf/text))

(map get-image-url (take 10 (get-works)))

Don’t go overboard with requesting all items in the collection at once. Retrieving 1000 items takes about 1 minute, so the calls to the API are most probably throttled. Anyhow, have fun with lazily stealing works of art!

Leave a Comment »

RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.
Entries and comments feeds.