Montag, 28. März 2011

monads in R: sapply and foreach

Monads are a powerful way of structuring functional programs. They are used in functional languages like haskell or F# to define control flows (like handling concurrency, continuations, side effects such as input/output, or exceptions). Basically a monad is defined by a type and two functions, unit and bind. Alternatively one can define a monad using two other functions together with a data type, namely: join and fmap. In following I want to show some analogies between fmap and join and the functions sapply and c in R.

fmap is defined as a (higher-order) function that fulfills the following relations (in haskell and the . denotes function concatenation):

fmap id = id
fmap (f . g) = (fmap f) . (fmap g)

the definition in R is then:
fmap <- function(f) sapply(x,f)

for the join function the following has to be valid (also in haskell):

join . fmap join = join . join

this translates to the following code identities in R (read === as "is the same as"):
c( fmap(c)(x)) === c(sapply(x,c)) === c(x) === x

Therefore c and sapply define the list monad in R. Can we go one step further and define some notation like the "do" notation in haskell in R? I think this is already done. It is the foreach package.

foreach(x=a) %do% f(x)

is the exact translation of the corresponding haskell monadic "do" notation (again in haskell):

fmap f x = do r <- x return (f r)
join x = do a <- x a

and in R:
fmap(f)(x) === foreach(r=x) %do% f(r) === sapply(x,f)
join(x) === foreach(a=x) %do% a === c(x)

Now hat does this mean for the several foreach extensions (doMC, doMPI, doRedis, doSNOW)?
They are all implementations of monads in R. So it would be very interesting to port all the other monadic types from other functional programming languages to R.

doRedis: redis as dispatcher for parallel R

Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets. Clients are available for C, C++, C#, Objective-C, Clojure, Common Lisp, Erlang, Go, Haskell, Io, Lua, Perl, Python, PHP, R ruby, scala, smalltalk, tcl...

B. W. Lewis has developed a parallel extension for the foreach package that allows a cluster of workers to obtain workloads from a redis server.

This is the redis binding for R (rredis)

  • redisConnect() # connect to redis store
  • redisSet('x',runif(5)) # store a value
  • redisGet('x') # retrieve value from store
  • redisClose() # close connection
  • redisAuth(pwd) # simple authentication
  • redisConnect()
  • redisLPush('x',1) # push numbers into list
  • redisLPush('x',2)
  • redisLPush('x',3)
  • redisLRange('x',0,2) # retrieve list
using this redis interface for R, the fine R library doRedis allows redis to be a dispatcher for parallel R commands on a cluster. The usage is fairly easy and resembles closely the usage of the doSNOW clustering library:
  • first you have to start a redis server on one of the machines as cluster master.
  • then connect as many R workers as you want to the redis master
  • finally start an R interpreter and connect to the redis master and submit the parallel computation job using the foreach package
the workload is then distributed to the workers (which can reside on other machines than the master) and the result is gathered back to the interactive R interpreter. Here is an example how to use the package:

start the redis server:

./redis-server

start as many workers as you like

echo "require('doRedis');redisWorker('jobs')" > R --no-save -q &

start a R interpreter and connect to the redis server:

library(doRedis)
registerDoRedis('jobs')
foreach(j=1:1000, .combine=sum) %dopar% sum(runif(10000000))

if you want to minimize the communication with the redis server use setChunkSize to send out chunks to each R worker.