Friday, May 30, 2014

RGolf: rolling window

I have learned a lot from my last RGolf post. Therefore today I have another problem from practice.

You have a data set on values of contracts signed by ten salesmen. It has three columns: person id (p), contract value (v) and time (t).
Here is the code that generates the test data set.

n <- 4000
y <- 10
set.seed(1)
d <- data.frame(p = sample(letters[1:y], n, rep=T),
                v = runif(n),
                t = sample(seq(as.Date("2000-1-1"),
                           by="day", len=n)))

head(d)
#   p          v          t
# 1 c 0.18776846 2001-05-28
# 2 d 0.50475902 2001-05-25
# 3 f 0.02728685 2008-07-06
# 4 j 0.49629785 2004-08-06
# 5 c 0.94735171 2007-10-31
# 6 i 0.38118213 2001-09-04

For each salesmen we want to find the 90-day period in which she generated the highest sum of sales.
The task is to generate a data frame that has three columns: person id (p), maximal sum of sales (v), start of the 90-day period (t) and save it to a variable named r.
In situations where there are multiple such 90-day periods you are to report the date of the earliest contract that was signed during such a period.

The rules of engagement are exactly as last time:
(1) generate data frame r as few keystrokes as possible,
(2) one line of code may not be longer than 80 characters,
(3) the solution must be in base R only (no package loading is allowed).

Here is my attempt consisting of 225 characters:

x=max(d$t);m=matrix(0,x+89,y)
m[cbind(d$t, d$p)]=d$v
s=sapply(1:x,function(z)
colSums(m[z+0:89,]))*t(m>0)[,1:x]
u=apply(s,1,function(z)c(max(z),which.max(z)))
r=data.frame(p=letters[1:y],v=u[1,],t=u[2,]+as.Date("1970-01-01"))

It produces the following output:

r
#    p         v          t
# 1  a 10.198044 2009-10-11
# 2  b  9.265335 2002-08-28
# 3  c  9.735401 2008-02-21
# 4  d  9.479942 2004-09-07
# 5  e 11.036041 2010-09-16
# 6  f 10.602446 2002-04-04
# 7  g  9.927153 2007-08-11
# 8  h 10.917856 2007-04-26
# 9  i 10.027341 2008-10-26
# 10 j  9.965738 2004-12-28

Here is a more verbose version that is easier to read:

d_o <- d[order(d$t),] # order d by time
testr <- NULL
for (p in levels(d_o$p)) {
    d_o_p <- d_o[d_o$p == p,] # subset ordered d by player
    best_v_sum <- -Inf
    best_i <- NA
    for (i in 1:nrow(d_o_p)) {
        d_o_p_t <- d_o_p[d_o_p$t < d_o_p$t[i] + 90 &
                         d_o_p$t >= d_o_p$t[i],]
        # subset ordered d by player and rolling window
        if (sum(d_o_p_t$v) > best_v_sum) {
            best_v_sum <- sum(d_o_p_t$v)
            best_i <- i
        }
    }
    testr <- rbind(testr, data.frame(p = p, v = best_v_sum,
                   t = d_o_p$t[best_i]))
}

all.equal(r,testr)

# [1] TRUE

Any competing solution should pass all.equal test.

As last time - please post your solutions in comments to the post.

8 comments:

  1. Hello!

    Here is my terribly minified solution (approx. 242 characters):

    r <- do.call(rbind, arg=lapply(split(d, d$p), function(d){
    d <- d[order(d$t),, FALSE]
    s <- mapply(function(i,j) sum(d$v[i:j]),
    seq_along(d$t), .bincode(d$t+90, c(d$t,+Inf)))
    data.frame( p=d$p[i<-which.max(s)], v=s[i], t=d$t[i] )
    }))

    It partitions the original data.frame into rows by the ID of the agent and runs the below on each partition:
    1) It sorts the sales table of the current agent by date of sale in increasing order. This is done with the view of applying binary search at a later stage.
    2) It runs base:::.bincode() on the dates of sales shifted by 90 days into the future with original dates serving as right-closed bins and an infinite right end. This finds the last date within the 90 day window from the current one. Also .bincode() correctly handles sequential duplicates in brakes.
    3) It invokes mapply() with the running sum function on the indices of the last and the first date.
    4) It locates the first date of the 90-day window with the highest total sales and constructs the output row.
    In the end, to get the result it builds a call to rbind() with the list-output of the first phase.

    I guess the complexity is somewhere between O( N^2 ), since which.max() is O( N ), order() and .bincode() are O(N log N), while the computation of the running sum is close to O( N^2 ).

    Here is the full version, which also provides some debug information. Note that sorting is can be performed only once outside the main lapply() call.

    ## Since split preserves the original natural order
    d <- d[ order( d$t ), , FALSE ]

    r <- do.call( rbind, arg = lapply( split( d, d$p ), function( d ) {
    ## Find the last date, which falls within the 90 day running window
    jnx <- .bincode( d$t + 90, c( d$t, +Inf ) )

    ## We need a running sum with variable length - context dependent window
    sales <- mapply(
    function( i, j ) sum( d$v[ i : j ] ),
    seq_along( d$t ), jnx )

    ## Fetch the row, which begins the 90-day period with the maximal sales
    inx <- which.max( sales )

    data.frame(
    p = d$p[ inx ]
    , v = sales[ inx ]
    , i = inx
    , t = d$t[ inx ]
    , j = jnx[ inx ]
    , e = d$t[ jnx[ inx ] ] )
    } ) )

    ivan nazarov

    ReplyDelete
    Replies
    1. Here is a faster and shorter version of my solution. It is 157 characters long (with newlines) and instead of running summation, it computes an "extended" cumulative sum of ordered sales, and then gets the difference of the accumulated sales value at the last date of the window and just before the first day of the window (zero if the window starts at the earliest date).

      r<-do.call(rbind,lapply(split(d,d$p),function(d){
      d=d[order(d$t),];d$v=(v=c(0,cumsum(d$v)))[1+.bincode(
      d$t+90,c(d$t,+Inf))]-head(v,-1);d[which.max(d$v),]}))

      However this cumsum() approach should not be used in practice without modification. The problem is with finite precision arithmetic: computing a cumulative sum is prone to growing roundoff errors, which become more severe the longer the input data. As the result, the total sales in the updated d$v might be numerically inexact, thereby adversely affecting the choice of the maximal sales period.

      Delete
  2. Just dropping in to suggest that, perhaps not in the spirit of "code golf," but certainly of interest, would be the results of time trials of all proposed code solutions.

    ReplyDelete
  3. I was thinking about it. Interestingly - up till now - shorter codes usually are more vectorized and thus run faster :).

    In the current example, on my machine I get the following times:
    - verbose version: 1 s.
    - my solution: 0.25 s.
    - Ivan Nazarov: 0.1 s.

    ReplyDelete
  4. That's my solution:

    r=d[c()];sapply(levels(d$p),function(l){s=d[d$p==l,];t=s$t
    s$v=sapply(t,function(x)sum(s$v[t>=x&t r
    p v t
    1 a 10.198044 2009-10-11
    2 b 9.338709 2002-08-24
    3 c 9.735401 2008-02-21
    4 d 9.479942 2004-09-07
    5 e 11.036041 2010-09-16
    6 f 10.602446 2002-04-04
    7 g 9.927153 2007-08-11
    8 h 10.917856 2007-04-26
    9 i 10.027341 2008-10-26
    10 j 9.965738 2004-12-28
    > all.equal(r, testr)
    [1] TRUE

    ReplyDelete
    Replies
    1. Erm, it seems that blogspot has problems with angle brackets. My solution is at http://pastebin.com/UNJp65Kw . 159 bytes, including newlines.

      It's slow as hell, but hey, it needs to be short, not fast ;-)

      Delete
  5. I've managed to do it in 138 characters, including setting rownames which costs 17 characters. Without these rownames all.equal doesn't return TRUE, although all values are identical.

    r=Reduce(rbind,lapply(split(d,d$p),function(x){t=x$t
    x$v=sapply(t,function(b)sum(x$v[t>=b&t<b+90]))
    x[which.max(x$v),]}))
    rownames(r)=1:10


    I use Reduce instead of do.call to save one letter :)

    ReplyDelete