2013-03-24

A Merging Test Bench

As requested here's the packed data and a test bench you can test your own merging function ideas and replicate my results (hopefully). If you want the plots you can use the end part of scripts in part1 part2.

The data is a bunch of super secret Eve Online killmails. The first part of the script handles the json to data.frame transformation. After that I introduce the R core merging in a for-loop, data.table variant, reduce function variants, and finally the fast.merging. There's actually alot more data then we are using in the bench. Totalling 166106 killmails, merging only 3000 of those. That is because as noted in previous posts the merge for-looping seems to work in exponential time so going much higher then 3000 lines seems really excessive. Even now it took over an hour to run the whole bench on intel i5 3570k (not overclocked). So if you want faster results choose 1000ish frames and set rbenchmark replications to 1. With low frame count (under 100) the results are fairly similiar.

If you are going to test are the results from different functions the same, note that the order of rows and columns will be different. I'm eagerly waiting for you results and ideas.

4 comments:

  1. Anonymous25/3/13 18:58

    Merge seems like the wrong tool to use in this instance and thus why this operation takes so long. All you are doing is an rbind with NAs filled in form missing columns. There are likely faster methods out there but I don't think merge is the right tool in this instance.

    Try the following making sure eve.frames are actually data.frames:

    library(plyr)

    eve.frames2 <- lapply(eve.frames, as.data.frame,stringsAsFactors=FALSE)

    TMP2 <- do.call(rbind.fill, eve.frames2)

    ReplyDelete
  2. Yes, your plyr was faster. I'm just wondering will it translate to a parallel solution ? Because I think I can just parallel FM my way out. :)

    Also I'm wondering if you have to use merge( , all=TRUE) what would be the general good method. Because it's one of the most often repeated lines I need at work. Even if it's not actually needed, I want to be sure. Of course this example is kind of lame because you can just easily plyr your way out of it.

    After a quick test (not parallel):

    at 10 000 frames.
    Elapsed (sec)
    plyr 124.03
    fm 177.84
    fm2 10.94 (my newest solution)

    My newest solution is basically: sort all by columns => rbind unique frames => merge. I'll spam the code if it actually works. Currently it does it pretty fast, I'm just not 100% sure does it do it correctly.

    ReplyDelete
  3. Anonymous25/3/13 21:41

    First refine your algorithm and then think about going parallel. Should a problem of this size take minutes or even 10s of seconds for completion? Likely not. This matrix version avoids data.frames and will get your there faster than your previous versions. Tweak it to get the output how you want it (and I'm sure it can be improved). Likely another part of your code is now the bottleneck.

    As an aside I think you are abusing both the base merge and data.table::merge and you should be careful to not jump to the wrong conclusions about speed.

    After you have created eve.frames run this code.

    # GET THE UNIQUE NAMES IN ALL THE FILES
    nm <- unique(unlist(sapply(eve.frames , colnames)))

    # PREALLOCATE
    eve.matrix <- matrix(NA_character_, nrow=length(eve), ncol=length(nm))
    colnames(eve.matrix) <- nm

    system.time({
    # LOOP THROUGH EACH ENTRY AND FILL MATRIX
    for(i in 1:length(eve.frames)){
    eveTmp <- eve.frames[[i]]
    ind <- nm[(nm %in% colnames(eveTmp))]
    eve.matrix[i, ind] <- eveTmp[,ind]
    }
    })

    Jonathan

    ReplyDelete
    Replies
    1. Hi Jonathan,

      True, that's basically the same as plyr packages rbind.fill. But much faster. It's interesting that the order of the columns doesn't matter, and the fact that it's faster.

      Already done the parallel solution for the old one so doesn't really matter.

      This is hopefully just some preliminary testing. I'm going to fragment the data so that rbind solutions will not work. So then we can truly measure the merging speed and we can check is the result correct.

      And I'm assuming that the speed here will correlate to the speed later on when we do this on the fragmented data. If my assumption is wrong, what does that tell us about the current merging functions?

      - Xachriel

      Delete