Jim Drannbauer home resume

Cascalog Made Easier

04 Feb 2011

There’s an awesome tool for processing data on Hadoop with Clojure. It’s called Cascalog. I think it’s pretty great. There’s a buttload of words to read about it on the wiki.

As a Cascalog n00b, I had a hard time wrapping my head around Cascalog’s set of incredibly powerful Custom Operations. They’re kinda dense (in a good, clojure-y kinda way) so I wrote some tests that illustrate the functionality provided by them. They helped me make sense out of it all.

Here they are. If you want to run ‘em, do this first. Then, drop the tests in your project. The tests should be pretty self-explanatory but if you need more information, read this and this. BTW… gonna assume you know Clojure.

(ns my-project.test.demos-test
  (:use cascalog.api)
  (:use [clojure.test])
  (:use [cascalog.testing])
  (:require [clojure.contrib.string :as s]))

(defn make-uppercase
  [string]
  "uppercase")

(defn uppercase-sq
  [src]
  (<- [?upper-string]
    (src ?lower-string)
    (make-uppercase ?lower-string :> ?upper-string)))

(deftest understand-defn-as-a-transformer
  (with-tmp-sources [test-src [["LOWERCASE"]]]
    (test?- [["uppercase"]]
      (uppercase-sq test-src))))

(defn truthy? [truthiness] truthiness)

(defn only-the-truth
  [src]
  (<- [?statement]
    (src ?statement ?truthiness)
    (truthy? ?truthiness)))

(deftest understand-defn-as-a-filter
  (with-tmp-sources [test-src [["truth" true]["lie" false]]]
    (test?- [["truth"]]
      (only-the-truth test-src))))

(defmapop my-name-is [x] ["my" "name" "is"])

(defn hello
  [src]
  (<- [?my ?name ?is ?my-name]
    (src ?my-name)
    (my-name-is ?my-name :> ?my ?name ?is)))

(deftest understand-defmapop
  (with-tmp-sources [test-src [["Jim"]["Kerry"]]]
    (test?- [["my" "name" "is" "Jim"]
             ["my" "name" "is" "Kerry"]]
      (hello test-src))))

(deffilterop parallel-truthy? [truthiness] truthiness)

(defn only-the-truth-parallel
  [src]
  (<- [?statement]
    (src ?statement ?truthiness)
    (parallel-truthy? ?truthiness)))

(deftest understand-deffilterop
  (with-tmp-sources [test-src [["truth" true]["lie" false]]]
    (test?- [["truth"]]
      (only-the-truth-parallel test-src))))

;  What defmapcatop does:
;
;           this should be vertical
;             |    |    |     |
; this <------+    |    |     |
; should <---------+    |     |
; be <------------------+     |
; vertical <------------------+

(defmapcatop vert
  [this should be vertical]
  [[this][should][be][vertical]])

(defn make-vertical
  [src]
  (<- [?word]
    (src ?this ?should ?be ?vertical)
    (vert ?this ?should ?be ?vertical :> ?word)))

(deftest understand-defmapcatop
  (with-tmp-sources [test-src [["this" "should" "be" "vertical"]]]
    (test?- [["this"]
             ["should"]
             ["be"]
             ["vertical"]]
      (make-vertical test-src))))

; race-winners works because the rows are
; sorted by place (in this case only a
; simple number string).
;
; defbufferop is doing this:
;        +---+------------------+
;        |   |                  |
; Race Jim   1                  |
; Race Kerry 2                  |
;   |    |   |                  v
;   |    +---+---------------------------+
;   |                                    v
;   +----------------> Race ( Jim 1 | Kerry 2 )
;
;  Then, in this case, we just want the
;  first tuple (Jim 1) since it comes first
;  (because we sorted). Without sorting,
;  there is no way to reliably pick the
;  first. Luckily, it's only one line.

(defbufferop get-winner
  [tuples]
  (take 1 tuples))

(defn race-winners
  [src]
  (<- [?race ?winner ?winning-place]
    (src ?race ?runner ?place)
    (:sort ?place)
    (get-winner ?runner ?place :> ?winner ?winning-place)))

(deftest understand-defbufferop
  (with-tmp-sources [test-src [["Race 1" "Jim"   "1"]
                               ["Race 1" "Kerry" "2"]
                               ["Race 2" "Jim"   "5"]
                               ["Race 2" "Kerry" "3"]]]
    (test?- [["Race 1" "Jim"   "1"]
             ["Race 2" "Kerry" "3"]]
      (race-winners test-src))))

;  What defaggregateop does:
;
;           this should be horizontal
;             ^    ^    ^     ^
;             |    |    |     |
; this -------+    |    |     |
; should ----------+    |     |
; be -------------------+     |
; horizontal -----------------+

(defaggregateop conj-words
  ([] [])
  ([word-list word]
    (conj word-list word))
  ([word-list]
    [word-list]))

(defn make-horizontal
  [src]
  (<- [?this ?should ?be ?horizontal]
    (src ?word)
    (conj-words ?word :> ?this ?should ?be ?horizontal)))

(deftest understand-defaggregateop
  (with-tmp-sources [test-src [["this"]
                               ["should"]
                               ["be"]
                               ["horizontal"]]]
    (test?- [["this" "should" "be" "horizontal"]]
      (make-horizontal test-src))))

(defn count-if-awesome*
  [x]
  (if (= x "awesome") 1 0))

(defparallelagg count-if-awesome
  :init-var #'count-if-awesome*
  :combine-var #'+ )

(defn how-awesome-is-this?
  [src]
  (<- [?awesome-count]
    (src ?word)
    (count-if-awesome ?word :> ?awesome-count)))

(deftest understand-defparallelagg
  (with-tmp-sources [test-src [["this"]["is"]["awesome"]
                               ["awesome"]["awesome"]]]
    (test?- [[3]]
      (how-awesome-is-this? test-src))))

Yay! Nathan, thanks for all the awesome.

blog comments powered by Disqus