Getting Started

Installation

To install DataCubes, at the Julia REPL:

Pkg.add("DataCubes")

Using the DataCubes package

To use DataCubes,

using DataCubes

This will introduce the core functions into namespace. A few helper functions have the form dcube.* and they can be introduced into namespace as well by:

using DataCubes.Tools

Below, we assume you already executed using DataCubes.

Creating a multidimensional table

DictArray

A multidimensional table can be represented by either the DictArray or LabeledArray data type. A DictArray is an array of ordered dictionaries with the common keys, and represents a table with no speicial axis information. A LabeledArray is a DictArray with an additional vector for each axis for their labels. The macro @darr is used to create a DictArray:

julia> d = @darr(c1=[1,1,2], c2=["x", "y", "z"])

3 DictArray

c1 c2 
------
1  x  
1  y  
2  z

Note that this is a one-dimensional array. There are 3 elements in the array:

julia> for elem in d
         println(elem)
       end
DataCubes.LDict{Symbol,Nullable{T}}(:c1=>Nullable(1),:c2=>Nullable("x"))
DataCubes.LDict{Symbol,Nullable{T}}(:c1=>Nullable(1),:c2=>Nullable("y"))
DataCubes.LDict{Symbol,Nullable{T}}(:c1=>Nullable(2),:c2=>Nullable("z"))

LDict is an ordered dictionary. That is, it is similar to Dict but keeps track of the order of the insertion of elements. Each element has two keys :c1 and :c2. Their values are all Nullable: the macro @darr wraps values appropriately by Nullable if they are not wrapped already. To choose an element, e.g. y in the DictArray d, you can use d[2][:c2]. To choose one field, a function pick is provided:

julia> pick(d, :c1)
3-element DataCubes.AbstractArrayWrapper{Nullable{Int64},1,Array{Nullable{Int64},1}}:
 Nullable(1)
 Nullable(1)
 Nullable(2)

pick has many types of methods, and the meaning is different depending on the situation. For example, to get a DictArray with only c1 field, use pick(d, [:c1]). An array access expression such as d[2,:c2] cannot be provided because the field name (:c2) can be actually any type. For example:

julia> d1 = @darr(:one=>[1,2,3], 2=>[:x,:y,:z], "three"=>['u','v','w'])
3 DictArray

one 2 three 
------------
1   x u     
2   y v     
3   z w

LabeledArray

A LabeledArray adds axes information to a DictArray. A convenient way to create a LabeledArray by hand is using the macro @larr. For example,

julia> t = @larr(c1=[1 ,1 ,2], c2=["x", "y", "z"], axis1[k1=[10,11,12], k2=[:r1, :r2, :r3]])
3 LabeledArray

k1 k2 |c1 c2 
------+------
10 r1 |1  x  
11 r2 |1  y  
12 r3 |2  z

Note the appearance of labels in the first column whose field names are k1 and k2. You can also construct a LabeledArray from a DictArray and axes labels:

julia> d = @darr(c1=[1 ,1 ,2], c2=["x", "y", "z"])
3 DictArray

c1 c2 
------
1  x  
1  y  
2  z  


julia> t = @larr(d, axis1[k=[:r1, :r2, :r3]])
3 LabeledArray

k  |c1 c2 
---+------
r1 |1  x  
r2 |1  y  
r3 |2  z

peel(t) will return the underlying DictArray, stripping off all axes information. pick(t, :c1) will return the field value array of c1.

Multidimensional Tables

Both DictArray and LabeledArray can be multidimensional. For example,

julia> m = @larr(c1=[1 2;3 4;5 6],
                 c2=['a' 'b';'b' 'a';'a' 'a'],
                 axis1[k1=["x","y","z"]],
                 axis2[r1=[:A,:B]])
3 x 2 LabeledArray

r1 |A     |B     
---+------+------
k1 |c1 c2 |c1 c2 
---+------+------
x  |1  a  |2  b  
y  |3  b  |4  a  
z  |5  a  |6  a

You can choose elements in the array using usual array indexing expressions:

julia> m[2:3,2]
2 LabeledArray

k1 |c1 c2 
---+------
y  |4  a  
z  |6  a

Many operations for multidimensional arrays are also applicable:

julia> transpose(m)
2 x 3 LabeledArray

k1 |x     |y     |z     
---+------+------+------
r1 |c1 c2 |c1 c2 |c1 c2 
---+------+------+------
A  |1  a  |3  b  |5  a  
B  |2  b  |4  a  |6  a  


julia> sub(m, 1:2, 1:2)
2 x 2 LabeledArray

r1 |A     |B     
---+------+------
k1 |c1 c2 |c1 c2 
---+------+------
x  |1  a  |2  b  
y  |3  b  |4  a

Some operations take slightly differnt set of arguments. For example, to sort a LabeledArray m along the first axis using the field c2:

julia> sort(m, 1, :c2)
3 x 2 LabeledArray

r1 |A     |B     
---+------+------
k1 |c1 c2 |c1 c2 
---+------+------
x  |1  a  |2  b  
z  |5  a  |6  a  
y  |3  b  |4  a

Manipulating LabeledArray

The macro @select uses a SQL-like syntax to transform one LabeledArray into another (or into a dictionary). Let's use this LabeledArray m as an example:

julia> m = @larr(c1=[1 2;3 4;5 6],
                 c2=['a' 'b';'b' 'a';'a' 'a'],
                 c3=[10.0 NA;NA 12.0;20.0 20.0],
                 axis1[k1=["x","y","z"]],
                 axis2[r1=[:A,:B]])
3 x 2 LabeledArray

r1 |A          |B          
---+-----------+-----------
k1 |c1 c2 c3   |c1 c2 c3   
---+-----------+-----------
x  |1  a  10.0 |2  b       
y  |3  b       |4  a  12.0 
z  |5  a  20.0 |6  a  20.0

To select only the fields c1 and c2,

julia> @select(m, :c1, :c2)
3 x 2 LabeledArray

r1 |A     |B     
---+------+------
k1 |c1 c2 |c1 c2 
---+------+------
x  |1  a  |2  b  
y  |3  b  |4  a  
z  |5  a  |6  a

An example to create a new column from the existing one:

julia> @select(m, c=_c1 .* 2 .+ _c3)
3 x 2 LabeledArray

r1 |A    |B    
---+-----+-----
k1 |c    |c    
---+-----+-----
x  |12.0 |     
y  |     |20.0 
z  |30.0 |32.0

_c1 refers to the c1 field and .* does the component-wise multiplication.

To choose only relevant elements, use the where[...] expression,

julia> @select(m, :c1, :c2, where[_c2 .== 'b'])
2 x 2 LabeledArray

r1 |A     |B     
---+------+------
k1 |c1 c2 |c1 c2 
---+------+------
x  |      |2  b  
y  |3  b  |

where[...] can have many conditions inside ..., and they will be applied sequentially. Multiple where[...] are also possible, and they will be simply concatenated.

To group the array elements by some fields, use the by[...] expression,

julia> @select(m,c1=sum(_c1), c3=sum(_c3), by[:c2])
2 LabeledArray

c2 |c1 c3   
---+--------
a  |16 62.0 
b  |5  0.0

You can provide multiple fields to group by:

julia> @select(m,c1=sum(_c1), c3=mean(_c3), by[:c2,c4=_c1 .> 5])
3 LabeledArray

c2 c4    |c1 c3   
---------+--------
a  false |10 14.0 
a  true  |6  20.0 
b  false |5

But it is also possible to group by the array in a multidimensional way:

julia> @select(m,c1=sum(_c1), c3=mean(_c3), by[:c2], by[c4=_c1 .> 5])
2 x 2 LabeledArray

c4 |false      |true      
---+-----------+----------
c2 |c1    c3   |c1   c3   
---+-----------+----------
a  |10    14.0 |6    20.0 
b  |5          |