Getting Started
Installation
To install DataCubes, at the Julia REPL:
Pkg.add("DataCubes")
Using the DataCubes package
To use DataCubes,
using DataCubes
This will introduce the core functions into namespace.
A few helper functions have the form dcube.* and they can be introduced into namespace as well by:
using DataCubes.Tools
Below, we assume you already executed using DataCubes.
Creating a multidimensional table
DictArray
A multidimensional table can be represented by either the DictArray or LabeledArray data type.
A DictArray is an array of ordered dictionaries with the common keys, and represents a table with no speicial axis information.
A LabeledArray is a DictArray with an additional vector for each axis for their labels.
The macro @darr is used to create a DictArray:
julia> d = @darr(c1=[1,1,2], c2=["x", "y", "z"])
3 DictArray
c1 c2
------
1 x
1 y
2 z
Note that this is a one-dimensional array. There are 3 elements in the array:
julia> for elem in d
println(elem)
end
DataCubes.LDict{Symbol,Nullable{T}}(:c1=>Nullable(1),:c2=>Nullable("x"))
DataCubes.LDict{Symbol,Nullable{T}}(:c1=>Nullable(1),:c2=>Nullable("y"))
DataCubes.LDict{Symbol,Nullable{T}}(:c1=>Nullable(2),:c2=>Nullable("z"))
LDict is an ordered dictionary. That is, it is similar to Dict but keeps track of the order of the insertion of elements. Each element has two keys :c1 and :c2. Their values are all Nullable: the macro @darr wraps values appropriately by Nullable if they are not wrapped already.
To choose an element, e.g. y in the DictArray d, you can use d[2][:c2].
To choose one field, a function pick is provided:
julia> pick(d, :c1)
3-element DataCubes.AbstractArrayWrapper{Nullable{Int64},1,Array{Nullable{Int64},1}}:
Nullable(1)
Nullable(1)
Nullable(2)
pick has many types of methods, and the meaning is different depending on the situation.
For example, to get a DictArray with only c1 field, use pick(d, [:c1]).
An array access expression such as d[2,:c2] cannot be provided because the field name (:c2) can be actually any type. For example:
julia> d1 = @darr(:one=>[1,2,3], 2=>[:x,:y,:z], "three"=>['u','v','w'])
3 DictArray
one 2 three
------------
1 x u
2 y v
3 z w
LabeledArray
A LabeledArray adds axes information to a DictArray.
A convenient way to create a LabeledArray by hand is using the macro @larr.
For example,
julia> t = @larr(c1=[1 ,1 ,2], c2=["x", "y", "z"], axis1[k1=[10,11,12], k2=[:r1, :r2, :r3]])
3 LabeledArray
k1 k2 |c1 c2
------+------
10 r1 |1 x
11 r2 |1 y
12 r3 |2 z
Note the appearance of labels in the first column whose field names are k1 and k2.
You can also construct a LabeledArray from a DictArray and axes labels:
julia> d = @darr(c1=[1 ,1 ,2], c2=["x", "y", "z"])
3 DictArray
c1 c2
------
1 x
1 y
2 z
julia> t = @larr(d, axis1[k=[:r1, :r2, :r3]])
3 LabeledArray
k |c1 c2
---+------
r1 |1 x
r2 |1 y
r3 |2 z
peel(t) will return the underlying DictArray, stripping off all axes information.
pick(t, :c1) will return the field value array of c1.
Multidimensional Tables
Both DictArray and LabeledArray can be multidimensional.
For example,
julia> m = @larr(c1=[1 2;3 4;5 6],
c2=['a' 'b';'b' 'a';'a' 'a'],
axis1[k1=["x","y","z"]],
axis2[r1=[:A,:B]])
3 x 2 LabeledArray
r1 |A |B
---+------+------
k1 |c1 c2 |c1 c2
---+------+------
x |1 a |2 b
y |3 b |4 a
z |5 a |6 a
You can choose elements in the array using usual array indexing expressions:
julia> m[2:3,2]
2 LabeledArray
k1 |c1 c2
---+------
y |4 a
z |6 a
Many operations for multidimensional arrays are also applicable:
julia> transpose(m)
2 x 3 LabeledArray
k1 |x |y |z
---+------+------+------
r1 |c1 c2 |c1 c2 |c1 c2
---+------+------+------
A |1 a |3 b |5 a
B |2 b |4 a |6 a
julia> sub(m, 1:2, 1:2)
2 x 2 LabeledArray
r1 |A |B
---+------+------
k1 |c1 c2 |c1 c2
---+------+------
x |1 a |2 b
y |3 b |4 a
Some operations take slightly differnt set of arguments. For example, to sort a LabeledArray m along the first axis using the field c2:
julia> sort(m, 1, :c2)
3 x 2 LabeledArray
r1 |A |B
---+------+------
k1 |c1 c2 |c1 c2
---+------+------
x |1 a |2 b
z |5 a |6 a
y |3 b |4 a
Manipulating LabeledArray
The macro @select uses a SQL-like syntax to transform one LabeledArray into another (or into a dictionary).
Let's use this LabeledArray m as an example:
julia> m = @larr(c1=[1 2;3 4;5 6],
c2=['a' 'b';'b' 'a';'a' 'a'],
c3=[10.0 NA;NA 12.0;20.0 20.0],
axis1[k1=["x","y","z"]],
axis2[r1=[:A,:B]])
3 x 2 LabeledArray
r1 |A |B
---+-----------+-----------
k1 |c1 c2 c3 |c1 c2 c3
---+-----------+-----------
x |1 a 10.0 |2 b
y |3 b |4 a 12.0
z |5 a 20.0 |6 a 20.0
To select only the fields c1 and c2,
julia> @select(m, :c1, :c2)
3 x 2 LabeledArray
r1 |A |B
---+------+------
k1 |c1 c2 |c1 c2
---+------+------
x |1 a |2 b
y |3 b |4 a
z |5 a |6 a
An example to create a new column from the existing one:
julia> @select(m, c=_c1 .* 2 .+ _c3)
3 x 2 LabeledArray
r1 |A |B
---+-----+-----
k1 |c |c
---+-----+-----
x |12.0 |
y | |20.0
z |30.0 |32.0
_c1 refers to the c1 field and .* does the component-wise multiplication.
To choose only relevant elements, use the where[...] expression,
julia> @select(m, :c1, :c2, where[_c2 .== 'b'])
2 x 2 LabeledArray
r1 |A |B
---+------+------
k1 |c1 c2 |c1 c2
---+------+------
x | |2 b
y |3 b |
where[...] can have many conditions inside ..., and they will be applied sequentially.
Multiple where[...] are also possible, and they will be simply concatenated.
To group the array elements by some fields, use the by[...] expression,
julia> @select(m,c1=sum(_c1), c3=sum(_c3), by[:c2])
2 LabeledArray
c2 |c1 c3
---+--------
a |16 62.0
b |5 0.0
You can provide multiple fields to group by:
julia> @select(m,c1=sum(_c1), c3=mean(_c3), by[:c2,c4=_c1 .> 5])
3 LabeledArray
c2 c4 |c1 c3
---------+--------
a false |10 14.0
a true |6 20.0
b false |5
But it is also possible to group by the array in a multidimensional way:
julia> @select(m,c1=sum(_c1), c3=mean(_c3), by[:c2], by[c4=_c1 .> 5])
2 x 2 LabeledArray
c4 |false |true
---+-----------+----------
c2 |c1 c3 |c1 c3
---+-----------+----------
a |10 14.0 |6 20.0
b |5 |