Pattern Matching DataShapes =========================== DataShape includes type variables, as symbols beginning with a capital letter. For example `A * int32` represents a one-dimensional array of `int32`, where the size or type of the dimension is unspecified. Similarly, `3 * A` represents a size 3 one-dimensional array where the data type is unspecified. The main usage of pattern matching in the DataShape system is for specifying function signatures. To provide a little bit of motivation, let's examine what happens in NumPy ufuncs, and see how we can represent their behaviors via DataShape types. NumPy `ldexp` UFunc Example --------------------------- We're going to use the `ldexp` ufunc, which is for the C/C++ function with overloads `double ldexp(double x, int exp)` and `float ldexp(float x, int exp)`, computing `x * 2^exp` by tweaking the exponent in the floating point format. (We're ignoring the long double for now.) These C++ functions can be represented with the DataShape function signatures:: (float32, int32) -> float32 (float64, int32) -> float64 As a NumPy ufunc, there is an behavior for arrays, where the source arrays are "broadcasted" together, and the function is computed elementwise. In the simplest case, given two arrays which match, the result is an array of the same size. When one array has size one in a dimension, it gets repeated to match the size of the other dimension. When one array has fewer dimensions, it gets repeated to fill in the outer dimensions. The "broadcast" array shape is the result of combining all these repetitions, and is the shape of the output. Represented as DataShape function signatures, some examples are:: (12 * float32, 12 * int32) -> 12 * float32 (10 * float64, 1 * int32) -> 10 * float64 (float32, 3 * 4 * int32) -> 3 * 4 * float32 (3 * float64, 4 * 1 * int64) -> 4 * 3 * float64 Ellipsis for Broadcasting ------------------------- To represent the general broadcasting behavior, DataShape provides ellipsis type variables.:: (A... * float32, A... * int32) -> A... * float32 (A... * float64, A... * int64) -> A... * float64 Coercions/Broadcasting as a System of Equations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let's say as input we get two arrays with datashapes `3 * 4 * float64` and `int32`. We can express this as two systems of coercion equations as follows (using ==> as a "coerces to" operator):: # float32 prototype 3 * 4 * float64 ==> A... * float32 int32 ==> A... * int32 # float64 prototype 3 * 4 * float64 ==> A... * float64 int32 ==> A... * int32 To solve these equations, we evaluate the legality of each coercion, and accumulate the set of values the `A...` type variable must take.:: # float32 prototype float64 ==> float32 # ILLEGAL 3 * 4 * ==> A... * # "3 * 4 *" in A... int32 ==> int32 # LEGAL * ==> A... # "*" in A... # float64 prototype float64 ==> float64 # LEGAL 3 * 4 * ==> A... * # "3 * 4 *" in A... int32 ==> int32 # LEGAL * ==> A... # "*" in A... The float32 prototype can be discarded because it requires an illegal coercion. In the float64 prototype, we collect the set of all `A...` values `{"3 * 4 *", "*"}`, broadcast them together to get `"3 * 4 *"`, and substitute this in the output. Doing all the substitutions in the full prototype produces:: (3 * 4 * float64, int32) -> 3 * 4 * float64 as the matched function prototype that results. Disallowing Coercion -------------------- In the particular function we picked, ideally we don't want to allow implicit coercion of the type, because the nature of the function is to "load the exponent" in particular formats of floating point number. Saying `ldexp(True, 3)`, and having it work is kind of weird. One way to tackle this would be to add an `exact` type, both as a dimension and a data type, which indicates that broadcasting should be disallowed. For the discussion, in addition to `ldexp`, lets introduce a vector magnitude function `mag`, where we want to disallow scalar arrays to broadcast into it.:: # ldexp signatures (A... * exact[float32], A... * int32) -> A... * float32 (A... * exact[float64], A... * int64) -> A... * float64 # mag signatures (A... * exact[2] * float32) -> A... * float32 (A... * exact[3] * float32) -> A... * float32 # ufunc but disallowing broadcasting (exact[A...] * int32, exact[A...] * int32) -> A... * int32 A possible syntactic sugar (which I'm not attached to, I think this needs some exploration) for this is:: # ldexp signatures (A... * float32=, A... * int32) -> A... * float32 (A... * float64=, A... * int64) -> A... * float64 # mag signatures (A... * 2= * float32) -> A... * float32 (A... * 3= * float32) -> A... * float32 # ufunc but disallowing broadcasting (A=.. * int32, A=.. * int32) -> A... * int32 Factoring a Set of Signatures ----------------------------- One of the main things the multiple dispatch in DataShape has to do is match input arrays against a set of signatures very efficiently. We need to be able to hide the abstraction we're creating, and provide performance competitive with, but ideally superior to, what NumPy provides in its ufunc system. Factoring the set of signatures into two or more stages which are simpler to solve and can prune the possibilities more quickly is one way to do this abstraction hiding. Let's use the `add` function for our example, with the following subset of signatures. We've included the `datetime` signatures to dispel any notion that the signatures will always match precisely.:: # add signatures (A... * int32, A... * int32) -> A... * int32 (A... * int64, A... * int64) -> A... * int64 (A... * float32, A... * float32) -> A... * float32 (A... * float64, A... * float64) -> A... * float64 (A... * timedelta, A... * timedelta) -> A... * timedelta (A... * datetime, A... * timedelta) -> A... * datetime (A... * timedelta, A... * datetime) -> A... * datetime Because the broadcasting of all these cases is identical, we can transform this set of signatures into two stages as follows:: # broadcasting stage (A... * X, A... * Y) -> A... * Z # data type stage matched against (X, Y) (int32, int32) -> int32 (int64, int64) -> int64 (float32, float32) -> float32 (float64, float64) -> float64 (timedelta, timedelta) -> timedelta (datetime, timedelta) -> datetime (timedelta, datetime) -> datetime Let's work through this example to illustrate how it works.:: # Stage 1: Input arrays "3 * 1 * int32", "4 * float32" # (A... * X, A... * Y) -> A... * Z int32 ==> X # "int32" in X 3 * 1 * ==> A... # "3 * 1 *" in A... float32 ==> Y # "float32" in Y 4 * ==> A... # "4 *" in A... # Solution: A... is "3 * 4 *", X is "int32", and Y is "float32" # Stage 2: Input arrays "int32" and "float32" # (int32, int32) -> int32 int32 ==> int32 # LEGAL float32 ==> int32 # ILLEGAL # (float32, float32) -> float32 int32 ==> float32 # LEGAL float32 ==> float32 # LEGAL # etc. # Assume we picked (float32, float32) -> float32 # so the variables are: # X is "float32" # Y is "float32" # Z is "float32" # giving the solution substituted into stage 1: (3 * 1 * float32, 4 * float32) -> 3 * 4 * float32