The datashape language is a DSL which describes the structure of data, abstracted from a particular implementation in a language or file format. Compared to the Python library NumPy, it combines shape and dtype together, and introduces a syntax for describing structured data.
Some of the basic features include:
Here are some simple examples to motivate the idea:
# Scalar types
bool
int32
float64
# Arrays
3 * 4 * int32
3 * 4 * int32
10 * var * float64
3 * complex[float64]
# Array of Structures
100 * {
name: string,
birthday: date,
address: {
street: string,
city: string,
postalcode: string,
country: string
}
}
# Structure of Arrays
{
x: 100 * 100 * float32,
y: 100 * 100 * float32,
u: 100 * 100 * float32,
v: 100 * 100 * float32,
}
# Structure with strings for field names
{
'field 0': 100 * float32,
'field 1': float32,
'field 2': float32,
}
# List of Tuples
20 * (int32, float64)
# Function prototype
(3 * int32, float64) -> 3 * float64
# Function prototype with broadcasting dimensions
(A... * int32, A... * int32) -> A... * int32
Many syntax elements in datashape are syntax sugar for particular type constructors. For dtypes, this is:
{x : int32, y : int16} => struct[['x', 'y'], [int32, int16]]
(int64, float32) => tuple[[int64, float32]]
(int64, float32) -> bool => funcproto[[int64, float32], bool]
DTypeVar => typevar['DTypeVar']
For dims, this is:
3 * int32 => fixed[3] * int32
DimVar * int32 => typevar['DimVar'] * int32
... * int32 => ellipsis * int32
DimVar... * int32 => ellipsis['DimVar'] * int32
Dimension Type Symbol Table:
# Variable-sized dimension
var
Data Type Symbol Table:
# Numeric
bool
# Two's complement binary integers
int8
int16
int32
int64
int128
# Unsigned binary integers
uint8
uint16
uint32
uint64
uint128
# IEEE 754-2008 binary### floating point binary numbers
float16
float32
float64
float128
# IEEE 754-2008 decimal### floating point decimal numbers
decimal32
decimal64
decimal128
# Arbitrary precision integer
bignum
# Alias for int32
int
# Alias for float64
real
# Alias for complex[float64]
complex
# Alias for int32 or int64 depending on platform
intptr
# Alias for uint32 or uint64 depending on platform
uintptr
# A unicode string
string
# A single unicode code point
char
# A blob of bytes
bytes
# A date
date
# A string containing JSON
json
# No data
void
Data Type Constructor Symbol Table:
# complex[float32], complex[type=float64]
complex
# string['ascii'], string[enc='cp949']
string
# bytes[size=4,align=2]
bytes
# datetime[unit='minutes',tz='CST']
datetime
# categorical[type=string, values=['low', 'medium', 'high']]
categorical
# option[float64]
option
# pointer[target=2 * 3 * int32]
pointer
Tokens:
NAME_LOWER : [a-z][a-zA-Z0-9_]*
NAME_UPPER : [A-Z][a-zA-Z0-9_]*
NAME_OTHER : _[a-zA-Z0-9_]*
ASTERISK : \*
COMMA : ,
EQUAL : =
COLON : :
LBRACKET : \[
RBRACKET : \]
LBRACE : \{
RBRACE : \}
LPAREN : \(
RPAREN : \)
ELLIPSIS : \.\.\.
RARROW : ->
INTEGER : 0(?![0-9])|[1-9][0-9]*
STRING : (?:"(?:[^"\n\r\\]|(?:\\u[0-9a-fA-F]{4})|(?:\\["bfnrt]))*")|(?:\'(?:[^\'\n\r\\]|(?:\\u[0-9a-fA-F]{4})|(?:\\['bfnrt]))*"))*\')
Grammar:
# Comma-separated list of dimensions, followed by data type
datashape : dim ASTERISK datashape
| dtype
# Dimension Type (from the dimension type symbol table)
dim : typevar
| ellipsis_typevar
| type
| type_constr
| INTEGER
| ELLIPSIS
# Data Type (from the data type symbol table)
dtype : typevar
| type
| type_constr
| struct_type
| funcproto_or_tuple_type
# A type variable
typevar : NAME_UPPER
# A type variable with ellipsis
ellipsis_typevar : NAME_UPPER ELLIPSIS
# A bare type (from the data type symbol table)
type : NAME_LOWER
# Type Constructor (from the data type constructor symbol table)
type_constr : NAME_LOWER LBRACKET type_arg_list RBRACKET
# Type Constructor: list of arguments
type_arg_list : type_arg COMMA type_arg_list
| type_kwarg_list
| type_arg
# Type Constructor: list of arguments
type_kwarg_list : type_kwarg COMMA type_kwarg_list
| type_kwarg
# Type Constructor : single argument
type_arg : datashape
| INTEGER
| STRING
| list_type_arg
# Type Constructor : single keyword argument
type_kwarg : NAME_LOWER EQUAL type_arg
# Type Constructor : single list argument
list_type_arg : LBRACKET RBRACKET
| LBRACKET datashape_list RBRACKET
| LBRACKET integer_list RBRACKET
| LBRACKET string_list RBRACKET
datashape_list : datashape COMMA datashape_list
| datashape
integer_list : INTEGER COMMA integer_list
| INTEGER
string_list : STRING COMMA string_list
| STRING
# Struct/Record type (allowing for a trailing comma)
struct_type : LBRACE struct_field_list RBRACE
| LBRACE struct_field_list COMMA RBRACE
struct_field_list : struct_field COMMA struct_field_list
| struct_field
struct_field : struct_field_name COLON datashape
struct_field_name : NAME_LOWER
| NAME_UPPER
| NAME_OTHER
| STRING
# Function prototype is a tuple with an arrow to the output type
funcproto_or_tuple_type : tuple_type RARROW datashape
| tuple_type
# Tuple type (allowing for a trailing comma)
tuple_type : LPAREN tuple_item_list RPAREN
| LPAREN tuple_item_list COMMA RPAREN
tuple_item_list : datashape COMMA tuple_item_list
| datashape