risingwave_expr_macro

Attribute Macro function

source
#[function]
Expand description

Defining the RisingWave SQL function from a Rust function.

Online version of this doc.

§Table of Contents

The following example demonstrates a simple usage:

#[function("add(int32, int32) -> int32")]
fn add(x: i32, y: i32) -> i32 {
    x + y
}

§SQL Function Signature

Each function must have a signature, specified in the function("...") part of the macro invocation. The signature follows this pattern:

name ( [arg_types],* [...] ) [ -> [setof] return_type ]

Where name is the function name in snake_case, which must match the function name (in UPPER_CASE) defined in proto/expr.proto.

arg_types is a comma-separated list of argument types. The allowed data types are listed in in the name column of the appendix’s type matrix. Wildcards or auto can also be used, as explained below. If the function is variadic, the last argument can be denoted as ....

When setof appears before the return type, this indicates that the function is a set-returning function (table function), meaning it can return multiple values instead of just one. For more details, see the section on table functions.

If no return type is specified, the function returns void. However, the void type is not supported in our type system, so it now returns a null value of type int.

§Multiple Function Definitions

Multiple #[function] macros can be applied to a single generic Rust function to define multiple SQL functions of different types. For example:

#[function("add(int16, int16) -> int16")]
#[function("add(int32, int32) -> int32")]
#[function("add(int64, int64) -> int64")]
fn add<T: Add>(x: T, y: T) -> T {
    x + y
}

§Type Expansion with *

Types can be automatically expanded to multiple types using wildcards. Here are some examples:

  • *: expands to all types.
  • *int: expands to int16, int32, int64.
  • *float: expands to float32, float64.

For instance, #[function("cast(varchar) -> *int")] will be expanded to the following three functions:

#[function("cast(varchar) -> int16")]
#[function("cast(varchar) -> int32")]
#[function("cast(varchar) -> int64")]

Please note the difference between * and any: * will generate a function for each type, whereas any will only generate one function with a dynamic data type Scalar. This is similar to impl T and dyn T in Rust. The performance of using * would be much better than any. But we do not always prefer * due to better performance. In some cases, using any is more convenient. For example, in array functions, the element type of ListValue is Scalar(Ref)Impl. It is unnecessary to convert it from/into various T.

§Automatic Type Inference with auto

Correspondingly, the return type can be denoted as auto to be automatically inferred based on the input types. It will be inferred as the smallest type that can accommodate all input types.

For example, #[function("add(*int, *int) -> auto")] will be expanded to:

#[function("add(int16, int16) -> int16")]
#[function("add(int16, int32) -> int32")]
#[function("add(int16, int64) -> int64")]
#[function("add(int32, int16) -> int32")]
...

Especially when there is only one input argument, auto will be inferred as the type of that argument. For example, #[function("neg(*int) -> auto")] will be expanded to:

#[function("neg(int16) -> int16")]
#[function("neg(int32) -> int32")]
#[function("neg(int64) -> int64")]

§Custom Type Inference Function with type_infer

A few functions might have a return type that dynamically changes based on the input argument types, such as unnest. This is mainly for composite types like anyarray, struct, and anymap.

In such cases, the type_infer option can be used to specify a function to infer the return type based on the input argument types. Its function signature is

fn(&[DataType]) -> Result<DataType>

For example:

#[function(
    "unnest(anyarray) -> setof any",
    type_infer = "|args| Ok(args[0].unnest_list())"
)]

This type inference function will be invoked at the frontend (infer_type_with_sigmap).

§Rust Function Signature

The #[function] macro can handle various types of Rust functions.

Each argument corresponds to the reference type in the type matrix.

The return value type can be the reference type or owned type in the type matrix.

For instance:

#[function("trim_array(anyarray, int32) -> anyarray")]
fn trim_array(array: ListRef<'_>, n: i32) -> ListValue {...}

§Nullable Arguments

The functions above will only be called when all arguments are not null. It will return null if any argument is null. If null arguments need to be considered, the Option type can be used:

#[function("trim_array(anyarray, int32) -> anyarray")]
fn trim_array(array: ListRef<'_>, n: Option<i32>) -> ListValue {...}

This function will be called when n is null, but not when array is null.

§Return NULLs and Errors

Similarly, the return value type can be one of the following:

  • T: Indicates that a non-null value is always returned (for non-null inputs), and errors will not occur.
  • Option<T>: Indicates that a null value may be returned, but errors will not occur.
  • Result<T>: Indicates that an error may occur, but a null value will not be returned.
  • Result<Option<T>>: Indicates that a null value may be returned, and an error may also occur.

§Optimization

When all input and output types of the function are primitive type (refer to the type matrix) and do not contain any Option or Result, the #[function] macro will automatically generate SIMD vectorized execution code.

Therefore, try to avoid returning Option and Result whenever possible.

§Variadic Function

Variadic functions accept a impl Row input to represent tailing arguments. For example:

#[function("concat_ws(varchar, ...) -> varchar")]
fn concat_ws(sep: &str, vals: impl Row) -> Option<Box<str>> {
    let mut string_iter = vals.iter().flatten();
    // ...
}

See risingwave_common::row::Row for more details.

§Functions Returning Strings

For functions that return varchar types, you can also use the writer style function signature to avoid memory copying and dynamic memory allocation:

#[function("trim(varchar) -> varchar")]
fn trim(s: &str, writer: &mut impl Write) {
    writer.write_str(s.trim()).unwrap();
}

If errors may be returned, then the return value should be Result<()>:

#[function("trim(varchar) -> varchar")]
fn trim(s: &str, writer: &mut impl Write) -> Result<()> {
    writer.write_str(s.trim()).unwrap();
    Ok(())
}

If null values may be returned, then the return value should be Option<()>:

#[function("trim(varchar) -> varchar")]
fn trim(s: &str, writer: &mut impl Write) -> Option<()> {
    if s.is_empty() {
        None
    } else {
        writer.write_str(s.trim()).unwrap();
        Some(())
    }
}

§Preprocessing Constant Arguments

When some input arguments of the function are constants, they can be preprocessed to avoid calculations every time the function is called.

A classic use case is regular expression matching:

#[function(
    "regexp_match(varchar, varchar, varchar) -> varchar[]",
    prebuild = "RegexpContext::from_pattern_flags($1, $2)?"
)]
fn regexp_match(text: &str, regex: &RegexpContext) -> ListValue {
    regex.captures(text).collect()
}

The prebuild argument can be specified, and its value is a Rust expression Type::method(...) used to construct a new variable of Type from the input arguments of the function. Here $1, $2 represent the second and third arguments of the function (indexed from 0), and their types are &str. In the Rust function signature, these positions of parameters will be omitted, replaced by an extra new variable at the end.

This macro generates two versions of the function. If all the input parameters that prebuild depends on are constants, it will precompute them during the build function. Otherwise, it will compute them for each input row during evaluation. This way, we support both constant and variable inputs while optimizing performance for constant inputs.

§Context

If a function needs to obtain type information at runtime, you can add an &Context parameter to the function signature. For example:

#[function("foo(int32) -> int64")]
fn foo(a: i32, ctx: &Context) -> i64 {
   assert_eq!(ctx.arg_types[0], DataType::Int32);
   assert_eq!(ctx.return_type, DataType::Int64);
   // ...
}

§Async Function

Functions can be asynchronous.

#[function("pg_sleep(float64)")]
async fn pg_sleep(second: F64) {
    tokio::time::sleep(Duration::from_secs_f64(second.0)).await;
}

Asynchronous functions will be evaluated on rows sequentially.

§Table Function

A table function is a special kind of function that can return multiple values instead of just one. Its function signature must include the setof keyword, and the Rust function should return an iterator of the form impl Iterator<Item = T> or its derived types.

For example:

#[function("generate_series(int32, int32) -> setof int32")]
fn generate_series(start: i32, stop: i32) -> impl Iterator<Item = i32> {
    start..=stop
}

Likewise, the return value Iterator can include Option or Result either internally or externally. For instance:

  • impl Iterator<Item = Result<T>>
  • Result<impl Iterator<Item = T>>
  • Result<impl Iterator<Item = Result<Option<T>>>>

Currently, table function arguments do not support the Option type. That is, the function will only be invoked when all arguments are not null.

§Registration and Invocation

Every function defined by #[function] is automatically registered in the global function table.

You can build expressions through the following functions:

// scalar functions
risingwave_expr::expr::build(...) -> BoxedExpression
risingwave_expr::expr::build_from_prost(...) -> BoxedExpression
// table functions
risingwave_expr::table_function::build(...) -> BoxedTableFunction
risingwave_expr::table_function::build_from_prost(...) -> BoxedTableFunction

Or get their metadata through the following functions:

// scalar functions
risingwave_expr::sig::func::FUNC_SIG_MAP::get(...)
// table functions
risingwave_expr::sig::table_function::FUNC_SIG_MAP::get(...)

§Appendix: Type Matrix

§Base Types

nameSQL typeowned typereference typeprimitive?
booleanbooleanboolboolyes
int2smallinti16i16yes
int4integeri32i32yes
int8biginti64i64yes
int256rw_int256Int256Int256Ref<'_>no
float4realF32F32yes
float8double precisionF64F64yes
decimalnumericDecimalDecimalyes
serialserialSerialSerialyes
datedateDateDateyes
timetimeTimeTimeyes
timestamptimestampTimestampTimestampyes
timestamptztimestamptzTimestamptzTimestamptzyes
intervalintervalIntervalIntervalyes
varcharvarcharBox<str>&strno
byteabyteaBox<[u8]>&[u8]no
jsonbjsonbJsonbValJsonbRef<'_>no
anyanyScalarImplScalarRefImpl<'_>no

§Composite Types

nameSQL typeowned typereference type
anyarrayany[]ListValueListRef<'_>
structrecordStructValueStructRef<'_>
T1[]T[]ListValueListRef<'_>
struct<name_T1, ..>struct<name T, ..>(T, ..)(&T, ..)

  1. T could be any base type