Expand description
The solver for delta join, which determines lookup order of a join plan.
All collection types in this module should be BTree
to ensure determinism between runs.
§Representation of Multi-Way Join
In this module, a*
means lookup executor that looks-up the arrangement a
of the current
epoch. a
means looks-up arrangement a
of the previous epoch.
Delta joins only support inner equal join. The solver is based on the following formula (take 3-way join as an example):
d((A join1 B) join2 C)
= ((A + dA) join1 (B + dB)) join2 (C + dC) - A join1 B join2 C
= (A join1 B + A join1 dB + dA join1 (B + dB)) join2 (C + dC) - A join1 B join2 C
= A join1 B join2 (C + dC) + A join1 dB join2 (C + dC) + dA join1 (B + dB) join2 (C + dC) - A join1 B join2 C
= A join1 B join2 dC + A join1 dB join2 (C + dC) + dA join1 (B + dB) join2 (C + dC)
= dA join1 (B + dB) join2 (C + dC) + dB join1 A join2 (C + dC) + dC join2 B join1 A
join1 means A join B using condition #1,
join2 means B join C using condition #2.
Inner joins satisfy commutative law and associative laws, so we can switch them back and forth between joins.
… which generates the following look-up graph:
a -> b* -> c* -> output3
b -> a -> c* -> output2
c -> b -> a -> output1
And the final output is output1 <concat> output2 <concat> output3
. The concatenation is
ordered: all items from output 1 must appear before output 2.
TODO: support dynamic filter and filter condition in lookups.
§What need the caller do?
After solving the delta join plan, caller will need to do a lot of things.
- Use the correct index for every stream input. By looking at the first lookup fragment in each
row, we can decide whether to use
a.x
ora.y
as index for stream input. - Insert exchanges between lookups of different distribution. Generally, if the whole row is
operating on the same join key, we only need to do 1v1 exchanges between lookups. However, it
would be possible that a row of lookup first join
a.x == b.x
, thena.y == c.y
. In this case, we will need to insert hash exchange between these two lookups. - Ensure the order of union. Always union from the last row to the first row.
- Insert exchange before union. Still the case for
a.x == b.x
, thena.y == c.y
, it is possible that every lookup path produces different distribution. We need to shuffle them before feeding data to union.
Structs§
- Given a multi-way inner join plan, the solver will produces the most efficient way of getting those join done.
- Represents whether
left
andright
can be joined using a condition. - A row in the lookup plan, which includes the stream side arrangement, and all arrangements to be placed in the lookup node.
Enums§
- Decides how to place arrangements over lookup nodes. Given a 3-way join example:
- Decides how to place stream inputs. Given a 3-way join example: