Specifics of JOIN
You can join data through the dataset creation interface by dragging tables to the workspace and configuring links between them using the
JOIN operator. For YTsaurus tables, a join like this causes an error. It is due to table storage structure and the default query logic in YTsaurus.
Storing table data
The table storage schema in YTsaurus differs from that in ClickHouse:
In YTsaurus, the basic data storage primitive is a static table with rows grouped into chunks that are randomly distributed across a cluster. By default, the data is distributed across shards based on a sharding key. Static tables are highly inefficient for point data reads.
In ClickHouse, sharding ensures locality of rows with a single sharding expression value per VM.
YTsaurus supports sorted tables: a table schema may factor in its row sorting by a column prefix. Columns like this are called key columns. This is a more efficient storage primitive: a dynamic table.
In YTsaurus, sorting ensures locality of rows with a single key value per chunk (or in a set of successive chunks).
Simple SELECT query in CHYT
A SELECT with no
JOIN is basically a streaming query. It reads data from the only source and performs row processing. You can apply it to function and aggregation fields.
In CHYT, this query is implemented as follows:
- A query is made to a clique instance. The instance is called a query coordinator, as it determines the subsequent execution plan.
- The coordinator sends a rewritten query to each instance over the internal TCP. In this query, the input table name is replaced with an encoded sequence of
chunk slices, where a
chunk sliceis a certain range within a chunk. (The range is set using border row numbers or keys.)
- Each instance executes the rewritten query and returns its result to the coordinator that performs its final aggregation (if required) and only then sends a response to the user.
Types of queries using JOIN in ClickHouse
Like a simple query, a SELECT that uses a
JOIN is sent to a clique instance that is also called a query coordinator. It determines the subsequent query execution plan, while the load distribution across instances depends on how
lhs JOIN rhs USING/ON is executed:
A distributed local JOIN is run in ClickHouse by default. If tables are sharded in the same way, a matching key pair may not be on different VMs. Therefore, the
rhsare interpreted on each instance as respective local tables. In this case, the query that the coordinator splits in parts can be executed independently on each instance.
This method requires that the tables use the same sharding scheme.
A GLOBAL JOIN is executed if you add the
GLOBALkeyword to a
JOIN. The right-hand argument,
rhs, is fully executed and materialized in the coordinator. Next, its serialized representation is sent along with the query across the instances. They use this representation to retrieve the right-hand side in their memory.
- This method is efficient if the
rhsis relatively small and there are relatively few instances. Otherwise, there may be a shortage of resources when the coordinator is executing the
rhsor sending the query's serialized representation with a subquery across the network.
- This method does not require meeting any additional conditions for table data storage or sharding consistency.
- This method is efficient if the
JOIN via subqueries. ClickHouse allows enclosing the
rhsin brackets, which significantly affects the execution plan:Enclosing the LHS in bracketsEnclosing the RHS in brackets
- The LHS is executed independently in the coordinator.
- The RHS is executed independently.
- The RHS is placed in-memory in a hash table.
JOINis fully executed in the coordinator only.
The LHS is executed independently in the coordinator as if the query looked just like a SELECT
The coordinator sends its queries to instances leaving a
rhs) as is.
Each instance executes the
This may cause the load to increase significantly, since each instance will materialize the RHS on its own. Therefore, the ClickHouse security mechanism disables this behavior by default. As a result, the
Double-distributed IN/JOIN subqueries is deniederror occurs.
Specifics of queries using a JOIN in CHYT
When making a SELECT query that uses a
JOIN in CHYT, the Sorted JOIN logic applies by default. It is based on the same column sorting. A regular
lhs JOIN rhs USING/ON is used.
In this case, additional restrictions are imposed on the
rhsmust be sorted tables.
USING/ONclause must use sorted columns only.
- Let's assume that the
lhsis sorted by columns
l1, l2, ..., lnand the
rhs, by columns
r1, r2, ..., rm.
ONclause should look like a set of
l1 = r1 , ..., lk = rkequations for a
k(the equations can be listed in any order).
- This can be represented as a set of equations in the
ONclause and as a set of general key columns in the
USINGstatement, but not as a set of equations in the
If these conditions are met, the query coordinator generates pairs of matching ranges from the
rhs and distributes them across the instances in subqueries.
Otherwise, an error is returned. In this case, you should either use a
GLOBAL JOIN or enclose the
rhs in a subquery.