Get the most out of DuneSQL by writing efficient queries.
ethereum.transactions
table is stored in a parquet file that looks like this:
very simplified contents of a parquet file
Schematic view of a parquet file
min/max
values of each column are used to efficiently skip entire parquet files or row groups within files while scanning through a table, provided that the column contains values that are not random and can be ordered. This is a key difference between DuneSQL and e.g. Snowflake.
If you query a table in DuneSQL, the system will access the data in the following order:
block_time
, block_date
and block_number
are the best candidates for this, as they are sequentially ordered and can be used to efficiently skip entire parquet files or row groups within files while scanning through a table.
Let’s take a look at an example. Say you want to query the ethereum.transactions
table for a specific transaction hash. The query would usually look like this:
0xd... - 0xz...
. In that case, since our hash is 0xc...
, there is no way that the row group or file contains the hash we are looking for, and the engine can skip it. However, this is very unlikely to happen, and most likely the engine will have to read all the files associated with the table.
Now let’s take a look at a query that uses a column that is not random and can be ordered, such as block_number
:
block_number
we are searching for is within the range of min/max
values for the block_number
column in the footer of each parquet file. For example, if the engine starts reading the first parquet file and sees that the min/max
values for the block_number
column are 14854600
and 14854620
, it will know that the file does not contain the data needed for the query. It will skip the file and move on to the next one. This will significantly reduce the amount of data that needs to be read into memory, which will improve query performance.
block_number
,block_time
or block_date
.
The above example is pretty theoretical, as you will most likely not be querying the ethereum.transactions
table for a specific transaction hash. However, the same principle applies to other tables and queries. For example, if we want to query ethereum.traces
for all calls to a specific smart contract, we can use block_number
,block_time
or block_date
to skip entire parquet files and row groups within files while scanning through the table.
Usually you would write a query like this:
block_number
greater than a specific value. For example, if we know that the smart contract was deployed in block 17580248
, we can write the query like this:
block_number
column in the query, the engine can quickly narrow down the search to files and row groups that have a block_number
greater than 17580247
. This significantly reduces the amount of data that needs to be read, thereby improving query performance. Additionally, only querying the columns you absolutely need will also improve query performance.
We can also apply the same principle to join tables. For example, if we want to join ethereum.transactions
with uniswap_v3_ethereum.Pair_evt_Swap
, we can write the query like this:
block_number
column in the join condition, the engine can quickly narrow down the search in ethereum.transactions
to files and row groups that contain the block_number
contained in a specific row in uniswap_v3_ethereum.Pair_evt_Swap
. This significantly reduces the amount of data that needs to be read, thereby improving query performance.
It also makes a difference whether we join ethereum.transactions
to uniswap_v3_ethereum.Pair_evt_Swap
or the other way around. The following query will be even faster:
ethereum.transactions
is much bigger than uniswap_v3_ethereum.Pair_evt_Swap
, so we should join uniswap_v3_ethereum.Pair_evt_Swap
with ethereum.transactions
.
Inner joins are generally faster than outer joins. If you can use an inner join instead of an outer join, it will improve query performance.
account_activity
, which is ordered by account_keys
rather than block_time
. This allows for utilizing the min/max values for account_keys
when building queries based on raw Solana data.
UNION ALL
instead of UNION
: If you’re combining the results of multiple queries, use UNION ALL
instead of UNION
to avoid the overhead of removing duplicate rows.
varbinary
operations are faster than varchar
operations as the data is compressed and encoded. Always use the smallest possible data type for your columns.
ORDER BY
.
date_trunc('day', block_time) > '2022-01-01'
. Instead, use block_time > '2022-01-01'
. The first example will not be able to use the min/max values of the block_time
column to skip entire parquet files or row groups within files while scanning through a table, while the second example will. The same goes for other functions, such as substr
, lower
, upper
etc.
EXPLAIN
command: Use the EXPLAIN
command to understand how the query engine processes your query. This can help you identify potential performance bottlenecks and optimize your queries accordingly.