Alluxio Product School Webinar
Mar. 23, 2023
For more Alluxio Events: https://www.alluxio.io/events/
Speaker: Beinan Wang (Tech Lead, Alluxio)
In March’s Product School session, Beinan, an Alluxio tech lead, Presto committer, and Trino contributor, will share expert tips for tuning Trino performance. In addition, he will demonstrate how to integrate Trino with Alluxio as a caching layer using connectors for Hive, Iceberg, Hudi, or Delta Lake.
4. Trino Overview
● Distributed SQL Query Engine
○ ANSI SQL on Hive data warehouse, Hudi, Iceberg, Kafka, Druid and etc.
○ Designed to be interactive
○ Access to petabytes of data
● Open-source
○ github.com/trinodb
● Use Cases
○ Ad-hoc
○ BI tools
○ Dashboard
○ A/B testing
○ ETL
4
6. Hive Tables Overview
● Using Hive metastore to serve the metadata
○ Backed by a RDBMS(Mysql)
○ Limited scalability
○
6
REF: https://tabular.io/blog/iceberg-metadata-indexing/
17. ○ Set max memory per query
○ Set max memory per node
■ If we could speed up queries, we would be able to reduce
max_concurrent and then increase the max memory per node and at the
same time keep the same throughput
○ Memory allocation deadlock
■ Try query.low-memory-killer.policy
Make Sure There Is Sufficient Memory
17
18. ● Consider using column-based data format for your data files.
● ORC might have a better performance than parquet (especially when using
prestodb)
● Avoid using CSV or json format
● Compression (SNAPPY, LZ4, ZSTD, and GZIP)
● Use partitioning. You can create a partitioned version of a table with a CTAS
https://prestodb.io/docs/current/sql/create-table-as.html by adding the
partitioned_by clause to the CREATE TABLE.
● Use bucketing. Do this by adding the bucketed_by clause to your CREATE
TABLE statement. You will also need to specify bucket_count.
Optimize File Format & Table Layout
18
19. ● Collect table statistics to ensure the most efficient query plan is produced,
which means queries run as fast as possible.
● Use the sql ANALYZE TABLE <tablename> command to do this. Repeat the
ANALYZE TABLE commands for all tables involved in queries on a regular basis,
typically when data has substantially changed (e.g. new data arrived / after an
ETL cycle has completed).
Collect Hive Table Stats
19
20. ● Enable CBO (It’s default value already)
● “LARGE LEFT” (put the large table on the left side of the join).
● Let Trino do the job in case it’s using default settings.
○ SET session join_distribution_type=’AUTOMATIC’;
○ SET session join_reordering_strategy=’AUTOMATIC’;
Join Optimization
20
21. ● You should enable Dynamic Filtering when 1 or more joins are in-play, especially
if there’s a smaller dimension table being used to probe a larger fact table for
example. Dynamic Filtering is pushed down to ORC and Parquet readers, and
can accelerate queries on partitioned as well as non-partitioned tables.
Dynamic Filtering is a join optimization intended to improve performance of
Hash JOINs. Enable this with:
○ SET session enable_dynamic_filtering=TRUE;
Dynamic Filtering
21
23. Configurations Recommendation (JVM)
● Avoid using large clusters
○ <= 400 workers
● Using JDK 11 instead of JDK 8
○ JDK 11 provides a much better performance for both runtime and GC
● Using G1GC for large heap (> 10G)
● DO NOT over tune JVM
○ DO NOT touch MaxNewSize and NewSize
○ DO NOT touch MaxTenuringThreshold
○ DO NOT touch InitiatingHeapOccupancyPercent
● Larger Xmx remediate 95% of GC issues
23