Alluxio Product School Webinar - Boosting Trino Performance.

•

1 gostou•155 visualizações

Alluxio Product School Webinar Mar. 23, 2023 For more Alluxio Events: https://www.alluxio.io/events/ Speaker: Beinan Wang (Tech Lead, Alluxio) In March’s Product School session, Beinan, an Alluxio tech lead, Presto committer, and Trino contributor, will share expert tips for tuning Trino performance. In addition, he will demonstrate how to integrate Trino with Alluxio as a caching layer using connectors for Hive, Iceberg, Hudi, or Delta Lake.

Software

Agenda
Overview &
Architecture
01 Trino, HMS and
Iceberg Overview
SQL Execution
02 SQL Execution
Introduction
Accelerating
SQL
03 Best practice of
Accelerating SQL
Q&A
04 Questions & Answers
2

Trino Overview
● Distributed SQL Query Engine
○ ANSI SQL on Hive data warehouse, Hudi, Iceberg, Kafka, Druid and etc.
○ Designed to be interactive
○ Access to petabytes of data
● Open-source
○ github.com/trinodb
● Use Cases
○ Ad-hoc
○ BI tools
○ Dashboard
○ A/B testing
○ ETL
4

Hive Tables Overview
● Using Hive metastore to serve the metadata
○ Backed by a RDBMS(Mysql)
○ Limited scalability
○
6
REF: https://tabular.io/blog/iceberg-metadata-indexing/

EXPLAIN vs EXPLAIN ANALYZE
10
EXPLAIN: plan structure + cost estimates
EXPLAIN ANALYZE: plan structure + cost estimates + actual execution statistics

Scheduling of Hive Tables
12
https://blog.bigdataboutique.com/2022/09/hive-tables-and-whats-next-for-modern-data-platf
orms-1xts1m

Scan Parquet Files
13
https://parquet.apache.org/docs/file-format/

S3 getObject API
https://docs.aws.amazon.com/Ama
zonS3/latest/API/API_GetObject.ht
ml

○ Set max memory per query
○ Set max memory per node
■ If we could speed up queries, we would be able to reduce
max_concurrent and then increase the max memory per node and at the
same time keep the same throughput
○ Memory allocation deadlock
■ Try query.low-memory-killer.policy
Make Sure There Is Sufﬁcient Memory
17

● Consider using column-based data format for your data ﬁles.
● ORC might have a better performance than parquet (especially when using
prestodb)
● Avoid using CSV or json format
● Compression (SNAPPY, LZ4, ZSTD, and GZIP)
● Use partitioning. You can create a partitioned version of a table with a CTAS
https://prestodb.io/docs/current/sql/create-table-as.html by adding the
partitioned_by clause to the CREATE TABLE.
● Use bucketing. Do this by adding the bucketed_by clause to your CREATE
TABLE statement. You will also need to specify bucket_count.
Optimize File Format & Table Layout
18

● Collect table statistics to ensure the most efﬁcient query plan is produced,
which means queries run as fast as possible.
● Use the sql ANALYZE TABLE <tablename> command to do this. Repeat the
ANALYZE TABLE commands for all tables involved in queries on a regular basis,
typically when data has substantially changed (e.g. new data arrived / after an
ETL cycle has completed).
Collect Hive Table Stats
19

● Enable CBO (It’s default value already)
● “LARGE LEFT” (put the large table on the left side of the join).
● Let Trino do the job in case it’s using default settings.
○ SET session join_distribution_type=’AUTOMATIC’;
○ SET session join_reordering_strategy=’AUTOMATIC’;
Join Optimization
20

● You should enable Dynamic Filtering when 1 or more joins are in-play, especially
if there’s a smaller dimension table being used to probe a larger fact table for
example. Dynamic Filtering is pushed down to ORC and Parquet readers, and
can accelerate queries on partitioned as well as non-partitioned tables.
Dynamic Filtering is a join optimization intended to improve performance of
Hash JOINs. Enable this with:
○ SET session enable_dynamic_ﬁltering=TRUE;
Dynamic Filtering
21

Conﬁgurations Recommendation (JVM)
● Avoid using large clusters
○ <= 400 workers
● Using JDK 11 instead of JDK 8
○ JDK 11 provides a much better performance for both runtime and GC
● Using G1GC for large heap (> 10G)
● DO NOT over tune JVM
○ DO NOT touch MaxNewSize and NewSize
○ DO NOT touch MaxTenuringThreshold
○ DO NOT touch InitiatingHeapOccupancyPercent
● Larger Xmx remediate 95% of GC issues
23

Q&A
04
24
Complete the Community
Survey for a chance to win
an Amazon gift card!

Mais conteúdo relacionado

Semelhante a Alluxio Product School Webinar - Boosting Trino Performance.

Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services

[DBA]_HiramFleitas_SQL_PASS_Summit_2017_SummaryHiram Fleitas León

In-memory ColumnStore IndexSolidQ

Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...Edgar Alejandro Villegas

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks

Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic

Run your queries 14X faster without any investment!Knoldus Inc.

Designing High Performance ETL for Data WarehouseMarcel Franke

How to Cost-Optimize Cloud Data Pipelines_.pptxSadeka Islam

PostgreSQL Table Partitioning / ShardingAmir Reza Hashemi

Best Practices – Extreme Performance with Data Warehousing on Oracle DatabaseEdgar Alejandro Villegas

Performance Tuning Oracle's BI ApplicationsKPI Partners

DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdfMiguel Angel Fajardo

Operating and Supporting Delta Lake in ProductionDatabricks

From Data Warehouse to LakehouseModern Data Stack France

Mutable data @ scaleOri Reshef

Real world business workflow with SharePoint designer 2013Ivan Sanders

OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...Altinity Ltd

Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus

G-Store: High-Performance Graph Store for Trillion-Edge ProcessingPradeep Kumar

Semelhante a Alluxio Product School Webinar - Boosting Trino Performance. (20)

Best Practices for Migrating your Data Warehouse to Amazon Redshift

[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary

In-memory ColumnStore Index

Best Practices – Extreme Performance with Data Warehousing on Oracle Databa...

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Best Practices for Supercharging Cloud Analytics on Amazon Redshift

Run your queries 14X faster without any investment!

Designing High Performance ETL for Data Warehouse

How to Cost-Optimize Cloud Data Pipelines_.pptx

PostgreSQL Table Partitioning / Sharding

Best Practices – Extreme Performance with Data Warehousing on Oracle Database

Performance Tuning Oracle's BI Applications

DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf

Operating and Supporting Delta Lake in Production

From Data Warehouse to Lakehouse

Mutable data @ scale

Real world business workflow with SharePoint designer 2013

OSA Con 2022 - Extract, Transform, and Learn about your developers - Brian Le...

Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud

G-Store: High-Performance Graph Store for Trillion-Edge Processing

Mais de Alluxio, Inc.

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Optimizing Data Access for Analytics And AI with AlluxioAlluxio, Inc.

Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.

Correctly Loading Incremental Data at ScaleAlluxio, Inc.

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.

Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.

Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...Alluxio, Inc.

Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionAlluxio, Inc.

Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeAlluxio, Inc.

Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudAlluxio, Inc.

Data Infra Meetup | ByteDance's Native Parquet ReaderAlluxio, Inc.

Data Infra Meetup | Uber's Data Storage EvolutionAlluxio, Inc.

Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.

AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...Alluxio, Inc.

AI Infra Day | The AI Infra in the Generative AI EraAlluxio, Inc.

AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...Alluxio, Inc.

AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...Alluxio, Inc.

AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAlluxio, Inc.

AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAlluxio, Inc.

Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio, Inc.

Mais de Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Optimizing Data Access for Analytics And AI with Alluxio

Speed Up Presto at Uber with Alluxio Caching

Correctly Loading Incremental Data at Scale

Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML

Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...

Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...

Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction

Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge

Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud

Data Infra Meetup | ByteDance's Native Parquet Reader

Data Infra Meetup | Uber's Data Storage Evolution

Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...

AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...

AI Infra Day | The AI Infra in the Generative AI Era

AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...

AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...

AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta

AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale

Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS

Último

What is Binary Language? Computer Number SystemsJheuzeDellosa

Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171

Professional Resume Template for Software DevelopersVinodh Ram

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

Test Automation Strategy for Frontend and BackendArshad QA

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

TECUNIQUE: Success Stories: IT Service providermohitmore19

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

Alluxio Product School Webinar - Boosting Trino Performance.

1. Boosting Trino Performance Mar 23, 2023

2. Agenda Overview & Architecture 01 Trino, HMS and Iceberg Overview SQL Execution 02 SQL Execution Introduction Accelerating SQL 03 Best practice of Accelerating SQL Q&A 04 Questions & Answers 2

3. Overview & Architecture 01 3

4. Trino Overview ● Distributed SQL Query Engine ○ ANSI SQL on Hive data warehouse, Hudi, Iceberg, Kafka, Druid and etc. ○ Designed to be interactive ○ Access to petabytes of data ● Open-source ○ github.com/trinodb ● Use Cases ○ Ad-hoc ○ BI tools ○ Dashboard ○ A/B testing ○ ETL 4

5. Trino Architecture 5

6. Hive Tables Overview ● Using Hive metastore to serve the metadata ○ Backed by a RDBMS(Mysql) ○ Limited scalability ○ 6 REF: https://tabular.io/blog/iceberg-metadata-indexing/

7. Trino Tuning Tips 02 7

8. Plan Generation 8 From Trino.io

9. Optimization 9

10. EXPLAIN vs EXPLAIN ANALYZE 10 EXPLAIN: plan structure + cost estimates EXPLAIN ANALYZE: plan structure + cost estimates + actual execution statistics

11. Plan Generation & Optimization 11

12. Scheduling of Hive Tables 12 https://blog.bigdataboutique.com/2022/09/hive-tables-and-whats-next-for-modern-data-platf orms-1xts1m

13. Scan Parquet Files 13 https://parquet.apache.org/docs/file-format/

14. Predicate Pushdown Resource Usage 14

15. S3 getObject API https://docs.aws.amazon.com/Ama zonS3/latest/API/API_GetObject.ht ml

16. Best Practices 03 16

17. ○ Set max memory per query ○ Set max memory per node ■ If we could speed up queries, we would be able to reduce max_concurrent and then increase the max memory per node and at the same time keep the same throughput ○ Memory allocation deadlock ■ Try query.low-memory-killer.policy Make Sure There Is Sufﬁcient Memory 17

18. ● Consider using column-based data format for your data ﬁles. ● ORC might have a better performance than parquet (especially when using prestodb) ● Avoid using CSV or json format ● Compression (SNAPPY, LZ4, ZSTD, and GZIP) ● Use partitioning. You can create a partitioned version of a table with a CTAS https://prestodb.io/docs/current/sql/create-table-as.html by adding the partitioned_by clause to the CREATE TABLE. ● Use bucketing. Do this by adding the bucketed_by clause to your CREATE TABLE statement. You will also need to specify bucket_count. Optimize File Format & Table Layout 18

19. ● Collect table statistics to ensure the most efﬁcient query plan is produced, which means queries run as fast as possible. ● Use the sql ANALYZE TABLE <tablename> command to do this. Repeat the ANALYZE TABLE commands for all tables involved in queries on a regular basis, typically when data has substantially changed (e.g. new data arrived / after an ETL cycle has completed). Collect Hive Table Stats 19

20. ● Enable CBO (It’s default value already) ● “LARGE LEFT” (put the large table on the left side of the join). ● Let Trino do the job in case it’s using default settings. ○ SET session join_distribution_type=’AUTOMATIC’; ○ SET session join_reordering_strategy=’AUTOMATIC’; Join Optimization 20

21. ● You should enable Dynamic Filtering when 1 or more joins are in-play, especially if there’s a smaller dimension table being used to probe a larger fact table for example. Dynamic Filtering is pushed down to ORC and Parquet readers, and can accelerate queries on partitioned as well as non-partitioned tables. Dynamic Filtering is a join optimization intended to improve performance of Hash JOINs. Enable this with: ○ SET session enable_dynamic_ﬁltering=TRUE; Dynamic Filtering 21

22. Trino + Alluxio Multi-Level Cache 22

23. Conﬁgurations Recommendation (JVM) ● Avoid using large clusters ○ <= 400 workers ● Using JDK 11 instead of JDK 8 ○ JDK 11 provides a much better performance for both runtime and GC ● Using G1GC for large heap (> 10G) ● DO NOT over tune JVM ○ DO NOT touch MaxNewSize and NewSize ○ DO NOT touch MaxTenuringThreshold ○ DO NOT touch InitiatingHeapOccupancyPercent ● Larger Xmx remediate 95% of GC issues 23

24. Q&A 04 24 Complete the Community Survey for a chance to win an Amazon gift card!

25. THANK YOU 25

Alluxio Product School Webinar - Boosting Trino Performance.

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Alluxio Product School Webinar - Boosting Trino Performance.

Semelhante a Alluxio Product School Webinar - Boosting Trino Performance. (20)

Mais de Alluxio, Inc.

Mais de Alluxio, Inc. (20)

Último

Último (20)

Alluxio Product School Webinar - Boosting Trino Performance.