Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Reducing large S3 API costs using Alluxio at Datasapiens
Juraj Pohanka & Koen Michiels, Datasapiens
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
18. DATA ORCHESTRATION SUMMIT
▪dataset
• TPC-DS dataset with scale factor
100
• stored in a S3 bucket
▪query execution
• set of queries:
• TPC-DS suite excl. Query no. 72
• query execution:
• number of repeates: 10
• concurrency level: 1
Test setup
▪measurements
• Alluxio:
• logical operations: 'File Infos Got'
• RPC invocations: 'GetFileInfo'
• S3:
• total request counts per request type
• total request costs per request type
19. DATA ORCHESTRATION SUMMIT
▪10 most API requests-expensive queries
Results from the Alluxio-Presto cluster
Query name File Infos Got - avg GetFileInfo Operations - avg
q14_1 159,200.1 127,576.9
q09 137,031.0 109,669.0
q14_2 110,933.8 88,732.6
q75 101,468.4 81,166.3
q64 75,148.3 60,099.4
q88 73,224.0 58,584.0
q23_1 61,313.6 49,054.3
q23_2 60,566.2 48,457.6
q95 56,518.0 45,212.0
q28 54,810.0 43,866.0
20. DATA ORCHESTRATION SUMMIT
▪cumulative request counts
Results from the Alluxio-Presto cluster
Operation type Cumulative count
File Infos Got 24,089,740
GetFileInfo Operations 19,287,627
▪S3 API costs for caching the dataset into Alluxio
Request type Cumulative count Cumulative cost ($)
ListBucket 28,324 0.14
GetObject 24,033 0.01
HeadObject 44,581 0.02
Total 96,938 0.17
22. DATA ORCHESTRATION SUMMIT
▪10 most API requests-expensive queries
Per-query cost estimations
Query name S3 API cost ($)
q14_1 0.2684
q09 0.2310
q14_2 0.1870
q75 0.1711
q64 0.1267
q88 0.1234
q23_1 0.1034
q23_2 0.1021
q95 0.0953
q28 0.0924
23. DATA ORCHESTRATION SUMMIT
▪Infrastructure costs and S3 API costs
Costs comparions
▪S3 API costs form 0.58% of total costs when using Alluxio
▪S3 API costs form 48.83% of total costs when not using Alluxio
Cluster Infrastructure costs ($) S3 API costs ($)
Alluxio+Presto cluster 29.02 0.17
EMR+Presto cluster 42.55 40.61
25. DATA ORCHESTRATION SUMMIT
▪pricing for storage across cloud providers
is similar
▪common analytical workloads are far
more storage/compute intense than in
our example
▪no intermediate data storage layer will
lead to higher costs
Use an intermediate storage layer
26. DATA ORCHESTRATION SUMMIT
▪GitHub repository with complete test results:
• https://github.com/datasapiens/alluxio-s3-costs-test
▪DZone article link:
• https://dzone.com/articles/reducing-large-s3-api-costs-using-alluxio
▪Company website
• https://www.datasapiens.co.uk
Links