SlideShare uma empresa Scribd logo
1 de 64
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pratim Das
20th of September 2017
Deploying your Data Warehouse on
AWS
Deploying your Data
Warehouse on AWS
AWS Lambda Glue
and/or
and/or
ETL? Managed?
Complex? Cost?
Batch
Firehose
Glue
S3
Streaming?
DBs (OLTP)?
Own code?
Parallel?
Managed?
SCT Migration
Agent
DWH?
S3
Query
optimized
&
Ready for
self-
service
Ingest Store
Prepare for
Analytics Store AnalyzeAthena
Query Service
Ad-hoc Analysis BI & Visualization
Redshift Spectrum Redshift Spectrum
DWH and Data Marts
Predictive
Redshift
Data Warehouse
Redshift
Data Warehouse
Redshift Architecture
Managed Massively Parallel Petabyte Scale
Data Warehouse
Streaming Backup/Restore to S3
Load data from S3, DynamoDB and EMR
Extensive Security Features
Scale from 160 GB -> 2 PB Online
Fast
CompatibleSecure
ElasticSimple
Cost
Efficient
Amazon Redshift Cluster Architecture
Massively parallel, shared nothing
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, backup, restore
• 2, 16 or 32 slices
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
S3 / EMR / DynamoDB / SSH
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node
Common Redshift Use Cases
Use Case: Traditional Data Warehousing
Business
Reporting
Advanced pipelines
and queries
Secure and
Compliant
Easy Migration – Point & Click using AWS Database Migration Service
Secure & Compliant – End-to-End Encryption. SOC 1/2/3, PCI-DSS, HIPAA and FedRAMP compliant
Large Ecosystem – Variety of cloud and on-premises BI and ETL tools
Japanese Mobile
Phone Provider
Powering 100 marketplaces
in 50 countries
World’s Largest Children’s
Book Publisher
Bulk Loads
and Updates
Use Case: Log Analysis
Log & Machine
IOT Data
Clickstream
Events Data
Time-Series
Data
Cheap – Analyze large volumes of data cost-effectively
Fast – Massively Parallel Processing (MPP) and columnar architecture for fast queries and parallel loads
Near real-time – Micro-batch loading and Amazon Kinesis Firehose for near-real time analytics
Interactive data analysis and
recommendation engine
Ride analytics for pricing
and product development
Ad prediction and
on-demand analytics
Use Case: Business Applications
Multi-Tenant BI
Applications
Back-end
services
Analytics as a
Service
Fully Managed – Provisioning, backups, upgrades, security, compression all come built-in so you can
focus on your business applications
Ease of Chargeback – Pay as you go, add clusters as needed. A few big common clusters, several
data marts
Service Oriented Architecture – Integrated with other AWS services. Easy to plug into your pipeline
Infosys Information
Platform (IIP)
Analytics-as-a-
Service
Product and Consumer
Analytics
Redshift Customers
Selected Amazon Redshift customers
NTT Docomo: Japan’s largest mobile service provider
68 million customers
Tens of TBs per day of data across a
mobile network
6 PB of total data (uncompressed)
Data science for marketing
operations, logistics, and so on
Greenplum on-premises
Scaling challenges
Performance issues
Need same level of security
Need for a hybrid environment
125 node DS2.8XL cluster
4,500 vCPUs, 30 TB RAM
2 PB compressed
10x faster analytic queries
50% reduction in time for new
BI application deployment
Significantly less operations
overhead
Data
Source
ET
AWS Direct
Connect
Client
Forwarder
LoaderState
Management
SandboxAmazon Redshift
S3
NTT Docomo: Japan’s largest mobile service provider
Nasdaq: powering 100 marketplaces in 50 countries
Orders, quotes, trade executions,
market “tick” data from 7 exchanges
7 billion rows/day
Analyze market share, client activity,
surveillance, billing, and so on
Microsoft SQL Server on-premises
Expensive legacy DW
($1.16 M/yr.)
Limited capacity (1 yr. of data
online)
Needed lower TCO
Must satisfy multiple security
and regulatory requirements
Similar performance
23 node DS2.8XL cluster
828 vCPUs, 5 TB RAM
368 TB compressed
2.7 T rows, 900 B derived
8 tables with 100 B rows
7 man-month migration
¼ the cost, 2x storage, room to
grow
Faster performance, very
secure
Nasdaq: powering 100 marketplaces in 50 countries
Tuning Redshift for
Performance
Design for Queryability
• Equally on each slice
• Minimum amount of work
• Use just enough cluster resources
Do an Equal Amount of Work
on Each Slice
Choose Best Table Distribution Style
All
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
All data on
every node
Key
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Same key to
same location
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Even
Round robin
distribution
Do the Minimum Amount of
Work on Each Slice
Columnar storage
+
Large data block sizes
+
Data compression
+
Zone maps
+
Direct-attached storage
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
Reduced I/O = Enhanced Performance
Use Cluster Resources
Efficiently to Complete as
Quickly as Possible
Amazon Redshift Workload Management
Waiting
Workload Management
BI tools
SQL clients
Analytics tools
Client
Running
Queries: 80% memory
ETL: 20% memory
4 Slots
2 Slots
80/4 = 20% per slot
20/2 = 10% per slot
Redshift Performance Tuning
Redshift Playbook
Part 1: Preamble, Prerequisites, and
Prioritization
Part 2: Distribution Styles and
Distribution Keys
Part 3: Compound and Interleaved
Sort Keys
Part 4: Compression Encodings
Part 5: Table Data Durability
amzn.to/2quChdM
Optimizing Amazon Redshift by Using the AWS
Schema Conversion Tool
amzn.to/2sTYow1
Amazon Redshift Spectrum
Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
S3
SQL
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
1
Query is optimized and compiled at
the leader node. Determine what gets
run locally and what goes to Amazon
Redshift Spectrum
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
2
Query plan is sent to
all compute nodes
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
3
Compute nodes obtain partition info from
Data Catalog; dynamically prune partitions
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
4
Each compute node issues multiple
requests to the Amazon Redshift
Spectrum layer
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
5
Amazon Redshift Spectrum nodes
scan your S3 data
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
6
7
Amazon Redshift
Spectrum projects,
filters, joins and
aggregates
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
Final aggregations and joins
with local Amazon Redshift
tables done in-cluster
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
8
Result is sent back to client
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
9
Demo:
Running an analytic query
over an exabyte in S3
Lets build an analytic query - #1
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets get the prior books she’s written.
1 Table
2 Filters
SELECT
P.ASIN,
P.TITLE
FROM
products P
WHERE
P.TITLE LIKE ‘%POTTER%’ AND
P.AUTHOR = ‘J. K. Rowling’
Lets build an analytic query - #2
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values
2 Tables (1 S3, 1 local)
2 Filters
1 Join
2 Group By columns
1 Order By
1 Limit
1 Aggregation
SELECT
P.ASIN,
P.TITLE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
products P
WHERE
D.ASIN = P.ASIN AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
GROUP BY P.ASIN, P.TITLE
ORDER BY SALES_sum DESC
LIMIT 20;
Lets build an analytic query - #3
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values, just for the first three days
of sales of first editions
3 Tables (1 S3, 2 local)
5 Filters
2 Joins
3 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
SELECT
P.ASIN,
P.TITLE,
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
asin_attributes A,
products P
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
A.EDITION LIKE '%FIRST%' AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
Lets build an analytic query - #4
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values, just for the first three days
of sales of first editions in the city of Seattle, WA, USA
4 Tables (1 S3, 3 local)
8 Filters
3 Joins
4 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
SELECT
P.ASIN,
P.TITLE,
R.POSTAL_CODE,
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
asin_attributes A,
products P,
regions R
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
D.REGION_ID = R.REGION_ID AND
A.EDITION LIKE '%FIRST%' AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
R.COUNTRY_CODE = ‘US’ AND
R.CITY = ‘Seattle’ AND
R.STATE = ‘WA’ AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
Now let’s run that query over an exabyte of data in S3
Roughly 140 TB of customer item order detail
records for each day over past 20 years.
190 million files across 15,000 partitions in S3.
One partition per day for USA and rest of world.
Need a billion-fold reduction in data processed.
Running this query using a 1000 node Hive cluster
would take over 5 years.*
• Compression ……………..….……..5X
• Columnar file format……….......…10X
• Scanning with 2500 nodes…....2500X
• Static partition elimination…............2X
• Dynamic partition elimination..….350X
• Redshift’s query optimizer……......40X
---------------------------------------------------
Total reduction……….…………3.5B X
* Estimated using 20 node Hive cluster & 1.4TB, assume linear
* Query used a 20 node DC1.8XLarge Amazon Redshift cluster
* Not actual sales data - generated for this demo based on data
format used by Amazon Retail.
Is Amazon Redshift Spectrum useful if I don’t have an exabyte?
Your data will get bigger
On average, data warehousing volumes grow 10x every 5 years
The average Amazon Redshift customer doubles data each year
Amazon Redshift Spectrum makes data analysis simpler
Access your data without ETL pipelines
Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake
Amazon Redshift Spectrum improves availability and concurrency
Run multiple Amazon Redshift clusters against common data
Isolate jobs with tight SLAs from ad hoc analysis
Ingestion, ETL & BI
Getting data to Redshift using AWS Database
Migration Service (DMS)
Simple to use Minimal Downtime Supports most widely
used Databases
Low Cost Fast & Easy to Set-up Reliable
Source=OLTP
Getting data to Redshift using AWS Schema
Conversion Tool (SCT) Source=OLAP
Loading data from S3
• Splitting Your Data into Multiple Files
• Uploading Files to Amazon S3
• Using the COPY Command to Load from
Amazon S3
ETL on Redshift
Upsert into Amazon Redshift using AWS Glue
and SneaQL
http://amzn.to/2woj3gB
Partner ETL solutions
Data integration
BI and Visualization
QuickSight for BI on Redshift
Amazon Redshift
Partner BI solutions
Advanced Analytics
• Approximate functions
• User defined functions
• Machine learning
• Data science
Redshift Partner Echo System
4 types of partners
• Load and transform your data with Data Integration
Partners
• Analyze data and share insights across your
organization with Business Intelligence Partners
• Architect and implement your analytics platform
with System Integration and Consulting Partners
• Query, explore and model your data using tools and
utilities from Query and Data Modeling Partners
aws.amazon.com/redshift/partners/
AWS Named as a Leader in The Forrester
WaveTM: Big Data Warehouse Q2 2017
http://bit.ly/2w1TAEy
On June 15, Forrester published the Big Data
Warehouse, Q2 2017, in which AWS is
positioned as a Leader. According to Forrester,
“With more than 5,000 deployments, Amazon
Redshift has the largest data warehouse
deployments in the cloud.” AWS received the
highest score possible, 5/5, for customer base,
market awareness, ability to execute, road map,
support, and partners. “AWS’s key strengths lie
in its dynamic scale, automated administration,
flexibility of database offerings, good security,
and high availability (HA) capabilities, which
make it a preferred choice for customers.
Migrations
Extending your DWH (or Migrations) to
Redshift
http://amzn.to/2vN3UBO
Oracle to Redshift
Extending your DWH (or Migrations) to
Redshift
http://amzn.to/2wZy7OA
Teradata to Redshift
Extending your DWH (or Migrations) to
Redshift
http://amzn.to/2hbKwYd
Converge Silos to Redshift
London Redshift Meetup
http://bit.ly/2x6ZdGA
Next Session on the 26th of
October.
RSVP opens 6th of October
Questions?

Mais conteúdo relacionado

Mais procurados

Data product thinking-Will the Data Mesh save us from analytics history
Data product thinking-Will the Data Mesh save us from analytics historyData product thinking-Will the Data Mesh save us from analytics history
Data product thinking-Will the Data Mesh save us from analytics historyRogier Werschkull
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data EngineeringHarald Erb
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftAmazon Web Services
 
Azure Data Factory Introduction.pdf
Azure Data Factory Introduction.pdfAzure Data Factory Introduction.pdf
Azure Data Factory Introduction.pdfMaheshPandit16
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data FactorySlava Kokaev
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceAmazon Web Services
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...Amazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 

Mais procurados (20)

Data product thinking-Will the Data Mesh save us from analytics history
Data product thinking-Will the Data Mesh save us from analytics historyData product thinking-Will the Data Mesh save us from analytics history
Data product thinking-Will the Data Mesh save us from analytics history
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 
Introduction to Amazon Aurora
Introduction to Amazon AuroraIntroduction to Amazon Aurora
Introduction to Amazon Aurora
 
Amazon QuickSight
Amazon QuickSightAmazon QuickSight
Amazon QuickSight
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data Engineering
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Azure Data Factory Introduction.pdf
Azure Data Factory Introduction.pdfAzure Data Factory Introduction.pdf
Azure Data Factory Introduction.pdf
 
Introduction to Azure Data Factory
Introduction to Azure Data FactoryIntroduction to Azure Data Factory
Introduction to Azure Data Factory
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database Service
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 

Semelhante a Deploying your Data Warehouse on AWS

BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...Amazon Web Services
 
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal HealthSRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal HealthAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017Pratim Das
 
Intro to Amazon Redshift Spectrum: Quickly Query Exabytes of Data in S3 - Jun...
Intro to Amazon Redshift Spectrum: Quickly Query Exabytes of Data in S3 - Jun...Intro to Amazon Redshift Spectrum: Quickly Query Exabytes of Data in S3 - Jun...
Intro to Amazon Redshift Spectrum: Quickly Query Exabytes of Data in S3 - Jun...Amazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...Sungmin Kim
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Amazon Web Services
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Amazon Web Services
 
Deploying Your Data Warehouse on AWS
Deploying Your Data Warehouse on AWSDeploying Your Data Warehouse on AWS
Deploying Your Data Warehouse on AWSAmazon Web Services
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Amazon Web Services
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSAmazon Web Services
 

Semelhante a Deploying your Data Warehouse on AWS (20)

BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
 
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal HealthSRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017
 
Intro to Amazon Redshift Spectrum: Quickly Query Exabytes of Data in S3 - Jun...
Intro to Amazon Redshift Spectrum: Quickly Query Exabytes of Data in S3 - Jun...Intro to Amazon Redshift Spectrum: Quickly Query Exabytes of Data in S3 - Jun...
Intro to Amazon Redshift Spectrum: Quickly Query Exabytes of Data in S3 - Jun...
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data Warehouse
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018
 
Deploying Your Data Warehouse on AWS
Deploying Your Data Warehouse on AWSDeploying Your Data Warehouse on AWS
Deploying Your Data Warehouse on AWS
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
Launching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWSLaunching Your First Big Data Project on AWS
Launching Your First Big Data Project on AWS
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 

Mais de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Deploying your Data Warehouse on AWS

  • 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Pratim Das 20th of September 2017 Deploying your Data Warehouse on AWS
  • 2. Deploying your Data Warehouse on AWS AWS Lambda Glue and/or and/or ETL? Managed? Complex? Cost? Batch Firehose Glue S3 Streaming? DBs (OLTP)? Own code? Parallel? Managed? SCT Migration Agent DWH? S3 Query optimized & Ready for self- service Ingest Store Prepare for Analytics Store AnalyzeAthena Query Service Ad-hoc Analysis BI & Visualization Redshift Spectrum Redshift Spectrum DWH and Data Marts Predictive Redshift Data Warehouse Redshift Data Warehouse
  • 4. Managed Massively Parallel Petabyte Scale Data Warehouse Streaming Backup/Restore to S3 Load data from S3, DynamoDB and EMR Extensive Security Features Scale from 160 GB -> 2 PB Online Fast CompatibleSecure ElasticSimple Cost Efficient
  • 5. Amazon Redshift Cluster Architecture Massively parallel, shared nothing Leader node • SQL endpoint • Stores metadata • Coordinates parallel SQL processing Compute nodes • Local, columnar storage • Executes queries in parallel • Load, backup, restore • 2, 16 or 32 slices 10 GigE (HPC) Ingestion Backup Restore SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores S3 / EMR / DynamoDB / SSH JDBC/ODBC 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Leader Node
  • 7. Use Case: Traditional Data Warehousing Business Reporting Advanced pipelines and queries Secure and Compliant Easy Migration – Point & Click using AWS Database Migration Service Secure & Compliant – End-to-End Encryption. SOC 1/2/3, PCI-DSS, HIPAA and FedRAMP compliant Large Ecosystem – Variety of cloud and on-premises BI and ETL tools Japanese Mobile Phone Provider Powering 100 marketplaces in 50 countries World’s Largest Children’s Book Publisher Bulk Loads and Updates
  • 8. Use Case: Log Analysis Log & Machine IOT Data Clickstream Events Data Time-Series Data Cheap – Analyze large volumes of data cost-effectively Fast – Massively Parallel Processing (MPP) and columnar architecture for fast queries and parallel loads Near real-time – Micro-batch loading and Amazon Kinesis Firehose for near-real time analytics Interactive data analysis and recommendation engine Ride analytics for pricing and product development Ad prediction and on-demand analytics
  • 9. Use Case: Business Applications Multi-Tenant BI Applications Back-end services Analytics as a Service Fully Managed – Provisioning, backups, upgrades, security, compression all come built-in so you can focus on your business applications Ease of Chargeback – Pay as you go, add clusters as needed. A few big common clusters, several data marts Service Oriented Architecture – Integrated with other AWS services. Easy to plug into your pipeline Infosys Information Platform (IIP) Analytics-as-a- Service Product and Consumer Analytics
  • 12. NTT Docomo: Japan’s largest mobile service provider 68 million customers Tens of TBs per day of data across a mobile network 6 PB of total data (uncompressed) Data science for marketing operations, logistics, and so on Greenplum on-premises Scaling challenges Performance issues Need same level of security Need for a hybrid environment
  • 13. 125 node DS2.8XL cluster 4,500 vCPUs, 30 TB RAM 2 PB compressed 10x faster analytic queries 50% reduction in time for new BI application deployment Significantly less operations overhead Data Source ET AWS Direct Connect Client Forwarder LoaderState Management SandboxAmazon Redshift S3 NTT Docomo: Japan’s largest mobile service provider
  • 14. Nasdaq: powering 100 marketplaces in 50 countries Orders, quotes, trade executions, market “tick” data from 7 exchanges 7 billion rows/day Analyze market share, client activity, surveillance, billing, and so on Microsoft SQL Server on-premises Expensive legacy DW ($1.16 M/yr.) Limited capacity (1 yr. of data online) Needed lower TCO Must satisfy multiple security and regulatory requirements Similar performance
  • 15. 23 node DS2.8XL cluster 828 vCPUs, 5 TB RAM 368 TB compressed 2.7 T rows, 900 B derived 8 tables with 100 B rows 7 man-month migration ¼ the cost, 2x storage, room to grow Faster performance, very secure Nasdaq: powering 100 marketplaces in 50 countries
  • 17. Design for Queryability • Equally on each slice • Minimum amount of work • Use just enough cluster resources
  • 18. Do an Equal Amount of Work on Each Slice
  • 19. Choose Best Table Distribution Style All Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 All data on every node Key Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Same key to same location Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Even Round robin distribution
  • 20. Do the Minimum Amount of Work on Each Slice
  • 21. Columnar storage + Large data block sizes + Data compression + Zone maps + Direct-attached storage analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959 Reduced I/O = Enhanced Performance
  • 22. Use Cluster Resources Efficiently to Complete as Quickly as Possible
  • 23. Amazon Redshift Workload Management Waiting Workload Management BI tools SQL clients Analytics tools Client Running Queries: 80% memory ETL: 20% memory 4 Slots 2 Slots 80/4 = 20% per slot 20/2 = 10% per slot
  • 25. Redshift Playbook Part 1: Preamble, Prerequisites, and Prioritization Part 2: Distribution Styles and Distribution Keys Part 3: Compound and Interleaved Sort Keys Part 4: Compression Encodings Part 5: Table Data Durability amzn.to/2quChdM
  • 26. Optimizing Amazon Redshift by Using the AWS Schema Conversion Tool amzn.to/2sTYow1
  • 28. Amazon Redshift Spectrum Run SQL queries directly against data in S3 using thousands of nodes Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query High concurrency: Multiple clusters access same data No ETL: Query data in-place using open file formats Full Amazon Redshift SQL support S3 SQL
  • 29. Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY… Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 1
  • 30. Query is optimized and compiled at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 2
  • 31. Query plan is sent to all compute nodes Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 3
  • 32. Compute nodes obtain partition info from Data Catalog; dynamically prune partitions Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 4
  • 33. Each compute node issues multiple requests to the Amazon Redshift Spectrum layer Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 5
  • 34. Amazon Redshift Spectrum nodes scan your S3 data Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 6
  • 35. 7 Amazon Redshift Spectrum projects, filters, joins and aggregates Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore
  • 36. Final aggregations and joins with local Amazon Redshift tables done in-cluster Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 8
  • 37. Result is sent back to client Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 9
  • 38. Demo: Running an analytic query over an exabyte in S3
  • 39. Lets build an analytic query - #1 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets get the prior books she’s written. 1 Table 2 Filters SELECT P.ASIN, P.TITLE FROM products P WHERE P.TITLE LIKE ‘%POTTER%’ AND P.AUTHOR = ‘J. K. Rowling’
  • 40. Lets build an analytic query - #2 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books she’s written in this series and return the top 20 values 2 Tables (1 S3, 1 local) 2 Filters 1 Join 2 Group By columns 1 Order By 1 Limit 1 Aggregation SELECT P.ASIN, P.TITLE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, products P WHERE D.ASIN = P.ASIN AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND GROUP BY P.ASIN, P.TITLE ORDER BY SALES_sum DESC LIMIT 20;
  • 41. Lets build an analytic query - #3 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books she’s written in this series and return the top 20 values, just for the first three days of sales of first editions 3 Tables (1 S3, 2 local) 5 Filters 2 Joins 3 Group By columns 1 Order By 1 Limit 1 Aggregation 1 Function 2 Casts SELECT P.ASIN, P.TITLE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20;
  • 42. Lets build an analytic query - #4 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books she’s written in this series and return the top 20 values, just for the first three days of sales of first editions in the city of Seattle, WA, USA 4 Tables (1 S3, 3 local) 8 Filters 3 Joins 4 Group By columns 1 Order By 1 Limit 1 Aggregation 1 Function 2 Casts SELECT P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P, regions R WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND D.REGION_ID = R.REGION_ID AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND R.COUNTRY_CODE = ‘US’ AND R.CITY = ‘Seattle’ AND R.STATE = ‘WA’ AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20;
  • 43. Now let’s run that query over an exabyte of data in S3 Roughly 140 TB of customer item order detail records for each day over past 20 years. 190 million files across 15,000 partitions in S3. One partition per day for USA and rest of world. Need a billion-fold reduction in data processed. Running this query using a 1000 node Hive cluster would take over 5 years.* • Compression ……………..….……..5X • Columnar file format……….......…10X • Scanning with 2500 nodes…....2500X • Static partition elimination…............2X • Dynamic partition elimination..….350X • Redshift’s query optimizer……......40X --------------------------------------------------- Total reduction……….…………3.5B X * Estimated using 20 node Hive cluster & 1.4TB, assume linear * Query used a 20 node DC1.8XLarge Amazon Redshift cluster * Not actual sales data - generated for this demo based on data format used by Amazon Retail.
  • 44. Is Amazon Redshift Spectrum useful if I don’t have an exabyte? Your data will get bigger On average, data warehousing volumes grow 10x every 5 years The average Amazon Redshift customer doubles data each year Amazon Redshift Spectrum makes data analysis simpler Access your data without ETL pipelines Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake Amazon Redshift Spectrum improves availability and concurrency Run multiple Amazon Redshift clusters against common data Isolate jobs with tight SLAs from ad hoc analysis
  • 46. Getting data to Redshift using AWS Database Migration Service (DMS) Simple to use Minimal Downtime Supports most widely used Databases Low Cost Fast & Easy to Set-up Reliable Source=OLTP
  • 47. Getting data to Redshift using AWS Schema Conversion Tool (SCT) Source=OLAP
  • 48. Loading data from S3 • Splitting Your Data into Multiple Files • Uploading Files to Amazon S3 • Using the COPY Command to Load from Amazon S3
  • 50. Upsert into Amazon Redshift using AWS Glue and SneaQL http://amzn.to/2woj3gB
  • 53. QuickSight for BI on Redshift Amazon Redshift
  • 55. Advanced Analytics • Approximate functions • User defined functions • Machine learning • Data science
  • 57. 4 types of partners • Load and transform your data with Data Integration Partners • Analyze data and share insights across your organization with Business Intelligence Partners • Architect and implement your analytics platform with System Integration and Consulting Partners • Query, explore and model your data using tools and utilities from Query and Data Modeling Partners aws.amazon.com/redshift/partners/
  • 58. AWS Named as a Leader in The Forrester WaveTM: Big Data Warehouse Q2 2017 http://bit.ly/2w1TAEy On June 15, Forrester published the Big Data Warehouse, Q2 2017, in which AWS is positioned as a Leader. According to Forrester, “With more than 5,000 deployments, Amazon Redshift has the largest data warehouse deployments in the cloud.” AWS received the highest score possible, 5/5, for customer base, market awareness, ability to execute, road map, support, and partners. “AWS’s key strengths lie in its dynamic scale, automated administration, flexibility of database offerings, good security, and high availability (HA) capabilities, which make it a preferred choice for customers.
  • 60. Extending your DWH (or Migrations) to Redshift http://amzn.to/2vN3UBO Oracle to Redshift
  • 61. Extending your DWH (or Migrations) to Redshift http://amzn.to/2wZy7OA Teradata to Redshift
  • 62. Extending your DWH (or Migrations) to Redshift http://amzn.to/2hbKwYd Converge Silos to Redshift
  • 63. London Redshift Meetup http://bit.ly/2x6ZdGA Next Session on the 26th of October. RSVP opens 6th of October

Notas do Editor

  1. We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/
  2. You can start small and grow without interrupting query size. It’s cost efficient, Smallest environment is single node 160GB cluster with SSD Storage for $.025/hour | Scale up to largest cluster (NTT Docomo) with 4PB of usable data in Redshift. It’s Simple > SQL, JDBC/ODBC connectivity to your favorite BI tool. Fast > We optimized for scans against TB/PB of data set, by leveraging columnar column storage, compression within a shared nothing massively parallel architecture. Secure > Security is built-in. You can encrypt data at rest and in transit, isolate your clusters using Amazon VPC and even manage your keys using AWS Key Management Service (KMS) and hardware security modules (HSMs).
  3. Let’s take a closer look at the Redshift architecture. The bit that you work with is what we call the leader node. This is where you connect to using your driver, you can use JDBC/ODBC. In addition to Redshift drives which you can download from our site, you can also connect technically with any Postgres driver. Behind the leader node is the compute nodes, can have up to 128 of those. Next, these compute nodes also talk to other services, primarily S3. We ingest data usually from as S3 and you can unload data to S3. We are continuously backing up your cluster to S3, all of this happens in the background and in parallel and that's really kind of the important takeaway on this slide. As we receive your query we generate C++ code that are compiled and we send those down all the compute nodes, that work is also done by the leader node. Postgres catalog tables also exist in the leader. We've also added to additional metadata tables to Redshift as well.
  4. We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/
  5. Fully Managed – We provision, backup, patch and monitor so you can focus on your data Fast – Massively Parallel Processing and columnar architecture for fast queries and parallel loads Nasdaq security – Ingests 5B rows/trading day, analyzes orders, trades and quotes to protect traders, report activity and develop their marketplaces NTT Docomo - Redshift is NTT Docomo's primary analytics platform for data science and marketing and logistic analytics. Data is pre-processed on premises and loaded into a massive, multi-petabyte data warehouse on Amazon Redshift, which data scientists use as their primary analytics platform.
  6. Pinterest uses Redshift for interactive data analysis. Redshift is used to store all web event data and uses for KPIs, recommendations and A/B experimentation. Lyft uses Redshift for ride analytics across the world (rides / location data ) - Through analysis, company engineers estimated that up to 90% of rides during peak times had similar routes. This led to the introduction of Lyft Line – a service that allows customers to save up to 60% by carpooling with others who are going in the same direction. Yelp has multiple deployments of RedShift with different data sets in use by product management, sales analytics, ads, SeatMe (Point of sale analytics) and many other teams. Analyzes 0s of millions of ads/day, 250M mobile events/day, ad campaign performance and new feature usage
  7. Accenture Insights Platform (AIP) is a scalable, on-demand, globally available analytics solution running on Amazon Redshift. AIP is Accenture's foundation for its big data offering to deliver analytics applications for healthcare and financial services.
  8. We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/
  9. USE CASES – to use across the next several slides (6-10) Amazon –Understand customer behavior, migrated from Oracle, PBs workload, 2TB/day@67% YoY. Could query across 1 week in one hour with Oracle, now can query 15 months in 14 min with Redshift Boingo – 2000+ Commercial Wi-Fi locations, 1 million+ Hotspots, 90M+ ad engagements in 100+ countries. Used Oracle, Rapid data growth slowed analytics, Admin overhead and Expensive (license, h/w, support). After migration 180x performance improvement and 7x cost savings Finra (Financial Industry Regulatory Authority) – One of the largest independent securities regulators in the United States, was established to monitor and regulate financial trading practices. Reacts to changing market dynamics while providing its analysts with the tools (Redshift) to interactively query multi-petabyte data sets. Captures, analyzes, and stores a daily ~75 billion records daily. The company estimates it will save up to $20 million annually by using AWS instead of a physical data center infrastructure. Desk.com – Ingests 200K case related events/hour and runs a user facing portal on Redshift
  10. We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/
  11. In Redshift world throughput is not about concurrency but effective use of MPP architecture. For the next few slides, lets see how we do that… Don't skew the work to a few slices – choose the right distribution key. Minimum - don't pull more blocks off disk than you absolutely have to – Use sort keys to eliminate IO. Assign just enough memory, which is the only resource you control, to the query - any more and you're wasting memory that could've used by other queries –WLM.
  12. How do we Do an Equal Amount of Work on Each Slice? Slices is a very important concept in Redshift - it's how we achieve parallelism within a node on the cluster. We've taken each compute node and we've divided up into what we call slices – each with it’s own share of compute and storage. Depending on the node type, there is 2, 16 or 32 of these slices within each compute node.
  13. This nicely leads to how do we distribute data into these slices. KEY (Distribute on column participating in largest join. We run a modular operator and that figures out what slice the data will land on.). Use in your Large fact tables, billions and trillions of rows & largest dimension table. Consider using in columns appearing in your GROUP BYs. Next is, EVEN, here we'll just round robin the data through the cluster will make it even, it’s the default – if you are note sure what to do, this is the safes to start with. Avoid using it on columns, if you are performing aggregations on them. ALL: This is ideal for Small (~5M rows) dimension tables
  14. Now, lets talk about how do we Do the Minimum Amount of Work on Each Slice
  15. Redshift uses a block size of 1 MB, which is more efficient and further reduces the number of I/O requests needed to perform any database loading or queries. Because it columnar, it’s stored column by column, all of the same data type and hence much easy to compress. And there are different compression encoding types based on the data type you have. ZSTD early this year. All data types. Great compression – great for CHAR, VARCHAR, JSON strings. Redshift also has an in-memory data structure, that contains the MIN/MAX of each of the blocks. That allows you to effectively prune out data when you are doing a predicate and skip blocks that don’t have data for your query results. This is what Redshift will do it for you, then use sort keys to further reduce IO.
  16. Let’s talk about Queues and how you effectively manage the resources in your cluster.
  17. You can use workload management (WLM) to define multiple queues and to route queries at runtime. This is important when you have multiple sessions or users running queries at the same time, each with different profiles. You can configure up to 8 query queues and set the number of queries that can run in each of those queues concurrently. You can set rules to route queries to particular queues based on the users or labels/i.e. query groups. For examples, you can have a rules that says: Any Long running query with high I/O skew in this queue where Segment execution time (seconds) > 120 and I/O skew (ratio) > 2 – hop to another queue, or log it or abort it. You can configure the amount of memory allocated to each queue, so that large queries run in queues with more memory than other queues. You can also configure the WLM timeout property to limit long-running queries.
  18. Take a note on this. This is a 5 step guide to do Redshift performance tuning.
  19. You can use the AWS Schema Conversion Tool (AWS SCT) to optimize Redshift. You can choose to Collect Statistics. AWS SCT uses the statistics to make suggestions for sort keys and distribution keys. And then, you can choose to Run Optimization. To review the suggestions, table by table. You can also create a report in PDF/CSV that contains the optimization suggestions. You can right click and choose Apply to database. To apply the recommendations. SCT is also your friend when you are planning to migrate another database or a data warehouse to Redshift. SCT supports: Oracle Data Warehouse, Microsoft SQL Server, Teradata, Netezza, Greenplum, Vertica… to Redshift. It will not only do an assements but will also migrate shemas and other objects to Redshift (DDLs), not the data – for data we have DMS that I will cover shortly.
  20. Let’s take a quick look at Spectrum
  21. Spectrum gives you the ability to use Redshift SQL queries directly against your S3 data using thousands of nodes. It's fast even at exabytes scale it's elastic and highly available. You pay per query. It's highly concurrent so you can have multiple Redshift clusters access the same data inside S3 using spectrum. There is no E.T.L. you can query the data in place in S3 using open data format without conversion. And it's ANSI SQL. No differences whatsoever with Redshift.
  22. This is similar to what we saw on the previous slide on Redshift architecture. Only difference is, we now have a spectrum cluster which are a bunch of private VPC managed nodes that are available to you. S3 contains the data you also need the meta-data to understand that data and whether that's the Athena catalog today or to the glue data catalog when it launches in a little while or it's your own Hive metastore if you happen to have one. Now, let’s trace a journey of a query in Redshift cluster using Spectrum. You start by submitting your query via your client…
  23. The query optimized at the leader node and compiled there into C++ and we determine what gets run locally and what needs to go to Spectrum.
  24. The code is sent to all of the compute notes to execute.
  25. The compute nodes figure out the partition information from the data catalog and they're able to dynamically prune the partitions based on the parts of the plan that they've run already.
  26. And they issues multiple requests to spectrum to grab the files.
  27. Spectrum is going to scan your S3 data.
  28. It runs the projections filters aggregates.
  29. And then the final aggregations, GROUP BYs and so forth get pulled back into Redshift where we do join processing. Because you want to do join processing on smaller amounts of data.
  30. And finally, results are returned to the client.
  31. Let's imagine that J.K. Rowling is about to issue her eighth book in the Harry Potter series. so she sells a lot of books and so I might be a product manager thinking about hey so how many books should I order for the region I'm responsible for. So I'm going to pick out the books from my products table where the title contains the word Potter and the author is J.K. Rowling. So one table two filters.
  32. Then you want to compute the sales of those books, So now suddenly you've got two tables, one in S3 which is because the customer_orders_Item_Details could be pretty large even if the product is small. There are two filters, one join, a couple of group by’s, one order by a limit and in the aggregate.
  33. The next thing you want to be able to say is hey I'm not interested in her total book sales I just want the first three days because that's my reorder point, and I don't want the book so for all time I just want the hardcover first edition. So now we're up to three tables, five filters, two joins. three group by’s, one order by, a limit and aggregate a function in a couple of casts. So it's starting to get complicated right.
  34. And then you would say like am I actually just interested in the city of Seattle and so I pick up a regions table and I'll say like OK here's the country or the state and here's the region and I want to basically join against the region.
  35. Let’s assume you are a bookstore at Amazon scale, you generate maybe roughly one hundred forty terabytes of customer_item_order_detail every day we might have been saving that for the last twenty years. That’s 190 million files across fifteen thousand partitions, basically a partition per day and then there's a partition for US vs rest of the world… You really need to get a billion fold reduction in the amount of data that you're processing in order to make that work and you have it returned in a reasonable time. We estimated, running this query using a 1000 node Hive cluster would take over 5 years. For this particular data set we get about 5X. compression because most of the data in that is text. There are 50 columns but we're retrieving about five from S3, so we get a 10X. compression there so that's fifty total. In this case spectrum decided to run 2500 nodes, so that gives me another bit of advantage there. But I'm still quite a bit short of the billion that I need barely into the million range. And so if you want to scale up you can't just throw bigger hammers at things and somewhere along the way you actually have to learn to be smart about what we apply and so using redshift you know there's a 2X static partition elimination that happens because I'm actually asking for data about the US, and so I can remove half of it came back just under three minutes here. I really think she should do another book because as you can see that the last one is the one that shows up in the top ten all the time.
  36. one reason is that your data is going to get bigger. on average data warehousing volumes grow 10X every 5 years so it'll go up a factor of a thousand every 15 years and that's industry wide. On average, redshift customer doubles their storage every year. Just because the costs are better
  37. Now, let’s take a closer look at AWS native services that are key to data warehousing workloads.
  38. Why DMS, it is simple, you can set up and get going in minutes, you can migrate databases with little downtime, supports many sources, free – you only pay for the underlying EC instance. And using DMS, you can move data either directly to Redshift from a number of source database or to a data lake on S3. Once the data is in S3, you have sperated storage from compute, you can then consume by multiple Redshift clusters via Spectrum, or via Athena or EMR. You can have many architectural patterns, such as 1 to 1, or bringing many data silo’s to Redshift for it’s inexpensive petabyte scale analytics capabilities and then the ability to query/join them all together in Redshift to gain further insights. From source you can do one-off load or CDC.
  39. The COPY command leverages Redshift’s massively parallel processing (MPP) architecture to read and load data in parallel from files in an S3 bucket. You can take maximum advantage of parallel processing by splitting your data into multiple files and by setting distribution keys on your tables.
  40. AWS Glue is a fully managed ETL service that makes it easy to move data between your data stores. AWS Glue is integrated with Amazon S3, Amazon RDS, and Amazon Redshift, and can connect to any JDBC-compliant data store. AWS Glue automatically crawls your data sources, identifies data formats, and then suggests schemas and transformations, so you don’t have to spend time hand-coding data flows. You can then edit these transformations, if necessary, using the tools and technologies you already know, such as Python, Spark, Git and your favorite integrated developer environment (IDE), and share them with other AWS Glue users. Glue schedules your ETL jobs and provisions and scales all the infrastructure required so your ETL jobs run quickly and efficiently at any scale. There are no servers to manage, and you pay only for resources consumed by your ETL jobs.
  41. We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/
  42. Amazon QuickSight is a fast, cloud-powered business analytics service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. QuickSight natively connects to Redshift. You can also leverage QuickSight’s Super-fast, Parallel, In-memory, Calculation Engine (SPICE) that uses a combination of columnar storage, in-memory technologies. QuickSight now also support Redshift Spectrum tables, which we will talk about soon. So now you have a fast BI engine on top of an exabyte scale data warehouse.
  43. Python 2.7 – do you need support for other languages? Let us know
  44. For Redshift we have a list of certified partners and their solutions to work with Amazon Redshift. To search all AWS technology partners, visit the Partner Solutions Finder. Or use the AWS Marketplace.
  45. Essentially, there are four types of partners: Informatica, matillion, Talend Looker, Tableau, Qlik From Accenture, Beeva, Northbay and Uturn… and you will also hear from Musoor who is the Director of Data Science at Crimson Macaw on how they implemented Machester Airport’s data warehouse using Redshift. Agnitty, Aqua Data Studio, RazorSQL and Toad
  46. We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/