Data warehousing is a critical component for analysing and extracting actionable insights from your data. Amazon Redshift allows you to deploy a scalable data warehouse in a matter of minutes and starts to analyse your data right away using your existing business intelligence tools.
2. Deploying your Data
Warehouse on AWS
AWS Lambda Glue
and/or
and/or
ETL? Managed?
Complex? Cost?
Batch
Firehose
Glue
S3
Streaming?
DBs (OLTP)?
Own code?
Parallel?
Managed?
SCT Migration
Agent
DWH?
S3
Query
optimized
&
Ready for
self-
service
Ingest Store
Prepare for
Analytics Store AnalyzeAthena
Query Service
Ad-hoc Analysis BI & Visualization
Redshift Spectrum Redshift Spectrum
DWH and Data Marts
Predictive
Redshift
Data Warehouse
Redshift
Data Warehouse
4. Managed Massively Parallel Petabyte Scale
Data Warehouse
Streaming Backup/Restore to S3
Load data from S3, DynamoDB and EMR
Extensive Security Features
Scale from 160 GB -> 2 PB Online
Fast
CompatibleSecure
ElasticSimple
Cost
Efficient
7. Use Case: Traditional Data Warehousing
Business
Reporting
Advanced pipelines
and queries
Secure and
Compliant
Easy Migration – Point & Click using AWS Database Migration Service
Secure & Compliant – End-to-End Encryption. SOC 1/2/3, PCI-DSS, HIPAA and FedRAMP compliant
Large Ecosystem – Variety of cloud and on-premises BI and ETL tools
Japanese Mobile
Phone Provider
Powering 100 marketplaces
in 50 countries
World’s Largest Children’s
Book Publisher
Bulk Loads
and Updates
8. Use Case: Log Analysis
Log & Machine
IOT Data
Clickstream
Events Data
Time-Series
Data
Cheap – Analyze large volumes of data cost-effectively
Fast – Massively Parallel Processing (MPP) and columnar architecture for fast queries and parallel loads
Near real-time – Micro-batch loading and Amazon Kinesis Firehose for near-real time analytics
Interactive data analysis and
recommendation engine
Ride analytics for pricing
and product development
Ad prediction and
on-demand analytics
9. Use Case: Business Applications
Multi-Tenant BI
Applications
Back-end
services
Analytics as a
Service
Fully Managed – Provisioning, backups, upgrades, security, compression all come built-in so you can
focus on your business applications
Ease of Chargeback – Pay as you go, add clusters as needed. A few big common clusters, several
data marts
Service Oriented Architecture – Integrated with other AWS services. Easy to plug into your pipeline
Infosys Information
Platform (IIP)
Analytics-as-a-
Service
Product and Consumer
Analytics
12. NTT Docomo: Japan’s largest mobile service provider
68 million customers
Tens of TBs per day of data across a
mobile network
6 PB of total data (uncompressed)
Data science for marketing
operations, logistics, and so on
Greenplum on-premises
Scaling challenges
Performance issues
Need same level of security
Need for a hybrid environment
13. 125 node DS2.8XL cluster
4,500 vCPUs, 30 TB RAM
2 PB compressed
10x faster analytic queries
50% reduction in time for new
BI application deployment
Significantly less operations
overhead
Data
Source
ET
AWS Direct
Connect
Client
Forwarder
LoaderState
Management
SandboxAmazon Redshift
S3
NTT Docomo: Japan’s largest mobile service provider
14. Nasdaq: powering 100 marketplaces in 50 countries
Orders, quotes, trade executions,
market “tick” data from 7 exchanges
7 billion rows/day
Analyze market share, client activity,
surveillance, billing, and so on
Microsoft SQL Server on-premises
Expensive legacy DW
($1.16 M/yr.)
Limited capacity (1 yr. of data
online)
Needed lower TCO
Must satisfy multiple security
and regulatory requirements
Similar performance
15. 23 node DS2.8XL cluster
828 vCPUs, 5 TB RAM
368 TB compressed
2.7 T rows, 900 B derived
8 tables with 100 B rows
7 man-month migration
¼ the cost, 2x storage, room to
grow
Faster performance, very
secure
Nasdaq: powering 100 marketplaces in 50 countries
19. Choose Best Table Distribution Style
All
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
All data on
every node
Key
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Same key to
same location
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Even
Round robin
distribution
25. Redshift Playbook
Part 1: Preamble, Prerequisites, and
Prioritization
Part 2: Distribution Styles and
Distribution Keys
Part 3: Compound and Interleaved
Sort Keys
Part 4: Compression Encodings
Part 5: Table Data Durability
amzn.to/2quChdM
28. Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
S3
SQL
29. Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
1
30. Query is optimized and compiled at
the leader node. Determine what gets
run locally and what goes to Amazon
Redshift Spectrum
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
2
31. Query plan is sent to
all compute nodes
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
3
32. Compute nodes obtain partition info from
Data Catalog; dynamically prune partitions
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
4
33. Each compute node issues multiple
requests to the Amazon Redshift
Spectrum layer
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
5
34. Amazon Redshift Spectrum nodes
scan your S3 data
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
6
35. 7
Amazon Redshift
Spectrum projects,
filters, joins and
aggregates
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
36. Final aggregations and joins
with local Amazon Redshift
tables done in-cluster
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
8
37. Result is sent back to client
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
9
39. Lets build an analytic query - #1
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets get the prior books she’s written.
1 Table
2 Filters
SELECT
P.ASIN,
P.TITLE
FROM
products P
WHERE
P.TITLE LIKE ‘%POTTER%’ AND
P.AUTHOR = ‘J. K. Rowling’
40. Lets build an analytic query - #2
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values
2 Tables (1 S3, 1 local)
2 Filters
1 Join
2 Group By columns
1 Order By
1 Limit
1 Aggregation
SELECT
P.ASIN,
P.TITLE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
products P
WHERE
D.ASIN = P.ASIN AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
GROUP BY P.ASIN, P.TITLE
ORDER BY SALES_sum DESC
LIMIT 20;
41. Lets build an analytic query - #3
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values, just for the first three days
of sales of first editions
3 Tables (1 S3, 2 local)
5 Filters
2 Joins
3 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
SELECT
P.ASIN,
P.TITLE,
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
asin_attributes A,
products P
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
A.EDITION LIKE '%FIRST%' AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
42. Lets build an analytic query - #4
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values, just for the first three days
of sales of first editions in the city of Seattle, WA, USA
4 Tables (1 S3, 3 local)
8 Filters
3 Joins
4 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
SELECT
P.ASIN,
P.TITLE,
R.POSTAL_CODE,
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
asin_attributes A,
products P,
regions R
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
D.REGION_ID = R.REGION_ID AND
A.EDITION LIKE '%FIRST%' AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
R.COUNTRY_CODE = ‘US’ AND
R.CITY = ‘Seattle’ AND
R.STATE = ‘WA’ AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
43. Now let’s run that query over an exabyte of data in S3
Roughly 140 TB of customer item order detail
records for each day over past 20 years.
190 million files across 15,000 partitions in S3.
One partition per day for USA and rest of world.
Need a billion-fold reduction in data processed.
Running this query using a 1000 node Hive cluster
would take over 5 years.*
• Compression ……………..….……..5X
• Columnar file format……….......…10X
• Scanning with 2500 nodes…....2500X
• Static partition elimination…............2X
• Dynamic partition elimination..….350X
• Redshift’s query optimizer……......40X
---------------------------------------------------
Total reduction……….…………3.5B X
* Estimated using 20 node Hive cluster & 1.4TB, assume linear
* Query used a 20 node DC1.8XLarge Amazon Redshift cluster
* Not actual sales data - generated for this demo based on data
format used by Amazon Retail.
44. Is Amazon Redshift Spectrum useful if I don’t have an exabyte?
Your data will get bigger
On average, data warehousing volumes grow 10x every 5 years
The average Amazon Redshift customer doubles data each year
Amazon Redshift Spectrum makes data analysis simpler
Access your data without ETL pipelines
Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake
Amazon Redshift Spectrum improves availability and concurrency
Run multiple Amazon Redshift clusters against common data
Isolate jobs with tight SLAs from ad hoc analysis
46. Getting data to Redshift using AWS Database
Migration Service (DMS)
Simple to use Minimal Downtime Supports most widely
used Databases
Low Cost Fast & Easy to Set-up Reliable
Source=OLTP
47. Getting data to Redshift using AWS Schema
Conversion Tool (SCT) Source=OLAP
48. Loading data from S3
• Splitting Your Data into Multiple Files
• Uploading Files to Amazon S3
• Using the COPY Command to Load from
Amazon S3
57. 4 types of partners
• Load and transform your data with Data Integration
Partners
• Analyze data and share insights across your
organization with Business Intelligence Partners
• Architect and implement your analytics platform
with System Integration and Consulting Partners
• Query, explore and model your data using tools and
utilities from Query and Data Modeling Partners
aws.amazon.com/redshift/partners/
58. AWS Named as a Leader in The Forrester
WaveTM: Big Data Warehouse Q2 2017
http://bit.ly/2w1TAEy
On June 15, Forrester published the Big Data
Warehouse, Q2 2017, in which AWS is
positioned as a Leader. According to Forrester,
“With more than 5,000 deployments, Amazon
Redshift has the largest data warehouse
deployments in the cloud.” AWS received the
highest score possible, 5/5, for customer base,
market awareness, ability to execute, road map,
support, and partners. “AWS’s key strengths lie
in its dynamic scale, automated administration,
flexibility of database offerings, good security,
and high availability (HA) capabilities, which
make it a preferred choice for customers.
We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/
You can start small and grow without interrupting query size. It’s cost efficient, Smallest environment is single node 160GB cluster with SSD Storage for $.025/hour | Scale up to largest cluster (NTT Docomo) with 4PB of usable data in Redshift. It’s Simple > SQL, JDBC/ODBC connectivity to your favorite BI tool.
Fast > We optimized for scans against TB/PB of data set, by leveraging columnar column storage, compression within a shared nothing massively parallel architecture.
Secure > Security is built-in. You can encrypt data at rest and in transit, isolate your clusters using Amazon VPC and even manage your keys using AWS Key Management Service (KMS) and hardware security modules (HSMs).
Let’s take a closer look at the Redshift architecture. The bit that you work with is what we call the leader node. This is where you connect to using your driver, you can use JDBC/ODBC. In addition to Redshift drives which you can download from our site, you can also connect technically with any Postgres driver. Behind the leader node is the compute nodes, can have up to 128 of those. Next, these compute nodes also talk to other services, primarily S3. We ingest data usually from as S3 and you can unload data to S3. We are continuously backing up your cluster to S3, all of this happens in the background and in parallel and that's really kind of the important takeaway on this slide. As we receive your query we generate C++ code that are compiled and we send those down all the compute nodes, that work is also done by the leader node. Postgres catalog tables also exist in the leader. We've also added to additional metadata tables to Redshift as well.
We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/
Fully Managed – We provision, backup, patch and monitor so you can focus on your data
Fast – Massively Parallel Processing and columnar architecture for fast queries and parallel loads
Nasdaq security – Ingests 5B rows/trading day, analyzes orders, trades and quotes to protect traders, report activity and develop their marketplaces
NTT Docomo - Redshift is NTT Docomo's primary analytics platform for data science and marketing and logistic analytics. Data is pre-processed on premises and loaded into a massive, multi-petabyte data warehouse on Amazon Redshift, which data scientists use as their primary analytics platform.
Pinterest uses Redshift for interactive data analysis. Redshift is used to store all web event data and uses for KPIs, recommendations and A/B experimentation.
Lyft uses Redshift for ride analytics across the world (rides / location data ) - Through analysis, company engineers estimated that up to 90% of rides during peak times had similar routes. This led to the introduction of Lyft Line – a service that allows customers to save up to 60% by carpooling with others who are going in the same direction.
Yelp has multiple deployments of RedShift with different data sets in use by product management, sales analytics, ads, SeatMe (Point of sale analytics) and many other teams.
Analyzes 0s of millions of ads/day, 250M mobile events/day, ad campaign performance and new feature usage
Accenture Insights Platform (AIP) is a scalable, on-demand, globally available analytics solution running on Amazon Redshift. AIP is Accenture's foundation for its big data offering to deliver analytics applications for healthcare and financial services.
We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/
USE CASES – to use across the next several slides (6-10)
Amazon –Understand customer behavior, migrated from Oracle, PBs workload, 2TB/day@67% YoY. Could query across 1 week in one hour with Oracle, now can query 15 months in 14 min with Redshift
Boingo – 2000+ Commercial Wi-Fi locations, 1 million+ Hotspots, 90M+ ad engagements in 100+ countries. Used Oracle, Rapid data growth slowed analytics, Admin overhead and Expensive (license, h/w, support). After migration 180x performance improvement and 7x cost savings
Finra (Financial Industry Regulatory Authority) – One of the largest independent securities regulators in the United States, was established to monitor and regulate financial trading practices. Reacts to changing market dynamics while providing its analysts with the tools (Redshift) to interactively query multi-petabyte data sets. Captures, analyzes, and stores a daily ~75 billion records daily. The company estimates it will save up to $20 million annually by using AWS instead of a physical data center infrastructure.
Desk.com – Ingests 200K case related events/hour and runs a user facing portal on Redshift
We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/
In Redshift world throughput is not about concurrency but effective use of MPP architecture. For the next few slides, lets see how we do that…
Don't skew the work to a few slices – choose the right distribution key.
Minimum - don't pull more blocks off disk than you absolutely have to – Use sort keys to eliminate IO.
Assign just enough memory, which is the only resource you control, to the query - any more and you're wasting memory that could've used by other queries –WLM.
How do we Do an Equal Amount of Work on Each Slice? Slices is a very important concept in Redshift - it's how we achieve parallelism within a node on the cluster. We've taken each compute node and we've divided up into what we call slices – each with it’s own share of compute and storage. Depending on the node type, there is 2, 16 or 32 of these slices within each compute node.
This nicely leads to how do we distribute data into these slices.
KEY (Distribute on column participating in largest join. We run a modular operator and that figures out what slice the data will land on.). Use in your Large fact tables, billions and trillions of rows & largest dimension table. Consider using in columns appearing in your GROUP BYs.
Next is, EVEN, here we'll just round robin the data through the cluster will make it even, it’s the default – if you are note sure what to do, this is the safes to start with. Avoid using it on columns, if you are performing aggregations on them.
ALL: This is ideal for Small (~5M rows) dimension tables
Now, lets talk about how do we Do the Minimum Amount of Work on Each Slice
Redshift uses a block size of 1 MB, which is more efficient and further reduces the number of I/O requests needed to perform any database loading or queries. Because it columnar, it’s stored column by column, all of the same data type and hence much easy to compress. And there are different compression encoding types based on the data type you have. ZSTD early this year. All data types. Great compression – great for CHAR, VARCHAR, JSON strings.
Redshift also has an in-memory data structure, that contains the MIN/MAX of each of the blocks. That allows you to effectively prune out data when you are doing a predicate and skip blocks that don’t have data for your query results.
This is what Redshift will do it for you, then use sort keys to further reduce IO.
Let’s talk about Queues and how you effectively manage the resources in your cluster.
You can use workload management (WLM) to define multiple queues and to route queries at runtime. This is important when you have multiple sessions or users running queries at the same time, each with different profiles. You can configure up to 8 query queues and set the number of queries that can run in each of those queues concurrently. You can set rules to route queries to particular queues based on the users or labels/i.e. query groups. For examples, you can have a rules that says: Any Long running query with high I/O skew in this queue where Segment execution time (seconds) > 120 and I/O skew (ratio) > 2 – hop to another queue, or log it or abort it. You can configure the amount of memory allocated to each queue, so that large queries run in queues with more memory than other queues. You can also configure the WLM timeout property to limit long-running queries.
Take a note on this. This is a 5 step guide to do Redshift performance tuning.
You can use the AWS Schema Conversion Tool (AWS SCT) to optimize Redshift. You can choose to Collect Statistics. AWS SCT uses the statistics to make suggestions for sort keys and distribution keys. And then, you can choose to Run Optimization. To review the suggestions, table by table.
You can also create a report in PDF/CSV that contains the optimization suggestions. You can right click and choose Apply to database. To apply the recommendations.
SCT is also your friend when you are planning to migrate another database or a data warehouse to Redshift. SCT supports: Oracle Data Warehouse, Microsoft SQL Server, Teradata, Netezza, Greenplum, Vertica… to Redshift. It will not only do an assements but will also migrate shemas and other objects to Redshift (DDLs), not the data – for data we have DMS that I will cover shortly.
Let’s take a quick look at Spectrum
Spectrum gives you the ability to use Redshift SQL queries directly against your S3 data using thousands of nodes. It's fast even at exabytes scale it's elastic and highly available. You pay per query. It's highly concurrent so you can have multiple Redshift clusters access the same data inside S3 using spectrum. There is no E.T.L. you can query the data in place in S3 using open data format without conversion. And it's ANSI SQL. No differences whatsoever with Redshift.
This is similar to what we saw on the previous slide on Redshift architecture. Only difference is, we now have a spectrum cluster which are a bunch of private VPC managed nodes that are available to you. S3 contains the data you also need the meta-data to understand that data and whether that's the Athena catalog today or to the glue data catalog when it launches in a little while or it's your own Hive metastore if you happen to have one. Now, let’s trace a journey of a query in Redshift cluster using Spectrum. You start by submitting your query via your client…
The query optimized at the leader node and compiled there into C++ and we determine what gets run locally and what needs to go to Spectrum.
The code is sent to all of the compute notes to execute.
The compute nodes figure out the partition information from the data catalog and they're able to dynamically prune the partitions based on the parts of the plan that they've run already.
And they issues multiple requests to spectrum to grab the files.
Spectrum is going to scan your S3 data.
It runs the projections filters aggregates.
And then the final aggregations, GROUP BYs and so forth get pulled back into Redshift where we do join processing. Because you want to do join processing on smaller amounts of data.
And finally, results are returned to the client.
Let's imagine that J.K. Rowling is about to issue her eighth book in the Harry Potter series. so she sells a lot of books and so I might be a product manager thinking about hey so how many books should I order for the region I'm responsible for.
So I'm going to pick out the books from my products table where the title contains the word Potter and the author is J.K. Rowling. So one table two filters.
Then you want to compute the sales of those books, So now suddenly you've got two tables, one in S3 which is because the customer_orders_Item_Details could be pretty large even if the product is small. There are two filters, one join, a couple of group by’s, one order by a limit and in the aggregate.
The next thing you want to be able to say is hey I'm not interested in her total book sales I just want the first three days because that's my reorder point, and I don't want the book so for all time I just want the hardcover first edition.
So now we're up to three tables, five filters, two joins. three group by’s, one order by, a limit and aggregate a function in a couple of casts. So it's starting to get complicated right.
And then you would say like am I actually just interested in the city of Seattle and so I pick up a regions table and I'll say like OK here's the country or the state and here's the region and I want to basically join against the region.
Let’s assume you are a bookstore at Amazon scale, you generate maybe roughly one hundred forty terabytes of customer_item_order_detail every day we might have been saving that for the last twenty years. That’s 190 million files across fifteen thousand partitions, basically a partition per day and then there's a partition for US vs rest of the world…
You really need to get a billion fold reduction in the amount of data that you're processing in order to make that work and you have it returned in a reasonable time. We estimated, running this query using a 1000 node Hive cluster would take over 5 years.
For this particular data set we get about 5X. compression because most of the data in that is text. There are 50 columns but we're retrieving about five from S3, so we get a 10X. compression there so that's fifty total. In this case spectrum decided to run 2500 nodes, so that gives me another bit of advantage there. But I'm still quite a bit short of the billion that I need barely into the million range. And so if you want to scale up you can't just throw bigger hammers at things and somewhere along the way you actually have to learn to be smart about what we apply and so using redshift you know there's a 2X static partition elimination that happens because I'm actually asking for data about the US, and so I can remove half of it came back just under three minutes here.
I really think she should do another book because as you can see that the last one is the one that shows up in the top ten all the time.
one reason is that your data is going to get bigger. on average data warehousing volumes grow 10X every 5 years so it'll go up a factor of a thousand every 15 years and that's industry wide. On average, redshift customer doubles their storage every year. Just because the costs are better
Now, let’s take a closer look at AWS native services that are key to data warehousing workloads.
Why DMS, it is simple, you can set up and get going in minutes, you can migrate databases with little downtime, supports many sources, free – you only pay for the underlying EC instance. And using DMS, you can move data either directly to Redshift from a number of source database or to a data lake on S3. Once the data is in S3, you have sperated storage from compute, you can then consume by multiple Redshift clusters via Spectrum, or via Athena or EMR. You can have many architectural patterns, such as 1 to 1, or bringing many data silo’s to Redshift for it’s inexpensive petabyte scale analytics capabilities and then the ability to query/join them all together in Redshift to gain further insights. From source you can do one-off load or CDC.
The COPY command leverages Redshift’s massively parallel processing (MPP) architecture to read and load data in parallel from files in an S3 bucket. You can take maximum advantage of parallel processing by splitting your data into multiple files and by setting distribution keys on your tables.
AWS Glue is a fully managed ETL service that makes it easy to move data between your data stores. AWS Glue is integrated with Amazon S3, Amazon RDS, and Amazon Redshift, and can connect to any JDBC-compliant data store. AWS Glue automatically crawls your data sources, identifies data formats, and then suggests schemas and transformations, so you don’t have to spend time hand-coding data flows. You can then edit these transformations, if necessary, using the tools and technologies you already know, such as Python, Spark, Git and your favorite integrated developer environment (IDE), and share them with other AWS Glue users. Glue schedules your ETL jobs and provisions and scales all the infrastructure required so your ETL jobs run quickly and efficiently at any scale. There are no servers to manage, and you pay only for resources consumed by your ETL jobs.
We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/
Amazon QuickSight is a fast, cloud-powered business analytics service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. QuickSight natively connects to Redshift. You can also leverage QuickSight’s Super-fast, Parallel, In-memory, Calculation Engine (SPICE) that uses a combination of columnar storage, in-memory technologies. QuickSight now also support Redshift Spectrum tables, which we will talk about soon. So now you have a fast BI engine on top of an exabyte scale data warehouse.
Python 2.7 – do you need support for other languages? Let us know
For Redshift we have a list of certified partners and their solutions to work with Amazon Redshift. To search all AWS technology partners, visit the Partner Solutions Finder. Or use the AWS Marketplace.
Essentially, there are four types of partners: Informatica, matillion, Talend
Looker, Tableau, Qlik
From Accenture, Beeva, Northbay and Uturn… and you will also hear from Musoor who is the Director of Data Science at Crimson Macaw on how they implemented Machester Airport’s data warehouse using Redshift.
Agnitty, Aqua Data Studio, RazorSQL and Toad
We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/