SlideShare uma empresa Scribd logo
1 de 44
Baixar para ler offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building Your First Serverless Data
Lake
Damon Cortesi
Big Data Architect
AWS
A N T 3 5 6 - R
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
Data lakes on AWS
Ingestion
Data prep & optimization
File formats and partitions
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A data lake is a centralized repository that allows
you to store all your structured and unstructured
data at any scale
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Simple Storage Service (Amazon S3) is the core
• Durable – designed to be 99.999999999%
• Available – designed to be 99.99%
• Storage – virtually unlimited
• Query in place
• Integrated encryption
• Decoupled storage & compute
Ingest
Analysis
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Typical Hadoop environment
• CPU/memory tied directly to disk
• Multiple users competing for
resources
• Difficult to upgrade
Hadoop Cluster
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Decoupled storage and compute
• Scale specific to each workload
• Optimize for CPU/memory
requirements
• Amazon EMR benefits
•Spin clusters up/down
•Easily test new versions with
same data
•Utilize Spot for reduced cost
Amazon
Athena
AWS Glue
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data lake on AWS
On premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AWS Glue
ETL
Amazon
QuickSight
Amazon
SageMaker
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Real-time ingestion
• May 2018: Amazon Kinesis Data Firehose adds support for Parquet
and ORC
• One schema per Kinesis Data Firehose, integrates with AWS Glue
Data Catalog
• Automatically uses an Amazon S3 prefix in “YYYY/MM/DD/HH”
UTC time format
• Caveat: Partitions not automatically managed in AWS Glue
Kinesis Streams Kinesis Data Firehose to S3 Parquet
AWS Glue Data Catalog
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Spark on Amazon EMR for splitting generic data
def createContext():
sc = SparkContext.getOrCreate()
ssc = StreamingContext(sc, 1) # Set a batch interval of 1 second
ssc.checkpoint("s3://<bucket>/<checkpoint>") # Enable _Spark_ checkpointing in S3
lines = KinesisUtils.createStream(
ssc, "Strlogger", "dcortesi-logger", # _Kinesis_ checkpointing in "StrLogger" DymamoDB table
"https://kinesis.us-east-1.amazonaws.com", ”us-east-1",
InitialPositionInStream.LATEST, 2
)
parsedDF = lines.map(lambda v: map_json(v)). # map_json creates standard schema with nested JSON
parsedDF.foreachRDD(save_rdd)
return ssc
# For each RDD, write it out to S3 partitioned by the log type ("tag") and date ("dt")
def save_rdd(rdd):
if rdd.count() > 0:
sqlContext = SQLContext(rdd.context)
df = sqlContext.createDataFrame(rdd)
df.write.partitionBy("tag", "dt").mode("append").parquet("s3://<bucket>/<prefix>/")
ssc = StreamingContext.getOrCreate("s3://<bucket>/<cp>/", lambda: createContext())
ssc.start()
https://spark.apache.org/docs/latest/streaming-kinesis-integration.html
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Database sources
• AWS Database Migration Service (AWS DMS)
• Can extract in semi-real-time to S3 in change data capture format
• Export from DB
• AWS Glue JDBC connections or native DB utilities
• Crawl using AWS Glue
Amazon
RDS
Amazon
DynamoDB
AWS Glue S3 Bucket
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Third-party sources
• Still need some tooling to extract from third-party sources like
Segment/Mixpanel
• Some sources provide a JDBC connector (Salesforce)
• Similar to DB sources
• Export  S3  Catalog with AWS Glue
• If needed, iteratively process using AWS Glue bookmarks
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example structures
Kinesis Data Firehose to Parquet
Log data
DB exports
Third-party exports
YYYY/MM/DD/HH/name-TS-UUID.parquet
<pre>/<logtype>/dt=YYYY-MM-DD/json.gz
<pre>/<schema>/<db>/<table>/Y-m-dTH/
<pre>/<table>/LOAD00[1-9].CSV
<pre>/<table>/<timestamp>.csv
<pre>/full_export/latest-YMD/csv.gz
<pre>/incremental/YYYY/MM/DD/csv.gz
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Business drivers
• “Hey, can you just add this one data source real quick?”
• Question everything/understand the business drivers
• Do you really need real-time access?
• What business decisions are you making?
• What fields are important to you?
• What questions are you trying to answer?
• The quickest way to a data swamp is by trying to tackle everything at
one time. You won’t understand why any data is needed or important,
and worse, neither will your customers.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Optimize your data
• Most raw data will not be optimized for querying or might also contain
sensitive data
• There will be CSVs 
• There will be multiple date/time formats
• Carefully grant access to raw
• Architect for idempotency
• Storage tiers: raw  landing  production
1. Optimized file format
2. Partitioned data
3. Masked data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
1. Optimized file format
 Compress
 Compact
 Convert
Compaction Number of files Run time
SELECT count(*) FROM lineitem 5000 files 8.4 seconds
SELECT count(*) FROM lineitem 1 file 2.31 seconds
Speedup 3.6x faster
Conversion File format
Number
of Files
Size Run time
SELECT count(*) FROM events json.gz 46,182 176 GB
463.33
seconds
SELECT count(*) FROM events
snappy
parquet
11,640 213 GB
6.25
seconds
Speedup 74x faster
* Tests run using Athena
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Row and column formats
ID Age State
123 20 CA
345 25 WA
678 40 FL
999 21 WA
123 20 CA 345 25 WA 678 40 FL 999 21 WA
123 345 678 999 20 25 40 21 CA WA FL WA
Row format
Column format
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Parquet file format
Row group metadata
allows Parquet reader
to skip portions of, or
all, files
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Parquet challenges
• Schema evolution, particularly with nested data
• https://github.com/prestodb/presto/pull/6675 (2016)
• https://github.com/prestodb/presto/pull/10158 (March 2018)
• Can’t use predicate pushdowns with nested data
• Strings are just binary – can lead to performance issues in extreme
cases
• Use parquet-tools to debug, investigate your Parquet files
• github.com/apache/parquet-mr/tree/master/parquet-tools
• Most recent version doesn’t show string min/max stats (patch)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Flatten data using AWS Glue DynamicFramedatasource0 = glueContext.create_dynamic_frame.from_catalog(
database = "dcortesi",
table_name = "events_relationalize",
transformation_ctx = "datasource0"
)
datasource1 = Relationalize.apply(
frame = datasource0,
staging_path = glue_temp_path,
name = dfc_root_table_name,
transformation_ctx = "datasource1"
)
flattened_dynf = datasource1.select(dfc_root_table_name)
(
flattened_dynf.toDF()
.repartition("date")
.write
.partitionBy("date")
.option("maxRecordsPerFile", OUTPUT_LINES_PER_FILE)
.mode("append")
.parquet(EXPORT_PATH)
)
https://aws.amazon.com/blogs/big-data/simplify-querying-nested-json-with-the-aws-glue-relationalize-transform/
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Flatten data using AWS Glue DynamicFrame
{
"player.username": "user1",
"player.characteristics.race": "Human",
"player.characteristics.class": "Warlock",
"player.characteristics.subclass": "Dawnblade",
"player.characteristics.power": 300,
"player.characteristics.playercountry": "USA",
"player.arsenal.kinetic.name": "Sweet Business",
"player.arsenal.kinetic.type": "Auto Rifle",
"player.arsenal.kinetic.power": 300,
"player.arsenal.kinetic.element": "Kinetic",
"player.arsenal.energy.name": "MIDA Mini-Tool",
"player.arsenal.energy.type": "Submachine Gun",
"player.arsenal.energy.power": 300,
"player.arsenal.energy.element": "Solar”
}
{
"player": {
"username": "user1",
"characteristics": {
"race": "Human",
"class": "Warlock",
"subclass": "Dawnblade",
"power": 300,
"playercountry": "USA"
},
"arsenal": {
"kinetic": {
"name": "Sweet Business",
"type": "Auto Rifle",
"power": 300,
"element": "Kinetic"
},
"energy": {
"name": "MIDA Mini-Tool",
"type": "Submachine Gun",
"power": 300,
"element": "Solar"
}}}}
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
parquet-tools output
row group 1: RC:177 TS:16692 OFFSET:4
--------------------------------------------------------------------------------
operation: BINARY SNAPPY DO:0 FPO:4 SZ:250/315/1.26 VC:177 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: REST.COPY.OBJECT, max:
REST.PUT.OBJECT, num_nulls: 0]
bytes_sent: BINARY SNAPPY DO:0 FPO:254 SZ:436/512/1.17 VC:177 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 1161, max: 8, num_nulls:
34]
object_size: BINARY SNAPPY DO:0 FPO:3882 SZ:134/130/0.97 VC:177 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 0, max: 79, num_nulls:
134]
remote_ip: BINARY SNAPPY DO:0 FPO:4016 SZ:323/354/1.10 VC:177 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 10.1.2.3, max: 54.2.4.5,
num_nulls: 0]
bucket: BINARY SNAPPY DO:0 FPO:7270 SZ:126/122/0.97 VC:177 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: dcortesi-test-us-west-2,
max: dcortesi-test-us-west-2, num_nulls: 0]
time: INT96 SNAPPY DO:0 FPO:8230 SZ:518/739/1.43 VC:177 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[no stats for this column]
user_agent: BINARY SNAPPY DO:0 FPO:8748 SZ:625/771/1.23 VC:177 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: , aws-sdk-java/1.11.132
Linux/4.9.76-3.78.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.141-b16/1.8.0_141, presto, max: aws-cli/1.14.17 Python/2.7.10 Darwin/16.7.0
botocore/1.8.21, num_nulls: 1]
http_status: BINARY SNAPPY DO:0 FPO:9373 SZ:160/160/1.00 VC:177 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 200, max: 404, num_nulls:
0]
Row count
Column stats
Compression
Data type
🤔
Total size
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Optimizing columnar even more
• 300GB dataset in S3, but only retrieving one row per Athena query
• Used AWS Glue to sort entire dataset before writing to Parquet
• Parquet reader can determine whether to skip an entire file
• Cost optimization
3483719
1895939
10
20
52030
10
10
10
20
52030
1895939
3483719
Sort
WHERE id = 52030
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Compact small files
• Regular “janitor” job that uses Spark to copy and compact data from
one prefix into another
• Swap AWS Glue Data Catalog partition location
• If you know your dataset …
• Spark ”repartition” into optimal sizes
• Brute force it!
• S3 list, group objects based on size
• repartition(<some_small_number>) + “maxRecordsPerFile”
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Compact small files
raw/2018/01/20/00/file1.parquet
raw/2018/01/20/00/file2.parquet
…
raw/2018/01/20/00/file356.parquet
AWS Glue Data Catalog
Partition:
dt=2018-01-20-00
Total: 356 files, 208MB
AWS Glue
Compaction Job
compacted/2018/01/20/00/file1.parquet
compacted/2018/01/20/00/file2.parquet
Total: 2 files, ~200MB
Updated partition:
dt=2018-01-20-00
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
2. Partitioned data
• “Be careful when you repartition the data so you don’t write out too
few partitions and exceed your partition rate limits” 😠
• “Be careful when you repartition the data (in Spark) do you don’t write
out too few (Hive-style) partitions and exceed your (S3) partition rate
limits”
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Partitions, partitions, partitions—Hive-style
• S3 prefix convention
• s3://<bucket>/<prefix>/year=2018/month=01/day=20/
• S3://<bucket>/<prefix>/dt=2018-01-20/
• Cost and performance optimization
• Enables predicate pushdown
• Design based on query patterns
• Don’t go overboard SELECT remote_ip,
COUNT(*)
FROM access_logs
WHERE dt='2018-01-20'
https://docs.aws.amazon.com/athena/latest/ug/partitions.html
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Partitions, partitions, partitions—S3
• Bucket: service-logs
• Partition: service-logs/2
• Impact at (per prefix)
• ~3,500 PUT/LIST/DELETE requests/second
• ~5,500 GET requests/second
Object keys:
2018/01/20/data001.csv.gz
2018/01/20/data002.csv.gz
2018/01/20/data003.csv.gz
2018/01/21/data001.csv.gz
2018/01/21/data002.csv.gz
2018/01/22/data001.csv.gz
S3 automatically partitions based on key prefix
https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Partitions, partitions, partitions—Spark
• “Logical chunk of a large distributed data set”
• Enables optimal distributed performance
• You can repartition – or redistribute – data in Spark
• Explicit number of partitions
• Partition by one or more columns (defaults to 200 partitions)
• spark.sql.shuffle.partitions
• Spark can read/write Hive-style partitions
write.partitionBy("date").parquet("s3://<bucket>/<prefix>/")
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data and workflow management
• Emerging area – lots of active development
• Apache Atlas (1.0 – June 2018) - https://atlas.apache.org/
• Netflix Metacat (June 2018) - https://github.com/Netflix/metacat
• LinkedIn WhereHows (2016) -
https://github.com/linkedin/WhereHows
• AWS Glue (2016) - https://aws.amazon.com/glue/
• FINRA Herd (2015) - http://finraos.github.io/herd/
• Apache Airflow - https://airflow.apache.org/
• Spotify Luigi - https://luigi.readthedocs.io/
• Apache Oozie - http://oozie.apache.org/
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data security
• Limit access to data on S3 using IAM policies
• Encrypt data at rest using S3 or AWS Key Management Service (AWS
KMS)-managed keys
• Amazon EMR integrates with LDAP for authentication
• Amazon Macie for discovering sensitive data
• Enable S3 access logs for auditing
• Bonus – Query your S3 access logs on S3 using Athena!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring
• Track latency on your different data sources/pipelines
• Alarm on data freshness to avoid data drift
• Can use AWS Glue metadata properties or external system
• CloudWatch integrates with AWS Glue/Amazon EMR/Athena
• Build trust
• Notice data latency before your customers
• Data validation – “Is that metric right?”
• When something goes wrong, how will you recover?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Overview
• So far we’ve …
• Crawled and discovered our raw datasets
• Converted data into Parquet
• Organized data into structures that make sense for our business
• Compacted small files into larger ones
• Optimized the heck out of Parquet
• Partitioned our data when it needs to be
• So what does this look like?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Storage tiers—Raw
raw-bucket/
|___ apache-logs/
|___ dt=2018-01-20-00/
|___ log-2018-01-20T00:01:00Z.log
|___ log-2018-01-20T00:01:03Z.log
|___ log-2018-01-20T00:02:10Z.log
|___ log-2018-01-20T00:20:00Z.log
|
|___ syslog-logs/
|___ dt=2018-01-20-02/
|___ syslog-2018-01-20T02:00:00Z.gz
|___ syslog-2018-01-20T02:15:00Z.gz
|
|___ kinesis-events/
|___ 2018
|___ 01
|___ 20
|___ 00
|___ events-01:15.parquet
|___ events-01:16.parquet
raw-bucket/
|___ db-backups/
|___ prod-webapp/
|___ 2018-01-19/
|___ users_20180120.tsv.gz
|___ products_20180120.tsv.gz
|___ plans_20180120.tsv.gz
|___ orgs_20180120.tsv.gz
|___ 2018-01-18/
|___ users_20180119.tsv.gz
|___ products_20180119.tsv.gz
|___ plans_20180119.tsv.gz
|___ orgs_20180119.tsv.gz
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Storage tiers—Landing
raw-processed-bucket/
|___ log-data/
|___ apache/
|___ dt=2018-01-20/
|___ part00000.snappy.parquet
|___ part00001.snappy.parquet
|___ syslog/
|___ dt=2018-01-20/
|___ part00000.snappy.parquet
|
|___ product/
|___ orgs/
|___ part00000.snappy.parquet
|___ plans/
|___ part00000.snappy.parquet
|___ events/
|___ dt=2018-01-20/
|___ part00000.snappy.parquet
|___ part00001.snappy.parquet
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Storage tiers—Production model
production-marketing-bucket/
|___ product/
|___ users/
|___ part00000.snappy.parquet
|___ events/
|___ dt=2018-01-20/
|___ part00000.snappy.parquet
|___ part00001.snappy.parquet
|___ email-analytics/
|___ email-clicks/
|___ dt=2018-01-20/
|___ part00000.snappy.parquet
production-marketing-bucket/
|___ social-analytics/
|___ reach-dashboard/
|___ aggregates-by-date/
|___ dt=2018-01-20/
|___
part00000.snappy.parquet
|___ aggregates-by-region/
|___ region=na/
|___ dt=2018-01-20/
|___ part00000.parquet
|___ region=apac/
|___ dt=2018-01-20/
|___ part00000.parquet
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Optimizing for analytics
• Data prep optimizations apply
• Compress, compact, convert
• Tailor to your customer’s use case
• It’s OK to duplicate data
• Create production data models
• Use case dependent – usually by organization
• Same data is typically represented different ways depending on need
• Always running COUNT/MIN/MAX queries? Create aggregates!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Application-specific tuning and use cases
• “What database should I use for our data lake?”
• “Yes.”
• Athena
• Amazon EMR
• AWS Glue
• Amazon QuickSight
• Amazon SageMaker
• Amazon Redshift Spectrum
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Damon Cortesi
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Mais conteúdo relacionado

Mais procurados

Cost Optimization Tooling (ARC301) - AWS re:Invent 2018
Cost Optimization Tooling (ARC301) - AWS re:Invent 2018Cost Optimization Tooling (ARC301) - AWS re:Invent 2018
Cost Optimization Tooling (ARC301) - AWS re:Invent 2018Amazon Web Services
 
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Amazon Web Services
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Amazon Web Services
 
Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018
Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018
Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018Amazon Web Services
 
Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018
Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018
Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018Amazon Web Services
 
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...Amazon Web Services
 
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018Amazon Web Services
 
Discover & Migrate at Scale with AWS Migration Hub & Application Discovery Se...
Discover & Migrate at Scale with AWS Migration Hub & Application Discovery Se...Discover & Migrate at Scale with AWS Migration Hub & Application Discovery Se...
Discover & Migrate at Scale with AWS Migration Hub & Application Discovery Se...Amazon Web Services
 
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Amazon Web Services
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAmazon Web Services
 
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...Amazon Web Services
 
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...Amazon Web Services
 
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018Amazon Web Services
 
Don’t Wait Until Tomorrow: From Batch to Streaming (ANT360) - AWS re:Invent 2018
Don’t Wait Until Tomorrow: From Batch to Streaming (ANT360) - AWS re:Invent 2018Don’t Wait Until Tomorrow: From Batch to Streaming (ANT360) - AWS re:Invent 2018
Don’t Wait Until Tomorrow: From Batch to Streaming (ANT360) - AWS re:Invent 2018Amazon Web Services
 
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...Amazon Web Services
 
Running Oracle Databases on Amazon RDS and Migrating to PostgreSQL (DAT307-R1...
Running Oracle Databases on Amazon RDS and Migrating to PostgreSQL (DAT307-R1...Running Oracle Databases on Amazon RDS and Migrating to PostgreSQL (DAT307-R1...
Running Oracle Databases on Amazon RDS and Migrating to PostgreSQL (DAT307-R1...Amazon Web Services
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Amazon Web Services
 
Hands-On Building and Deploying .NET Applications on AWS (DEV331-R1) - AWS re...
Hands-On Building and Deploying .NET Applications on AWS (DEV331-R1) - AWS re...Hands-On Building and Deploying .NET Applications on AWS (DEV331-R1) - AWS re...
Hands-On Building and Deploying .NET Applications on AWS (DEV331-R1) - AWS re...Amazon Web Services
 
Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...
Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...
Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...Amazon Web Services
 
Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018Amazon Web Services
 

Mais procurados (20)

Cost Optimization Tooling (ARC301) - AWS re:Invent 2018
Cost Optimization Tooling (ARC301) - AWS re:Invent 2018Cost Optimization Tooling (ARC301) - AWS re:Invent 2018
Cost Optimization Tooling (ARC301) - AWS re:Invent 2018
 
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018
Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018
Building Your Geospatial Data Lake (WPS324) - AWS re:Invent 2018
 
Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018
Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018
Build Your First Big Data Application on AWS (ANT213-R1) - AWS re:Invent 2018
 
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...
 
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
Build Your Own Log Analytics Solutions on AWS (ANT323-R) - AWS re:Invent 2018
 
Discover & Migrate at Scale with AWS Migration Hub & Application Discovery Se...
Discover & Migrate at Scale with AWS Migration Hub & Application Discovery Se...Discover & Migrate at Scale with AWS Migration Hub & Application Discovery Se...
Discover & Migrate at Scale with AWS Migration Hub & Application Discovery Se...
 
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
 
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
Design, Deploy, and Optimize Microsoft SQL Server on AWS (GPSTEC314) - AWS re...
 
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
 
Don’t Wait Until Tomorrow: From Batch to Streaming (ANT360) - AWS re:Invent 2018
Don’t Wait Until Tomorrow: From Batch to Streaming (ANT360) - AWS re:Invent 2018Don’t Wait Until Tomorrow: From Batch to Streaming (ANT360) - AWS re:Invent 2018
Don’t Wait Until Tomorrow: From Batch to Streaming (ANT360) - AWS re:Invent 2018
 
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
 
Running Oracle Databases on Amazon RDS and Migrating to PostgreSQL (DAT307-R1...
Running Oracle Databases on Amazon RDS and Migrating to PostgreSQL (DAT307-R1...Running Oracle Databases on Amazon RDS and Migrating to PostgreSQL (DAT307-R1...
Running Oracle Databases on Amazon RDS and Migrating to PostgreSQL (DAT307-R1...
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Hands-On Building and Deploying .NET Applications on AWS (DEV331-R1) - AWS re...
Hands-On Building and Deploying .NET Applications on AWS (DEV331-R1) - AWS re...Hands-On Building and Deploying .NET Applications on AWS (DEV331-R1) - AWS re...
Hands-On Building and Deploying .NET Applications on AWS (DEV331-R1) - AWS re...
 
Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...
Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...
Modern Cloud Data Warehousing ft. Intuit: Optimize Analytics Practices (ANT20...
 
Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018
 

Semelhante a Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018

Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with LabAmazon Web Services
 
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech TalksHow to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech TalksAmazon Web Services
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFAmazon Web Services
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Amazon Web Services
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon RedshiftAmazon Web Services
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftAmazon Web Services
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfSasikumarPalanivel3
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfsaidbilgen
 
Query your data in S3 with SQL and optimize for cost and performance
Query your data in S3 with SQL and optimize for cost and performanceQuery your data in S3 with SQL and optimize for cost and performance
Query your data in S3 with SQL and optimize for cost and performanceAWS Germany
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumAmazon Web Services
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon AthenaSungmin Kim
 
Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...
Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...
Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...AWS Germany
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Data Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFData Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFAmazon Web Services
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSAmazon Web Services
 

Semelhante a Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018 (20)

Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with Lab
 
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech TalksHow to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SF
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon Redshift
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF Loft
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
 
Query your data in S3 with SQL and optimize for cost and performance
Query your data in S3 with SQL and optimize for cost and performanceQuery your data in S3 with SQL and optimize for cost and performance
Query your data in S3 with SQL and optimize for cost and performance
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...
Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...
Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Data Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFData Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SF
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 

Mais de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building Your First Serverless Data Lake Damon Cortesi Big Data Architect AWS A N T 3 5 6 - R
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda Data lakes on AWS Ingestion Data prep & optimization File formats and partitions
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Simple Storage Service (Amazon S3) is the core • Durable – designed to be 99.999999999% • Available – designed to be 99.99% • Storage – virtually unlimited • Query in place • Integrated encryption • Decoupled storage & compute Ingest Analysis
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Typical Hadoop environment • CPU/memory tied directly to disk • Multiple users competing for resources • Difficult to upgrade Hadoop Cluster
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Decoupled storage and compute • Scale specific to each workload • Optimize for CPU/memory requirements • Amazon EMR benefits •Spin clusters up/down •Easily test new versions with same data •Utilize Spot for reduced cost Amazon Athena AWS Glue
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data lake on AWS On premises data Web app data Amazon RDS Other databases Streaming data Your data AWS Glue ETL Amazon QuickSight Amazon SageMaker
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Real-time ingestion • May 2018: Amazon Kinesis Data Firehose adds support for Parquet and ORC • One schema per Kinesis Data Firehose, integrates with AWS Glue Data Catalog • Automatically uses an Amazon S3 prefix in “YYYY/MM/DD/HH” UTC time format • Caveat: Partitions not automatically managed in AWS Glue Kinesis Streams Kinesis Data Firehose to S3 Parquet AWS Glue Data Catalog
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Spark on Amazon EMR for splitting generic data def createContext(): sc = SparkContext.getOrCreate() ssc = StreamingContext(sc, 1) # Set a batch interval of 1 second ssc.checkpoint("s3://<bucket>/<checkpoint>") # Enable _Spark_ checkpointing in S3 lines = KinesisUtils.createStream( ssc, "Strlogger", "dcortesi-logger", # _Kinesis_ checkpointing in "StrLogger" DymamoDB table "https://kinesis.us-east-1.amazonaws.com", ”us-east-1", InitialPositionInStream.LATEST, 2 ) parsedDF = lines.map(lambda v: map_json(v)). # map_json creates standard schema with nested JSON parsedDF.foreachRDD(save_rdd) return ssc # For each RDD, write it out to S3 partitioned by the log type ("tag") and date ("dt") def save_rdd(rdd): if rdd.count() > 0: sqlContext = SQLContext(rdd.context) df = sqlContext.createDataFrame(rdd) df.write.partitionBy("tag", "dt").mode("append").parquet("s3://<bucket>/<prefix>/") ssc = StreamingContext.getOrCreate("s3://<bucket>/<cp>/", lambda: createContext()) ssc.start() https://spark.apache.org/docs/latest/streaming-kinesis-integration.html
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Database sources • AWS Database Migration Service (AWS DMS) • Can extract in semi-real-time to S3 in change data capture format • Export from DB • AWS Glue JDBC connections or native DB utilities • Crawl using AWS Glue Amazon RDS Amazon DynamoDB AWS Glue S3 Bucket
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Third-party sources • Still need some tooling to extract from third-party sources like Segment/Mixpanel • Some sources provide a JDBC connector (Salesforce) • Similar to DB sources • Export  S3  Catalog with AWS Glue • If needed, iteratively process using AWS Glue bookmarks
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example structures Kinesis Data Firehose to Parquet Log data DB exports Third-party exports YYYY/MM/DD/HH/name-TS-UUID.parquet <pre>/<logtype>/dt=YYYY-MM-DD/json.gz <pre>/<schema>/<db>/<table>/Y-m-dTH/ <pre>/<table>/LOAD00[1-9].CSV <pre>/<table>/<timestamp>.csv <pre>/full_export/latest-YMD/csv.gz <pre>/incremental/YYYY/MM/DD/csv.gz
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Business drivers • “Hey, can you just add this one data source real quick?” • Question everything/understand the business drivers • Do you really need real-time access? • What business decisions are you making? • What fields are important to you? • What questions are you trying to answer? • The quickest way to a data swamp is by trying to tackle everything at one time. You won’t understand why any data is needed or important, and worse, neither will your customers.
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Optimize your data • Most raw data will not be optimized for querying or might also contain sensitive data • There will be CSVs  • There will be multiple date/time formats • Carefully grant access to raw • Architect for idempotency • Storage tiers: raw  landing  production 1. Optimized file format 2. Partitioned data 3. Masked data
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 1. Optimized file format  Compress  Compact  Convert Compaction Number of files Run time SELECT count(*) FROM lineitem 5000 files 8.4 seconds SELECT count(*) FROM lineitem 1 file 2.31 seconds Speedup 3.6x faster Conversion File format Number of Files Size Run time SELECT count(*) FROM events json.gz 46,182 176 GB 463.33 seconds SELECT count(*) FROM events snappy parquet 11,640 213 GB 6.25 seconds Speedup 74x faster * Tests run using Athena
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Row and column formats ID Age State 123 20 CA 345 25 WA 678 40 FL 999 21 WA 123 20 CA 345 25 WA 678 40 FL 999 21 WA 123 345 678 999 20 25 40 21 CA WA FL WA Row format Column format
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Parquet file format Row group metadata allows Parquet reader to skip portions of, or all, files
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Parquet challenges • Schema evolution, particularly with nested data • https://github.com/prestodb/presto/pull/6675 (2016) • https://github.com/prestodb/presto/pull/10158 (March 2018) • Can’t use predicate pushdowns with nested data • Strings are just binary – can lead to performance issues in extreme cases • Use parquet-tools to debug, investigate your Parquet files • github.com/apache/parquet-mr/tree/master/parquet-tools • Most recent version doesn’t show string min/max stats (patch)
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Flatten data using AWS Glue DynamicFramedatasource0 = glueContext.create_dynamic_frame.from_catalog( database = "dcortesi", table_name = "events_relationalize", transformation_ctx = "datasource0" ) datasource1 = Relationalize.apply( frame = datasource0, staging_path = glue_temp_path, name = dfc_root_table_name, transformation_ctx = "datasource1" ) flattened_dynf = datasource1.select(dfc_root_table_name) ( flattened_dynf.toDF() .repartition("date") .write .partitionBy("date") .option("maxRecordsPerFile", OUTPUT_LINES_PER_FILE) .mode("append") .parquet(EXPORT_PATH) ) https://aws.amazon.com/blogs/big-data/simplify-querying-nested-json-with-the-aws-glue-relationalize-transform/
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Flatten data using AWS Glue DynamicFrame { "player.username": "user1", "player.characteristics.race": "Human", "player.characteristics.class": "Warlock", "player.characteristics.subclass": "Dawnblade", "player.characteristics.power": 300, "player.characteristics.playercountry": "USA", "player.arsenal.kinetic.name": "Sweet Business", "player.arsenal.kinetic.type": "Auto Rifle", "player.arsenal.kinetic.power": 300, "player.arsenal.kinetic.element": "Kinetic", "player.arsenal.energy.name": "MIDA Mini-Tool", "player.arsenal.energy.type": "Submachine Gun", "player.arsenal.energy.power": 300, "player.arsenal.energy.element": "Solar” } { "player": { "username": "user1", "characteristics": { "race": "Human", "class": "Warlock", "subclass": "Dawnblade", "power": 300, "playercountry": "USA" }, "arsenal": { "kinetic": { "name": "Sweet Business", "type": "Auto Rifle", "power": 300, "element": "Kinetic" }, "energy": { "name": "MIDA Mini-Tool", "type": "Submachine Gun", "power": 300, "element": "Solar" }}}}
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. parquet-tools output row group 1: RC:177 TS:16692 OFFSET:4 -------------------------------------------------------------------------------- operation: BINARY SNAPPY DO:0 FPO:4 SZ:250/315/1.26 VC:177 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: REST.COPY.OBJECT, max: REST.PUT.OBJECT, num_nulls: 0] bytes_sent: BINARY SNAPPY DO:0 FPO:254 SZ:436/512/1.17 VC:177 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 1161, max: 8, num_nulls: 34] object_size: BINARY SNAPPY DO:0 FPO:3882 SZ:134/130/0.97 VC:177 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 0, max: 79, num_nulls: 134] remote_ip: BINARY SNAPPY DO:0 FPO:4016 SZ:323/354/1.10 VC:177 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 10.1.2.3, max: 54.2.4.5, num_nulls: 0] bucket: BINARY SNAPPY DO:0 FPO:7270 SZ:126/122/0.97 VC:177 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: dcortesi-test-us-west-2, max: dcortesi-test-us-west-2, num_nulls: 0] time: INT96 SNAPPY DO:0 FPO:8230 SZ:518/739/1.43 VC:177 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[no stats for this column] user_agent: BINARY SNAPPY DO:0 FPO:8748 SZ:625/771/1.23 VC:177 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: , aws-sdk-java/1.11.132 Linux/4.9.76-3.78.amzn1.x86_64 OpenJDK_64-Bit_Server_VM/25.141-b16/1.8.0_141, presto, max: aws-cli/1.14.17 Python/2.7.10 Darwin/16.7.0 botocore/1.8.21, num_nulls: 1] http_status: BINARY SNAPPY DO:0 FPO:9373 SZ:160/160/1.00 VC:177 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED ST:[min: 200, max: 404, num_nulls: 0] Row count Column stats Compression Data type 🤔 Total size
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Optimizing columnar even more • 300GB dataset in S3, but only retrieving one row per Athena query • Used AWS Glue to sort entire dataset before writing to Parquet • Parquet reader can determine whether to skip an entire file • Cost optimization 3483719 1895939 10 20 52030 10 10 10 20 52030 1895939 3483719 Sort WHERE id = 52030
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Compact small files • Regular “janitor” job that uses Spark to copy and compact data from one prefix into another • Swap AWS Glue Data Catalog partition location • If you know your dataset … • Spark ”repartition” into optimal sizes • Brute force it! • S3 list, group objects based on size • repartition(<some_small_number>) + “maxRecordsPerFile”
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Compact small files raw/2018/01/20/00/file1.parquet raw/2018/01/20/00/file2.parquet … raw/2018/01/20/00/file356.parquet AWS Glue Data Catalog Partition: dt=2018-01-20-00 Total: 356 files, 208MB AWS Glue Compaction Job compacted/2018/01/20/00/file1.parquet compacted/2018/01/20/00/file2.parquet Total: 2 files, ~200MB Updated partition: dt=2018-01-20-00
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 2. Partitioned data • “Be careful when you repartition the data so you don’t write out too few partitions and exceed your partition rate limits” 😠 • “Be careful when you repartition the data (in Spark) do you don’t write out too few (Hive-style) partitions and exceed your (S3) partition rate limits”
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Partitions, partitions, partitions—Hive-style • S3 prefix convention • s3://<bucket>/<prefix>/year=2018/month=01/day=20/ • S3://<bucket>/<prefix>/dt=2018-01-20/ • Cost and performance optimization • Enables predicate pushdown • Design based on query patterns • Don’t go overboard SELECT remote_ip, COUNT(*) FROM access_logs WHERE dt='2018-01-20' https://docs.aws.amazon.com/athena/latest/ug/partitions.html
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Partitions, partitions, partitions—S3 • Bucket: service-logs • Partition: service-logs/2 • Impact at (per prefix) • ~3,500 PUT/LIST/DELETE requests/second • ~5,500 GET requests/second Object keys: 2018/01/20/data001.csv.gz 2018/01/20/data002.csv.gz 2018/01/20/data003.csv.gz 2018/01/21/data001.csv.gz 2018/01/21/data002.csv.gz 2018/01/22/data001.csv.gz S3 automatically partitions based on key prefix https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Partitions, partitions, partitions—Spark • “Logical chunk of a large distributed data set” • Enables optimal distributed performance • You can repartition – or redistribute – data in Spark • Explicit number of partitions • Partition by one or more columns (defaults to 200 partitions) • spark.sql.shuffle.partitions • Spark can read/write Hive-style partitions write.partitionBy("date").parquet("s3://<bucket>/<prefix>/")
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data and workflow management • Emerging area – lots of active development • Apache Atlas (1.0 – June 2018) - https://atlas.apache.org/ • Netflix Metacat (June 2018) - https://github.com/Netflix/metacat • LinkedIn WhereHows (2016) - https://github.com/linkedin/WhereHows • AWS Glue (2016) - https://aws.amazon.com/glue/ • FINRA Herd (2015) - http://finraos.github.io/herd/ • Apache Airflow - https://airflow.apache.org/ • Spotify Luigi - https://luigi.readthedocs.io/ • Apache Oozie - http://oozie.apache.org/
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data security • Limit access to data on S3 using IAM policies • Encrypt data at rest using S3 or AWS Key Management Service (AWS KMS)-managed keys • Amazon EMR integrates with LDAP for authentication • Amazon Macie for discovering sensitive data • Enable S3 access logs for auditing • Bonus – Query your S3 access logs on S3 using Athena!
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring • Track latency on your different data sources/pipelines • Alarm on data freshness to avoid data drift • Can use AWS Glue metadata properties or external system • CloudWatch integrates with AWS Glue/Amazon EMR/Athena • Build trust • Notice data latency before your customers • Data validation – “Is that metric right?” • When something goes wrong, how will you recover?
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Overview • So far we’ve … • Crawled and discovered our raw datasets • Converted data into Parquet • Organized data into structures that make sense for our business • Compacted small files into larger ones • Optimized the heck out of Parquet • Partitioned our data when it needs to be • So what does this look like?
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Storage tiers—Raw raw-bucket/ |___ apache-logs/ |___ dt=2018-01-20-00/ |___ log-2018-01-20T00:01:00Z.log |___ log-2018-01-20T00:01:03Z.log |___ log-2018-01-20T00:02:10Z.log |___ log-2018-01-20T00:20:00Z.log | |___ syslog-logs/ |___ dt=2018-01-20-02/ |___ syslog-2018-01-20T02:00:00Z.gz |___ syslog-2018-01-20T02:15:00Z.gz | |___ kinesis-events/ |___ 2018 |___ 01 |___ 20 |___ 00 |___ events-01:15.parquet |___ events-01:16.parquet raw-bucket/ |___ db-backups/ |___ prod-webapp/ |___ 2018-01-19/ |___ users_20180120.tsv.gz |___ products_20180120.tsv.gz |___ plans_20180120.tsv.gz |___ orgs_20180120.tsv.gz |___ 2018-01-18/ |___ users_20180119.tsv.gz |___ products_20180119.tsv.gz |___ plans_20180119.tsv.gz |___ orgs_20180119.tsv.gz
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Storage tiers—Landing raw-processed-bucket/ |___ log-data/ |___ apache/ |___ dt=2018-01-20/ |___ part00000.snappy.parquet |___ part00001.snappy.parquet |___ syslog/ |___ dt=2018-01-20/ |___ part00000.snappy.parquet | |___ product/ |___ orgs/ |___ part00000.snappy.parquet |___ plans/ |___ part00000.snappy.parquet |___ events/ |___ dt=2018-01-20/ |___ part00000.snappy.parquet |___ part00001.snappy.parquet
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Storage tiers—Production model production-marketing-bucket/ |___ product/ |___ users/ |___ part00000.snappy.parquet |___ events/ |___ dt=2018-01-20/ |___ part00000.snappy.parquet |___ part00001.snappy.parquet |___ email-analytics/ |___ email-clicks/ |___ dt=2018-01-20/ |___ part00000.snappy.parquet production-marketing-bucket/ |___ social-analytics/ |___ reach-dashboard/ |___ aggregates-by-date/ |___ dt=2018-01-20/ |___ part00000.snappy.parquet |___ aggregates-by-region/ |___ region=na/ |___ dt=2018-01-20/ |___ part00000.parquet |___ region=apac/ |___ dt=2018-01-20/ |___ part00000.parquet
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Optimizing for analytics • Data prep optimizations apply • Compress, compact, convert • Tailor to your customer’s use case • It’s OK to duplicate data • Create production data models • Use case dependent – usually by organization • Same data is typically represented different ways depending on need • Always running COUNT/MIN/MAX queries? Create aggregates!
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Application-specific tuning and use cases • “What database should I use for our data lake?” • “Yes.” • Athena • Amazon EMR • AWS Glue • Amazon QuickSight • Amazon SageMaker • Amazon Redshift Spectrum
  • 43. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Damon Cortesi
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.