This document provides an overview and agenda for a presentation on real-time data processing using AWS Lambda. The presentation covers serverless real-time data processing concepts, processing streaming data with Lambda and Kinesis, a streaming data processing demo, a data processing pipeline with Lambda and MapReduce, and a big data processing solution demo. It also discusses a customer story of Fannie Mae using distributed computing with Lambda for financial modeling. Key topics include serverless processing of real-time streaming data, a map-reduce model for serverless distributed computing, benchmarks of serverless distributed computing, and Fannie Mae's journey migrating their high performance computing workloads to AWS Lambda.
2. Agenda
What’s Serverless Real-Time Data Processing?
Processing Streaming Data with Lambda and Kinesis
Streaming Data Processing Demo
Data Processing Pipeline with Lambda and MapReduce
Building a Big Data Processing Solution Demo
What’s Serverless Real-Time Data Processing?
Serverless Processing of Real-Time Streaming Data
Serverless Data Processing with Distributed Computing
Customer Story:
Fannie Mae-Distributed Computing with Lambda
4. AWS Lambda
Efficient performance at scale Easy to author, deploy,
maintain, secure & manage. Focus on business logic
to build back-end services that perform at scale.
Bring Your Own Code: Stateless, event-driven code
with native support for Node.js, Java, Python and C#
languages.
No Infrastructure to manage: Compute without
managing infrastructure like Amazon EC2 instances
and Auto Scaling groups.
Cost-effective: Automatically matches capacity to
request rate. Compute cost 100 ms increments.
Triggered by events: Direct Sync & Async API calls,
AWS service integrations, and 3rd party triggers.
6. Serverless Real-Time Data Processing Is..
Capture Data
Streams
IoT Data
Financial
Data
Log Data
No servers to
provision or
manage
EVENT SOURCE
Node.js
Python
Java
C#
Process Data
Streams
FUNCTION
Clickstream
Data
Output
Data
DATABASE
CLOUD
SERVICES
7. Amazon
DynamoDB
Amazon
Kinesis
Amazon
S3
Amazon
SNS
ASYNCHRONOUS PUSH MODEL
STREAM PULL MODEL
Lambda Real-Time Event Sources
Amazon
Alexa
AWS
IoT
SYNCHRONOUS PUSH MODEL
Mapping owned by Event Source
Mapping owned by Lambda
Invokes Lambda via Event Source API
Lambda function invokes when new
records found on stream
Resource-based policy permissions
Lambda Execution role policy permissions
Concurrent executions
Sync invocation
Async Invocation
Sync invocation
Lambda polls the streams
HOW IT WORKS
9. Amazon Kinesis
Real-Time: Collect real-time data streams and
promptly respond to key business events and
operational triggers. Real-time latencies.
Easy to use: Focus on quickly launching data
streaming applications instead of managing
infrastructure.
Amazon Kinesis Offering: Managed services for
streaming data ingestion and processing.
• Amazon Kinesis Streams: Build applications
that process or analyze streaming data.
• Amazon Kinesis Firehose: Load massive
volumes of streaming data into Amazon S3
and Amazon Redshift.
• Amazon Kinesis Analytics: Analyze data
streams using SQL queries.
10. Processing Real-Time Streams: Lambda + Amazon Kinesis
Streaming data sent to Amazon
Kinesis and stored in shards
Multiple Lambda functions can be
triggered to process same Amazon
Kinesis stream for “fan out”
Lambda can process data and store
results ex. to DynamoDB, S3
Lambda can aggregate data to
services like Amazon Elasticsearch
Service for analytics
Lambda sends event data and
function info to Amazon CloudWatch
for capturing metrics and monitoring
Amazon
Kinesis
AWS
Lambda
Amazon
CloudWatch
Amazn
DynamoDB
AWS
Lambda
Amazon
Elasticsearch Service
Amazon
S3
11. Processing Streams: Set Up Amazon Kinesis Stream
Streams
Made up of Shards
Each Shard ingests/reads data up to 1 MB/sec
Each Shard emits/writes data up to 2 MB/sec
Each shard supports 5 reads/sec
Data
All data is stored and is replayable for 24 hours
Make sure partition key distribution is even to optimize parallel throughput
Partition key used to distribute PUTs across shards, choose key with more groups than
shards
Best Practice
Determine an initial size/shards to plan for expected maximum demand
Leverage “Help me decide how many shards I need” option in Console
Use formula for Number Of Shards:
max(incoming_write_bandwidth_in_KB/1000, outgoing_read_bandwidth_in_KB / 2000)
12. Processing Streams: Create Lambda functions
Memory
CPU allocation proportional to the memory configured
Increasing memory makes your code execute faster (if CPU bound)
Increasing memory allows for larger record sizes processed
Timeout
Increasing timeout allows for longer functions, but longer wait in case of errors
Permission model
Execution role defined for Lambda must have permission to access the stream
Retries
With Amazon Kinesis, Lambda retries until the data expires
(24 hours)
Best Practice
Write Lambda function code to be stateless
Instantiate AWS clients & database clients outside the scope of the function handler
13. Processing Streams: Configure Event Source
Amazon Kinesis mapped as event source in Lambda
Batch size
Max number of records that Lambda will send to one invocation
Not equivalent to effective batch size
Effective batch size is every 250 ms – Calculated as:
MIN(records available, batch size, 6MB)
Increasing batch size allows fewer Lambda function invocations with more
data processed per function
Best Practices
Set to “Trim Horizon” for reading from start of
stream (all data)
Set to “Latest” for reading most recent data (LIFO) (latest data)
14. Processing streams: How It Works
Polling
Concurrent polling and processing per shard
Lambda polls every 250 ms if no records found
Will grab as much data as possible in one GetRecords call (Batch)
Batching
Batches are passed for invocation to Lambda through
function parameters
Batch size may impact duration if the Lambda function
takes longer to process more records
Sub batch in memory for invocation payload
Synchronous invocation
Batches invoked as synchronous RequestResponse type
Lambda honors Amazon Kinesis at least once semantics
Each shard blocks in order of synchronous invocation
15. Processing streams: Tuning throughput
If put / ingestion rate is greater than the theoretical throughput, your
processing is at risk of falling behind
Maximum theoretical throughput
# shards * 2MB / Lambda function duration (s)
Effective theoretical throughput
# shards * batch size (MB) / Lambda function duration (s)
… …
Source
Amazon Kinesis
Destination
1
Lambda
Destination
2
FunctionsShards
Lambda will scale automaticallyScale Amazon Kinesis by splitting or merging shards
Waits for responsePolls a batch
16. Processing streams: Tuning Throughput w/ Retries
Retries
Will retry on execution failures until the record is expired
Throttles and errors impacts duration and directly impacts throughput
Best Practice
Retry with exponential backoff of up to 60s
Effective theoretical throughput with retries
( # shards * batch size (MB) ) / ( function duration (s) * retries until expiry)
… …
Source
Amazon Kinesis
Destination
1
Lambda
Destination
2
FunctionsShards
Lambda will scale automaticallyScale Amazon Kinesis by splitting or merging shards
Receives errorPolls a batch
Receives error
Receives success
17. Processing streams: Common observations
Effective batch size may be less than configured during low throughput
Effective batch size will increase during higher throughput
Increased Lambda duration -> decreased # of invokes and GetRecord calls
Too many consumers of your stream may compete with Amazon Kinesis read
limits and induce ReadProvisionedThroughputExceeded errors and metrics
Amazon
Kinesis
AWS
Lambda
18. Processing streams: Monitoring with Cloudwatch
• GetRecords: (effective throughput)
• PutRecord : bytes, latency, records, etc
• GetRecords.IteratorAgeMilliseconds: how old your
last processed records were
Monitoring Amazon Kinesis Streams
Monitoring Lambda functions
• Invocation count: Time function invoked
• Duration: Execution/processing time
• Error count: Number of Errors
• Throttle count: Number of time function throttled
• Iterator Age: Time elapsed from batch received &
final record written to stream
• Review All Metrics
• Make Custom logs
• View RAM consumed
• Search for log events
Debugging
AWS X-Ray
Coming soon!
20. Serverless Distributed Computing: Map-Reduce Model
Why Serverless Data Processing with Distributed
Computing?
Remove Difficult infrastructure management
Cluster administration
Complex configuration tools
Enable simple, elastic, user-friendly distributed data
processing
Eliminate complexity of state management
Bring Distributed Computing power to the masses
21. Serverless Distributed Computing: Map-Reduce Model
Why Serverless Data Processing with Distributed
Computing?
Eliminate utilization concerns
Makes code simpler by removes complexities of multi-
threading processing to optimize server usage
Cost-effective option to run ad hoc MapReduce jobs
Easier, automatic horizontal scaling
Provide ability to process scientific and analytics
applications
23. Serverless Distributed Computing: PyWren
PyWren Prototype Developed at University of California, Berkeley
Uses Python with AWS Lambda stateless functions for large scale data
analytics
Achieved @ 30-40 MB/s write and read performance per-core to S3
object store
Scaled to 60-80 GB/s across 2800 simultaneous functions
24. Serverless Distributed Computing: Benchmark
Using Amazon MapReduce Reference Architecture Framework
with Lambda
Dataset
Queries:
Scan query (90 M Rows, 6.36 GB of data)
Select query on Page Rankings
Aggregation query on UserVisits ( 775M rows, ~127GB of
data)
Rankings
(rows)
Rankings
(bytes)
UserVisits
(rows)
UserVisits
(bytes)
Documents
(bytes)
90 Million 6.38 GB 775 Million 126.8 GB 136.9 GB
25. Serverless Distributed Computing: Benchmark
Using Amazon MapReduce Reference Architecture Framework
with Lambda
Subset of the Amplab benchmark ran to compare with other data
processing frameworks
Performance Benchmarks: Execution time for each workload in seconds
TECHNOLOGY SCAN 1A SCAN 1B AGGREGATE 2A
Amazon Redshift (HDD) 2.49 2.61 25.46
Serverless MapReduce 39 47 200
Impala - Disk - 1.2.3 12.015 12.015 113.72
Impala - Mem - 1.2.3 2.17 3.01 84.35
Shark - Disk - 0.8.1 6.6 7 151.4
Shark - Mem - 0.8.1 1.7 1.8 83.7
Hive - 0.12 YARN 50.49 59.93 730.62
Tez - 0.2.0 28.22 36.35 377.48
38. Data Processing with AWS: Next steps
Learn more about AWS Serverless at
https://aws.amazon.com/serverless
Explore the AWS Lambda Reference Architecture on GitHub:
Real-Time Streaming:
https://github.com/awslabs/lambda-refarch-
streamprocessing
Distributed Computing Reference Architecture
(serverless MapReduce)
https://github.com/awslabs/lambda-refarch-mapreduce
39. Data Processing with AWS: Next steps
Create an Amazon Kinesis stream. Visit the Amazon Kinesis
Console and configure a stream to receive data Ex. data from
Social media feeds.
Create & test a Lambda function to process streams from Amazon
Kinesis by visiting Lambda console. First 1M requests each month
are on us!
Read the Developer Guide and try the Lambda and Amazon
Kinesis Tutorial:
http://docs.aws.amazon.com/lambda/latest/dg/with-
kinesis.html
Send questions, comments, feedback to the AWS Lambda Forums
TEW Notes: A shard is a uniquely identified group of data records in a stream. A stream is composed of one or more shards, each of which provides a fixed unit of capacity.
Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second and up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition keys). The data capacity of your stream is a function of the number of shards that you specify for the stream
You should plan for expected maximum demand when you provision shards. Therefore, Before you create a stream, you need to determine an initial size for the stream. A shard is a unit of throughput capacity. On the Create Stream console page, select the checkbox labelled "Help me decide how many shards I need
To determine the initial size of a stream, you'll need the following input values:
The average size of the data record written to the stream in kilobytes (KB), rounded up to the nearest 1 KB, the data size (average_data_size_in_KB).
The number of data records written to and read from the stream per second (records_per_second).
The number of Amazon Kinesis Streams applications that consume data concurrently and independently from the stream, that is, the consumers (number_of_consumers).
The incoming write bandwidth in KB (incoming_write_bandwidth_in_KB), which is equal to the average_data_size_in_KB multiplied by the records_per_second.
The outgoing read bandwidth in KB (outgoing_read_bandwidth_in_KB), which is equal to the incoming_write_bandwidth_in_KB multiplied by the number_of_consumers.
number_of_shards = max(incoming_write_bandwidth_in_KB/1000, outgoing_read_bandwidth_in_KB/2000)
You can dynamically resize your stream or add and remove shards after you create the stream and while there is an Amazon Kinesis Streams application consuming data from the stream.
You can increase the retention period up to 168 hours (7 days) using the IncreaseStreamRetentionPeriod operation, and decrease the retention period down to a minimum of 24 hours using the DecreaseStreamRetentionPeriod
AT_SEQUENCE_NUMBER - Start reading from the position denoted by a specific sequence number, provided in the value StartingSequenceNumber.
AFTER_SEQUENCE_NUMBER - Start reading right after the position denoted by a specific sequence number, provided in the value StartingSequenceNumber.
AT_TIMESTAMP - Start reading from the position denoted by a specific timestamp, provided in the value Timestamp.
TRIM_HORIZON - Start reading at the last untrimmed record in the shard in the system, which is the oldest data record in the shard.
LATEST - Start reading just after the most recent record in the shard, so that you always read the most recent data in the shard.
TEW
TEW Notes
•Write your Lambda function code in a stateless style, and ensure there is no affinity between your code and the underlying compute infrastructure.
•Instantiate AWS clients and database clients outside the scope of the handler to take advantage of connection re-use.
•Make sure you have set +rx permissions on your files in the uploaded ZIP to ensure Lambda can execute code on your behalf.
•Lower costs and improve performance by minimizing the use of startup code not directly related to processing the current event.
•Use the built-in CloudWatch monitoring of your Lambda functions to view and optimize request latencies.
•Delete old Lambda functions that you are no longer using.
TEW: Always FIFO Trim Horizon (start of stream) latest on reading current data.
TEW: IteratorAge is strictly for stream-based invocations and works by determining the age of the last record for each batch of processed records. Age is quantified by measuring the time elapsed between when Lambda received the batch in question and when the final record within that batch was written to the stream.
TEW : Note this was a pain to reproduce this architecture flow from pic from reference architecture