SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
Project Skyfall
                        
                        
Matt Abrams (@abramsm)
Agenda


 A bit about AddThis!
 !
 Why did we need Skyfall?!
 !
 Architecture!
 !
 Operations/Performance!
Introduction!
Fun with Numbers
AddThis JavaScript loads > 3 Billion times per day

Edge Network (Skyfall) receives around 4B hits per
day

Either datacenter can handle 100% load (we test this
often) 

Currently using around 1K servers (will double next
year)
Data Center Porn
Why did we need Skyfall?
We couldn’t find anyone else to do it for us
    •  Pervious vendors log aggregation was delayed by a
       minimum of 3 hours and could take up to 5 days

Minimize impact on our publishers
    •    Combining log collection with remote services means we only
         need 1 event instead of n

Support near real time applications
Why did we call it Skyfall?
Why did we call it Skyfall?
Skyfall Goals and Architecture!
Skyfall Goals (Technical)
High Availability
                      Handle Server and DC failure

                                       gracefully
Low latency
                            

                                       Zero downtime deployment and
                                        configuration
Use for internal and external Logging
needs
                                  

                                       In session RPC
O(1) reads and writes
                  

                                       Support data filtering at the
                                        edge
Smart Clients
Why speed and robustness matters
Architecture
                              Web Event
                                          Web Event
                                         Web Event



                                       Global Traffic
                                       Management



               DC1                                                     DC2

 Skyfall      Skyfall        Skyfall                     Skyfall      Skyfall        Skyfall


                                           Repeater



   Consumer                Service                         Consumer
 Consumer
  Consumer               Service                                                   Service
Consumer                Service                          Consumer
                                                          Consumer               Service
                                                        Consumer                Service
1.    Messages are placed on concurrent non-blocking queue
      (CNBQ) to minimize latency impact on producer

2.    Messages are then popped from CNBQ and placed on a
      Disk-Backed queue (DBQ)

3.    DBQ is used to provide temporary storage in case Kafka is
      down or backed up

4.    Messages from DBQ are popped and sent to Kafka where
      they are persisted to file system
Kafka
Kafka is treats persistence as a first class citizen

Focus is on high throughput vs lots of bells and whistles

State about what has been consumed is maintained in the
client rather than the server

Kafka is explicitly distributed

Supports O(1) reads and writes

Pull rather than push


           http://incubator.apache.org/kafka/design.html
Circuit Breaker for remote Services
Pattern is used to detect failures and encapsulates logic of
preventing a failure to reoccur constantly[1]


If a service instance throws an error, times out, or responds
with a failure message an error event is marked

If the error rate threshold is exceeded that service instance is
removed from the pool of available services

Before re-adding a service to the pool a test request is made
and validated

Internal service failures should not be reflected in response to
message originator

          [1] - http://en.wikipedia.org/wiki/Circuit_breaker_design_pattern
What does a call to our endpoint look like?



Topic


 •    "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://
      s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0
      (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/
      5.0)"!
What does a call to our endpoint look like?


              Version
Topic


 •    "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://
      s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0
      (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/
      5.0)"!
What does a call to our endpoint look like?


              Version Resource
Topic


 •    "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://
      s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0
      (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/
      5.0)"!
What does a call to our endpoint look like?


              Version Resource URL Params
Topic


 •    "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://
      s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0
      (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/
      5.0)"!
What does a call to our endpoint look like?


              Version Resource URL Params   Status Code
Topic


 •    "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://
      s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0
      (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/
      5.0)"!
What does a call to our endpoint look like?


              Version Resource URL Params   Status Code
Topic
                                                    Bytes Transferred


 •    "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://
      s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0
      (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/
      5.0)"!
What does a call to our endpoint look like?


               Version Resource URL Params   Status Code
Topic
                                                     Bytes Transferred


 •    "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://
      s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0
      (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/
      5.0)"!

      CDN Resource              User Agent
What does a call to our endpoint look like?


             Version Resource URL Parameters Status Code
Topic
                                                     Bytes Transferred


 "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://
 s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0
 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/
 5.0)"

  CDN Resource                      User Agent


        The endpoint also receives header and cookie information not
        Shown here.
Zero Downtime Deployment and
Configuration

Group 1
                         4             8             16
 S1       S2   S2   S3       S3   S4       S4   S5        S5




Group 2
                         4             8             16
 S1       S2   S2   S3       S3   S4       S4   S5        S5
Endpoint Configuration




Each endpoint maps to a ‘topic’

Header elements may be extracted from the HTTP request

Parameters may be mapped to new key names

Variables may be extracted from the URL path
Data Center Repeater

DC Repeater nodes
automatically negotiate         N1
peering relationships with
nodes in the other data              N1
center
                                N2
If a peer node becomes
unreachable the local node           N2
will select a new peer
                                N3
These are special consumers
of the Kafka log data created
by the local node
Skyfall Operations!
Requests per/second (VA Data Center)
TCP - When do you say goodbye?




      http://upload.wikimedia.org/wikipedia/commons/a/a2/Tcp_state_diagram_fixed.svg
Connection Tracking – what you need to
know
Connection information is maintained in memory

The message: “ip_conntrack: table full, dropping packet” is
BAD

Chrome – doesn’t close connection on FIN

This means that the connection info remains open until it
times out, drastically increasing the number of connection
your server needs to track

You need some mechanism for timing out the connection in a
reasonable time period
HA Proxy
We use a simple round-robin load balancing algorithm with a
liveness check

Default connection timeouts are way to high. Reasonable
values are used to prevent excessive connection tracking

“http-close” and “http-server-close” are enabled to ensure low
latency for clients and fast session reuse for the server

HA Proxy is our solution of choice our LB needs. We prefer
software solutions on commodity hardware vs expensive
custom LB appliances

They could use a new logo

Mais conteúdo relacionado

Mais procurados

Open stack with_openflowsdn-torii
Open stack with_openflowsdn-toriiOpen stack with_openflowsdn-torii
Open stack with_openflowsdn-toriiHui Cheng
 
OpenContrail Silicon Valley Meetup Aug 25 2015
OpenContrail Silicon Valley Meetup Aug 25 2015OpenContrail Silicon Valley Meetup Aug 25 2015
OpenContrail Silicon Valley Meetup Aug 25 2015Scott Sneddon
 
Ari Zilka Cluster Architecture Patterns
Ari Zilka Cluster Architecture PatternsAri Zilka Cluster Architecture Patterns
Ari Zilka Cluster Architecture Patternsdeimos
 
Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Abdelkrim Hadjidj
 
In-depth Troubleshooting on NetScaler using Command Line Tools
In-depth Troubleshooting on NetScaler using Command Line ToolsIn-depth Troubleshooting on NetScaler using Command Line Tools
In-depth Troubleshooting on NetScaler using Command Line ToolsDavid McGeough
 
PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...
PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...
PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...PROIDEA
 
Mellanox High Performance Networks for Ceph
Mellanox High Performance Networks for CephMellanox High Performance Networks for Ceph
Mellanox High Performance Networks for CephMellanox Technologies
 
Understanding network and service virtualization
Understanding network and service virtualizationUnderstanding network and service virtualization
Understanding network and service virtualizationSDN Hub
 
Software-Defined Networking SDN - A Brief Introduction
Software-Defined Networking SDN - A Brief IntroductionSoftware-Defined Networking SDN - A Brief Introduction
Software-Defined Networking SDN - A Brief IntroductionJason TC HOU (侯宗成)
 
SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014
SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014
SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014SAMeh Zaghloul
 
Understanding and deploying Network Virtualization
Understanding and deploying Network VirtualizationUnderstanding and deploying Network Virtualization
Understanding and deploying Network VirtualizationSDN Hub
 
Cloudian dynamic consistency
Cloudian dynamic consistencyCloudian dynamic consistency
Cloudian dynamic consistencyCLOUDIAN KK
 
NSX Reference Design version 3.0
NSX Reference Design version 3.0NSX Reference Design version 3.0
NSX Reference Design version 3.0Doddi Priyambodo
 
Oracle 10g Performance: chapter 11 SQL*Net
Oracle 10g Performance: chapter 11 SQL*NetOracle 10g Performance: chapter 11 SQL*Net
Oracle 10g Performance: chapter 11 SQL*NetKyle Hailey
 
F5 link controller
F5  link controllerF5  link controller
F5 link controllerJimmy Saigon
 
OpenStack and OpenContrail for FreeBSD platform by Michał Dubiel
OpenStack and OpenContrail for FreeBSD platform by Michał DubielOpenStack and OpenContrail for FreeBSD platform by Michał Dubiel
OpenStack and OpenContrail for FreeBSD platform by Michał Dubieleurobsdcon
 
Advanced network services insertions framework
Advanced network services insertions frameworkAdvanced network services insertions framework
Advanced network services insertions frameworksalv_orlando
 
Midokura OpenStack Day Korea Talk: MidoNet Open Source Network Virtualization...
Midokura OpenStack Day Korea Talk: MidoNet Open Source Network Virtualization...Midokura OpenStack Day Korea Talk: MidoNet Open Source Network Virtualization...
Midokura OpenStack Day Korea Talk: MidoNet Open Source Network Virtualization...Dan Mihai Dumitriu
 
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*Michelle Holley
 

Mais procurados (20)

Opening Up Your Network with SDN
Opening Up Your Network with SDNOpening Up Your Network with SDN
Opening Up Your Network with SDN
 
Open stack with_openflowsdn-torii
Open stack with_openflowsdn-toriiOpen stack with_openflowsdn-torii
Open stack with_openflowsdn-torii
 
OpenContrail Silicon Valley Meetup Aug 25 2015
OpenContrail Silicon Valley Meetup Aug 25 2015OpenContrail Silicon Valley Meetup Aug 25 2015
OpenContrail Silicon Valley Meetup Aug 25 2015
 
Ari Zilka Cluster Architecture Patterns
Ari Zilka Cluster Architecture PatternsAri Zilka Cluster Architecture Patterns
Ari Zilka Cluster Architecture Patterns
 
Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2
 
In-depth Troubleshooting on NetScaler using Command Line Tools
In-depth Troubleshooting on NetScaler using Command Line ToolsIn-depth Troubleshooting on NetScaler using Command Line Tools
In-depth Troubleshooting on NetScaler using Command Line Tools
 
PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...
PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...
PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...
 
Mellanox High Performance Networks for Ceph
Mellanox High Performance Networks for CephMellanox High Performance Networks for Ceph
Mellanox High Performance Networks for Ceph
 
Understanding network and service virtualization
Understanding network and service virtualizationUnderstanding network and service virtualization
Understanding network and service virtualization
 
Software-Defined Networking SDN - A Brief Introduction
Software-Defined Networking SDN - A Brief IntroductionSoftware-Defined Networking SDN - A Brief Introduction
Software-Defined Networking SDN - A Brief Introduction
 
SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014
SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014
SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014
 
Understanding and deploying Network Virtualization
Understanding and deploying Network VirtualizationUnderstanding and deploying Network Virtualization
Understanding and deploying Network Virtualization
 
Cloudian dynamic consistency
Cloudian dynamic consistencyCloudian dynamic consistency
Cloudian dynamic consistency
 
NSX Reference Design version 3.0
NSX Reference Design version 3.0NSX Reference Design version 3.0
NSX Reference Design version 3.0
 
Oracle 10g Performance: chapter 11 SQL*Net
Oracle 10g Performance: chapter 11 SQL*NetOracle 10g Performance: chapter 11 SQL*Net
Oracle 10g Performance: chapter 11 SQL*Net
 
F5 link controller
F5  link controllerF5  link controller
F5 link controller
 
OpenStack and OpenContrail for FreeBSD platform by Michał Dubiel
OpenStack and OpenContrail for FreeBSD platform by Michał DubielOpenStack and OpenContrail for FreeBSD platform by Michał Dubiel
OpenStack and OpenContrail for FreeBSD platform by Michał Dubiel
 
Advanced network services insertions framework
Advanced network services insertions frameworkAdvanced network services insertions framework
Advanced network services insertions framework
 
Midokura OpenStack Day Korea Talk: MidoNet Open Source Network Virtualization...
Midokura OpenStack Day Korea Talk: MidoNet Open Source Network Virtualization...Midokura OpenStack Day Korea Talk: MidoNet Open Source Network Virtualization...
Midokura OpenStack Day Korea Talk: MidoNet Open Source Network Virtualization...
 
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
 

Semelhante a Big datadc skyfall_preso_v2

Meetup Microservices Commandments
Meetup Microservices CommandmentsMeetup Microservices Commandments
Meetup Microservices CommandmentsBill Zajac
 
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...Amazon Web Services
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Community
 
Presentation deploying cloud based services
Presentation   deploying cloud based servicesPresentation   deploying cloud based services
Presentation deploying cloud based servicesxKinAnx
 
From nothing to production in 1 hour
From nothing to production in 1 hourFrom nothing to production in 1 hour
From nothing to production in 1 hourRoy Braam
 
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ... The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...Josef Adersberger
 
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...QAware GmbH
 
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogicRakuten Group, Inc.
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internalsTokyo Azure Meetup
 
Spring and Pivotal Application Service - SpringOne Tour - Boston
Spring and Pivotal Application Service - SpringOne Tour - BostonSpring and Pivotal Application Service - SpringOne Tour - Boston
Spring and Pivotal Application Service - SpringOne Tour - BostonVMware Tanzu
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings MeetupGwen (Chen) Shapira
 
Netsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvNetsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvIntel
 
Move fast and make things with microservices
Move fast and make things with microservicesMove fast and make things with microservices
Move fast and make things with microservicesMithun Arunan
 
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWSArquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWSAmazon Web Services LATAM
 
Inside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable CloudInside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable Cloudinside-BigData.com
 
20120416 tf mms_feedback_slideshare
20120416 tf mms_feedback_slideshare20120416 tf mms_feedback_slideshare
20120416 tf mms_feedback_slideshareOsamu Takazoe
 

Semelhante a Big datadc skyfall_preso_v2 (20)

Mini-Track: Lessons from Public Cloud
Mini-Track: Lessons from Public CloudMini-Track: Lessons from Public Cloud
Mini-Track: Lessons from Public Cloud
 
Meetup Microservices Commandments
Meetup Microservices CommandmentsMeetup Microservices Commandments
Meetup Microservices Commandments
 
Active network
Active networkActive network
Active network
 
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
 
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
 
Presentation deploying cloud based services
Presentation   deploying cloud based servicesPresentation   deploying cloud based services
Presentation deploying cloud based services
 
From nothing to production in 1 hour
From nothing to production in 1 hourFrom nothing to production in 1 hour
From nothing to production in 1 hour
 
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ... The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
 
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
 
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Tokyo azure meetup #12 service fabric internals
Tokyo azure meetup #12   service fabric internalsTokyo azure meetup #12   service fabric internals
Tokyo azure meetup #12 service fabric internals
 
Spring and Pivotal Application Service - SpringOne Tour - Boston
Spring and Pivotal Application Service - SpringOne Tour - BostonSpring and Pivotal Application Service - SpringOne Tour - Boston
Spring and Pivotal Application Service - SpringOne Tour - Boston
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Netsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfvNetsft2017 day in_life_of_nfv
Netsft2017 day in_life_of_nfv
 
Move fast and make things with microservices
Move fast and make things with microservicesMove fast and make things with microservices
Move fast and make things with microservices
 
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWSArquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
 
Inside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable CloudInside Microsoft's FPGA-Based Configurable Cloud
Inside Microsoft's FPGA-Based Configurable Cloud
 
20151207 - iot strategy
20151207 - iot strategy20151207 - iot strategy
20151207 - iot strategy
 
20120416 tf mms_feedback_slideshare
20120416 tf mms_feedback_slideshare20120416 tf mms_feedback_slideshare
20120416 tf mms_feedback_slideshare
 

Big datadc skyfall_preso_v2

  • 1. Project Skyfall Matt Abrams (@abramsm)
  • 2. Agenda A bit about AddThis! ! Why did we need Skyfall?! ! Architecture! ! Operations/Performance!
  • 4.
  • 5. Fun with Numbers AddThis JavaScript loads > 3 Billion times per day Edge Network (Skyfall) receives around 4B hits per day Either datacenter can handle 100% load (we test this often) Currently using around 1K servers (will double next year)
  • 7. Why did we need Skyfall? We couldn’t find anyone else to do it for us •  Pervious vendors log aggregation was delayed by a minimum of 3 hours and could take up to 5 days Minimize impact on our publishers •  Combining log collection with remote services means we only need 1 event instead of n Support near real time applications
  • 8. Why did we call it Skyfall?
  • 9. Why did we call it Skyfall?
  • 10. Skyfall Goals and Architecture!
  • 11. Skyfall Goals (Technical) High Availability Handle Server and DC failure gracefully Low latency Zero downtime deployment and configuration Use for internal and external Logging needs In session RPC O(1) reads and writes Support data filtering at the edge Smart Clients
  • 12. Why speed and robustness matters
  • 13. Architecture Web Event Web Event Web Event Global Traffic Management DC1 DC2 Skyfall Skyfall Skyfall Skyfall Skyfall Skyfall Repeater Consumer Service Consumer Consumer Consumer Service Service Consumer Service Consumer Consumer Service Consumer Service
  • 14.
  • 15. 1.  Messages are placed on concurrent non-blocking queue (CNBQ) to minimize latency impact on producer 2.  Messages are then popped from CNBQ and placed on a Disk-Backed queue (DBQ) 3.  DBQ is used to provide temporary storage in case Kafka is down or backed up 4.  Messages from DBQ are popped and sent to Kafka where they are persisted to file system
  • 16. Kafka Kafka is treats persistence as a first class citizen Focus is on high throughput vs lots of bells and whistles State about what has been consumed is maintained in the client rather than the server Kafka is explicitly distributed Supports O(1) reads and writes Pull rather than push http://incubator.apache.org/kafka/design.html
  • 17. Circuit Breaker for remote Services Pattern is used to detect failures and encapsulates logic of preventing a failure to reoccur constantly[1] If a service instance throws an error, times out, or responds with a failure message an error event is marked If the error rate threshold is exceeded that service instance is removed from the pool of available services Before re-adding a service to the pool a test request is made and validated Internal service failures should not be reflected in response to message originator [1] - http://en.wikipedia.org/wiki/Circuit_breaker_design_pattern
  • 18. What does a call to our endpoint look like? Topic •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  • 19. What does a call to our endpoint look like? Version Topic •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  • 20. What does a call to our endpoint look like? Version Resource Topic •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  • 21. What does a call to our endpoint look like? Version Resource URL Params Topic •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  • 22. What does a call to our endpoint look like? Version Resource URL Params Status Code Topic •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  • 23. What does a call to our endpoint look like? Version Resource URL Params Status Code Topic Bytes Transferred •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  • 24. What does a call to our endpoint look like? Version Resource URL Params Status Code Topic Bytes Transferred •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"! CDN Resource User Agent
  • 25. What does a call to our endpoint look like? Version Resource URL Parameters Status Code Topic Bytes Transferred "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)" CDN Resource User Agent The endpoint also receives header and cookie information not Shown here.
  • 26. Zero Downtime Deployment and Configuration Group 1 4 8 16 S1 S2 S2 S3 S3 S4 S4 S5 S5 Group 2 4 8 16 S1 S2 S2 S3 S3 S4 S4 S5 S5
  • 27. Endpoint Configuration Each endpoint maps to a ‘topic’ Header elements may be extracted from the HTTP request Parameters may be mapped to new key names Variables may be extracted from the URL path
  • 28. Data Center Repeater DC Repeater nodes automatically negotiate N1 peering relationships with nodes in the other data N1 center N2 If a peer node becomes unreachable the local node N2 will select a new peer N3 These are special consumers of the Kafka log data created by the local node
  • 30.
  • 31. Requests per/second (VA Data Center)
  • 32. TCP - When do you say goodbye? http://upload.wikimedia.org/wikipedia/commons/a/a2/Tcp_state_diagram_fixed.svg
  • 33. Connection Tracking – what you need to know Connection information is maintained in memory The message: “ip_conntrack: table full, dropping packet” is BAD Chrome – doesn’t close connection on FIN This means that the connection info remains open until it times out, drastically increasing the number of connection your server needs to track You need some mechanism for timing out the connection in a reasonable time period
  • 34. HA Proxy We use a simple round-robin load balancing algorithm with a liveness check Default connection timeouts are way to high. Reasonable values are used to prevent excessive connection tracking “http-close” and “http-server-close” are enabled to ensure low latency for clients and fast session reuse for the server HA Proxy is our solution of choice our LB needs. We prefer software solutions on commodity hardware vs expensive custom LB appliances They could use a new logo