SlideShare uma empresa Scribd logo
1 de 36
Baixar para ler offline
Unified Data Access with Gimel
Vladimir Bacvanski
Anisha Nainani
Deepak Chandramouli
About us
Vladimir Bacvanski
vbacvanski@paypal.com
Twitter: @OnSoftware
• Principal Architect, Strategic Architecture at PayPal
• In previous life: CTO of a development and
consulting firm
• PhD in Computer Science from RWTH Aachen,
Germany
• O’Reilly author: Courses on Big Data, Kafka
Deepak Chandramouli
dmohanakumarchan@paypal.com
LinkedIn: @deepakmc
• MT2 Software Engineer, Data Platform Services at
PayPal
• Data Enthusiast
• Tech lead
• Gimel (Big Data Framework for Apache Spark)
• Unified Data Catalog – PayPal’s Enterprise Data
Catalog
Anisha Nainani
annainani@paypal.com
LinkedIn: @anishanainani
• Senior Software Engineer
• Big Data
• Data Platform Services
AGENDA
❑ PayPal - Introduction
❑ Why Gimel
❑ Gimel Deep Dive
❑ What’s next?
❑ Questions
PayPal – Key Metrics and Analytics Ecosystem
4
PayPal | Q3-2020 | Key Metrics
5https://investor.pypl.com/home/default.aspx
PayPal | Data Growth
6
160+ PB Data200,000+
YARN jobs/day
One of the largest
Aerospike,
Teradata,
Hortonworks
and Oracle
installations
Compute
supported:
Spark, Hive,
MR, BigQuery
20+ On-Premise
clusters
GPU co-located with
Hadoop
Cloud Migration
Adjacencies
7
Developer Data scientist Analyst Operator
Gimel
SDK
Notebooks
UDC Data API
Infrastructure services leveraged for elasticity and redundancy
Multi-DC Public cloudPredictive resource allocation
Logging
Monitoring
Alerting
Security
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
PayPal | Data Landscape
Why Gimel?
9
Challenges | Data Access Code | Cumbersome & Fragile
Spark Read From Hbase Spark Read From Elastic
Search
Spark Read From AeroSpike Spark Read From Druid
Illustration Purpose
Not Meant to Read
Spark Read From Hbase
10
Challenges | Data Processing Can Be Multi-Mode & Polyglot
Batch
11
Challenges with Data App Lifecycle
Learn Code Optimize Build Deploy RunOnboarding Big Data Apps
Learn Code Optimize Build Deploy RunCompute Version Upgraded
Learn Code Optimize Build Deploy RunStorage API Changed
Learn Code Optimize Build Deploy RunStorage Connector Upgraded
Learn Code Optimize Build Deploy RunStorage Hosts Migrated
Learn Code Optimize Build Deploy RunStorage Changed
Learn Code Optimize Build Deploy Run*********************
12
Gimel Simplifies Data Application Lifecycle
Data Application Lifecycle - With Data API
Learn Code Optimize Build Deploy RunOnboarding Big Data Apps
Compute Version Upgraded
Storage API Changed
Storage Connector Upgraded
Storage Hosts Migrated
Storage Changed
*********************
Run
Run
Run
Run
Run
Run
13
Challenges | Instrumentation Required at multiple touchpoints
Catalog /
Classification
Platform Centric
Interceptors
id name address
1 XXXX XXXX
2 XXXX XXXX
Visibility
Security
Data User /
App
Data Stores
14
Putting it all together…
id First_name Last_name address
1 XXXX XXXX XXXX
2 XXXX XXXX XXXX
3 XXXX XXXX XXXX
Data
User
Data App
Data Stores
Catalog /
Classification
Alert
Platform Centric
InterceptorsSecurity
Data / SQL API
App App App
App
…
….
15
Query Routing – Concept
15
Spark / Gimel
Application
Notebooks
Developer/Analyst/Data
Scientist
User / App needs transaction data
• NRT (Streaming)
• 7 days (Analytics Cache)
• 2 Years (cold storage)
1. Submits query to
GSQL Kernel
2. Submits
query to GTS Where txn_dt =
last_7_days
Fast Access Via Cache
A
P
P
• Gimel looks at logical dataset
in UDC
• Interpret filter criteria and
route query to appropriate
storage
Where txn_dt =
last_30_mins
Where txn_dt =
last_2_years
16
Query Routing – Concept
16
• Alluxio Caching – fast access to remote data
• Enterprise Catalog Service – provides config for various stores
• Logical Dataset in catalog - abstracts multiple target stores underneath
• Query pattern (filter) based routing – provides ability to serve data dynamically
Code
Code Base
Docs
http://gimel.io
Code
Gimel_Notebook
Github
https://github.com/paypal/gimel
Gitter
https://gitter.im/paypal/gimel_data_api_community
Gimel – Deep Dive
Peek into implementation
19
Unified Data API & SQL Abstraction
20
With Data APISpark Read From Hbase
Spark Read From Elastic
Search
With SQL
Unified Data API & Unified Config
Unified Data API
Unified Connector Config
Set gimel.catalog.provider=UDC
CatalogProvider.getDataSetProperties(“dataSetName”)
Metadata
Services
Set gimel.catalog.provider=USER
CatalogProvider.getDataSetProperties(“dataSetName”)
Set gimel.catalog.provider=HIVE
CatalogProvider.getDataSetProperties(“dataSetName”)
sql> set dataSetProperties={
"key.deserializer":"org.apache.kafka.common.serialization.StringDeserializer",
"auto.offset.reset":"earliest",
"gimel.kafka.checkpoint.zookeeper.host":"zookeeper:2181",
"gimel.storage.type":"kafka",
"gimel.kafka.whitelist.topics":"kafka_topic",
"datasetName":"test_table1",
"value.deserializer":"org.apache.kafka.common.serialization.ByteArrayDeserializer
",
"value.serializer":"org.apache.kafka.common.serialization.ByteArraySerializer",
"gimel.kafka.checkpoint.zookeeper.path":"/pcatalog/kafka_consumer/checkpoint",
"gimel.kafka.avro.schema.source":"CSR",
"gimel.kafka.zookeeper.connection.timeout.ms":"10000",
"gimel.kafka.avro.schema.source.url":"http://schema_registry:8081",
"key.serializer":"org.apache.kafka.common.serialization.StringSerializer",
"gimel.kafka.avro.schema.source.wrapper.key":"schema_registry_key",
"gimel.kafka.bootstrap.servers":"localhost:9092"
}
sql> Select * from pcatalog.test_table1.
spark.sql("set gimel.catalog.provider=USER");
val dataSetOptions = DataSetProperties(
"KAFKA",
Array(Field("payload","string",true)) ,
Array(),
Map(
"datasetName" -> "test_table1",
"auto.offset.reset"-> "earliest",
"gimel.kafka.bootstrap.servers"-> "localhost:9092",
"gimel.kafka.avro.schema.source"-> "CSR",
"gimel.kafka.avro.schema.source.url"-> "http://schema_registry:8081",
"gimel.kafka.avro.schema.source.wrapper.key"-> "schema_registry_key",
"gimel.kafka.checkpoint.zookeeper.host"-> "zookeeper:2181",
"gimel.kafka.checkpoint.zookeeper.path"->
"/pcatalog/kafka_consumer/checkpoint",
"gimel.kafka.whitelist.topics"-> "kafka_topic",
"gimel.kafka.zookeeper.connection.timeout.ms"-> "10000",
"gimel.storage.type"-> "kafka",
"key.serializer"-> "org.apache.kafka.common.serialization.StringSerializer",
"value.serializer"-> "org.apache.kafka.common.serialization.ByteArraySerializer"
)
)
dataSet.read(”test_table1",Map("dataSetProperties"->dataSetOptions))
CREATE EXTERNAL TABLE `pcatalog.test_table1`
(payload string)
LOCATION 'hdfs://tmp/'
TBLPROPERTIES (
"datasetName" -> "dummy",
"auto.offset.reset"-> "earliest",
"gimel.kafka.bootstrap.servers"-> "localhost:9092",
"gimel.kafka.avro.schema.source"-> "CSR",
"gimel.kafka.avro.schema.source.url"-> "http://schema_registry:8081",
"gimel.kafka.avro.schema.source.wrapper.key"-> "schema_registry_key",
"gimel.kafka.checkpoint.zookeeper.host"-> "zookeeper:2181",
"gimel.kafka.checkpoint.zookeeper.path"->
"/pcatalog/kafka_consumer/checkpoint",
"gimel.kafka.whitelist.topics"-> "kafka_topic",
"gimel.kafka.zookeeper.connection.timeout.ms"-> "10000",
"gimel.storage.type"-> "kafka",
"key.serializer"-> "org.apache.kafka.common.serialization.StringSerializer",
"value.serializer"->
"org.apache.kafka.common.serialization.ByteArraySerializer"
);
Spark-sql> Select * from pcatalog.test_table1
Scala>
dataSet.read(”test_table1",Map("dataSetProperties"->dataSetOptions))
Anatomy of Catalog Provider
Metadata
Set gimel.catalog.provider=YOUR_CATALOG
CatalogProvider.getDataSetProperties(“dataSetName”)
{
// Implement this !
}
gimel.dataset.factory {
KafkaDataSet
ElasticSearchDataSet
DruidDataSet
HiveDataSet
AerospikeDataSet
HbaseDataSet
CassandraDataSet
JDBCDataSet
}
Metadata
Services
dataSet.read(“dataSetName”,options)
dataSet.write(dataToWrite,”dataSetName”, options)
dataStream.read(“dataSetName”, options)
val storageDataSet = getFromFactory(type=“Hive”)
{
Core Connector Implementation, example – Kafka
Combination of Open Source Connector and
In-house implementations
Open source connector such as DataStax / SHC /
ES-Spark
}
Anatomy of API
gimel.datastream.factory {
KafkaDataStream
}
CatalogProvider.getDataSetProperties(“dataSetName”)
val storageDataStream = getFromStreamFactory(type=“kafka”)
kafkaDataSet.read(“dataSetName”,options)
hiveDataSet.write(dataToWrite,”dataSetName”, options)
storageDataStream.read(“dataSetName”, options)
dataSet.write(”pcatalog.HIVE_dataset”,readDf , options)
val dataSet : gimel.DataSet = DataSet(sparkSession)
val df1 = dataSet.read(“pcatalog.KAFKA_dataset”, options);
df1.createGlobalTempView(“tmp_abc123”)
Val resolvedSelectSQL =
selectSQL.replace(“pcatalog.KAFKA_dataset”,”tmp_abc123”)
Val readDf : DataFrame = sparkSession.sql(resolvedSelectSQL);
select kafka_ds.*,gimel_load_id
,substr(commit_timestamp,1,4) as yyyy
,substr(commit_timestamp,6,2) as mm
,substr(commit_timestamp,9,2) as dd
,substr(commit_timestamp,12,2) as hh
from pcatalog.KAFKA_dataset kafka_ds
join default.geo_lkp lkp
on kafka_ds.zip = geo_lkp.zip
where geo_lkp.region = ‘MIDWEST’
%%gimel
insert into pcatalog.HIVE_dataset
partition(yyyy,mm,dd,hh,mi)
-- Establish 10 concurrent connections per Topic-Partition
set gimel.kafka.throttle.batch.parallelsPerPartition=10;
-- Fetch at max - 10 M messages from each partition
set gimel.kafka.throttle.batch.maxRecordsPerPartition=10,000,000;
Next Steps
What’s Next?
• Expand Catalog Provider
• Google Data Catalog
• Cloud Support
• BigQuery
• PubSub
• GCS
• AWS Redshift
• Gimel SQL
• Expand to Cloud Stores
• Query / Access
Optimization
• Pre-empt runaway queries
• Graph Support
• Neo4j
• ML/NLP Support
• ML-Lib
• Spark-NLP
Questions?
Code base
http://gimel.io
Gitter
https://gitter.im/paypal/gimel_data_api_community
Thank You!
Appendix
Gimel Thrift Server @ PayPal
29
▪ HiveServer2
service that allows a remote client to submit requests to Hive using a variety of
programming languages (C++, Java, Python) and retrieve results BLOG
▪ Built on Apache Thrift Concepts
▪ Spark Thrift Server
Similar to HiveServer2, executes in spark Engine as compared to Hive (MR
/TEZ)
What is GTS?
• Gimel Thrift Server
Spark Thrift Server
+ Gimel
+ PayPal’s - Unified Data Catalog
+ Security & other PP specific features
Depending upon the cluster
capacity and traffic user has to
wait for the session
31
Why GTS?
31
Needs to read data from Hive
through SQL
PayPal
Notebooks
Developer/Analyst/Data
Scientist
2. Starts Spark
Session on
cluster
3. Spark session
Started
1. Get a Spark
Session
4. Submits the query
Select * from
pymtdba.wtransaction_p2
5. Reads from
Store
CLI
Host
APP
32
How does GTS Work?
32
Gimel Thrift
Server
Paypal
Notebooks
Developer/Analyst/Data
Scientist
Needs to read data from Hive
Select * from
pymtdba.wtransaction_p2
1. Submits query to
GSQL Kernal
2. Submits
query to GTS 3. Read from
Store
A
P
P
Connect via Java
JDBC / Python
Challenges in platform management
33
34
Challenges | Audit & Monitoring | Multifaceted
DBQLog
s
Audit Table
Cloud Audit Logs
***
Lack of Unified View of Data
Processed on Spark
PubSub
Use
r
35
Platform management Complexities Store Specific Interceptors
PubSub
Store
Operator
s
App
Developer
s
App
s
Instrumentation
By App Developer
GTS Key Features
Out-of-box Auditing:
Logging, Monitoring,
Dashboards
Alerting
(beta/internal)
Security
Apache Ranger
Teradata Proxy User
Part of Ecosystem
Notebooks – GSQL
UDC –Datasets
SCAAS – DML/DDL
Low Latency
User Experience
SQL to Any Store
Stores supported by
Gimel
Highly available
architecture
Software & Hardware
Query via REST
(work in progress)
REST
Query Guard
Kills run away
queries

Mais conteúdo relacionado

Mais procurados

Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkMatt Ingenthron
 
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMPowering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMAlluxio, Inc.
 
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...Databricks
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaAlluxio, Inc.
 
Deep Learning in the Cloud at Scale: A Data Orchestration Story
Deep Learning in the Cloud at Scale: A Data Orchestration StoryDeep Learning in the Cloud at Scale: A Data Orchestration Story
Deep Learning in the Cloud at Scale: A Data Orchestration StoryAlluxio, Inc.
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleVasu S
 
Architecture at Scale
Architecture at ScaleArchitecture at Scale
Architecture at ScaleElasticsearch
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
 
Alluxio + Spark: Accelerating Auto Data Tagging in WeRide
Alluxio + Spark: Accelerating Auto Data Tagging in WeRideAlluxio + Spark: Accelerating Auto Data Tagging in WeRide
Alluxio + Spark: Accelerating Auto Data Tagging in WeRideAlluxio, Inc.
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the CloudDataWorks Summit
 
Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory Sergio Zenatti Filho
 
Logging, Metrics, and APM: The Operations Trifecta
Logging, Metrics, and APM: The Operations TrifectaLogging, Metrics, and APM: The Operations Trifecta
Logging, Metrics, and APM: The Operations TrifectaElasticsearch
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeTorsten Steinbach
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkAlluxio, Inc.
 
Big Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBlueData, Inc.
 

Mais procurados (20)

Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMPowering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
 
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
Encryption and Masking for Sensitive Apache Spark Analytics Addressing CCPA a...
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
Deep Learning in the Cloud at Scale: A Data Orchestration Story
Deep Learning in the Cloud at Scale: A Data Orchestration StoryDeep Learning in the Cloud at Scale: A Data Orchestration Story
Deep Learning in the Cloud at Scale: A Data Orchestration Story
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
 
Architecture at Scale
Architecture at ScaleArchitecture at Scale
Architecture at Scale
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
Alluxio + Spark: Accelerating Auto Data Tagging in WeRide
Alluxio + Spark: Accelerating Auto Data Tagging in WeRideAlluxio + Spark: Accelerating Auto Data Tagging in WeRide
Alluxio + Spark: Accelerating Auto Data Tagging in WeRide
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the Cloud
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 
Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory Unleash the power of Azure Data Factory
Unleash the power of Azure Data Factory
 
Logging, Metrics, and APM: The Operations Trifecta
Logging, Metrics, and APM: The Operations TrifectaLogging, Metrics, and APM: The Operations Trifecta
Logging, Metrics, and APM: The Operations Trifecta
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lake
 
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
 
Big Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 Telco
 

Semelhante a Unified Data Access with Gimel

QCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformQCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformDeepak Chandramouli
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli
 
Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018Romit Mehta
 
Dataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platformDataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platformDeepak Chandramouli
 
Witsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streamingWitsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streamingMark Kerzner
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Automating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentAutomating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentCA | Automic Software
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in RealtimeDataWorks Summit
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?confluent
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryDataWorks Summit/Hadoop Summit
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshSion Smith
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoGimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoRomit Mehta
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseHao Chen
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...James Anderson
 
Neo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform OverviewNeo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform OverviewNeo4j
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsAnant Corporation
 

Semelhante a Unified Data Access with Gimel (20)

Scale By The Bay | 2020 | Gimel
Scale By The Bay | 2020 | GimelScale By The Bay | 2020 | Gimel
Scale By The Bay | 2020 | Gimel
 
QCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformQCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic Platform
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
 
Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018
 
Dataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platformDataworks | 2018-06-20 | Gimel data platform
Dataworks | 2018-06-20 | Gimel data platform
 
Witsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streamingWitsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streaming
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Automating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop AgentAutomating Big Data with the Automic Hadoop Agent
Automating Big Data with the Automic Hadoop Agent
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive Industry
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoGimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Neo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform OverviewNeo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform Overview
 
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsData Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data Platforms
 

Mais de Alluxio, Inc.

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioAlluxio, Inc.
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingAlluxio, Inc.
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...Alluxio, Inc.
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionAlluxio, Inc.
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeAlluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudAlluxio, Inc.
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderAlluxio, Inc.
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionAlluxio, Inc.
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio, Inc.
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...Alluxio, Inc.
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAlluxio, Inc.
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...Alluxio, Inc.
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...Alluxio, Inc.
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAlluxio, Inc.
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAlluxio, Inc.
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio, Inc.
 

Mais de Alluxio, Inc. (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
AI Infra Day | Hands-on Lab: CV Model Training with PyTorch & Alluxio on Kube...
 
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...AI Infra Day | The Generative AI Market  And Intel AI Strategy and Product Up...
AI Infra Day | The Generative AI Market And Intel AI Strategy and Product Up...
 
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ MetaAI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
AI Infra Day | Composable PyTorch Distributed with PT2 @ Meta
 
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber ScaleAI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
AI Infra Day | Model Lifecycle Management Quality Assurance at Uber Scale
 
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWSAlluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
Alluxio Monthly Webinar | Efficient Data Loading for Model Training on AWS
 

Último

Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 

Último (20)

Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 

Unified Data Access with Gimel

  • 1. Unified Data Access with Gimel Vladimir Bacvanski Anisha Nainani Deepak Chandramouli
  • 2. About us Vladimir Bacvanski vbacvanski@paypal.com Twitter: @OnSoftware • Principal Architect, Strategic Architecture at PayPal • In previous life: CTO of a development and consulting firm • PhD in Computer Science from RWTH Aachen, Germany • O’Reilly author: Courses on Big Data, Kafka Deepak Chandramouli dmohanakumarchan@paypal.com LinkedIn: @deepakmc • MT2 Software Engineer, Data Platform Services at PayPal • Data Enthusiast • Tech lead • Gimel (Big Data Framework for Apache Spark) • Unified Data Catalog – PayPal’s Enterprise Data Catalog Anisha Nainani annainani@paypal.com LinkedIn: @anishanainani • Senior Software Engineer • Big Data • Data Platform Services
  • 3. AGENDA ❑ PayPal - Introduction ❑ Why Gimel ❑ Gimel Deep Dive ❑ What’s next? ❑ Questions
  • 4. PayPal – Key Metrics and Analytics Ecosystem 4
  • 5. PayPal | Q3-2020 | Key Metrics 5https://investor.pypl.com/home/default.aspx
  • 6. PayPal | Data Growth 6 160+ PB Data200,000+ YARN jobs/day One of the largest Aerospike, Teradata, Hortonworks and Oracle installations Compute supported: Spark, Hive, MR, BigQuery 20+ On-Premise clusters GPU co-located with Hadoop Cloud Migration Adjacencies
  • 7. 7 Developer Data scientist Analyst Operator Gimel SDK Notebooks UDC Data API Infrastructure services leveraged for elasticity and redundancy Multi-DC Public cloudPredictive resource allocation Logging Monitoring Alerting Security Application Lifecycle Management Compute Frameworkand APIs GimelData Platform User Experience andAccess R Studio BI tools PayPal | Data Landscape
  • 9. 9 Challenges | Data Access Code | Cumbersome & Fragile Spark Read From Hbase Spark Read From Elastic Search Spark Read From AeroSpike Spark Read From Druid Illustration Purpose Not Meant to Read Spark Read From Hbase
  • 10. 10 Challenges | Data Processing Can Be Multi-Mode & Polyglot Batch
  • 11. 11 Challenges with Data App Lifecycle Learn Code Optimize Build Deploy RunOnboarding Big Data Apps Learn Code Optimize Build Deploy RunCompute Version Upgraded Learn Code Optimize Build Deploy RunStorage API Changed Learn Code Optimize Build Deploy RunStorage Connector Upgraded Learn Code Optimize Build Deploy RunStorage Hosts Migrated Learn Code Optimize Build Deploy RunStorage Changed Learn Code Optimize Build Deploy Run*********************
  • 12. 12 Gimel Simplifies Data Application Lifecycle Data Application Lifecycle - With Data API Learn Code Optimize Build Deploy RunOnboarding Big Data Apps Compute Version Upgraded Storage API Changed Storage Connector Upgraded Storage Hosts Migrated Storage Changed ********************* Run Run Run Run Run Run
  • 13. 13 Challenges | Instrumentation Required at multiple touchpoints Catalog / Classification Platform Centric Interceptors id name address 1 XXXX XXXX 2 XXXX XXXX Visibility Security Data User / App Data Stores
  • 14. 14 Putting it all together… id First_name Last_name address 1 XXXX XXXX XXXX 2 XXXX XXXX XXXX 3 XXXX XXXX XXXX Data User Data App Data Stores Catalog / Classification Alert Platform Centric InterceptorsSecurity Data / SQL API App App App App … ….
  • 15. 15 Query Routing – Concept 15 Spark / Gimel Application Notebooks Developer/Analyst/Data Scientist User / App needs transaction data • NRT (Streaming) • 7 days (Analytics Cache) • 2 Years (cold storage) 1. Submits query to GSQL Kernel 2. Submits query to GTS Where txn_dt = last_7_days Fast Access Via Cache A P P • Gimel looks at logical dataset in UDC • Interpret filter criteria and route query to appropriate storage Where txn_dt = last_30_mins Where txn_dt = last_2_years
  • 16. 16 Query Routing – Concept 16 • Alluxio Caching – fast access to remote data • Enterprise Catalog Service – provides config for various stores • Logical Dataset in catalog - abstracts multiple target stores underneath • Query pattern (filter) based routing – provides ability to serve data dynamically
  • 17. Code
  • 19. Gimel – Deep Dive Peek into implementation 19
  • 20. Unified Data API & SQL Abstraction 20 With Data APISpark Read From Hbase Spark Read From Elastic Search With SQL
  • 21. Unified Data API & Unified Config Unified Data API Unified Connector Config
  • 22. Set gimel.catalog.provider=UDC CatalogProvider.getDataSetProperties(“dataSetName”) Metadata Services Set gimel.catalog.provider=USER CatalogProvider.getDataSetProperties(“dataSetName”) Set gimel.catalog.provider=HIVE CatalogProvider.getDataSetProperties(“dataSetName”) sql> set dataSetProperties={ "key.deserializer":"org.apache.kafka.common.serialization.StringDeserializer", "auto.offset.reset":"earliest", "gimel.kafka.checkpoint.zookeeper.host":"zookeeper:2181", "gimel.storage.type":"kafka", "gimel.kafka.whitelist.topics":"kafka_topic", "datasetName":"test_table1", "value.deserializer":"org.apache.kafka.common.serialization.ByteArrayDeserializer ", "value.serializer":"org.apache.kafka.common.serialization.ByteArraySerializer", "gimel.kafka.checkpoint.zookeeper.path":"/pcatalog/kafka_consumer/checkpoint", "gimel.kafka.avro.schema.source":"CSR", "gimel.kafka.zookeeper.connection.timeout.ms":"10000", "gimel.kafka.avro.schema.source.url":"http://schema_registry:8081", "key.serializer":"org.apache.kafka.common.serialization.StringSerializer", "gimel.kafka.avro.schema.source.wrapper.key":"schema_registry_key", "gimel.kafka.bootstrap.servers":"localhost:9092" } sql> Select * from pcatalog.test_table1. spark.sql("set gimel.catalog.provider=USER"); val dataSetOptions = DataSetProperties( "KAFKA", Array(Field("payload","string",true)) , Array(), Map( "datasetName" -> "test_table1", "auto.offset.reset"-> "earliest", "gimel.kafka.bootstrap.servers"-> "localhost:9092", "gimel.kafka.avro.schema.source"-> "CSR", "gimel.kafka.avro.schema.source.url"-> "http://schema_registry:8081", "gimel.kafka.avro.schema.source.wrapper.key"-> "schema_registry_key", "gimel.kafka.checkpoint.zookeeper.host"-> "zookeeper:2181", "gimel.kafka.checkpoint.zookeeper.path"-> "/pcatalog/kafka_consumer/checkpoint", "gimel.kafka.whitelist.topics"-> "kafka_topic", "gimel.kafka.zookeeper.connection.timeout.ms"-> "10000", "gimel.storage.type"-> "kafka", "key.serializer"-> "org.apache.kafka.common.serialization.StringSerializer", "value.serializer"-> "org.apache.kafka.common.serialization.ByteArraySerializer" ) ) dataSet.read(”test_table1",Map("dataSetProperties"->dataSetOptions)) CREATE EXTERNAL TABLE `pcatalog.test_table1` (payload string) LOCATION 'hdfs://tmp/' TBLPROPERTIES ( "datasetName" -> "dummy", "auto.offset.reset"-> "earliest", "gimel.kafka.bootstrap.servers"-> "localhost:9092", "gimel.kafka.avro.schema.source"-> "CSR", "gimel.kafka.avro.schema.source.url"-> "http://schema_registry:8081", "gimel.kafka.avro.schema.source.wrapper.key"-> "schema_registry_key", "gimel.kafka.checkpoint.zookeeper.host"-> "zookeeper:2181", "gimel.kafka.checkpoint.zookeeper.path"-> "/pcatalog/kafka_consumer/checkpoint", "gimel.kafka.whitelist.topics"-> "kafka_topic", "gimel.kafka.zookeeper.connection.timeout.ms"-> "10000", "gimel.storage.type"-> "kafka", "key.serializer"-> "org.apache.kafka.common.serialization.StringSerializer", "value.serializer"-> "org.apache.kafka.common.serialization.ByteArraySerializer" ); Spark-sql> Select * from pcatalog.test_table1 Scala> dataSet.read(”test_table1",Map("dataSetProperties"->dataSetOptions)) Anatomy of Catalog Provider Metadata Set gimel.catalog.provider=YOUR_CATALOG CatalogProvider.getDataSetProperties(“dataSetName”) { // Implement this ! }
  • 23. gimel.dataset.factory { KafkaDataSet ElasticSearchDataSet DruidDataSet HiveDataSet AerospikeDataSet HbaseDataSet CassandraDataSet JDBCDataSet } Metadata Services dataSet.read(“dataSetName”,options) dataSet.write(dataToWrite,”dataSetName”, options) dataStream.read(“dataSetName”, options) val storageDataSet = getFromFactory(type=“Hive”) { Core Connector Implementation, example – Kafka Combination of Open Source Connector and In-house implementations Open source connector such as DataStax / SHC / ES-Spark } Anatomy of API gimel.datastream.factory { KafkaDataStream } CatalogProvider.getDataSetProperties(“dataSetName”) val storageDataStream = getFromStreamFactory(type=“kafka”) kafkaDataSet.read(“dataSetName”,options) hiveDataSet.write(dataToWrite,”dataSetName”, options) storageDataStream.read(“dataSetName”, options) dataSet.write(”pcatalog.HIVE_dataset”,readDf , options) val dataSet : gimel.DataSet = DataSet(sparkSession) val df1 = dataSet.read(“pcatalog.KAFKA_dataset”, options); df1.createGlobalTempView(“tmp_abc123”) Val resolvedSelectSQL = selectSQL.replace(“pcatalog.KAFKA_dataset”,”tmp_abc123”) Val readDf : DataFrame = sparkSession.sql(resolvedSelectSQL); select kafka_ds.*,gimel_load_id ,substr(commit_timestamp,1,4) as yyyy ,substr(commit_timestamp,6,2) as mm ,substr(commit_timestamp,9,2) as dd ,substr(commit_timestamp,12,2) as hh from pcatalog.KAFKA_dataset kafka_ds join default.geo_lkp lkp on kafka_ds.zip = geo_lkp.zip where geo_lkp.region = ‘MIDWEST’ %%gimel insert into pcatalog.HIVE_dataset partition(yyyy,mm,dd,hh,mi) -- Establish 10 concurrent connections per Topic-Partition set gimel.kafka.throttle.batch.parallelsPerPartition=10; -- Fetch at max - 10 M messages from each partition set gimel.kafka.throttle.batch.maxRecordsPerPartition=10,000,000;
  • 25. What’s Next? • Expand Catalog Provider • Google Data Catalog • Cloud Support • BigQuery • PubSub • GCS • AWS Redshift • Gimel SQL • Expand to Cloud Stores • Query / Access Optimization • Pre-empt runaway queries • Graph Support • Neo4j • ML/NLP Support • ML-Lib • Spark-NLP
  • 29. Gimel Thrift Server @ PayPal 29
  • 30. ▪ HiveServer2 service that allows a remote client to submit requests to Hive using a variety of programming languages (C++, Java, Python) and retrieve results BLOG ▪ Built on Apache Thrift Concepts ▪ Spark Thrift Server Similar to HiveServer2, executes in spark Engine as compared to Hive (MR /TEZ) What is GTS? • Gimel Thrift Server Spark Thrift Server + Gimel + PayPal’s - Unified Data Catalog + Security & other PP specific features
  • 31. Depending upon the cluster capacity and traffic user has to wait for the session 31 Why GTS? 31 Needs to read data from Hive through SQL PayPal Notebooks Developer/Analyst/Data Scientist 2. Starts Spark Session on cluster 3. Spark session Started 1. Get a Spark Session 4. Submits the query Select * from pymtdba.wtransaction_p2 5. Reads from Store CLI Host APP
  • 32. 32 How does GTS Work? 32 Gimel Thrift Server Paypal Notebooks Developer/Analyst/Data Scientist Needs to read data from Hive Select * from pymtdba.wtransaction_p2 1. Submits query to GSQL Kernal 2. Submits query to GTS 3. Read from Store A P P Connect via Java JDBC / Python
  • 33. Challenges in platform management 33
  • 34. 34 Challenges | Audit & Monitoring | Multifaceted DBQLog s Audit Table Cloud Audit Logs *** Lack of Unified View of Data Processed on Spark PubSub Use r
  • 35. 35 Platform management Complexities Store Specific Interceptors PubSub Store Operator s App Developer s App s Instrumentation By App Developer
  • 36. GTS Key Features Out-of-box Auditing: Logging, Monitoring, Dashboards Alerting (beta/internal) Security Apache Ranger Teradata Proxy User Part of Ecosystem Notebooks – GSQL UDC –Datasets SCAAS – DML/DDL Low Latency User Experience SQL to Any Store Stores supported by Gimel Highly available architecture Software & Hardware Query via REST (work in progress) REST Query Guard Kills run away queries