SlideShare uma empresa Scribd logo
1 de 48
Improving Python and Spark
Performance and Interoperability
with Apache Arrow
Julien Le Dem
Principal Architect
Dremio
Li Jin
Software Engineer
Two Sigma Investments
© 2017 Dremio Corporation, Two Sigma Investments, LP
About Us
• Architect at @DremioHQ
• Formerly Tech Lead at Twitter on Data
Platforms
• Creator of Parquet
• Apache member
• Apache PMCs: Arrow, Kudu, Incubator,
Pig, Parquet
Julien Le Dem
@J_
Li Jin
@icexelloss
• Software Engineer at Two Sigma
Investments
• Building a python-based analytics
platform with PySpark
• Other open source projects:
– Flint: A Time Series Library on Spark
– Cook: A Fair Share Scheduler on
Mesos
© 2017 Dremio Corporation, Two Sigma Investments, LP
Agenda
• Current state and limitations of PySpark UDFs
• Apache Arrow overview
• Improvements realized
• Future roadmap
Current state
and limitations
of PySpark UDFs
© 2017 Dremio Corporation, Two Sigma Investments, LP
Why do we need User Defined Functions?
• Some computation is more easily expressed with Python than Spark
built-in functions.
• Examples:
– weighted mean
– weighted correlation
– exponential moving average
© 2017 Dremio Corporation, Two Sigma Investments, LP
What is PySpark UDF
• PySpark UDF is a user defined function executed in
Python runtime.
• Two types:
– Row UDF:
• lambda x: x + 1
• lambda date1, date2: (date1 - date2).years
– Group UDF (subject of this presentation):
• lambda values: np.mean(np.array(values))
© 2017 Dremio Corporation, Two Sigma Investments, LP
Row UDF
• Operates on a row by row basis
– Similar to `map` operator
• Example …
df.withColumn(
‘v2’,
udf(lambda x: x+1, DoubleType())(df.v1)
)
• Performance:
– 60x slower than build-in functions for simple case
© 2017 Dremio Corporation, Two Sigma Investments, LP
Group UDF
• UDF that operates on more than one row
– Similar to `groupBy` followed by `map` operator
• Example:
– Compute weighted mean by month
© 2017 Dremio Corporation, Two Sigma Investments, LP
Group UDF
• Not supported out of box:
– Need boiler plate code to pack/unpack multiple rows into a nested row
• Poor performance
– Groups are materialized and then converted to Python data structures
© 2017 Dremio Corporation, Two Sigma Investments, LP
Example: Data Normalization
(values – values.mean()) / values.std()
© 2017 Dremio Corporation, Two Sigma Investments, LP
Example: Data Normalization
© 2017 Dremio Corporation, Two Sigma Investments, LP
Example: Monthly Data Normalization
Useful bits
© 2017 Dremio Corporation, Two Sigma Investments, LP
Example: Monthly Data Normalization
Boilerplate
Boilerplate
© 2017 Dremio Corporation, Two Sigma Investments, LP
Example: Monthly Data Normalization
• Poor performance - 16x slower than baseline
groupBy().agg(collect_list())
© 2017 Dremio Corporation, Two Sigma Investments, LP
Problems
• Packing / unpacking nested rows
• Inefficient data movement (Serialization / Deserialization)
• Scalar computation model: object boxing and interpreter overhead
Apache
Arrow
© 2017 Dremio Corporation, Two Sigma Investments, LP
Arrow: An open source standard
• Common need for in memory columnar
• Building on the success of Parquet.
• Top-level Apache project
• Standard from the start
– Developers from 13+ major open source projects involved
• Benefits:
– Share the effort
– Create an ecosystem
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
© 2017 Dremio Corporation, Two Sigma Investments, LP
Arrow goals
• Well-documented and cross language compatible
• Designed to take advantage of modern CPU
• Embeddable
- In execution engines, storage layers, etc.
• Interoperable
© 2017 Dremio Corporation, Two Sigma Investments, LP
High Performance Sharing & Interchange
Before With Arrow
• Each system has its own internal memory
format
• 70-80% CPU wasted on serialization and
deserialization
• Functionality duplication and unnecessary
conversions
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg:
Parquet-to-Arrow reader)
© 2017 Dremio Corporation, Two Sigma Investments, LP
Columnar data
persons = [{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}, {
name: ’Jack',
age: 37,
phones: [ ‘555-333-3333’ ]
}]
© 2017 Dremio Corporation, Two Sigma Investments, LP
Record Batch Construction
Schema
Negotiation
Dictionary
Batch
Record
Batch
Record
Batch
Record
Batch
name (offset)
name (data)
age (data)
phones (list offset)
phones (data)
data header (describes offsets into data)
name (bitmap)
age (bitmap)
phones (bitmap)
phones (offset)
{
name: ’Joe',
age: 18,
phones: [
‘555-111-1111’,
‘555-222-2222’
]
}
Each box (vector) is contiguous memory
The entire record batch is contiguous on wire
© 2017 Dremio Corporation, Two Sigma Investments, LP
In memory columnar format for speed
• Maximize CPU throughput
- Pipelining
- SIMD
- cache locality
• Scatter/gather I/O
© 2017 Dremio Corporation, Two Sigma Investments, LP
Results
- PySpark Integration:
53x speedup (IBM spark work on SPARK-13534)
http://s.apache.org/arrowresult1
- Streaming Arrow Performance
7.75GB/s data movement
http://s.apache.org/arrowresult2
- Arrow Parquet C++ Integration
4GB/s reads
http://s.apache.org/arrowresult3
- Pandas Integration
9.71GB/s
http://s.apache.org/arrowresult4
© 2017 Dremio Corporation, Two Sigma Investments, LP
Arrow Releases
178
195
311
85
237
131
76
17
October 10, 2016
0.1.0
February 18, 2017
0.2.0
May 5, 2017
0.3.0
May 22, 2017
0.4.0
Changes Days
Improvements
to PySpark
with Arrow
© 2017 Dremio Corporation, Two Sigma Investments, LP
How PySpark UDF works
Executor
Python
Worker
UDF: scalar -> scalar
Batched Rows
Batched Rows
© 2017 Dremio Corporation, Two Sigma Investments, LP
Current Issues with UDF
• Serialize / Deserialize in Python
• Scalar computation model (Python for loop)
© 2017 Dremio Corporation, Two Sigma Investments, LP
Profile lambda x: x+1 Actual Runtime is 2s without profiling.
8 Mb/s
91.8%
© 2017 Dremio Corporation, Two Sigma Investments, LP
Vectorize Row UDF
Executor
Python
Worker
UDF: pd.DataFrame -> pd.DataFrame
Rows ->
RB
RB ->
Rows
© 2017 Dremio Corporation, Two Sigma Investments, LP
Why pandas.DataFrame
• Fast, feature-rich, widely used by Python users
• Already exists in PySpark (toPandas)
• Compatible with popular Python libraries:
- NumPy, StatsModels, SciPy, scikit-learn…
• Zero copy to/from Arrow
© 2017 Dremio Corporation, Two Sigma Investments, LP
Scalar vs Vectorized UDF
20x Speed Up
Actual Runtime is 2s without profiling
© 2017 Dremio Corporation, Two Sigma Investments, LP
Scalar vs Vectorized UDF
Overhead
Removed
© 2017 Dremio Corporation, Two Sigma Investments, LP
Scalar vs Vectorized UDF
Less System Call
Faster I/O
© 2017 Dremio Corporation, Two Sigma Investments, LP
Scalar vs Vectorized UDF
4.5x Speed Up
© 2017 Dremio Corporation, Two Sigma Investments, LP
Support Group UDF
• Split-apply-combine:
- Break a problem into smaller pieces
- Operate on each piece independently
- Put all pieces back together
• Common pattern supported in SQL, Spark, Pandas, R …
© 2017 Dremio Corporation, Two Sigma Investments, LP
Split-Apply-Combine (Current)
• Split: groupBy, window, …
• Apply: mean, stddev, collect_list, rank …
• Combine: Inherently done by Spark
© 2017 Dremio Corporation, Two Sigma Investments, LP
Split-Apply-Combine (with Group UDF)
• Split: groupBy, window, …
• Apply: UDF
• Combine: Inherently done by Spark
© 2017 Dremio Corporation, Two Sigma Investments, LP
Introduce groupBy().apply()
• UDF: pd.DataFrame -> pd.DataFrame
– Treat each group as a pandas DataFrame
– Apply UDF on each group
– Assemble as PySpark DataFrame
© 2017 Dremio Corporation, Two Sigma Investments, LP
Introduce groupBy().apply()
Rows
Rows
Rows
Groups
Groups
Groups
Groups
Groups
Groups
Each Group:
pd.DataFrame -> pd.DataFramegroupBy
© 2017 Dremio Corporation, Two Sigma Investments, LP
Previous Example: Data Normalization
(values – values.mean()) / values.std()
© 2017 Dremio Corporation, Two Sigma Investments, LP
Previous Example: Data Normalization
5x Speed Up
Current: Group UDF:
© 2017 Dremio Corporation, Two Sigma Investments, LP
Limitations
• Requires Spark Row <-> Arrow RecordBatch conversion
– Incompatible memory layout (row vs column)
• (groupBy) No local aggregation
– Difficult due to how PySpark works. See
https://issues.apache.org/jira/browse/SPARK-10915
Future
Roadmap
© 2017 Dremio Corporation, Two Sigma Investments, LP
What’s Next (Arrow)
• Arrow RPC/REST
• Arrow IPC
• Apache {Spark, Drill, Kudu} to Arrow Integration
– Faster UDFs, Storage interfaces
© 2017 Dremio Corporation, Two Sigma Investments, LP
What’s Next (PySpark UDF)
• Continue working on SPARK-20396
• Support Pandas UDF with more PySpark functions:
– groupBy().agg()
– window
© 2017 Dremio Corporation, Two Sigma Investments, LP
What’s Next (PySpark UDF)
© 2017 Dremio Corporation, Two Sigma Investments, LP
Get Involved
• Watch SPARK-20396
• Join the Arrow community
– dev@arrow.apache.org
– Slack:
• https://apachearrowslackin.herokuapp.com/
– http://arrow.apache.org
– Follow @ApacheArrow
© 2017 Dremio Corporation, Two Sigma Investments, LP
Thank you
• Bryan Cutler (IBM), Wes McKinney (Two Sigma Investments) for
helping build this feature
• Apache Arrow community
• Spark Summit organizers
• Two Sigma and Dremio for supporting this work

Mais conteúdo relacionado

Mais procurados

Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017techmaddy
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed databaseJulien Le Dem
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonDataWorks Summit
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemDataWorks Summit/Hadoop Summit
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 

Mais procurados (20)

Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with PythonApache Spark 2.3 boosts advanced analytics and deep learning with Python
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 

Semelhante a Improving Python and Spark Performance and Interoperability with Apache Arrow

Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Databricks
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowDataWorks Summit/Hadoop Summit
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowLi Jin
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowTwo Sigma
 
From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...
From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...
From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...Amazon Web Services
 
A framework used to bridge between the language of business and PLCS
A framework used to bridge between the language of business and PLCSA framework used to bridge between the language of business and PLCS
A framework used to bridge between the language of business and PLCSMagnus Färneland
 
There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9Gleb Otochkin
 
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...Insight Technology, Inc.
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSKimmo Kantojärvi
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowDataWorks Summit
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...Amazon Web Services
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015Daniela Zuppini
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
 
SAP Teched 2012 Session Tec3438 Automate IaaS SAP deployments
SAP Teched 2012 Session Tec3438 Automate IaaS SAP deploymentsSAP Teched 2012 Session Tec3438 Automate IaaS SAP deployments
SAP Teched 2012 Session Tec3438 Automate IaaS SAP deploymentsChris Kernaghan
 
PostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolPostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolEDB
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database RoundtableEric Kavanagh
 

Semelhante a Improving Python and Spark Performance and Interoperability with Apache Arrow (20)

Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
 
From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...
From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...
From Mainframe to Microservices: Vanguard’s Move to the Cloud - ENT331 - re:I...
 
A framework used to bridge between the language of business and PLCS
A framework used to bridge between the language of business and PLCSA framework used to bridge between the language of business and PLCS
A framework used to bridge between the language of business and PLCS
 
There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9There and back_again_oracle_and_big_data_16x9
There and back_again_oracle_and_big_data_16x9
 
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
[db tech showcase Tokyo 2017] C13:There and back again or how to connect Orac...
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
SAP Teched 2012 Session Tec3438 Automate IaaS SAP deployments
SAP Teched 2012 Session Tec3438 Automate IaaS SAP deploymentsSAP Teched 2012 Session Tec3438 Automate IaaS SAP deployments
SAP Teched 2012 Session Tec3438 Automate IaaS SAP deployments
 
PostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolPostgreSQL as a Strategic Tool
PostgreSQL as a Strategic Tool
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 

Mais de Julien Le Dem

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Julien Le Dem
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open houseJulien Le Dem
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesJulien Le Dem
 
Embedding Pig in scripting languages
Embedding Pig in scripting languagesEmbedding Pig in scripting languages
Embedding Pig in scripting languagesJulien Le Dem
 

Mais de Julien Le Dem (14)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Poster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languagesPoster Hadoop summit 2011: pig embedding in scripting languages
Poster Hadoop summit 2011: pig embedding in scripting languages
 
Embedding Pig in scripting languages
Embedding Pig in scripting languagesEmbedding Pig in scripting languages
Embedding Pig in scripting languages
 

Último

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Fact vs. Fiction: Autodetecting Hallucinations in LLMs
Fact vs. Fiction: Autodetecting Hallucinations in LLMsFact vs. Fiction: Autodetecting Hallucinations in LLMs
Fact vs. Fiction: Autodetecting Hallucinations in LLMsZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Último (20)

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Fact vs. Fiction: Autodetecting Hallucinations in LLMs
Fact vs. Fiction: Autodetecting Hallucinations in LLMsFact vs. Fiction: Autodetecting Hallucinations in LLMs
Fact vs. Fiction: Autodetecting Hallucinations in LLMs
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

Improving Python and Spark Performance and Interoperability with Apache Arrow

  • 1. Improving Python and Spark Performance and Interoperability with Apache Arrow Julien Le Dem Principal Architect Dremio Li Jin Software Engineer Two Sigma Investments
  • 2. © 2017 Dremio Corporation, Two Sigma Investments, LP About Us • Architect at @DremioHQ • Formerly Tech Lead at Twitter on Data Platforms • Creator of Parquet • Apache member • Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet Julien Le Dem @J_ Li Jin @icexelloss • Software Engineer at Two Sigma Investments • Building a python-based analytics platform with PySpark • Other open source projects: – Flint: A Time Series Library on Spark – Cook: A Fair Share Scheduler on Mesos
  • 3. © 2017 Dremio Corporation, Two Sigma Investments, LP Agenda • Current state and limitations of PySpark UDFs • Apache Arrow overview • Improvements realized • Future roadmap
  • 5. © 2017 Dremio Corporation, Two Sigma Investments, LP Why do we need User Defined Functions? • Some computation is more easily expressed with Python than Spark built-in functions. • Examples: – weighted mean – weighted correlation – exponential moving average
  • 6. © 2017 Dremio Corporation, Two Sigma Investments, LP What is PySpark UDF • PySpark UDF is a user defined function executed in Python runtime. • Two types: – Row UDF: • lambda x: x + 1 • lambda date1, date2: (date1 - date2).years – Group UDF (subject of this presentation): • lambda values: np.mean(np.array(values))
  • 7. © 2017 Dremio Corporation, Two Sigma Investments, LP Row UDF • Operates on a row by row basis – Similar to `map` operator • Example … df.withColumn( ‘v2’, udf(lambda x: x+1, DoubleType())(df.v1) ) • Performance: – 60x slower than build-in functions for simple case
  • 8. © 2017 Dremio Corporation, Two Sigma Investments, LP Group UDF • UDF that operates on more than one row – Similar to `groupBy` followed by `map` operator • Example: – Compute weighted mean by month
  • 9. © 2017 Dremio Corporation, Two Sigma Investments, LP Group UDF • Not supported out of box: – Need boiler plate code to pack/unpack multiple rows into a nested row • Poor performance – Groups are materialized and then converted to Python data structures
  • 10. © 2017 Dremio Corporation, Two Sigma Investments, LP Example: Data Normalization (values – values.mean()) / values.std()
  • 11. © 2017 Dremio Corporation, Two Sigma Investments, LP Example: Data Normalization
  • 12. © 2017 Dremio Corporation, Two Sigma Investments, LP Example: Monthly Data Normalization Useful bits
  • 13. © 2017 Dremio Corporation, Two Sigma Investments, LP Example: Monthly Data Normalization Boilerplate Boilerplate
  • 14. © 2017 Dremio Corporation, Two Sigma Investments, LP Example: Monthly Data Normalization • Poor performance - 16x slower than baseline groupBy().agg(collect_list())
  • 15. © 2017 Dremio Corporation, Two Sigma Investments, LP Problems • Packing / unpacking nested rows • Inefficient data movement (Serialization / Deserialization) • Scalar computation model: object boxing and interpreter overhead
  • 17. © 2017 Dremio Corporation, Two Sigma Investments, LP Arrow: An open source standard • Common need for in memory columnar • Building on the success of Parquet. • Top-level Apache project • Standard from the start – Developers from 13+ major open source projects involved • Benefits: – Share the effort – Create an ecosystem Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 18. © 2017 Dremio Corporation, Two Sigma Investments, LP Arrow goals • Well-documented and cross language compatible • Designed to take advantage of modern CPU • Embeddable - In execution engines, storage layers, etc. • Interoperable
  • 19. © 2017 Dremio Corporation, Two Sigma Investments, LP High Performance Sharing & Interchange Before With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Functionality duplication and unnecessary conversions • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg: Parquet-to-Arrow reader)
  • 20. © 2017 Dremio Corporation, Two Sigma Investments, LP Columnar data persons = [{ name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] }, { name: ’Jack', age: 37, phones: [ ‘555-333-3333’ ] }]
  • 21. © 2017 Dremio Corporation, Two Sigma Investments, LP Record Batch Construction Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch name (offset) name (data) age (data) phones (list offset) phones (data) data header (describes offsets into data) name (bitmap) age (bitmap) phones (bitmap) phones (offset) { name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] } Each box (vector) is contiguous memory The entire record batch is contiguous on wire
  • 22. © 2017 Dremio Corporation, Two Sigma Investments, LP In memory columnar format for speed • Maximize CPU throughput - Pipelining - SIMD - cache locality • Scatter/gather I/O
  • 23. © 2017 Dremio Corporation, Two Sigma Investments, LP Results - PySpark Integration: 53x speedup (IBM spark work on SPARK-13534) http://s.apache.org/arrowresult1 - Streaming Arrow Performance 7.75GB/s data movement http://s.apache.org/arrowresult2 - Arrow Parquet C++ Integration 4GB/s reads http://s.apache.org/arrowresult3 - Pandas Integration 9.71GB/s http://s.apache.org/arrowresult4
  • 24. © 2017 Dremio Corporation, Two Sigma Investments, LP Arrow Releases 178 195 311 85 237 131 76 17 October 10, 2016 0.1.0 February 18, 2017 0.2.0 May 5, 2017 0.3.0 May 22, 2017 0.4.0 Changes Days
  • 26. © 2017 Dremio Corporation, Two Sigma Investments, LP How PySpark UDF works Executor Python Worker UDF: scalar -> scalar Batched Rows Batched Rows
  • 27. © 2017 Dremio Corporation, Two Sigma Investments, LP Current Issues with UDF • Serialize / Deserialize in Python • Scalar computation model (Python for loop)
  • 28. © 2017 Dremio Corporation, Two Sigma Investments, LP Profile lambda x: x+1 Actual Runtime is 2s without profiling. 8 Mb/s 91.8%
  • 29. © 2017 Dremio Corporation, Two Sigma Investments, LP Vectorize Row UDF Executor Python Worker UDF: pd.DataFrame -> pd.DataFrame Rows -> RB RB -> Rows
  • 30. © 2017 Dremio Corporation, Two Sigma Investments, LP Why pandas.DataFrame • Fast, feature-rich, widely used by Python users • Already exists in PySpark (toPandas) • Compatible with popular Python libraries: - NumPy, StatsModels, SciPy, scikit-learn… • Zero copy to/from Arrow
  • 31. © 2017 Dremio Corporation, Two Sigma Investments, LP Scalar vs Vectorized UDF 20x Speed Up Actual Runtime is 2s without profiling
  • 32. © 2017 Dremio Corporation, Two Sigma Investments, LP Scalar vs Vectorized UDF Overhead Removed
  • 33. © 2017 Dremio Corporation, Two Sigma Investments, LP Scalar vs Vectorized UDF Less System Call Faster I/O
  • 34. © 2017 Dremio Corporation, Two Sigma Investments, LP Scalar vs Vectorized UDF 4.5x Speed Up
  • 35. © 2017 Dremio Corporation, Two Sigma Investments, LP Support Group UDF • Split-apply-combine: - Break a problem into smaller pieces - Operate on each piece independently - Put all pieces back together • Common pattern supported in SQL, Spark, Pandas, R …
  • 36. © 2017 Dremio Corporation, Two Sigma Investments, LP Split-Apply-Combine (Current) • Split: groupBy, window, … • Apply: mean, stddev, collect_list, rank … • Combine: Inherently done by Spark
  • 37. © 2017 Dremio Corporation, Two Sigma Investments, LP Split-Apply-Combine (with Group UDF) • Split: groupBy, window, … • Apply: UDF • Combine: Inherently done by Spark
  • 38. © 2017 Dremio Corporation, Two Sigma Investments, LP Introduce groupBy().apply() • UDF: pd.DataFrame -> pd.DataFrame – Treat each group as a pandas DataFrame – Apply UDF on each group – Assemble as PySpark DataFrame
  • 39. © 2017 Dremio Corporation, Two Sigma Investments, LP Introduce groupBy().apply() Rows Rows Rows Groups Groups Groups Groups Groups Groups Each Group: pd.DataFrame -> pd.DataFramegroupBy
  • 40. © 2017 Dremio Corporation, Two Sigma Investments, LP Previous Example: Data Normalization (values – values.mean()) / values.std()
  • 41. © 2017 Dremio Corporation, Two Sigma Investments, LP Previous Example: Data Normalization 5x Speed Up Current: Group UDF:
  • 42. © 2017 Dremio Corporation, Two Sigma Investments, LP Limitations • Requires Spark Row <-> Arrow RecordBatch conversion – Incompatible memory layout (row vs column) • (groupBy) No local aggregation – Difficult due to how PySpark works. See https://issues.apache.org/jira/browse/SPARK-10915
  • 44. © 2017 Dremio Corporation, Two Sigma Investments, LP What’s Next (Arrow) • Arrow RPC/REST • Arrow IPC • Apache {Spark, Drill, Kudu} to Arrow Integration – Faster UDFs, Storage interfaces
  • 45. © 2017 Dremio Corporation, Two Sigma Investments, LP What’s Next (PySpark UDF) • Continue working on SPARK-20396 • Support Pandas UDF with more PySpark functions: – groupBy().agg() – window
  • 46. © 2017 Dremio Corporation, Two Sigma Investments, LP What’s Next (PySpark UDF)
  • 47. © 2017 Dremio Corporation, Two Sigma Investments, LP Get Involved • Watch SPARK-20396 • Join the Arrow community – dev@arrow.apache.org – Slack: • https://apachearrowslackin.herokuapp.com/ – http://arrow.apache.org – Follow @ApacheArrow
  • 48. © 2017 Dremio Corporation, Two Sigma Investments, LP Thank you • Bryan Cutler (IBM), Wes McKinney (Two Sigma Investments) for helping build this feature • Apache Arrow community • Spark Summit organizers • Two Sigma and Dremio for supporting this work

Notas do Editor

  1. We are a quantitative hedge fund based in New York City
  2. Spark build-in functions overs basic math opeartors and functions, such as ‘mean’, ‘stddev’, ‘sum’
  3. This is in comparison with build-in spark functions such as mean and sum where the python code gets translated into java code and executed in JVM Grouped based udf doesn’t exist now. Not sure about calling: lambda values: np.mean(np.array(values)) “grouped based udf”
  4. 60x is computed by: T1 = df.agg(F.sum(df.v1)) T2 = df.withColumn(‘v2’, when(df.v1 > 0, 1.0).otherwise(-1.0)).agg(F.sum(‘v2’) T3 = df.withColumn(‘v3’, F.udf(lambda x: 1.0 if x > 0 else -1.0, DoubleType())(df.v1)).agg(F.sum(“v3”)) (T3 – T1) / (T2 – T1)
  5. Grouping here can be by some id, category, or time based grouping, daily average
  6. Groups are materialized is one of the performance issue. However, this is not the one we focus on here. For instance, groupBy().agg(collect_list()) fully materialized the group. It takes about 1.5 seconds to do that, while the normalization examples takes more than 25 seconds
  7. Groups are materialized is one of the performance issue. However, this is not the one we focus on here. For instance, groupBy().agg(collect_list()) fully materialized the group. It takes about 1.5 seconds to do that, while the normalization examples takes more than 25 seconds
  8. (complicated code. Audience should not read the code. The presenter shows the boiler plate code vs actually code. The point here is show audience how difficult it is to do such things)
  9. (complicated code. Audience should not read the code. The presenter shows the boiler plate code vs actually code. The point here is show audience how difficult it is to do such things)
  10. (complicated code. Audience should not read the code. The presenter shows the boiler plate code vs actually code. The point here is show audience how difficult it is to do such things)
  11. Scatter/gather I/O also unknown as vectored I/O is the a method where single procedure call reads data from a stream and write to multiple buffers
  12. To illustrate further, here is the profiling result of python worker during one UDF evaluation. Without profiling, It took about 2 seconds to process 2M doubles. And most of the time is spent in serialization and udf evalution
  13. So we made some changes to the PySpark
  14. Note: Without profile, 4 seconds -> 2 seconds Let’s now compare Scalar vs Vectorized UDF. For 2 M doubles, the runtime of the python worker goes from 2 seconds to about 0.14 seconds, that’s a 15x speed up.
  15. Note: Without profile, 4 seconds -> 2 seconds Let’s now compare Scalar vs Vectorized UDF. For 2 M doubles, the runtime of the python worker goes from 2 seconds to about 0.14 seconds, that’s a 15x speed up.
  16. Note: Without profile, 4 seconds -> 2 seconds Let’s now compare Scalar vs Vectorized UDF. For 2 M doubles, the runtime of the python worker goes from 2 seconds to about 0.14 seconds, that’s a 15x speed up.
  17. Groups are materialized is one of the performance issue. However, this is not the one we focus on here. For instance, groupBy().agg(collect_list()) fully materialized the group. It takes about 1.5 seconds to do that, while the normalization examples takes more than 25 seconds
  18. Make it clear before and after
  19. Less time on this slide RPC: Generic way to exchange data. IPC: Share memory Integration: