SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
Embulk - 進化するバルク

データローダ
Sadayuki Furuhashi

Founder & Software Architect
Embulk Meetup Tokyo #2
A little about me…
Sadayuki Furuhashi
github: @frsyuki
Fluentd - Unifid log collection infrastracture
Embulk - Plugin-based parallel ETL Founder & Software Architect
What’s Embulk?
> An open-source parallel bulk data loader
> loads records from “A” to “B”
> using plugins
> for various kinds of “A” and “B”
> to make data integration easy.
> which was very painful…
Storage, RDBMS,
NoSQL, Cloud Service,
etc.
broken records,

transactions (idempotency),

performance, …
The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL
> 1. First attempt → fails
> 2. Write a script to make the records cleaned
• Convert ”2015-01-27T19:05:00Z” → “2015-01-27 19:05:00 UTC”
• Convert “N" → “”
• many cleanings…
> 3. Second attempt → another error
• Convert “Inf” → “Infinity”
> 4. Fix the script, retry, retry, retry…
> 5. Oh, some data got loaded twice!?
The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL
> 6. Ok, the script worked.
> 7. Register it to cron to sync data every day.
> 8. One day… it fails with another error
• Convert invalid UTF-8 byte sequence to U+FFFD
The pains of bulk data loading
Example: load 10GB CSV × 720 files
> Most of scripts are slow.
• People have little time to optimize bulk load scripts
> One file takes 1 hour → 720 files takes 1 month (!?)
A lot of integration efforts for each storages:
> XML, JSON, Apache log format (+some custom), …
> SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile…
> MongoDB, Elasticsearch, Redshift, Salesforce, …
The problems:
> Data cleaning (normalization)
> How to normalize broken records?
> Error handling
> How to remove broken records?
> Idempotent retrying
> How to retry without duplicated loading?
> Performance optimization
> How to optimize the code or parallelize?
HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data validation
✓ Error recovery
✓ Deterministic behavior
✓ Resuming
Plugins Plugins
bulk load
Input Output
Embulk’s Plugin Architecture
Embulk Core
Executor Plugin
Filter Filter
Guess
Output
Embulk’s Plugin Architecture
Embulk Core
Executor Plugin
Filter Filter
GuessFileInput
Parser
Decoder
Guess
Embulk’s Plugin Architecture
Embulk Core
FileInput
Executor Plugin
Parser
Decoder
FileOutput
Formatter
Encoder
Filter Filter
Execution overview
Task
Transaction Task
Task
taskCount
{
taskIndex: 0,
task: {…}
}
{
taskIndex: 2,
task: {…}
}
runs on a single thread runs on multiple threads

(or machines)
Parallel execution
Task
Task
Task
Task
Threads
Task queue
run tasks in parallel
(embulk-executor-local-thread)
Distributed execution
Task
Task
Task
Task
Map tasks
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
Distributed execution (w/ partitioning)
Task
Task
Task
Task
Map - Shuffle - Reduce
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
Transaction control
fileInput.transaction {
parser.transaction {
filters.transaction {
formatter.transaction {
fileOutput.transaction {
executor.transaction {
…
}
}
}
}
}
}
file input plugin
parser plugin
filter plugins
formatter plugin
file output plugin
executor plugin
Task Task
Task configuration
fileInput.transaction { fileInputTask, taskCount →
parser.transaction { parserTask, schema →
filters.transaction { filterTasks, schema →
formatter.transaction { formatterTask →
fileOutput.transaction { fileOutputTask →
executor.transaction { →
task = {
fileInputTask,
parserTask,
filterTasks,
formatterTask,
fileOutputTask,
}
taskCount.times.inParallel { taskIndex → run(taskIndex, task)
taskCount is
decided by input
schema is decided
by input, and may be
modified by filters
Task execution
parser.run(fileInput, pageOutput)
fileInput.open() formatter.open(fileOutput)
fileOutput.open()
parser plugin
file input plugin filter plugins
file output plugin
formatter plugin …Task Task …
Type conversion
Embulk type systemInput type system Output type system
boolean
long
double
string
timestamp
boolean
integer
bigint
double precision
text
varchar
date
timestamp
timestamp with zone
…
(e.g. PostgreSQL)
boolean
integer
long
float
double
string
array
geo point
geo shape
… (e.g. Elasticsearch)
Input plugin

(parser plugin if input is file-based)
Output plugin

(formatter plugin if output is file-based)
What’s added since the first release?
• v0.3
• Resuming
• Filter plugin type
• v0.4
• Plugin template generator
• Incremental execution (ConfigDiff)
• Isolated ClassLoaders for Java plugins
• Polyglot command launcher
What’s added since the first release?
• v0.6
• Executor plugin type
• Liquid template engine
• v0.7
• EmbulkEmbed & Embulk::Runner
• Plugin bundle (embulk-mkbundle)
• JRuby 9000
• Gradle v2.6
Resuming
• Retries a failed transaction without retrying
everything.
• Skips successful tasks by using information stored in
a file by the previous transaction.
• embulk run config.yml -r resume-state.yml
Filter plugin type
• Filtering rows out, filtering columns out, or enrich
the data. 18 plugins released.
Plugin template generator
• Generates template of a plugin.
• Generated code is already ready to compile.
> You modify & compile it to do your work.
• embulk new <category> <new>
Incremental execution
• Store last file name or row in a file, and next
execution starts from there.
• Usecase:

sync new files on S3 to Elasticsearch every day.
• embulk run config.yml -o next-config.yml
Isolated ClassLoaders for Java plugins
• Embulk can load multiple versions of java plugins.
Plugin Version Conflicts
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Version conflicts!
aws-sdk.jar v1.10
embulk-output-redshift.jar
Multiple Classloaders in JVM
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Isolated
environments
aws-sdk.jar v1.10
embulk-output-redshift.jar
Class Loader 1
Class Loader 2
Polyglot launcher script
• embulk .jar is a jar file.
• embulk.jar is a shell script.
• embulk.jar is a bat script.
• It sets JVM options to improve performance.
• ./embulk run abc
Executor plugin type
• embulk-executor-mapreduce executes tasks on
distributed environment.
Liquid template engine
• A config file can include variables.
EmbulkEmbed & Embulk::Runner
• Embed embulk in an application.
Plugin bundle
• Uses fixed version of plugins.
• embulk mkbundle my-project
• embulk run -b my-project config.yml
Gradle v2.6
• Continous compiling.
• “embulk migrate .” upgrades gradle versio of your
plugin project.
• ./gradlew -t build
Future plan
• v0.8
• JSON type (issue #306)
• Error plugin type (#27, #124)
• More (or less) concurrency for output (#231)
• v0.9
• More Guess (#242, #235)
• Multiple jobs using a single config file (#167)

Mais conteúdo relacionado

Mais procurados

次世代Webコンテナ Undertowについて
次世代Webコンテナ Undertowについて次世代Webコンテナ Undertowについて
次世代Webコンテナ UndertowについてYoshimasa Tanabe
 
実環境にTerraform導入したら驚いた
実環境にTerraform導入したら驚いた実環境にTerraform導入したら驚いた
実環境にTerraform導入したら驚いたAkihiro Kuwano
 
スマートフォン向けサービスにおけるサーバサイド設計入門
スマートフォン向けサービスにおけるサーバサイド設計入門スマートフォン向けサービスにおけるサーバサイド設計入門
スマートフォン向けサービスにおけるサーバサイド設計入門Hisashi HATAKEYAMA
 
大規模分散システムの現在 -- Twitter
大規模分散システムの現在 -- Twitter大規模分散システムの現在 -- Twitter
大規模分散システムの現在 -- Twittermaruyama097
 
Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話
Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話
Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話JustSystems Corporation
 
Cassandraとh baseの比較して入門するno sql
Cassandraとh baseの比較して入門するno sqlCassandraとh baseの比較して入門するno sql
Cassandraとh baseの比較して入門するno sqlYutuki r
 
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Noritaka Sekiyama
 
A critique of ansi sql isolation levels 解説公開用
A critique of ansi sql isolation levels 解説公開用A critique of ansi sql isolation levels 解説公開用
A critique of ansi sql isolation levels 解説公開用Takashi Kambayashi
 
DPDKによる高速コンテナネットワーキング
DPDKによる高速コンテナネットワーキングDPDKによる高速コンテナネットワーキング
DPDKによる高速コンテナネットワーキングTomoya Hibi
 
Where狙いのキー、order by狙いのキー
Where狙いのキー、order by狙いのキーWhere狙いのキー、order by狙いのキー
Where狙いのキー、order by狙いのキーyoku0825
 
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところY Watanabe
 
PostgreSQLアンチパターン
PostgreSQLアンチパターンPostgreSQLアンチパターン
PostgreSQLアンチパターンSoudai Sone
 
SQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するかSQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するかShogo Wakayama
 
9/14にリリースされたばかりの新LTS版Java 17、ここ3年間のJavaの変化を知ろう!(Open Source Conference 2021 O...
9/14にリリースされたばかりの新LTS版Java 17、ここ3年間のJavaの変化を知ろう!(Open Source Conference 2021 O...9/14にリリースされたばかりの新LTS版Java 17、ここ3年間のJavaの変化を知ろう!(Open Source Conference 2021 O...
9/14にリリースされたばかりの新LTS版Java 17、ここ3年間のJavaの変化を知ろう!(Open Source Conference 2021 O...NTT DATA Technology & Innovation
 
コンテナを突き破れ!! ~コンテナセキュリティ入門基礎の基礎~(Kubernetes Novice Tokyo #11 発表資料)
コンテナを突き破れ!! ~コンテナセキュリティ入門基礎の基礎~(Kubernetes Novice Tokyo #11 発表資料)コンテナを突き破れ!! ~コンテナセキュリティ入門基礎の基礎~(Kubernetes Novice Tokyo #11 発表資料)
コンテナを突き破れ!! ~コンテナセキュリティ入門基礎の基礎~(Kubernetes Novice Tokyo #11 発表資料)NTT DATA Technology & Innovation
 
Dockerfile を書くためのベストプラクティス解説編
Dockerfile を書くためのベストプラクティス解説編Dockerfile を書くためのベストプラクティス解説編
Dockerfile を書くためのベストプラクティス解説編Masahito Zembutsu
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)NTT DATA OSS Professional Services
 
Akkaとは。アクターモデル とは。
Akkaとは。アクターモデル とは。Akkaとは。アクターモデル とは。
Akkaとは。アクターモデル とは。Kenjiro Kubota
 

Mais procurados (20)

次世代Webコンテナ Undertowについて
次世代Webコンテナ Undertowについて次世代Webコンテナ Undertowについて
次世代Webコンテナ Undertowについて
 
実環境にTerraform導入したら驚いた
実環境にTerraform導入したら驚いた実環境にTerraform導入したら驚いた
実環境にTerraform導入したら驚いた
 
スマートフォン向けサービスにおけるサーバサイド設計入門
スマートフォン向けサービスにおけるサーバサイド設計入門スマートフォン向けサービスにおけるサーバサイド設計入門
スマートフォン向けサービスにおけるサーバサイド設計入門
 
大規模分散システムの現在 -- Twitter
大規模分散システムの現在 -- Twitter大規模分散システムの現在 -- Twitter
大規模分散システムの現在 -- Twitter
 
Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話
Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話
Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話
 
Cassandraとh baseの比較して入門するno sql
Cassandraとh baseの比較して入門するno sqlCassandraとh baseの比較して入門するno sql
Cassandraとh baseの比較して入門するno sql
 
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
 
A critique of ansi sql isolation levels 解説公開用
A critique of ansi sql isolation levels 解説公開用A critique of ansi sql isolation levels 解説公開用
A critique of ansi sql isolation levels 解説公開用
 
DPDKによる高速コンテナネットワーキング
DPDKによる高速コンテナネットワーキングDPDKによる高速コンテナネットワーキング
DPDKによる高速コンテナネットワーキング
 
Kafka Security
Kafka SecurityKafka Security
Kafka Security
 
Where狙いのキー、order by狙いのキー
Where狙いのキー、order by狙いのキーWhere狙いのキー、order by狙いのキー
Where狙いのキー、order by狙いのキー
 
At least onceってぶっちゃけ問題の先送りだったよね #kafkajp
At least onceってぶっちゃけ問題の先送りだったよね #kafkajpAt least onceってぶっちゃけ問題の先送りだったよね #kafkajp
At least onceってぶっちゃけ問題の先送りだったよね #kafkajp
 
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
 
PostgreSQLアンチパターン
PostgreSQLアンチパターンPostgreSQLアンチパターン
PostgreSQLアンチパターン
 
SQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するかSQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するか
 
9/14にリリースされたばかりの新LTS版Java 17、ここ3年間のJavaの変化を知ろう!(Open Source Conference 2021 O...
9/14にリリースされたばかりの新LTS版Java 17、ここ3年間のJavaの変化を知ろう!(Open Source Conference 2021 O...9/14にリリースされたばかりの新LTS版Java 17、ここ3年間のJavaの変化を知ろう!(Open Source Conference 2021 O...
9/14にリリースされたばかりの新LTS版Java 17、ここ3年間のJavaの変化を知ろう!(Open Source Conference 2021 O...
 
コンテナを突き破れ!! ~コンテナセキュリティ入門基礎の基礎~(Kubernetes Novice Tokyo #11 発表資料)
コンテナを突き破れ!! ~コンテナセキュリティ入門基礎の基礎~(Kubernetes Novice Tokyo #11 発表資料)コンテナを突き破れ!! ~コンテナセキュリティ入門基礎の基礎~(Kubernetes Novice Tokyo #11 発表資料)
コンテナを突き破れ!! ~コンテナセキュリティ入門基礎の基礎~(Kubernetes Novice Tokyo #11 発表資料)
 
Dockerfile を書くためのベストプラクティス解説編
Dockerfile を書くためのベストプラクティス解説編Dockerfile を書くためのベストプラクティス解説編
Dockerfile を書くためのベストプラクティス解説編
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
 
Akkaとは。アクターモデル とは。
Akkaとは。アクターモデル とは。Akkaとは。アクターモデル とは。
Akkaとは。アクターモデル とは。
 

Destaque

Embulkを活用したログ管理システム
Embulkを活用したログ管理システムEmbulkを活用したログ管理システム
Embulkを活用したログ管理システムAkihiro Ikezoe
 
Embulk at Treasure Data
Embulk at Treasure DataEmbulk at Treasure Data
Embulk at Treasure DataSatoshi Akama
 
EmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とEmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とToru Takahashi
 
20150219 初めての「embulk」
20150219 初めての「embulk」20150219 初めての「embulk」
20150219 初めての「embulk」Hideto Masuoka
 
20151215 embulk 『新人がEmbulk mBaaSプラグインを開発した話』
20151215 embulk  『新人がEmbulk mBaaSプラグインを開発した話』20151215 embulk  『新人がEmbulk mBaaSプラグインを開発した話』
20151215 embulk 『新人がEmbulk mBaaSプラグインを開発した話』Yuya Niimi
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11Sadayuki Furuhashi
 
Fluentd - road to v1 -
Fluentd - road to v1 -Fluentd - road to v1 -
Fluentd - road to v1 -N Masahiro
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?Sadayuki Furuhashi
 
Fluentdでログ収集「だけ」やる話 #study2study
Fluentdでログ収集「だけ」やる話 #study2studyFluentdでログ収集「だけ」やる話 #study2study
Fluentdでログ収集「だけ」やる話 #study2studySATOSHI TAGOMORI
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkSadayuki Furuhashi
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
Open Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud ServiceOpen Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud ServiceSATOSHI TAGOMORI
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupSadayuki Furuhashi
 
Big Data入門に見せかけたFluentd入門
Big Data入門に見せかけたFluentd入門Big Data入門に見せかけたFluentd入門
Big Data入門に見せかけたFluentd入門Keisuke Takahashi
 
How To Write Middleware In Ruby
How To Write Middleware In RubyHow To Write Middleware In Ruby
How To Write Middleware In RubySATOSHI TAGOMORI
 
Elasticsearchインデクシングのパフォーマンスを測ってみた
Elasticsearchインデクシングのパフォーマンスを測ってみたElasticsearchインデクシングのパフォーマンスを測ってみた
Elasticsearchインデクシングのパフォーマンスを測ってみたRyoji Kurosawa
 
Lucene/Solr Revolution 2016 参加レポート
Lucene/Solr Revolution 2016 参加レポートLucene/Solr Revolution 2016 参加レポート
Lucene/Solr Revolution 2016 参加レポートShinpei Nakata
 
Flumeを活用したAmebaにおける大規模ログ収集システム
Flumeを活用したAmebaにおける大規模ログ収集システムFlumeを活用したAmebaにおける大規模ログ収集システム
Flumeを活用したAmebaにおける大規模ログ収集システムSatoshi Iijima
 

Destaque (20)

Embulkを活用したログ管理システム
Embulkを活用したログ管理システムEmbulkを活用したログ管理システム
Embulkを活用したログ管理システム
 
Embulk at Treasure Data
Embulk at Treasure DataEmbulk at Treasure Data
Embulk at Treasure Data
 
EmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とEmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤と
 
20150219 初めての「embulk」
20150219 初めての「embulk」20150219 初めての「embulk」
20150219 初めての「embulk」
 
20151215 embulk 『新人がEmbulk mBaaSプラグインを開発した話』
20151215 embulk  『新人がEmbulk mBaaSプラグインを開発した話』20151215 embulk  『新人がEmbulk mBaaSプラグインを開発した話』
20151215 embulk 『新人がEmbulk mBaaSプラグインを開発した話』
 
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
分散ワークフローエンジン『Digdag』の実装 at Tokyo RubyKaigi #11
 
Fluentd - road to v1 -
Fluentd - road to v1 -Fluentd - road to v1 -
Fluentd - road to v1 -
 
Embulk 20150411
Embulk 20150411Embulk 20150411
Embulk 20150411
 
DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?DigdagはなぜYAMLなのか?
DigdagはなぜYAMLなのか?
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
 
Fluentdでログ収集「だけ」やる話 #study2study
Fluentdでログ収集「だけ」やる話 #study2studyFluentdでログ収集「だけ」やる話 #study2study
Fluentdでログ収集「だけ」やる話 #study2study
 
Fighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with EmbulkFighting Against Chaotically Separated Values with Embulk
Fighting Against Chaotically Separated Values with Embulk
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Open Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud ServiceOpen Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud Service
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes MeetupFluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
 
Big Data入門に見せかけたFluentd入門
Big Data入門に見せかけたFluentd入門Big Data入門に見せかけたFluentd入門
Big Data入門に見せかけたFluentd入門
 
How To Write Middleware In Ruby
How To Write Middleware In RubyHow To Write Middleware In Ruby
How To Write Middleware In Ruby
 
Elasticsearchインデクシングのパフォーマンスを測ってみた
Elasticsearchインデクシングのパフォーマンスを測ってみたElasticsearchインデクシングのパフォーマンスを測ってみた
Elasticsearchインデクシングのパフォーマンスを測ってみた
 
Lucene/Solr Revolution 2016 参加レポート
Lucene/Solr Revolution 2016 参加レポートLucene/Solr Revolution 2016 参加レポート
Lucene/Solr Revolution 2016 参加レポート
 
Flumeを活用したAmebaにおける大規模ログ収集システム
Flumeを活用したAmebaにおける大規模ログ収集システムFlumeを活用したAmebaにおける大規模ログ収集システム
Flumeを活用したAmebaにおける大規模ログ収集システム
 

Semelhante a Embulk - 進化するバルクデータローダ

Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.comRenzo Tomà
 
Retrofitting Continuous Delivery
Retrofitting Continuous Delivery Retrofitting Continuous Delivery
Retrofitting Continuous Delivery Alan Norton
 
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵Amazon Web Services Korea
 
Scaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customersScaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customersSpeck&Tech
 
Evolutionary Database Design
Evolutionary Database DesignEvolutionary Database Design
Evolutionary Database DesignAndrei Solntsev
 
SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesEduardo Castro
 
Rust kafka-5-2019-unskip
Rust kafka-5-2019-unskipRust kafka-5-2019-unskip
Rust kafka-5-2019-unskipGerard Klijs
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit
 
End to end testing a web application with Clojure
End to end testing a web application with ClojureEnd to end testing a web application with Clojure
End to end testing a web application with ClojureGerard Klijs
 
An Ensemble Core with Docker - Solving a Real Pain in the PaaS
An Ensemble Core with Docker - Solving a Real Pain in the PaaS An Ensemble Core with Docker - Solving a Real Pain in the PaaS
An Ensemble Core with Docker - Solving a Real Pain in the PaaS Erik Osterman
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Sadayuki Furuhashi
 
CI/CD on AWS Deploy Everything All the Time
CI/CD on AWS Deploy Everything All the TimeCI/CD on AWS Deploy Everything All the Time
CI/CD on AWS Deploy Everything All the TimeAmazon Web Services
 
Sql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ramSql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ramChris Adkin
 
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...Claus Ibsen
 
Melhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftMelhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftAmazon Web Services LATAM
 
(APP201) Going Zero to Sixty with AWS Elastic Beanstalk | AWS re:Invent 2014
(APP201) Going Zero to Sixty with AWS Elastic Beanstalk | AWS re:Invent 2014(APP201) Going Zero to Sixty with AWS Elastic Beanstalk | AWS re:Invent 2014
(APP201) Going Zero to Sixty with AWS Elastic Beanstalk | AWS re:Invent 2014Amazon Web Services
 
Novices guide to docker
Novices guide to dockerNovices guide to docker
Novices guide to dockerAlec Clews
 

Semelhante a Embulk - 進化するバルクデータローダ (20)

Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.com
 
Retrofitting Continuous Delivery
Retrofitting Continuous Delivery Retrofitting Continuous Delivery
Retrofitting Continuous Delivery
 
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 
Scaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customersScaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customers
 
Evolutionary Database Design
Evolutionary Database DesignEvolutionary Database Design
Evolutionary Database Design
 
SQL Server 2008 Integration Services
SQL Server 2008 Integration ServicesSQL Server 2008 Integration Services
SQL Server 2008 Integration Services
 
Rust kafka-5-2019-unskip
Rust kafka-5-2019-unskipRust kafka-5-2019-unskip
Rust kafka-5-2019-unskip
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
 
End to end testing a web application with Clojure
End to end testing a web application with ClojureEnd to end testing a web application with Clojure
End to end testing a web application with Clojure
 
An Ensemble Core with Docker - Solving a Real Pain in the PaaS
An Ensemble Core with Docker - Solving a Real Pain in the PaaS An Ensemble Core with Docker - Solving a Real Pain in the PaaS
An Ensemble Core with Docker - Solving a Real Pain in the PaaS
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
 
CI/CD on AWS Deploy Everything All the Time
CI/CD on AWS Deploy Everything All the TimeCI/CD on AWS Deploy Everything All the Time
CI/CD on AWS Deploy Everything All the Time
 
Sql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ramSql server engine cpu cache as the new ram
Sql server engine cpu cache as the new ram
 
Using Embulk at Treasure Data
Using Embulk at Treasure DataUsing Embulk at Treasure Data
Using Embulk at Treasure Data
 
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
Cloud-Native Integration with Apache Camel on Kubernetes (Copenhagen October ...
 
Melhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftMelhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon Redshift
 
(APP201) Going Zero to Sixty with AWS Elastic Beanstalk | AWS re:Invent 2014
(APP201) Going Zero to Sixty with AWS Elastic Beanstalk | AWS re:Invent 2014(APP201) Going Zero to Sixty with AWS Elastic Beanstalk | AWS re:Invent 2014
(APP201) Going Zero to Sixty with AWS Elastic Beanstalk | AWS re:Invent 2014
 
Novices guide to docker
Novices guide to dockerNovices guide to docker
Novices guide to docker
 
Celery with python
Celery with pythonCelery with python
Celery with python
 

Mais de Sadayuki Furuhashi

Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Sadayuki Furuhashi
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesSadayuki Furuhashi
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsSadayuki Furuhashi
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreSadayuki Furuhashi
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoSadayuki Furuhashi
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualSadayuki Furuhashi
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataSadayuki Furuhashi
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into HadoopSadayuki Furuhashi
 
Programming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack ProjectProgramming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack ProjectSadayuki Furuhashi
 
gumiStudy#7 The MessagePack Project
gumiStudy#7 The MessagePack ProjectgumiStudy#7 The MessagePack Project
gumiStudy#7 The MessagePack ProjectSadayuki Furuhashi
 

Mais de Sadayuki Furuhashi (20)

Scripting Embulk Plugins
Scripting Embulk PluginsScripting Embulk Plugins
Scripting Embulk Plugins
 
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
Performance Optimization Techniques of MessagePack-Ruby - RubyKaigi 2019
 
Making KVS 10x Scalable
Making KVS 10x ScalableMaking KVS 10x Scalable
Making KVS 10x Scalable
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Prestogres internals
Prestogres internalsPrestogres internals
Prestogres internals
 
Presto+MySQLで分散SQL
Presto+MySQLで分散SQLPresto+MySQLで分散SQL
Presto+MySQLで分散SQL
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Fluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect MoreFluentd - Set Up Once, Collect More
Fluentd - Set Up Once, Collect More
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasualWhat's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
What's new in v11 - Fluentd Casual Talks #3 #fluentdcasual
 
How we use Fluentd in Treasure Data
How we use Fluentd in Treasure DataHow we use Fluentd in Treasure Data
How we use Fluentd in Treasure Data
 
Fluentd meetup at Slideshare
Fluentd meetup at SlideshareFluentd meetup at Slideshare
Fluentd meetup at Slideshare
 
How to collect Big Data into Hadoop
How to collect Big Data into HadoopHow to collect Big Data into Hadoop
How to collect Big Data into Hadoop
 
Fluentd meetup
Fluentd meetupFluentd meetup
Fluentd meetup
 
upload test 1
upload test 1upload test 1
upload test 1
 
Programming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack ProjectProgramming Tools and Techniques #369 - The MessagePack Project
Programming Tools and Techniques #369 - The MessagePack Project
 
Gumi study7 messagepack
Gumi study7 messagepackGumi study7 messagepack
Gumi study7 messagepack
 
gumiStudy#7 The MessagePack Project
gumiStudy#7 The MessagePack ProjectgumiStudy#7 The MessagePack Project
gumiStudy#7 The MessagePack Project
 

Último

US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxsomshekarkn64
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 

Último (20)

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptx
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 

Embulk - 進化するバルクデータローダ

  • 1. Embulk - 進化するバルク
 データローダ Sadayuki Furuhashi
 Founder & Software Architect Embulk Meetup Tokyo #2
  • 2. A little about me… Sadayuki Furuhashi github: @frsyuki Fluentd - Unifid log collection infrastracture Embulk - Plugin-based parallel ETL Founder & Software Architect
  • 3. What’s Embulk? > An open-source parallel bulk data loader > loads records from “A” to “B” > using plugins > for various kinds of “A” and “B” > to make data integration easy. > which was very painful… Storage, RDBMS, NoSQL, Cloud Service, etc. broken records,
 transactions (idempotency),
 performance, …
  • 4. The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 1. First attempt → fails > 2. Write a script to make the records cleaned • Convert ”2015-01-27T19:05:00Z” → “2015-01-27 19:05:00 UTC” • Convert “N" → “” • many cleanings… > 3. Second attempt → another error • Convert “Inf” → “Infinity” > 4. Fix the script, retry, retry, retry… > 5. Oh, some data got loaded twice!?
  • 5. The pains of bulk data loading Example: load a 10GB CSV file to PostgreSQL > 6. Ok, the script worked. > 7. Register it to cron to sync data every day. > 8. One day… it fails with another error • Convert invalid UTF-8 byte sequence to U+FFFD
  • 6. The pains of bulk data loading Example: load 10GB CSV × 720 files > Most of scripts are slow. • People have little time to optimize bulk load scripts > One file takes 1 hour → 720 files takes 1 month (!?) A lot of integration efforts for each storages: > XML, JSON, Apache log format (+some custom), … > SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile… > MongoDB, Elasticsearch, Redshift, Salesforce, …
  • 7. The problems: > Data cleaning (normalization) > How to normalize broken records? > Error handling > How to remove broken records? > Idempotent retrying > How to retry without duplicated loading? > Performance optimization > How to optimize the code or parallelize?
  • 8. HDFS MySQL Amazon S3 Embulk CSV Files SequenceFile Salesforce.com Elasticsearch Cassandra Hive Redis ✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Resuming Plugins Plugins bulk load
  • 9. Input Output Embulk’s Plugin Architecture Embulk Core Executor Plugin Filter Filter Guess
  • 10. Output Embulk’s Plugin Architecture Embulk Core Executor Plugin Filter Filter GuessFileInput Parser Decoder
  • 11. Guess Embulk’s Plugin Architecture Embulk Core FileInput Executor Plugin Parser Decoder FileOutput Formatter Encoder Filter Filter
  • 12. Execution overview Task Transaction Task Task taskCount { taskIndex: 0, task: {…} } { taskIndex: 2, task: {…} } runs on a single thread runs on multiple threads
 (or machines)
  • 13. Parallel execution Task Task Task Task Threads Task queue run tasks in parallel (embulk-executor-local-thread)
  • 14. Distributed execution Task Task Task Task Map tasks Task queue run tasks on Hadoop (embulk-executor-mapreduce)
  • 15. Distributed execution (w/ partitioning) Task Task Task Task Map - Shuffle - Reduce Task queue run tasks on Hadoop (embulk-executor-mapreduce)
  • 16. Transaction control fileInput.transaction { parser.transaction { filters.transaction { formatter.transaction { fileOutput.transaction { executor.transaction { … } } } } } } file input plugin parser plugin filter plugins formatter plugin file output plugin executor plugin Task Task
  • 17. Task configuration fileInput.transaction { fileInputTask, taskCount → parser.transaction { parserTask, schema → filters.transaction { filterTasks, schema → formatter.transaction { formatterTask → fileOutput.transaction { fileOutputTask → executor.transaction { → task = { fileInputTask, parserTask, filterTasks, formatterTask, fileOutputTask, } taskCount.times.inParallel { taskIndex → run(taskIndex, task) taskCount is decided by input schema is decided by input, and may be modified by filters
  • 18. Task execution parser.run(fileInput, pageOutput) fileInput.open() formatter.open(fileOutput) fileOutput.open() parser plugin file input plugin filter plugins file output plugin formatter plugin …Task Task …
  • 19. Type conversion Embulk type systemInput type system Output type system boolean long double string timestamp boolean integer bigint double precision text varchar date timestamp timestamp with zone … (e.g. PostgreSQL) boolean integer long float double string array geo point geo shape … (e.g. Elasticsearch) Input plugin
 (parser plugin if input is file-based) Output plugin
 (formatter plugin if output is file-based)
  • 20. What’s added since the first release? • v0.3 • Resuming • Filter plugin type • v0.4 • Plugin template generator • Incremental execution (ConfigDiff) • Isolated ClassLoaders for Java plugins • Polyglot command launcher
  • 21. What’s added since the first release? • v0.6 • Executor plugin type • Liquid template engine • v0.7 • EmbulkEmbed & Embulk::Runner • Plugin bundle (embulk-mkbundle) • JRuby 9000 • Gradle v2.6
  • 22. Resuming • Retries a failed transaction without retrying everything. • Skips successful tasks by using information stored in a file by the previous transaction. • embulk run config.yml -r resume-state.yml
  • 23. Filter plugin type • Filtering rows out, filtering columns out, or enrich the data. 18 plugins released.
  • 24. Plugin template generator • Generates template of a plugin. • Generated code is already ready to compile. > You modify & compile it to do your work. • embulk new <category> <new>
  • 25. Incremental execution • Store last file name or row in a file, and next execution starts from there. • Usecase:
 sync new files on S3 to Elasticsearch every day. • embulk run config.yml -o next-config.yml
  • 26. Isolated ClassLoaders for Java plugins • Embulk can load multiple versions of java plugins.
  • 27. Plugin Version Conflicts Embulk Core Java Runtime aws-sdk.jar v1.9 embulk-input-s3.jar Version conflicts! aws-sdk.jar v1.10 embulk-output-redshift.jar
  • 28. Multiple Classloaders in JVM Embulk Core Java Runtime aws-sdk.jar v1.9 embulk-input-s3.jar Isolated environments aws-sdk.jar v1.10 embulk-output-redshift.jar Class Loader 1 Class Loader 2
  • 29. Polyglot launcher script • embulk .jar is a jar file. • embulk.jar is a shell script. • embulk.jar is a bat script. • It sets JVM options to improve performance. • ./embulk run abc
  • 30. Executor plugin type • embulk-executor-mapreduce executes tasks on distributed environment.
  • 31. Liquid template engine • A config file can include variables.
  • 32. EmbulkEmbed & Embulk::Runner • Embed embulk in an application.
  • 33. Plugin bundle • Uses fixed version of plugins. • embulk mkbundle my-project • embulk run -b my-project config.yml
  • 34. Gradle v2.6 • Continous compiling. • “embulk migrate .” upgrades gradle versio of your plugin project. • ./gradlew -t build
  • 35. Future plan • v0.8 • JSON type (issue #306) • Error plugin type (#27, #124) • More (or less) concurrency for output (#231) • v0.9 • More Guess (#242, #235) • Multiple jobs using a single config file (#167)