2. A little about me…
Sadayuki Furuhashi
github: @frsyuki
Fluentd - Unifid log collection infrastracture
Embulk - Plugin-based parallel ETL Founder & Software Architect
3. What’s Embulk?
> An open-source parallel bulk data loader
> loads records from “A” to “B”
> using plugins
> for various kinds of “A” and “B”
> to make data integration easy.
> which was very painful…
Storage, RDBMS,
NoSQL, Cloud Service,
etc.
broken records,
transactions (idempotency),
performance, …
4. The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL
> 1. First attempt → fails
> 2. Write a script to make the records cleaned
• Convert ”2015-01-27T19:05:00Z” → “2015-01-27 19:05:00 UTC”
• Convert “N" → “”
• many cleanings…
> 3. Second attempt → another error
• Convert “Inf” → “Infinity”
> 4. Fix the script, retry, retry, retry…
> 5. Oh, some data got loaded twice!?
5. The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL
> 6. Ok, the script worked.
> 7. Register it to cron to sync data every day.
> 8. One day… it fails with another error
• Convert invalid UTF-8 byte sequence to U+FFFD
6. The pains of bulk data loading
Example: load 10GB CSV × 720 files
> Most of scripts are slow.
• People have little time to optimize bulk load scripts
> One file takes 1 hour → 720 files takes 1 month (!?)
A lot of integration efforts for each storages:
> XML, JSON, Apache log format (+some custom), …
> SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile…
> MongoDB, Elasticsearch, Redshift, Salesforce, …
7. The problems:
> Data cleaning (normalization)
> How to normalize broken records?
> Error handling
> How to remove broken records?
> Idempotent retrying
> How to retry without duplicated loading?
> Performance optimization
> How to optimize the code or parallelize?
19. Type conversion
Embulk type systemInput type system Output type system
boolean
long
double
string
timestamp
boolean
integer
bigint
double precision
text
varchar
date
timestamp
timestamp with zone
…
(e.g. PostgreSQL)
boolean
integer
long
float
double
string
array
geo point
geo shape
… (e.g. Elasticsearch)
Input plugin
(parser plugin if input is file-based)
Output plugin
(formatter plugin if output is file-based)
20. What’s added since the first release?
• v0.3
• Resuming
• Filter plugin type
• v0.4
• Plugin template generator
• Incremental execution (ConfigDiff)
• Isolated ClassLoaders for Java plugins
• Polyglot command launcher
22. Resuming
• Retries a failed transaction without retrying
everything.
• Skips successful tasks by using information stored in
a file by the previous transaction.
• embulk run config.yml -r resume-state.yml
23. Filter plugin type
• Filtering rows out, filtering columns out, or enrich
the data. 18 plugins released.
24. Plugin template generator
• Generates template of a plugin.
• Generated code is already ready to compile.
> You modify & compile it to do your work.
• embulk new <category> <new>
25. Incremental execution
• Store last file name or row in a file, and next
execution starts from there.
• Usecase:
sync new files on S3 to Elasticsearch every day.
• embulk run config.yml -o next-config.yml
27. Plugin Version Conflicts
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Version conflicts!
aws-sdk.jar v1.10
embulk-output-redshift.jar
28. Multiple Classloaders in JVM
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Isolated
environments
aws-sdk.jar v1.10
embulk-output-redshift.jar
Class Loader 1
Class Loader 2
29. Polyglot launcher script
• embulk .jar is a jar file.
• embulk.jar is a shell script.
• embulk.jar is a bat script.
• It sets JVM options to improve performance.
• ./embulk run abc
30. Executor plugin type
• embulk-executor-mapreduce executes tasks on
distributed environment.
33. Plugin bundle
• Uses fixed version of plugins.
• embulk mkbundle my-project
• embulk run -b my-project config.yml
34. Gradle v2.6
• Continous compiling.
• “embulk migrate .” upgrades gradle versio of your
plugin project.
• ./gradlew -t build
35. Future plan
• v0.8
• JSON type (issue #306)
• Error plugin type (#27, #124)
• More (or less) concurrency for output (#231)
• v0.9
• More Guess (#242, #235)
• Multiple jobs using a single config file (#167)