SlideShare uma empresa Scribd logo
1 de 44
Data processing into ClickHouse
Nikolai Kochetov, ClickHouse developer
Agenda
› Data layout and compression
› In-memory layout and data processing
› Pipelining and parallelism
› Specialized data structures
3 / 44
Data layout and compression
Column-Oriented DBMS
General ideas
› Separate column is stored in separate file
(or several files)
› Only affected columns are read
› Columnar data representation in memory
Additional concepts
› Sparse index
› Per-column compression
5 / 44
Compression
Highly customizable in CREATE TABLE statement
CREATE TABLE codec_example
(
`dt` DateTime, -- default CODEC is LZ4
`dt_none` DateTime CODEC(NONE),
`dt_lz4_4` DateTime CODEC(LZ4HC(4)),
`dt_zstd` DateTime CODEC(ZSTD),
`dt_dd_lz4` DateTime CODEC(DoubleDelta, LZ4HC) -- combined
)
ENGINE = MergeTree
ORDER BY dt
6 / 44
Compression ratio vs decompression speed
Intel Xeon E3-1225V3, enwik8 https://quixdb.github.io/squash-benchmark
7 / 44
Compression ratio
SELECT
column,
formatReadableSize(column_data_compressed_bytes) AS compressed,
formatReadableSize(column_data_uncompressed_bytes) AS uncompressed,
column_data_uncompressed_bytes / column_data_compressed_bytes AS r
FROM system.parts_columns
WHERE (table = 'codec_example') AND active ORDER BY r ASC
┌─column────┬─compressed─┬─uncompressed─┬──────────────────r─┐
│ dt_none │ 67.73 MiB │ 67.70 MiB │ 0.999618408127124 │
│ dt │ 3.06 MiB │ 67.70 MiB │ 22.156958788868835 │
│ dt_lz4_4 │ 3.06 MiB │ 67.70 MiB │ 22.156958788868835 │
│ dt_zstd │ 1.08 MiB │ 67.70 MiB │ 62.91648262048673 │
│ dt_dd_lz4 │ 938.17 KiB │ 67.70 MiB │ 73.89642182401099 │
└───────────┴────────────┴──────────────┴────────────────────┘
8 / 44
Data transformation chain
Time series data
9 / 44
Data transformation chain
Time series data -> Delta
10 / 44
Data transformation chain
Time series data -> Delta -> Delta
11 / 44
Data transformation chain
Time series data -> Delta -> Delta -> variable length encoding
Time series data -> DoubleDelta
Time series data -> DoubleDelta -> LZ4HC
12 / 44
Read time
Test query
SELECT dt_dd_lz4 FROM codec_example FORMAT Null
Enable system.query_log
SET log_queries = 1
xml config:
clickhouse.tech/docs/en/operations/server_settings/settings/#server_settings-query-log
Drop FS cache
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
13 / 44
Read time
Profile events are in system.query_log
SELECT
pe.Names,
pe.Values
FROM system.query_log
ARRAY JOIN ProfileEvents AS pe
WHERE event_date = today() AND type = 'QueryFinish'
AND query_id = '...'
┌─pe.Names──────────────────────────────┬─pe.Values─┐
│ DiskReadElapsedMicroseconds │ 123970 │
│ RealTimeMicroseconds │ 596084 │
...
14 / 44
Read time
Higher compression rate means
› less IO and more CPU time
› less real time for IO-bounded queries
15 / 44
Read time
select sum(halfMD5(halfMD5(dt))) from codec_example
For CPU-bounded queries decompression time is usually insignificant
16 / 44
In-memory layout
and data processing
Data processing
› Data is processed by blocks
› Block stores slices of columns
› Column is represented in one or several buffers
18 / 44
Integers
› Single buffer
› Stores zero at position -1
› Extra 15 bytes are allocated at array’s tail
19 / 44
Strings
› Buffers with data and offsets
› Offsets are prefix sums of sizes
› Store 0 at string’s end
20 / 44
Arrays
› As well as Strings
› Offsets are stored in a
separate file on FS
21 / 44
N-dimensional Arrays
› N-dimensional Array
is an Array of
(N-1)-dimensional Arrays
› N-dimensional Offsets
are Offsets for
(N-1)-dimensional offsets
› Natural generalization of
1-dimensional Arrays
22 / 44
Functions
Concepts
› Pure (with some exceptions)
› Strong typing
› Multiple overloads
Per-columns execution
› Less virtual calls
› SIMD optimizations
› Complication for UDF
23 / 44
SIMD operations
int memcmpSmallAllowOverflow15(const Char * a, size_t a_size,
const Char * b, size_t b_size)
{
size_t min_size = std::min(a_size, b_size);
for (size_t offset = 0; offset < min_size; offset += 16)
{
/// Compare 16 bytes at once
uint16_t mask = _mm_movemask_epi8(_mm_cmpeq_epi8(
_mm_loadu_si128(reinterpret_cast<const __m128i *>(a + offset)),
_mm_loadu_si128(reinterpret_cast<const __m128i *>(b + offset))));
if (~mask) /// if mask has zero bit (some bytes are different)
{
/// Find and compare first different bytes
...
}
}
return detail::cmp(a_size, b_size);
}
24 / 44
SIMD operations
Main loop for memcmpSmallAllowOverflow15
0xa187fc0 : add $0x10,%r8 ; offset += 16
0xa187fc4 : cmp %r9,%r8 ; if (offset >= min_size)
0xa187fc7 : jae 0xa188008 ; exit loop
0xa187fc9 : movdqu (%rdx,%r8,1),%xmm0 ; xmm0 = a[offset] (16 bytes)
0xa187fcf : movdqu (%rdi,%r8,1),%xmm1 ; xmm1 = b[offset] (16 bytes)
0xa187fd5 : pcmpeqb %xmm1,%xmm0 ; xmm0 = (xmm0 == xmm1)
; (16 bytes at once)
0xa187fd9 : pmovmskb %xmm0,%eax ; mask = `bit mask from xmm0`
0xa187fdd : xor $0xffff,%ax ; mask = ~mask
0xa187fe1 : je 0xa187fc0 ; if (mask == 0)
; continue loop
25 / 44
Pipelining and parallelism
Query Pipeline
SELECT avg(length(URL)) FROM hits WHERE URL != ''
Independent execution steps
› Read column URL
› Calculate expression URL != ''
› Filter column URL
› Calculate function length(URL)
› Calculate aggregate function avg
27 / 44
Query Pipeline
SELECT avg(length(URL)) FROM hits WHERE URL != ''
Properties
› Arbitrary graph
› Support parallel execution
› Dynamically changeable
28 / 44
Parallel Execution
SELECT avg(length(URL)) FROM hits WHERE URL != ''
Parallelism by data
29 / 44
Parallel Execution
SELECT hex(SHA256(*)) FROM (
SELECT hex(SHA256(*)) FROM (
SELECT hex(SHA256(*)) FROM (
SELECT URL FROM hits ORDER BY URL ASC)))
Vertical parallelism
30 / 44
Dynamic pipeline modification
Sometimes we need to change pipeline during execution
Sort stores all query data in memory
Set max_bytes_before_external_sort = <some limit>
31 / 44
Dynamic pipeline modification
Sometimes we need to change pipeline during execution
Sort stores all query data in memory
Set max_bytes_before_external_sort = <some limit>
32 / 44
Dynamic pipeline modification
Sometimes we need to change pipeline during execution
Sort stores all query data in memory
Set max_bytes_before_external_sort = <some limit>
33 / 44
Dynamic pipeline modification
Sometimes we need to change pipeline during execution
Sort stores all query data in memory
Set max_bytes_before_external_sort = <some limit>
34 / 44
Dynamic pipeline modification
Sometimes we need to change pipeline during execution
Sort stores all query data in memory
Set max_bytes_before_external_sort = <some limit>
35 / 44
Query Pipeline
SELECT avg(length(URL)) + 1
FROM hits WHERE URL != ''
WITH TOTALS SETTINGS extremes = 1
┌─plus(avg(length(URL)), 1)─┐
│ 85.3475007793562 │
└───────────────────────────┘
Totals:
┌─plus(avg(length(URL)), 1)─┐
│ 85.3475007793562 │
└───────────────────────────┘
Extremes:
┌─plus(avg(length(URL)), 1)─┐
│ 85.3475007793562 │
│ 85.3475007793562 │
└───────────────────────────┘
36 / 44
Specialized data structures
Task analysis
Task example: string search.
Possible aspects of a task
› Approximate or exact search
› Substring or regexp
› Single or multiple needles
› Single or multiple haystacks
› Short or long strings
› Bytes, unicode code points, real words
For every option can be created specialized algorithm
38 / 44
Concepts
› Take the best implementations
Example: simdjson, pdqsort
› Improve existent algorithms
Volnitsky -> MultiVolnitsky
memcpy -> memcpySmallAllowReadWriteOverflow15
› Use more optimal specializations
40 hash table implementations for GROUP BY
› Test performance on real data
Per-commit tests on real (obfuscated) dataset with page hits
› Profiling
39 / 44
GROUP BY
› Hash table
› Parallel
› Merging in
single thread
40 / 44
GROUP BY
Two level
› Split data to
256 buckets
› Merging in
multiple threads
› More efficient for
remote queries
41 / 44
Hash table specializations
› 8-bit or 16 bit key
lookup table
› 32, 64, 128, 256 bit key
32-bit hash for aggregating, 64-bit hash for merging
› several fixed size keys
represented as single integer if possible
› string key
store pre-calculated hash in hash table
small string optimization
› LowCardinality key
pre-calculated hash for dictionaries
pre-calculated bucket for consecutively repeated dictionaries
42 / 44
Conclusion
› Specialized algorithms and data structures
are necessary for the best performance
› Use the same ideas in your projects
› Contribute: https://github.com/ClickHouse/ClickHouse
43 / 44
Thank you!
QA
44 / 44

Mais conteúdo relacionado

Mais procurados

Distributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeConDistributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeConDuyhai Doan
 
Oracle 10g Performance: chapter 09 enqueues
Oracle 10g Performance: chapter 09 enqueuesOracle 10g Performance: chapter 09 enqueues
Oracle 10g Performance: chapter 09 enqueuesKyle Hailey
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenJohn Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenPostgresOpen
 
Fosdem2012 mariadb-5.3-query-optimizer-r2
Fosdem2012 mariadb-5.3-query-optimizer-r2Fosdem2012 mariadb-5.3-query-optimizer-r2
Fosdem2012 mariadb-5.3-query-optimizer-r2Sergey Petrunya
 
PostgreSQL, performance for queries with grouping
PostgreSQL, performance for queries with groupingPostgreSQL, performance for queries with grouping
PostgreSQL, performance for queries with groupingAlexey Bashtanov
 
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade DowntimeSCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade DowntimeJeff Frost
 
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
PG Day'14 Russia, PostgreSQL System Architecture, Heikki LinnakangasPG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangaspgdayrussia
 
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
Cassandra nice use cases and worst anti patterns no sql-matters barcelonaCassandra nice use cases and worst anti patterns no sql-matters barcelona
Cassandra nice use cases and worst anti patterns no sql-matters barcelonaDuyhai Doan
 
Automating Disaster Recovery PostgreSQL
Automating Disaster Recovery PostgreSQLAutomating Disaster Recovery PostgreSQL
Automating Disaster Recovery PostgreSQLNina Kaufman
 
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오PgDay.Seoul
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Alexey Lesovsky
 
OSDC 2012 | Scaling with MongoDB by Ross Lawley
OSDC 2012 | Scaling with MongoDB by Ross LawleyOSDC 2012 | Scaling with MongoDB by Ross Lawley
OSDC 2012 | Scaling with MongoDB by Ross LawleyNETWAYS
 
My sql fabric ha and sharding solutions
My sql fabric ha and sharding solutionsMy sql fabric ha and sharding solutions
My sql fabric ha and sharding solutionsLouis liu
 
Extra performance out of thin air
Extra performance out of thin airExtra performance out of thin air
Extra performance out of thin airKonstantine Krutiy
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File SystemAdrian Huang
 
Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationlin bao
 
Recent my sql_performance Test detail
Recent my sql_performance Test detailRecent my sql_performance Test detail
Recent my sql_performance Test detailLouis liu
 
Wait Events 10g
Wait Events 10gWait Events 10g
Wait Events 10gsagai
 

Mais procurados (20)

Distributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeConDistributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeCon
 
Oracle 10g Performance: chapter 09 enqueues
Oracle 10g Performance: chapter 09 enqueuesOracle 10g Performance: chapter 09 enqueues
Oracle 10g Performance: chapter 09 enqueues
 
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres OpenJohn Melesky - Federating Queries Using Postgres FDW @ Postgres Open
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
 
Fosdem2012 mariadb-5.3-query-optimizer-r2
Fosdem2012 mariadb-5.3-query-optimizer-r2Fosdem2012 mariadb-5.3-query-optimizer-r2
Fosdem2012 mariadb-5.3-query-optimizer-r2
 
PostgreSQL, performance for queries with grouping
PostgreSQL, performance for queries with groupingPostgreSQL, performance for queries with grouping
PostgreSQL, performance for queries with grouping
 
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade DowntimeSCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
SCALE 15x Minimizing PostgreSQL Major Version Upgrade Downtime
 
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
PG Day'14 Russia, PostgreSQL System Architecture, Heikki LinnakangasPG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
PG Day'14 Russia, PostgreSQL System Architecture, Heikki Linnakangas
 
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
Cassandra nice use cases and worst anti patterns no sql-matters barcelonaCassandra nice use cases and worst anti patterns no sql-matters barcelona
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
 
Automating Disaster Recovery PostgreSQL
Automating Disaster Recovery PostgreSQLAutomating Disaster Recovery PostgreSQL
Automating Disaster Recovery PostgreSQL
 
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
 
OSDC 2012 | Scaling with MongoDB by Ross Lawley
OSDC 2012 | Scaling with MongoDB by Ross LawleyOSDC 2012 | Scaling with MongoDB by Ross Lawley
OSDC 2012 | Scaling with MongoDB by Ross Lawley
 
My sql fabric ha and sharding solutions
My sql fabric ha and sharding solutionsMy sql fabric ha and sharding solutions
My sql fabric ha and sharding solutions
 
Extra performance out of thin air
Extra performance out of thin airExtra performance out of thin air
Extra performance out of thin air
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
MariaDB Temporal Tables
MariaDB Temporal TablesMariaDB Temporal Tables
MariaDB Temporal Tables
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 
Native erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentationNative erasure coding support inside hdfs presentation
Native erasure coding support inside hdfs presentation
 
Recent my sql_performance Test detail
Recent my sql_performance Test detailRecent my sql_performance Test detail
Recent my sql_performance Test detail
 
Wait Events 10g
Wait Events 10gWait Events 10g
Wait Events 10g
 

Semelhante a Data processing into ClickHouse: Columnar layouts, compression, and specialized algorithms

Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerPlatonov Sergey
 
Servers and Processes: Behavior and Analysis
Servers and Processes: Behavior and AnalysisServers and Processes: Behavior and Analysis
Servers and Processes: Behavior and Analysisdreamwidth
 
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...confluent
 
Perl at SkyCon'12
Perl at SkyCon'12Perl at SkyCon'12
Perl at SkyCon'12Tim Bunce
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performancesource{d}
 
Neo4j after 1 year in production
Neo4j after 1 year in productionNeo4j after 1 year in production
Neo4j after 1 year in productionAndrew Nikishaev
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby SystemsEngine Yard
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Alexey Lesovsky
 
Lecture 6 Kernel Debugging + Ports Development
Lecture 6 Kernel Debugging + Ports DevelopmentLecture 6 Kernel Debugging + Ports Development
Lecture 6 Kernel Debugging + Ports DevelopmentMohammed Farrag
 
Replication MongoDB Days 2013
Replication MongoDB Days 2013Replication MongoDB Days 2013
Replication MongoDB Days 2013Randall Hunt
 
Poker, packets, pipes and Python
Poker, packets, pipes and PythonPoker, packets, pipes and Python
Poker, packets, pipes and PythonRoger Barnes
 
Scale17x buffer overflows
Scale17x buffer overflowsScale17x buffer overflows
Scale17x buffer overflowsjohseg
 
Marian Marinov, 1H Ltd.
Marian Marinov, 1H Ltd.Marian Marinov, 1H Ltd.
Marian Marinov, 1H Ltd.Ontico
 
Performance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksPerformance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksMarian Marinov
 
Fatkulin presentation
Fatkulin presentationFatkulin presentation
Fatkulin presentationEnkitec
 

Semelhante a Data processing into ClickHouse: Columnar layouts, compression, and specialized algorithms (20)

Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
Servers and Processes: Behavior and Analysis
Servers and Processes: Behavior and AnalysisServers and Processes: Behavior and Analysis
Servers and Processes: Behavior and Analysis
 
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
Perl at SkyCon'12
Perl at SkyCon'12Perl at SkyCon'12
Perl at SkyCon'12
 
Improving go-git performance
Improving go-git performanceImproving go-git performance
Improving go-git performance
 
Neo4j after 1 year in production
Neo4j after 1 year in productionNeo4j after 1 year in production
Neo4j after 1 year in production
 
Meltdown & spectre
Meltdown & spectreMeltdown & spectre
Meltdown & spectre
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby Systems
 
Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
 
Lecture 6 Kernel Debugging + Ports Development
Lecture 6 Kernel Debugging + Ports DevelopmentLecture 6 Kernel Debugging + Ports Development
Lecture 6 Kernel Debugging + Ports Development
 
Meltdown & Spectre
Meltdown & Spectre Meltdown & Spectre
Meltdown & Spectre
 
Replication MongoDB Days 2013
Replication MongoDB Days 2013Replication MongoDB Days 2013
Replication MongoDB Days 2013
 
Poker, packets, pipes and Python
Poker, packets, pipes and PythonPoker, packets, pipes and Python
Poker, packets, pipes and Python
 
Scale17x buffer overflows
Scale17x buffer overflowsScale17x buffer overflows
Scale17x buffer overflows
 
Marian Marinov, 1H Ltd.
Marian Marinov, 1H Ltd.Marian Marinov, 1H Ltd.
Marian Marinov, 1H Ltd.
 
Performance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksPerformance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networks
 
Fatkulin presentation
Fatkulin presentationFatkulin presentation
Fatkulin presentation
 
Tek tutorial
Tek tutorialTek tutorial
Tek tutorial
 

Mais de Athens Big Data

22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...Athens Big Data
 
19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to wor...
19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to wor...19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to wor...
19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to wor...Athens Big Data
 
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...Athens Big Data
 
20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers
20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers
20th Athens Big Data Meetup - 2nd Talk - Druid: under the coversAthens Big Data
 
20th Athens Big Data Meetup - 3rd Talk - Message from our sponsor: Velti
20th Athens Big Data Meetup - 3rd Talk - Message from our sponsor: Velti20th Athens Big Data Meetup - 3rd Talk - Message from our sponsor: Velti
20th Athens Big Data Meetup - 3rd Talk - Message from our sponsor: VeltiAthens Big Data
 
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...Athens Big Data
 
19th Athens Big Data Meetup - 1st Talk - NLP understanding
19th Athens Big Data Meetup - 1st Talk - NLP understanding19th Athens Big Data Meetup - 1st Talk - NLP understanding
19th Athens Big Data Meetup - 1st Talk - NLP understandingAthens Big Data
 
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on KubernetesAthens Big Data
 
18th Athens Big Data Meetup - 1st Talk - Timeseries Forecasting as a Service
18th Athens Big Data Meetup - 1st Talk - Timeseries Forecasting as a Service18th Athens Big Data Meetup - 1st Talk - Timeseries Forecasting as a Service
18th Athens Big Data Meetup - 1st Talk - Timeseries Forecasting as a ServiceAthens Big Data
 
17th Athens Big Data Meetup - 2nd Talk - Data Flow Building and Calculation P...
17th Athens Big Data Meetup - 2nd Talk - Data Flow Building and Calculation P...17th Athens Big Data Meetup - 2nd Talk - Data Flow Building and Calculation P...
17th Athens Big Data Meetup - 2nd Talk - Data Flow Building and Calculation P...Athens Big Data
 
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...Athens Big Data
 
16th Athens Big Data Meetup - 2nd Talk - A Focus on Building and Optimizing M...
16th Athens Big Data Meetup - 2nd Talk - A Focus on Building and Optimizing M...16th Athens Big Data Meetup - 2nd Talk - A Focus on Building and Optimizing M...
16th Athens Big Data Meetup - 2nd Talk - A Focus on Building and Optimizing M...Athens Big Data
 
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...Athens Big Data
 
15th Athens Big Data Meetup - 1st Talk - Running Spark On Mesos
15th Athens Big Data Meetup - 1st Talk - Running Spark On Mesos15th Athens Big Data Meetup - 1st Talk - Running Spark On Mesos
15th Athens Big Data Meetup - 1st Talk - Running Spark On MesosAthens Big Data
 
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...Athens Big Data
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...Athens Big Data
 
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...Athens Big Data
 
11th Athens Big Data Meetup - 2nd Talk - Beyond Bitcoin; Blockchain Technolog...
11th Athens Big Data Meetup - 2nd Talk - Beyond Bitcoin; Blockchain Technolog...11th Athens Big Data Meetup - 2nd Talk - Beyond Bitcoin; Blockchain Technolog...
11th Athens Big Data Meetup - 2nd Talk - Beyond Bitcoin; Blockchain Technolog...Athens Big Data
 
9th Athens Big Data Meetup - 2nd Talk - Lead Scoring And Grading
9th Athens Big Data Meetup - 2nd Talk - Lead Scoring And Grading9th Athens Big Data Meetup - 2nd Talk - Lead Scoring And Grading
9th Athens Big Data Meetup - 2nd Talk - Lead Scoring And GradingAthens Big Data
 
8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY Style
8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY Style8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY Style
8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY StyleAthens Big Data
 

Mais de Athens Big Data (20)

22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
 
19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to wor...
19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to wor...19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to wor...
19th Athens Big Data Meetup - 2nd Talk - NLP: From news recommendation to wor...
 
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
 
20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers
20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers
20th Athens Big Data Meetup - 2nd Talk - Druid: under the covers
 
20th Athens Big Data Meetup - 3rd Talk - Message from our sponsor: Velti
20th Athens Big Data Meetup - 3rd Talk - Message from our sponsor: Velti20th Athens Big Data Meetup - 3rd Talk - Message from our sponsor: Velti
20th Athens Big Data Meetup - 3rd Talk - Message from our sponsor: Velti
 
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
 
19th Athens Big Data Meetup - 1st Talk - NLP understanding
19th Athens Big Data Meetup - 1st Talk - NLP understanding19th Athens Big Data Meetup - 1st Talk - NLP understanding
19th Athens Big Data Meetup - 1st Talk - NLP understanding
 
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
 
18th Athens Big Data Meetup - 1st Talk - Timeseries Forecasting as a Service
18th Athens Big Data Meetup - 1st Talk - Timeseries Forecasting as a Service18th Athens Big Data Meetup - 1st Talk - Timeseries Forecasting as a Service
18th Athens Big Data Meetup - 1st Talk - Timeseries Forecasting as a Service
 
17th Athens Big Data Meetup - 2nd Talk - Data Flow Building and Calculation P...
17th Athens Big Data Meetup - 2nd Talk - Data Flow Building and Calculation P...17th Athens Big Data Meetup - 2nd Talk - Data Flow Building and Calculation P...
17th Athens Big Data Meetup - 2nd Talk - Data Flow Building and Calculation P...
 
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
17th Athens Big Data Meetup - 1st Talk - Speedup Machine Application Learning...
 
16th Athens Big Data Meetup - 2nd Talk - A Focus on Building and Optimizing M...
16th Athens Big Data Meetup - 2nd Talk - A Focus on Building and Optimizing M...16th Athens Big Data Meetup - 2nd Talk - A Focus on Building and Optimizing M...
16th Athens Big Data Meetup - 2nd Talk - A Focus on Building and Optimizing M...
 
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
16th Athens Big Data Meetup - 1st Talk - An Introduction to Machine Learning ...
 
15th Athens Big Data Meetup - 1st Talk - Running Spark On Mesos
15th Athens Big Data Meetup - 1st Talk - Running Spark On Mesos15th Athens Big Data Meetup - 1st Talk - Running Spark On Mesos
15th Athens Big Data Meetup - 1st Talk - Running Spark On Mesos
 
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...
 
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
 
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...
 
11th Athens Big Data Meetup - 2nd Talk - Beyond Bitcoin; Blockchain Technolog...
11th Athens Big Data Meetup - 2nd Talk - Beyond Bitcoin; Blockchain Technolog...11th Athens Big Data Meetup - 2nd Talk - Beyond Bitcoin; Blockchain Technolog...
11th Athens Big Data Meetup - 2nd Talk - Beyond Bitcoin; Blockchain Technolog...
 
9th Athens Big Data Meetup - 2nd Talk - Lead Scoring And Grading
9th Athens Big Data Meetup - 2nd Talk - Lead Scoring And Grading9th Athens Big Data Meetup - 2nd Talk - Lead Scoring And Grading
9th Athens Big Data Meetup - 2nd Talk - Lead Scoring And Grading
 
8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY Style
8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY Style8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY Style
8th Athens Big Data Meetup - 1st Talk - Riding The Streaming Wave DIY Style
 

Último

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Data processing into ClickHouse: Columnar layouts, compression, and specialized algorithms

  • 1.
  • 2. Data processing into ClickHouse Nikolai Kochetov, ClickHouse developer
  • 3. Agenda › Data layout and compression › In-memory layout and data processing › Pipelining and parallelism › Specialized data structures 3 / 44
  • 4. Data layout and compression
  • 5. Column-Oriented DBMS General ideas › Separate column is stored in separate file (or several files) › Only affected columns are read › Columnar data representation in memory Additional concepts › Sparse index › Per-column compression 5 / 44
  • 6. Compression Highly customizable in CREATE TABLE statement CREATE TABLE codec_example ( `dt` DateTime, -- default CODEC is LZ4 `dt_none` DateTime CODEC(NONE), `dt_lz4_4` DateTime CODEC(LZ4HC(4)), `dt_zstd` DateTime CODEC(ZSTD), `dt_dd_lz4` DateTime CODEC(DoubleDelta, LZ4HC) -- combined ) ENGINE = MergeTree ORDER BY dt 6 / 44
  • 7. Compression ratio vs decompression speed Intel Xeon E3-1225V3, enwik8 https://quixdb.github.io/squash-benchmark 7 / 44
  • 8. Compression ratio SELECT column, formatReadableSize(column_data_compressed_bytes) AS compressed, formatReadableSize(column_data_uncompressed_bytes) AS uncompressed, column_data_uncompressed_bytes / column_data_compressed_bytes AS r FROM system.parts_columns WHERE (table = 'codec_example') AND active ORDER BY r ASC ┌─column────┬─compressed─┬─uncompressed─┬──────────────────r─┐ │ dt_none │ 67.73 MiB │ 67.70 MiB │ 0.999618408127124 │ │ dt │ 3.06 MiB │ 67.70 MiB │ 22.156958788868835 │ │ dt_lz4_4 │ 3.06 MiB │ 67.70 MiB │ 22.156958788868835 │ │ dt_zstd │ 1.08 MiB │ 67.70 MiB │ 62.91648262048673 │ │ dt_dd_lz4 │ 938.17 KiB │ 67.70 MiB │ 73.89642182401099 │ └───────────┴────────────┴──────────────┴────────────────────┘ 8 / 44
  • 9. Data transformation chain Time series data 9 / 44
  • 10. Data transformation chain Time series data -> Delta 10 / 44
  • 11. Data transformation chain Time series data -> Delta -> Delta 11 / 44
  • 12. Data transformation chain Time series data -> Delta -> Delta -> variable length encoding Time series data -> DoubleDelta Time series data -> DoubleDelta -> LZ4HC 12 / 44
  • 13. Read time Test query SELECT dt_dd_lz4 FROM codec_example FORMAT Null Enable system.query_log SET log_queries = 1 xml config: clickhouse.tech/docs/en/operations/server_settings/settings/#server_settings-query-log Drop FS cache $ echo 3 | sudo tee /proc/sys/vm/drop_caches 13 / 44
  • 14. Read time Profile events are in system.query_log SELECT pe.Names, pe.Values FROM system.query_log ARRAY JOIN ProfileEvents AS pe WHERE event_date = today() AND type = 'QueryFinish' AND query_id = '...' ┌─pe.Names──────────────────────────────┬─pe.Values─┐ │ DiskReadElapsedMicroseconds │ 123970 │ │ RealTimeMicroseconds │ 596084 │ ... 14 / 44
  • 15. Read time Higher compression rate means › less IO and more CPU time › less real time for IO-bounded queries 15 / 44
  • 16. Read time select sum(halfMD5(halfMD5(dt))) from codec_example For CPU-bounded queries decompression time is usually insignificant 16 / 44
  • 18. Data processing › Data is processed by blocks › Block stores slices of columns › Column is represented in one or several buffers 18 / 44
  • 19. Integers › Single buffer › Stores zero at position -1 › Extra 15 bytes are allocated at array’s tail 19 / 44
  • 20. Strings › Buffers with data and offsets › Offsets are prefix sums of sizes › Store 0 at string’s end 20 / 44
  • 21. Arrays › As well as Strings › Offsets are stored in a separate file on FS 21 / 44
  • 22. N-dimensional Arrays › N-dimensional Array is an Array of (N-1)-dimensional Arrays › N-dimensional Offsets are Offsets for (N-1)-dimensional offsets › Natural generalization of 1-dimensional Arrays 22 / 44
  • 23. Functions Concepts › Pure (with some exceptions) › Strong typing › Multiple overloads Per-columns execution › Less virtual calls › SIMD optimizations › Complication for UDF 23 / 44
  • 24. SIMD operations int memcmpSmallAllowOverflow15(const Char * a, size_t a_size, const Char * b, size_t b_size) { size_t min_size = std::min(a_size, b_size); for (size_t offset = 0; offset < min_size; offset += 16) { /// Compare 16 bytes at once uint16_t mask = _mm_movemask_epi8(_mm_cmpeq_epi8( _mm_loadu_si128(reinterpret_cast<const __m128i *>(a + offset)), _mm_loadu_si128(reinterpret_cast<const __m128i *>(b + offset)))); if (~mask) /// if mask has zero bit (some bytes are different) { /// Find and compare first different bytes ... } } return detail::cmp(a_size, b_size); } 24 / 44
  • 25. SIMD operations Main loop for memcmpSmallAllowOverflow15 0xa187fc0 : add $0x10,%r8 ; offset += 16 0xa187fc4 : cmp %r9,%r8 ; if (offset >= min_size) 0xa187fc7 : jae 0xa188008 ; exit loop 0xa187fc9 : movdqu (%rdx,%r8,1),%xmm0 ; xmm0 = a[offset] (16 bytes) 0xa187fcf : movdqu (%rdi,%r8,1),%xmm1 ; xmm1 = b[offset] (16 bytes) 0xa187fd5 : pcmpeqb %xmm1,%xmm0 ; xmm0 = (xmm0 == xmm1) ; (16 bytes at once) 0xa187fd9 : pmovmskb %xmm0,%eax ; mask = `bit mask from xmm0` 0xa187fdd : xor $0xffff,%ax ; mask = ~mask 0xa187fe1 : je 0xa187fc0 ; if (mask == 0) ; continue loop 25 / 44
  • 27. Query Pipeline SELECT avg(length(URL)) FROM hits WHERE URL != '' Independent execution steps › Read column URL › Calculate expression URL != '' › Filter column URL › Calculate function length(URL) › Calculate aggregate function avg 27 / 44
  • 28. Query Pipeline SELECT avg(length(URL)) FROM hits WHERE URL != '' Properties › Arbitrary graph › Support parallel execution › Dynamically changeable 28 / 44
  • 29. Parallel Execution SELECT avg(length(URL)) FROM hits WHERE URL != '' Parallelism by data 29 / 44
  • 30. Parallel Execution SELECT hex(SHA256(*)) FROM ( SELECT hex(SHA256(*)) FROM ( SELECT hex(SHA256(*)) FROM ( SELECT URL FROM hits ORDER BY URL ASC))) Vertical parallelism 30 / 44
  • 31. Dynamic pipeline modification Sometimes we need to change pipeline during execution Sort stores all query data in memory Set max_bytes_before_external_sort = <some limit> 31 / 44
  • 32. Dynamic pipeline modification Sometimes we need to change pipeline during execution Sort stores all query data in memory Set max_bytes_before_external_sort = <some limit> 32 / 44
  • 33. Dynamic pipeline modification Sometimes we need to change pipeline during execution Sort stores all query data in memory Set max_bytes_before_external_sort = <some limit> 33 / 44
  • 34. Dynamic pipeline modification Sometimes we need to change pipeline during execution Sort stores all query data in memory Set max_bytes_before_external_sort = <some limit> 34 / 44
  • 35. Dynamic pipeline modification Sometimes we need to change pipeline during execution Sort stores all query data in memory Set max_bytes_before_external_sort = <some limit> 35 / 44
  • 36. Query Pipeline SELECT avg(length(URL)) + 1 FROM hits WHERE URL != '' WITH TOTALS SETTINGS extremes = 1 ┌─plus(avg(length(URL)), 1)─┐ │ 85.3475007793562 │ └───────────────────────────┘ Totals: ┌─plus(avg(length(URL)), 1)─┐ │ 85.3475007793562 │ └───────────────────────────┘ Extremes: ┌─plus(avg(length(URL)), 1)─┐ │ 85.3475007793562 │ │ 85.3475007793562 │ └───────────────────────────┘ 36 / 44
  • 38. Task analysis Task example: string search. Possible aspects of a task › Approximate or exact search › Substring or regexp › Single or multiple needles › Single or multiple haystacks › Short or long strings › Bytes, unicode code points, real words For every option can be created specialized algorithm 38 / 44
  • 39. Concepts › Take the best implementations Example: simdjson, pdqsort › Improve existent algorithms Volnitsky -> MultiVolnitsky memcpy -> memcpySmallAllowReadWriteOverflow15 › Use more optimal specializations 40 hash table implementations for GROUP BY › Test performance on real data Per-commit tests on real (obfuscated) dataset with page hits › Profiling 39 / 44
  • 40. GROUP BY › Hash table › Parallel › Merging in single thread 40 / 44
  • 41. GROUP BY Two level › Split data to 256 buckets › Merging in multiple threads › More efficient for remote queries 41 / 44
  • 42. Hash table specializations › 8-bit or 16 bit key lookup table › 32, 64, 128, 256 bit key 32-bit hash for aggregating, 64-bit hash for merging › several fixed size keys represented as single integer if possible › string key store pre-calculated hash in hash table small string optimization › LowCardinality key pre-calculated hash for dictionaries pre-calculated bucket for consecutively repeated dictionaries 42 / 44
  • 43. Conclusion › Specialized algorithms and data structures are necessary for the best performance › Use the same ideas in your projects › Contribute: https://github.com/ClickHouse/ClickHouse 43 / 44