Databases: Column-Oriented DBs revolution!

Column Storage


Summary from above video:
"In SAP HANA we store data in Column Storage(layout) instead of Rows (traditional). This is because it allow us to pipe one column into a CPU Core VERY effectively (and L1,L2,L3 will be used on 100%).
 
This mean we may perform searching, filtering and aggregation (OLAP)  "On FLY"!!!
We don't need to store indexes or aggregated data even in RAM.

Because CPU have a lot of Cores - we perform all above operations highly parallel.

Scanning a numeric column: 5Bn/core/s
Aggregation a numeric column: 12m/core/s"

Diving deeper 

In fact, Column-Oriented DBs exist for a while. Currently, most of industry DBs have some-sort of this:
All of them considered as In-Memory optimized or Analytic Workload (OLAP) optimized, as opposite to common, OLTP optimized Row-oriented DBs.
In summary:
  • Row-Oriented - good for OLTP workloads - like CRUD
  • Column-Oriented - good for OLAP workloads - reporting, aggregation.
So, SAP states that CPUs and data-centers are changes rapidly, so, by using many tricks they was able to COMBINE  OLTP+OLAP ready cases in single DB engine - SAP HANA.

Checking some numbers

I was very curios about underlying algorithms behind Column-Oriented DBs. 
The idea that FULL scanning 1milion array (ArrayList<Integer>) by CPU is compatible with BTree indexes - is unbelievable
Simple POC programs on .Net and Java, mostly proved SAP provided numbers:

Scanning a numeric column: 5Bn/core/s
Aggregation a numeric column: 12m/core/s

In fact, by POC showed lower performance, but using C++ and SIMD(SEE (1-4), AVX) CPU instructions which may compare about 60 integers in single CPU instruction numbers are fair.

Yes, BTree is still blast fast (~5x-50x?), but Full Сolumn Scan now is feasible alternative in real scenarios, and providing a lot! of useful possibilities.

And I really agree that 3Ghz CPU limit it is the real wall for next decade, so we don't have other way except going forward with parallel calculations and speeding CPU-Memory integration (like 3D Memory). And, while this road we may faced with many wild and unexpected things, like:
"FULL scanning is compatible with BTree indexes"
:)

Some links:

Comments