Vikram ahuja: Carbondata in 2019: Year in review

Before we dive into a new decade, we are looking back on all that we have accomplished together as a community in 2019. We are excited to be a part of community which have had contributions from more than 160 members all over the world. We would like to take the opportunity to thank everyone that contributed to the CarbonData project in one way or another. The project has been more active than ever. In the year 2019, we completed 5 releases conisting of more than 650 commits. Here's a quick summary.

New Features Added:

1. Index Server to distribute the index cache: If the cache size becomes huge(70-80% of the driver memory) during driver pruning then there can be excessive Garbage Collection in the spark driver which can slow down the queries and the Spark driver may even go OutOfMemory. To solve these problems we have introduced distributed Index Cache Server. It is a separate scalable server which stores only index information and the spark driver can connect and prune the data using cached index information. Refer: Index Server

2. Writing data with Apache Flink Streaming data to Carbon: Carbondata , an indexed columnar data format is integrated with a fault-tolerant streaming dataflow engine like Apache Flink for performance efficient and faster analytics on big data platforms. Users can build a flink streaming job and use flink sink to write data as table stage files written by CarbonSDK. Later, Insert Stage DML can be used to insert data from stage files to the carbon table. This flink application provides high concurrency and high throughput fro writing streaming data to carbon.

3. Read data from Apache Hive made easy : CarbonData files can be read from the Hive. This helps users to easily migrate to CarbonData format on existing Hive deployments using other file formats like ORC, Parquet. Users can run full scan queries and filter queries from the hive table.

4. Materialized Views: Query processing can be accelerated in Carbondata with the use of Materialized Views. Users can create limitless materialized views of repetitive queries to calculate and store results of expensive operations like joins and aggregations. Materialized views can be created on the same table to serve different queries. Materialized views can also be created with timeseries data aggregated by time-based granularity, which can be be analyzed to take better decision for business.

5. Support Merge for SCD and CDC scenarios: In the current data warehouse world slowly changing dimensions (SCD) and change data capture(CDC) are very common scenarios. Legacy systems like RDBMS can handle these scenarios very well because of support of transactions and merge/update/delete operations. Merge syntax is supported in carbondata to improve the SCD and CDC performances.

6 . Adapt to SparkSessionExtension: Spark now provides SparkSessionExtension which are holder for injection points to the SparkSession in order to provide extended capabilities. Carbon now uses this in order to avoid tight coupling due to CarbonSession in spark environment. This can allow Carbon to provide Apache Zeppelin support directly.

7. Binary Data Type: Carbondata now supports storing binary data such as image, blob data and huge text as binary (byte array). In AI/ML scenarios, while training the model if the image is stored as a binary column instead of using raw images, the query performance is improved due to reduced IO calls. Refer: Datatypes

8. Rename of Column Names in carbondata tables - Column names in carbondata table can now be renamed to reflect the changes in scenario or conventions. Refer: Alter DDL

9. Support GZIP Compressor for CarbonData Files - GZIP compression is supported to compress each page of CarbonData file. GZIP offers better compression ratio there by reducing the store size. On the average GZIP compression reduces store size by 20-30% as compared to Snappy compression. Gzip is the preferred choice of compression as it can take advantage of hardware optimizations.

10. Carbondata as a unified format for AI Engines: By integrating carbondata to uber's petastorm, carbondata can be used as a unified storage that can be used by multiple engines such as pytorch, pyspark, tensorflow. To handle AI scenarios:

Carbondata has provided a python SDK to read and write files

Carbondata has provided Arrow Integration with SDK, SDK can fill arrow vector by reading carbondata files and return a byte array, which can be used by any AI engine to read the data.

Carbondata has been integrated with uber's opensource petastorm library.

11. Carbondata now supports adding segments by path : Data can now be added to already existing transactional table if data is generated outside the carbondata severs like SDK.

12. Reading segments from other file formats made easy in Carbondata: Data made with other file formats like Parquet or ORC can now be added to carbondata table without completely converting data to carbondata format. Segments can now be added from their path along with their format.

Functional Improvements:

1. DDL Support on CarbonData LRU Cache: DDL on CarbonData LRU Cache allows opeations as show and clear current cache for a specific table. Refer : Cache DDL

2. Support Altering SORT_COLUMNS Property on the table : Initially, the user could only configure the sort columns during table creation only, it restricts the user to load the data with same sort columns even though his query scenarios are changed. This feature support altering the sort columns even after the table is created. Refer: Alter DDL

3. Support Configurable Page Size: This feature allows the user to configure the page size, it gives the control of memory utilization during reading and loading data especially for complex, varchar, and binary datatypes.

4 . Supported Compaction on Range Sorted Segments: The segments which are loaded with range sort scope will now be compacted using the range compaction. Refer: Compaction

5. Support add segments for partition table: CarbonData supports ADD SEGMENT for non-partition table already. This feature adds the same support for Hive Partition Table.

6. Support Global Sort Compaction: The segments which are now loaded with global sort will now be compacted using global sort compaction instead of local sort to improve the performance o query.

7. Support SDK carbon files: Presto has been enhanced to read SDK output files and added support for presto to read stream segment data.

Performance Improvements:

1. Optimizing the performance of sorting – Improved the performance of sorting by nearly 5 percent thus improving the load time.

2. Filter query performance – Nearly 7 percent improvement in select filter query.

3. Improve Single/Concurrent query performance – Nearly 40 percent improvement on concurrent queries on large number of segments(>10K) and 6 percent improvement on single sequential queries.

4. CarbonData query performance improvement over Parquet: Improved query performance of Carbondata in comparison with Parquet by comparing time taken to query. Refer: TPCH Report (to be updated)

Coming Up:

1. Apache Hive Write Support - Carbondata files can only be read from Hive. We are soon providing suppport for writing to hive.

2. Geospatial Support - A spatial index allows for efficient access of spatial object. CarbonData rasterize the user data during data load into segments. User can query for the coordinates present within the given polygon as predicate.

3. Carbondata to Support writing data for Presto: Carbondata files can only be read from Presto. We are soon providing suppport for writing to Presto.

4. Secondary Index - User can create a secondary index on a carbon table with columns that require a quick lookup. Secondary index will store blockletId as index which helps in better pruning during query.

5. Optimize Insert into Flow: The performance of Insert flow to be improved and it's memory usage to be reduced.

References:
0. Apache Carbondata: https://carbondata.apache.org/
1. Apache Spark : https://spark.apache.org/
2. Apache Hive: https://hive.apache.org/
3. Apache ORC: https://orc.apache.org/
4. Apache Parquet: https://parquet.apache.org/
5. Gzip: https://en.wikipedia.org/wiki/Gzip
6. Snappy: https://en.wikipedia.org/wiki/Snappy_(compression)
7. Uber Petastorm: https://github.com/uber/petastorm
8. Presto: https://prestodb.io/
9. Apache Flink: https://flink.apache.org/
10. Apache Zeppelin: https://zeppelin.apache.org/

Vikram ahuja

Tuesday, 28 January 2020

Carbondata in 2019: Year in review

New Features Added:

Performance Improvements:

Coming Up:

1 comment: