Best practices for multi-engine metrics management based on Apache Calcite

谢佳君

Chinese Session 2023-08-18 16:45 GMT+8  #olap

There are a variety of indicators in data analysis, and when maintaining a large number of indicators, there are often the following pain points:

  • Repeat segments cannot be reused.
  • Different engines need to write different SQL.
  • Aperture changes are difficult to synchronize to all downstream.

To solve these problems, ByteDance tried to design a solution using existing technical capabilities:

  • Storing indicators in Hive tables as much as possible: This greatly increases the storage cost and backtracking cost, which is not feasible.
  • Encapsulating indicators into views: Additional table information is generated in Hive, which doubles the number of tables, and partition support is not friendly. The query experience is poor, so it is difficult to promote.

Because the existing technology is not enough to solve the above problems, Bytedance has designed and implemented two new sets of syntax capabilities based on Apache Calcite:

  • Virtual column: Column level view, reuse table column permissions, simple promotion.
  • SQL Define Function: Uses SQL to define functions, facilitating the reuse of SQL fragments.

The combination of these two capabilities can effectively reduce the cost of indicator management such as:

  • Indicators need to be modified only once and do not need to be synchronized downstream.
  • Fields in collection types such as MAP and JSON can be defined as virtual columns, which is more logical and convenient to use.

Specific typical cases and implementation principles will be introduced in the presentation PPT.

Speakers:


Xie Jiajun: bytedance, Bytedance Senior R&D Engineer, Bytedance Senior R&D Engineer, has participated in the 2022 Apache Asia Con presentation. Love open source, often involved in community work, is now an Apache Calcite active committer and Linkedin Coral Contributor.