A few observations from experience with handling of TPC-H queries. See tpch demo folder, query.ijs file.

Here's a working prototype of aggregation prototype. It allowed to significantly simplify the aggregation plan below.

Definition of Aggregation

Given a sets of by-columns and residual columns, the latter each having a scalar-valued aggregator (count, sum, avg, min, max, etc.), aggregation replaces by-columns with their compound key (or nub), and their grouping map i.~ reduces each residual column with its aggregator.

Notes.

Aggregation as Query Stage

In the trivial case of one by-column and one residual
   sum(rc) by bc  where  ...
and rv,bv being filtered values, the result is obtained as
   (~.bv) ; bv sum/. rv 

However, typically there are multiple heterogeneous by-columns and/or multiple residuals. So the following aggregation plan will use index mapping.

Query Notation

The new syntax affects the Select Clause:

It should be easy to parse first by splitting with reserved word by and next splitting by comma, then isolating aggregator from residuals and finally cutting alias with : in resuduals.

If alias is given, it is used as column name in result set.

  Reads__d 'Q1:sum R1,Q2:avg R2 by B1,B2 from t where Gender=Male'
--+--+--+--
B1|B2|Q1|Q2
--+--+--+--
V1|V3|12|67
V2|V4|45|89

If alias is not given, and columns are different, their orginal names are used.

  Reads__d 'min R1,max R2 by B1,B2 from t'
--+--+--+--
B1|B2|R1|R2
--+--+--+--
V1|V3|12|67
V2|V4|45|89

If alias is not given, the same columns are prefixed with name of aggregator.

  Reads__d 'min R,max R by B1,B2 from t where Gender=Male'
--+--+----+----
B1|B2|minR|maxR
--+--+----+----
V1|V3|12  |67
V2|V4|45  |89

More examples.

  Read__d 'sum Salary by Department from t where Gender=Male'
----------+-------
Department|Salary
----------+-------
DEPT1     |1000000
DEPT2     |1500000

  Read__d 'SalTotal:sum Salary by Department from t where Gender=Male'
----------+--------
Department|SalTotal
----------+--------
DEPT1     |1000000
DEPT2     |1500000

Note. In the future, column expressions can be used in place of input columns (Bj and Rk), which are applied after where-filter. For example, year_of(dt), uppercase(name), or (price*discount) etc. Result set names of same columns should be prefixed or alias required.

Hierarchical Aggregation

Each by-column can be thought of as a dimension and aggregation result is a sparse cube, each cell containg a residual tuple addressed by the "coordinates" formed with values of by-columns.

The definition above can be called simple or one-level aggregation. In contrast, multi-level (hierarchical, pivot-table, OLAP cube) aggregation, is when more than one set of by-columns is given. It can be executed as a sequence of simple aggregations for each level, where the by-sets above are added to the filter, current by-set is used as simple by-columns and lower levels ignored, which corresponds to NULLs in the dimensions coordinate. The process can be optimized by re-using the aggregations from lower levels.

JDB/Aggregation/Formal (last edited 2008-12-31 04:37:32 by ChrisBurke)