PQL GROUPBY()
GroupBy()
can be used similar to SELECT-FROM-WHERE-GROUP BY
… in SQL
. See below for the current functionality.
GroupBy()
can be a very memory intensive opperation in FeatureBase. Ensure you have adequate memory to run large GroupBy()
queries or use the max-query-memory FeatureBase configuration option to prevent out of memory issues.
Call Definition
GroupBy(
ROWS_CALL,
... ,
filter=ROW_CALL,
having=CONDITION,
aggregate=AGG_CALL,
sort=SORT_DIR,
limit=UINT,
offset=UINT
)
Mandatory Arguments
ROWS_CALL
: This argument tellsGroupBy()
which fields / field values to group by. Note that Rows() usually only accepts Set, Mutex, and Time fields as arguments. When passed as an argument toGroupBy()
,Rows()
also accept Int and Timestamp fields.- Paging through results is supported by passing the
previous
argument to each of the Rows() calls in theGroupBy()
. Take the last result from your previousGroupBy()
query, and pass each row ID in that result as the previous argument to each of the respective Rows() queries in your nextGroupBy()
query.
- Paging through results is supported by passing the
Optional Arguments
...
: Any number of additional rows calls used for grouping records by – seeROWS_CALL
above.filter
:ROW_CALL
is a row call used to filter records before grouping – analagous to theWHERE
clause in SQL.having
: a condition that can be used to filter groups returned byGroupBy()
based on the value of their aggragate statisticsCONDITION
: the value of thehaving
argument. This value should be structured in one of two ways:- count condition : this means we keeps groups with counts that meet a certain condition
- structure :
Condition(VALUE OPERATOR count OPERATOR VALUE)
orCondition(count OPERATOR VALUE)
- example :
Condition(10 < count < 20)
- the example above only outputs groups with a count between 10 and 20
- structure :
- sum condition : this means we keeps groups with sums that meet a certain condition
- structure :
Condition(VALUE OPERATOR sum OPERATOR VALUE)
orCondition(sum OPERATOR VALUE)
- example :
Condition(100 < sum < 200)
- the example above only outputs groups where the value summed is between 100 and 200
- structure :
- count condition : this means we keeps groups with counts that meet a certain condition
aggregate
: an aggregate statistic applied over groups – similar functionality toSUM()
in theSELECT
clause of a SQLGROUP BY
query.AGG_CALL
: one of the following:- Sum(): will return the sum and count of each group – note the count here does not include the count of records with null values in the field being summed.
- a
Count(Distinct())
query
sort
: this argument can be used to sort the output.SORT_DIR
: the value of the sort argument. Below are the available sort directives for the sort argument."count asc"
or"count desc"
: this will sort the output ascending or descending by the count or records in each group"aggregate asc"
or"aggregate desc"
: this will sort the output ascending or descending by the aggregate statistic provided in the aggregate argument. Again asc for ascending and desc for descending."count asc, aggregate desc"
: finally, we can sort by count and then aggregate doing something like this. Note we can use any combination of the two bullets above. This example will first sort by count in ascending order. Then, groups with the same count will be secondarily sorted by the aggregate statistic in descending order.
limit
: this argument can be used to limit the results to a specific number of groups (i.e. we can return the first two groups by supplyinglimit=2
)offset
: this argument can be used to offset the results to a specific number (i.e. we can skip the first two groups by supplyingoffset=2
)
Returns
- List of “groups”. Each group is an object with a group key and a count key. The count is an integer, and the group is an array of objects which specify the field and row for each row that was intersected to get that result.
Examples
Data:
Index: customer (non keyed index)
_id | age (Int) | has_purchased (Set) | last_purchase (Timestamp)
-----+-----------+---------------------+---------------------------
0 | 23 | ["brand1","brand2"] | 2021-01-05T08:30:00Z
1 | 31 | ["brand1","brand3"] | 2020-09-12T12:30:00Z
2 | 23 | ["brand1","brand3"] | 2021-08-06T16:15:00Z
3 | 19 | [] | null
4 | 25 | ["brand1","brand4"] | 2021-10-01T20:45:00Z
5 | 40 | ["brand4"] | 2022-01-13T11:00:00Z
Example 1
Count the number of customers that have purchased from each brand - sort by that count.
Query
[customer]GroupBy(
Rows(has_purchased),
sort="count desc"
)
Tabular Response
has_purchased | count
---------------+-------
brand1 | 4
brand3 | 2
brand4 | 2
brand2 | 1
HTTP Response
{
"results": [
[
{
"group": [
{
"field": "has_purchased",
"rowKey": "brand1"
}
],
"count": 4
},
{
"group": [
{
"field": "has_purchased",
"rowKey": "brand3"
}
],
"count": 2
},
{
"group": [
{
"field": "has_purchased",
"rowKey": "brand4"
}
],
"count": 2
},
{
"group": [
{
"field": "has_purchased",
"rowKey": "brand2"
}
],
"count": 1
}
]
]
}
Explanation
4 customers purchased from brand 1, 2 customer purchased from brand 3 and 4, and 1 customer purchased from brand 2.
Example 2
What are the counts and sum of ages for each brand? Sorted by sum.
Query
[customer]GroupBy(
Rows(has_purchased),
aggregated=Sum(field=age)
)
Tabular Response
has_purchased | count | sum
---------------+-------+-----
brand1 | 4 | 107
brand2 | 1 | 23
brand3 | 2 | 59
brand4 | 2 | 65
HTTP Response
{
"results": [
[
{
"group": [
{
"field": "has_purchased",
"rowKey": "brand1"
}
],
"count": 4,
"sum": 107
},
{
"group": [
{
"field": "has_purchased",
"rowKey": "brand2"
}
],
"count": 1,
"sum": 23
},
{
"group": [
{
"field": "has_purchased",
"rowKey": "brand3"
}
],
"count": 2,
"sum": 59
},
{
"group": [
{
"field": "has_purchased",
"rowKey": "brand4"
}
],
"count": 2,
"sum": 65
}
]
]
}
Explanation
Our groups are the the values in the has_purchased field due to the Rows(has_purchased)
argument. Then, for each customer, add their age to the sum. The output is above.
Example 3
Group by has_purchased and age and return groups with counts of more than one.
Query
[customer]GroupBy(
Rows(has_purchased),
Rows(age),
having=Condition(count > 1)
)
Tabular Response
has_purchased | age | count
---------------+-----+-------
brand1 | 23 | 2
HTTP Response
{
"results": [
[
{
"group": [
{
"field": "has_purchased",
"rowKey": "brand1"
},
{
"field": "age",
"value": 23
}
],
"count": 2
}
]
]
}
Explanation
There is only one group with a count over 1 - brand1 and age 23.
Example 4
For customers younger than 35, what are the counts of customer for each brand those customers have purchased from?
Query
[customer]GroupBy(
Rows(has_purchased),
filter=Row(age < 35)
)
Tabular Response
has_purchased | count
---------------+-------
brand1 | 4
brand2 | 1
brand3 | 2
brand4 | 1
HTTP Response
{
"results": [
[
{
"group": [
{
"field": "has_purchased",
"rowKey": "brand1"
}
],
"count": 4
},
{
"group": [
{
"field": "has_purchased",
"rowKey": "brand2"
}
],
"count": 1
},
{
"group": [
{
"field": "has_purchased",
"rowKey": "brand3"
}
],
"count": 2
},
{
"group": [
{
"field": "has_purchased",
"rowKey": "brand4"
}
],
"count": 1
}
]
]
}
Explanation
The filter Row(age < 35)
is applied before counts are computed for each has_purchased
group. Thus, customer 5 isn’t included and the count is 1 for brand4.