PQL TOPK()
TopK()
returns the count of records associated with field values in a given field. The top K most common (i.e. highest count) values are returned. It is equivalent to:
GroupBy(
Rows(FIELD),
filter=ROW_CALL,
limit=UINT,
sort="count desc"
)
Differences from TopN:
- TopN returns approximate results, and TopK returns exact results
- TopK supports time ranges, and TopN does not
- TopN requires a cache (ranked/lru) and TopK does not
- TopK computes total counts for all rows, and TopN does not
- TopK is deterministic, and TopN is not
- TopK does not currently support Tanimoto
Usage notes:
- TopK may take a second or longer to run on high-cardinality set fields
- TopK is fast when a sparse filter is applied, as this allows a large portion of work to be skipped
- When applying a filter which is correlated with row values, TopN and TopK may return dramatically different results
Call Definition
TopK(FIELD, k=UINT, filter=ROW_CALL, from=TIMESTAMP, to=TIMESTAMP)
Mandatory Arguments
FIELD
: the name of the field to group by (i.e. count the records that have a relationship with each value in the field)
Optional Arguments
filter
/ROW_CALL
: the row call used to filter records included in the count.k
: the number of field values to return (i.e. return the topUINT
most common field values).from
: the start date and time when querying Time fields (TIMESTAMP
formatted like'2006-01-02T00:00:00Z'
).TIMESTAMP
here is an inclusive value - i.e. records with relationships made on or after this time will be included in the result – for Time fields only.to
: the end date and time when querying Time fields (TIMESTAMP
formatted like'2006-01-02T00:00:00Z'
).TIMESTAMP
here is an exclusive value - i.e. records with relationships made before this time (but not on this time) will be included in the result – for Time fields only.
Returns
- list of (key,count) pairs sorted in descending order
Examples
Data:
Index: customer (non keyed index)
_id | age (Int) | has_purchased (Set) | last_purchase (Timestamp)
-----+-----------+---------------------+---------------------------
0 | 23 | ["brand1","brand2"] | 2021-01-05T08:30:00Z
1 | 31 | ["brand1","brand3"] | 2020-09-12T12:30:00Z
2 | 28 | ["brand1","brand3"] | 2021-08-06T16:15:00Z
3 | 19 | [] | null
4 | 25 | ["brand1","brand4"] | 2021-10-01T20:45:00Z
5 | 40 | ["brand4"] | 2022-01-13T11:00:00Z
Example 1
What are the top brands that have been purchased from?
Query
[customer]TopK(has_purchased)
Tabular Response
has_purchased | count
---------------+-------
brand1 | 4
brand4 | 2
brand3 | 2
brand2 | 1
HTTP Response
{
"results": [
[
{
"id": 0,
"key": "brand1",
"count": 4
},
{
"id": 0,
"key": "brand4",
"count": 2
},
{
"id": 0,
"key": "brand3",
"count": 2
},
{
"id": 0,
"key": "brand2",
"count": 1
}
]
]
}
Explanation
4 customers have purchased from brand1, 2 customers have purchased from brand3 and brand4, and 1 customer has purchased from brand2.
Example 2
What is the top brand that have been purchased from?
[customer]TopK(has_purchased, k=1)
Tabular Response
has_purchased | count
---------------+-------
brand1 | 4
HTTP Response
{
"results": [
[
{
"id": 0,
"key": "brand1",
"count": 4
}
]
]
}
Explanation
4 customers have purchased from brand1 making it the top brand. The k
arguments limits the return set to the most common brand.
Example 3
What are the top 2 brand from customers over 25?
[customer]TopK(has_purchased, filter=Row(age > 25), k=2)
Tabular Response
has_purchased | count
---------------+-------
brand3 | 2
brand1 | 2
HTTP Response
{
"results": [
[
{
"id": 0,
"key": "brand1",
"count": 2
},
{
"id": 0,
"key": "brand3",
"count": 2
}
]
]
}
Explanation
The Row()
call limits the records to users over 25 years old - i.e. customers 1, 2, and 5. brand3 and brand1 where both purchased the most at two times.