PQL TOPN()
TopN()
returns the count of records associated with field values in a given field. The top N most common (i.e. highest count) values are returned. It is equivalent to:
GroupBy(
Rows(FIELD),
filter=ROW_CALL,
limit=UINT,
sort="count desc"
)
However, TopN()
is not guaranteed to compute the correct result. See the Caveats
section below.
Call Definition
TopN(FIELD, ROW_CALL, n=UINT)
Mandatory Arguments
FIELD
: the name of the field to group by (i.e. count the records that have a relationship with each value in FIELD)
Optional Arguments
ROW_CALL
: the row call used to filter records included in the count.n
: the number of field values to return (i.e. return the topUINT
most common field values).
Returns
- list of (key,count) pairs sorted in descending order
Caveats:
In general, the order of the resulting row keys is not guaranteed to reflect the true order of bit counts across an index. The exact solution to the problem of computing the TopN counts is prohibitively expensive, so TopN is instead implemented as a heuristic. This provides a significant performance improvement, at the cost of uncertainty in the result order.
For guaranteed exact results, use TopK, though be advised that it may be significantly slower, particularly for high cardinality fields. It can also be faster in some cases, especially when n is large, so experiment with your particular data set to see what works best.
The implementation is based on a per-shard cache. The accuracy of the results depends on how well the counts for the overall index are reflected in the individual shards (so TopN queries on a single-shard index are exact). If the distribution of bits across shards is uniform, shard counts are representative. This is often a reasonable assumption, especially for the top results for large data sets, in which counts might follow Zipfian, exponential, or other long-tail distributions. However, this assumption may not hold for some applications.
Additional implementation details:
- The field’s cache size determines the number of sorted rows to maintain in the cache for purposes of TopN queries. There is a tradeoff between performance and accuracy; increasing the cache size will improve accuracy of results at the cost of performance. Note that this per-shard tradeoff is independent of the per-index performance/accuracy tradeoff mentioned above.
- Fields with cache type ranked will return the top rows sorted by count in descending order.
- Fields with cache type lru will maintain an LRU (Least Recently Used replacement policy) cache, thus a TopN query on this type of field will return rows sorted in order of most recently set bit.
- Once full, the cache will truncate the set of rows according to the field option CacheSize. Rows that straddle the limit and have the same count will be truncated in no particular order.
See field creation for more information about the cache.
Examples
Data:
Index: customer (non keyed index)
_id | age (Int) | has_purchased (Set) | last_purchase (Timestamp)
-----+-----------+---------------------+---------------------------
0 | 23 | ["brand1","brand2"] | 2021-01-05T08:30:00Z
1 | 31 | ["brand1","brand3"] | 2020-09-12T12:30:00Z
2 | 28 | ["brand1","brand3"] | 2021-08-06T16:15:00Z
3 | 19 | [] | null
4 | 25 | ["brand1","brand4"] | 2021-10-01T20:45:00Z
5 | 40 | ["brand4"] | 2022-01-13T11:00:00Z
Example 1
What are the top brands that have been purchased from?
Query
[customer]TopN(has_purchased)
Tabular Response
has_purchased | count
---------------+-------
brand1 | 4
brand4 | 2
brand3 | 2
brand2 | 1
HTTP Response
{
"results": [
[
{
"id": 0,
"key": "brand1",
"count": 4
},
{
"id": 0,
"key": "brand4",
"count": 2
},
{
"id": 0,
"key": "brand3",
"count": 2
},
{
"id": 0,
"key": "brand2",
"count": 1
}
]
]
}
Explanation
4 customers have purchased from brand1, 2 customers have purchased from brand3 and brand4, and 1 customer has purchased from brand2.
Example 2
What is the top brand that have been purchased from?
[customer]TopN(has_purchased, n=1)
Tabular Response
has_purchased | count
---------------+-------
brand1 | 4
HTTP Response
{
"results": [
[
{
"id": 0,
"key": "brand1",
"count": 4
}
]
]
}
Explanation
4 customers have purchased from brand1 making it the top brand. The n
arguments limits the return set to the most common single brand.
Example 3
What are the top 2 brand from customers over 25?
[customer]TopN(has_purchased, Row(age > 25), n=2)
Tabular Response
has_purchased | count
---------------+-------
brand3 | 2
brand1 | 2
HTTP Response
{
"results": [
[
{
"id": 0,
"key": "brand1",
"count": 2
},
{
"id": 0,
"key": "brand3",
"count": 2
}
]
]
}
Explanation
The Row()
call limits the records to users over 25 years old - i.e. customers 1, 2, and 5. brand3 and brand1 where both purchased the most at two times.