PQL DISTINCT()
The Distinct()
query returns the distinct values in a field for a set of records.
Distinct()
can always be used as a row call. This, in conjunction with the index argument, gives users the ability to query across indexes. Use caution to ensure you get the expected behavior. The examples below outline some considerations.
Call Definitioin
Distinct(ROW_CALL, field=FIELD, index=INDEX)
Mandatory Arguments
field
: a string, FIELD, which is the name of some field in INDEX or a string which is the name of some field in the primary index (i.e. the index in square brackets before the query) – the latter is the default when theindex
argument isn’t supplied. The distinct values returned will be the distinct values from this field.
Optional Arguments
ROW_CALL
: the row call use to filter records. Only the values of records returned by this row call will be used in the computation ofDistinct()
. Note this row call is run against the primary index (i.e. the index in square brackets) when there is not anindex
argument supplied. It is run againstINDEX
when theindex
argument is supplied.index
: INDEX is the name of the index that FIELD exists in. When this argument is supplied, it is the index ROW_CALL is executed against. When it isn’t supplied, FIELD should exist in the primary index andROW_CALL
will be executed on the primary index (i.e. the index in square brackets)
Returns
- if FIELD is an unkeyed Set, Mutex, or Time field:
- a list of field values (ID or unsigned integers) are return under the “columns” key
- if FIELD is a keyed Set, Mutex, of Time field:
- a list of field values (strings) are returned under the “keys” key
- a list of ID / unsigned integers are returned uner the “columns” key
- the value of the “columns” key is essentially erroneous data; but note, these are the values that will be used if you’re attempting you use
Distinct()
as the row call. In this case, you might get some unexpected behavior.
- if FIELD is an Int, Decimal, or Timestamp field:
- similar to the result of a row call,
Distinct()
including a list of “columns”. Dissimilarly, these results are separated into two groups – one for negative values and one for positive values. This structure is necessary to accommodate signed values that are represented with bitsets.
- similar to the result of a row call,
Examples
Data:
Index: users (non-keyed index)
_id | age (Int) | zipcode (Set)
-----+-----------+---------
0 | 18 | 44444
1 | 55 | 55555
2 | 30 | 11111
3 | 35 | 22222
4 | 22 | 33333
Index: transactions (non-keyed index)
_id | amount (Decimal) | category (Keyed Set) | userid (Non-keyed Set)
-----+------------------+----------------------+------------------------
123 | 99.99 | Tools | 0
124 | 34.99 | Hardware | 1
125 | 129.99 | Tools | 1
126 | 29.99 | Plants | 0
127 | 12.99 | Food | 0
128 | 399.99 | Drugs | 4
129 | 29.99 | Plants | 2
130 | 34.99 | Hardware | 3
Note: In this case, the userid
field in transactions
was ingested as an ID field (non-keyed mutex in this case).
Example 1
What are the distinct ages in the users index?
Query
[users]Distinct(field=age)
Tabular Response
----
18
22
30
35
55
HTTP Response
{
"results": [
{
"neg": {
"columns": []
},
"pos": {
"columns": [
18,
22,
30,
35,
55
]
}
}
]
}
Explanation
Distinct()
returns the unique values in the age
field of the users
index. Note these distinct values are separated into negative and positive becuase age
is an Int
field.
Example 2
What are the distinct prices of transactions?
Query
[transactions]Distinct(field=amount)
Tabular Response
-------
1299
2999
3499
9999
12999
39999
HTTP Response
{
"results": [
{
"neg": {
"columns": []
},
"pos": {
"columns": [
1299,
2999,
3499,
9999,
12999,
39999
]
}
}
]
}
Explanation
Distinct()
returns the unique values in the amount
field of the transactions
index. Note these distinct values are separated into negative and positive because age is an Decimal field.
Example 3
What are the distinct field values in the category field in the transactions index?
Query
[transactions]Distinct(field=category)
Tabular Response
_id
----------
Tools
Hardware
Plants
Food
Drugs
HTTP Response
{
"results": [
{
"columns": [
1,
2,
3,
4,
5
],
"keys": [
"Tools",
"Hardware",
"Plants",
"Food",
"Drugs"
]
}
]
}
Explanation
Distinct()
returns the unique values in the category
field of the transactions
index. Note the “columns” key returned by Distinct() in the HTTP response. In some sense, this is erroneous data. If this query is used as a row call the “columns” key values will be interpreted as record IDs. However, in this case 1, 2, 3, 4, 5 are not referencing record IDs.
Example 4
How many distinct field value are there in the category field in the transactions index?
Query
[transaction]Count(
Distinct(field=category)
)
Tabular Response
count
-------
5
{
"results": [
5
]
}
Explanation
Count(Distinct())
can be used to count the number of unique field values in a field. In this case, there are 5.
Example 5
Which users have bought tools and is older than 25?
Query
[users]Intersect(
Row(age > 25),
Distinct(Row(category=Tools), field=userid, index=transactions)
)
Tabular Response
_id
-----
1
HTTP Response
{
"results": [
{
"columns": [
1
]
}
]
}
Explanation
Row(age > 25)
return the users that are older than 25 - [1, 2, 3]
. Distinct()
is returning unique userid
s from the transactions
index where the transaction also had category=Tools
– [0, 1]
. The two sets are intersected which gives [1]
. This is an example of a query that spans multiple indexes. It works because the userid
field in the transactions
index have a soft reference to the records IDs in the users
index. “Soft reference” here means there is not an actual contraint ensuring all the userid
values in the transactions
are actually record IDs in the users
index. That would be enforced outside of FeatureBase in this example.