PQL DISTINCT()

The Distinct() query returns the distinct values in a field for a set of records.

Distinct() can always be used as a row call. This, in conjunction with the index argument, gives users the ability to query across indexes. Use caution to ensure you get the expected behavior. The examples below outline some considerations.

Call Definitioin

Distinct(ROW_CALL, field=FIELD, index=INDEX)

Mandatory Arguments

field: a string, FIELD, which is the name of some field in INDEX or a string which is the name of some field in the primary index (i.e. the index in square brackets before the query) – the latter is the default when the index argument isn’t supplied. The distinct values returned will be the distinct values from this field.

Optional Arguments

ROW_CALL : the row call use to filter records. Only the values of records returned by this row call will be used in the computation of Distinct(). Note this row call is run against the primary index (i.e. the index in square brackets) when there is not an index argument supplied. It is run against INDEX when the index argument is supplied.
index : INDEX is the name of the index that FIELD exists in. When this argument is supplied, it is the index ROW_CALL is executed against. When it isn’t supplied, FIELD should exist in the primary index and ROW_CALL will be executed on the primary index (i.e. the index in square brackets)

Returns

if FIELD is an unkeyed Set, Mutex, or Time field:
- a list of field values (ID or unsigned integers) are return under the “columns” key
if FIELD is a keyed Set, Mutex, of Time field:
- a list of field values (strings) are returned under the “keys” key
- a list of ID / unsigned integers are returned uner the “columns” key
- the value of the “columns” key is essentially erroneous data; but note, these are the values that will be used if you’re attempting you use Distinct() as the row call. In this case, you might get some unexpected behavior.
if FIELD is an Int, Decimal, or Timestamp field:
- similar to the result of a row call, Distinct() including a list of “columns”. Dissimilarly, these results are separated into two groups – one for negative values and one for positive values. This structure is necessary to accommodate signed values that are represented with bitsets.

Examples

Data:

Index: users (non-keyed index)

 _id | age (Int) | zipcode (Set)
-----+-----------+---------
 0   | 18        | 44444
 1   | 55        | 55555
 2   | 30        | 11111
 3   | 35        | 22222
 4   | 22        | 33333

Index: transactions (non-keyed index)

 _id | amount (Decimal) | category (Keyed Set) | userid (Non-keyed Set)
-----+------------------+----------------------+------------------------
 123 | 99.99            | Tools                | 0
 124 | 34.99            | Hardware             | 1
 125 | 129.99           | Tools                | 1
 126 | 29.99            | Plants               | 0
 127 | 12.99            | Food                 | 0
 128 | 399.99           | Drugs                | 4
 129 | 29.99            | Plants               | 2
 130 | 34.99            | Hardware             | 3

Note: In this case, the userid field in transactions was ingested as an ID field (non-keyed mutex in this case).

Example 1

What are the distinct ages in the users index?

Query

[users]Distinct(field=age)

Tabular Response

HTTP Response

{
  "results": [
    {
      "neg": {
        "columns": []
      },
      "pos": {
        "columns": [
          18,
          22,
          30,
          35,
          55
        ]
      }
    }
  ]
}

Explanation

Distinct() returns the unique values in the age field of the users index. Note these distinct values are separated into negative and positive becuase age is an Int field.

Example 2

What are the distinct prices of transactions?

Query

[transactions]Distinct(field=amount)

Tabular Response

HTTP Response

{
  "results": [
    {
      "neg": {
        "columns": []
      },
      "pos": {
        "columns": [
          1299,
          2999,
          3499,
          9999,
          12999,
          39999
        ]
      }
    }
  ]
}

Explanation

Distinct() returns the unique values in the amount field of the transactions index. Note these distinct values are separated into negative and positive because age is an Decimal field.

Example 3

What are the distinct field values in the category field in the transactions index?

Query

[transactions]Distinct(field=category)

Tabular Response

   _id
----------
 Tools
 Hardware
 Plants
 Food
 Drugs

HTTP Response

{
  "results": [
    {
      "columns": [
        1,
        2,
        3,
        4,
        5
      ],
      "keys": [
        "Tools",
        "Hardware",
        "Plants",
        "Food",
        "Drugs"
      ]
    }
  ]
}

Explanation

Distinct() returns the unique values in the category field of the transactions index. Note the “columns” key returned by Distinct() in the HTTP response. In some sense, this is erroneous data. If this query is used as a row call the “columns” key values will be interpreted as record IDs. However, in this case 1, 2, 3, 4, 5 are not referencing record IDs.

Example 4

How many distinct field value are there in the category field in the transactions index?

Query

[transaction]Count(
  Distinct(field=category)
)

Tabular Response

 count
-------
 5

{
  "results": [
    5
  ]
}

Explanation

Count(Distinct()) can be used to count the number of unique field values in a field. In this case, there are 5.

Example 5

Which users have bought tools and is older than 25?

Query

[users]Intersect(
           Row(age > 25),
           Distinct(Row(category=Tools), field=userid, index=transactions)
       )

Tabular Response

 _id
-----
 1

HTTP Response

{
  "results": [
    {
      "columns": [
        1
      ]
    }
  ]
}

Explanation

Row(age > 25) return the users that are older than 25 - [1, 2, 3]. Distinct() is returning unique userids from the transactions index where the transaction also had category=Tools – [0, 1]. The two sets are intersected which gives [1]. This is an example of a query that spans multiple indexes. It works because the userid field in the transactions index have a soft reference to the records IDs in the users index. “Soft reference” here means there is not an actual contraint ensuring all the userid values in the transactions are actually record IDs in the users index. That would be enforced outside of FeatureBase in this example.