SQL ingest tool reference

The SQL ingester tool selects data from a MSSQL, MySQL or Postgres database table.

Source column names are used to specify column names for each record.

Table of contents

Before you begin
Syntax
Common flags
SQL ingest flags
Generate ID flags
Batch processing flags
Error handling flags
Logging & statistics flags
Testing flags
Authentication token flags
TLS authentication flags
Additional information
Examples
Further information

Before you begin

Syntax

molecula-consumer-sql        \
  <source-and-target-flags>  \
  <sql-flags>                \
  <id-flags>                 \
  <batch-flags>              \
  <error-flags>              \
  <log-stat-flags>           \
  <testing-flags>            \
  <auth-token-flags>         \
  <tls-authentication-flags> \

Common flags

Flag	Data type	Description	Default	Required	Additional
`--batch-size`	int	Number of records to read before indexing them as a batch. Recommended: 1,048,576	1		A larger value indicates better throughput and more memory usage.
`--concurrency`	int	Number of concurrent sources and indexing routines to launch.	1	When ingesting multiple CSV files	Does not support SQL ingestion or `--auto-generate`
`--featurebase-hosts`	string	Supply FeatureBase default bind points using comma separated list of host:port pairs.	`[localhost:10101]`
`--index`	string	Name of target FeatureBase index.		Yes
`--string-array-separator`	string	character used to delineate values in `string` array	`,`
`--use-shard-transactional-endpoint`		Use alternate import endpoint that ingests data for all fields in a shard in a single atomic request.		Recommended.	Flag has negative performance impact and better consistency

SQL ingest flags

flag	data type	Description	Default
`--driver`	string	key used for finding go sql database driver	postgres
`--connection-string`	string	sql endpoint credentials	postgres://user:password@localhost:5432/defaultindex?sslmode=disable
`--row-expr`	string	sql + type description on input
`--sting-array-separator`		Used to delineate values in string array	`,`

Generate ID flags

Flag	Data type	Description	Default	Required
`--auto-generate`		Automatically generate IDs. Used for testing purposes. Cannot be used with `--concurrency`		When `--id-field` or `--primary-key-fields` not defined
`--external-generate`		Allocate `_id` using the FeatureBase ID allocator. Supports `--offset-mode`. Requires `--auto-generate`
`--id-alloc-key-prefix`	string	Prefix for ID allocator keys when using `--external-generate`. Requires different value for each concurrent ingester	ingest
`--id-field`	string	A sequence of positive integers that uniquely identifies each record. Use instead of `--primary-key-fields`		if `--auto-generate` or `--primary-key-fields` not defined
`--primary-key-fields`	string	Convert records to strings for use as unique `_id`. Single records are not added to target as records. Multiple records are concatenated using `/` and added to target as records. Use instead of `--id-field`	[]	If `--auto-generate` or `--id-field` are not defined.
`--offset-mode`		Set Offset-mode based Autogenerated IDs. Requires `--auto-generate` and `--external-generate`		When ingesting from an offset-based data source

Batch processing flags

flag	data type	Description	Default	Required
`--batch-size`	int	Number of records to read before indexing them as a batch. A larger value indicates better throughput and more memory usage. Recommended: 1,048,576	1

Error handling flags

flag	data type	Description	Default
`--allow-decimal-out-of-range`		Allow ingest to continue when it encounters out of range decimals in `Decimal` Fields.	false
`--allow-int-out-of-range`		Allow ingest to continue when it encounters out of range integers in `Int` Fields.	false
`--allow-timestamp-out-of-range`		Allow ingest to continue when it encounters out of range timestamps in `Timestamp` Fields.	false
`--batch-max-staleness`	duration	Maximum length of time the oldest record in a batch can exist before the batch is flushed. This may result in timeouts while waiting for the source
`--commit-timeout`	duration	A `commit` is a process of informing the data source the current batch of records is ingested. `--commit-timeout` is the maximum time before the commit process is cancelled. May not function for CSV ingest process.

`--skip-bad-rows`	int	Fail the ingest process if `n` rows are not processed.

Logging & statistics flags

Flag	Data type	Description	Default
`--log-path`	string	Log file to write to.	Empty means stderr.
`--pprof`	string	host:port on which to listen for pprof `go` package	“localhost:6062”
`--stats`	string	host:port on which to host metrics	“localhost:9093”
`--track-progress`		Periodically print status updates on how many records have been sourced.
`--verbose`		Enable verbose logging.
`--write-csv`	string	Write ingested data to the named CSV file.

Testing flags

flag	Description	Default	Required
`--delete-index`	Delete an existing index specified by `--index` before starting ingest. USE WITH CAUTION
`--dry-run`	Parse flags without starting an ingest process

Authentication token flags

Flag	Data type	Description	Default	Required
`--auth-token`	string	Authentication Token for FeatureBase

TLS authentication flags

Flag	Data type	Description	Default	Required
`--tls.ca-certificate`	string	Path to CA certificate file on the target FeatureBase instance, or literal PEM data.		Yes
`--tls.certificate`	string	Path to certificate file on the target FeatureBase instance, or literal PEM data.		Yes
`--tls.enable-client-verification`		Enable verification of client certificates.		Yes
`--tls.key`	string	Path to certificate key file on the target FeatureBase instance, or literal PEM data.		Yes
`--tls.skip-verify`		Disables verification of server certificates.	Use for self-signed certificates.		Optional

Additional information

`batch` additional

There is no default batch-size because memory usage per record varies between workloads.

During ingestion processing, there is a fixed overhead:

from setting up an ingester transaction
for each row

Setting large batch-size values will:

average-out the overheads
proportionally increase memory usage
improve performance (in general terms)

For example:

Workload includes	Batch size	Typical memory usage (MB)
High number of sparse keys	20,000	100+
High-frequency keys	1,000,000+

`concurrency` additional

The concurrency ingest flag is used to run ingesters in parallel which can:

improve utilization on multi-core systems
allow for redundancy

Alternatively, ingest processes can be launched individually on different environments.

SQL connection strings

`--driver`	example `--connection-string`	Additional information
Postgres	`'postgresql://postgres:password@localhost:5432/molecula?sslmode=disable'` OR</br>`'user=postgres password=password host=localhost port=5432 dbname=molecula sslmode=disable'`	`postgres` connection string
MySql	`'myusername:password@(10.0.0.1:3306)/mydb'` OR `'server=sqldb.myserver.com;userid=mysqlusername;password=secret;database=mydbname'`	`MySQL` connection strings
SQL-Server	`--driver mssql --connection-string 'server=sqldb.myserver.com;userid=mysqlusername;password=secret;database=mydbname'`	`SQL-Server` connection strings

Missing value processing

Missing and empty string values are handled the same.

Field data type	Expected behaviour
`"ID"`	Error if `"ID"` selected for id-field. Otherwise, do not update value in index.
`"DateInt"`	Raise error during ingestion - timestamp must have a valid value.
`"Timestamp"`	Raise error during ingestion - input is not time.
`"RecordTime"`	Do not update value in index.
`"Int"`	Do not update value in index.
`"Decimal"`	Do not update value in index.
`"String"`	Error if `"String"` select for primary-key field. Otherwise do not update value in index.
`"Bool"`	Do not update value in index.
`"StringArray"`	Do not update value in index.
`"IDArray"`	Do not update value in index.
`"ForeignKey"`	Do not update value in index.

Quoting values

Use double quotes “…” to enclose fields containing:

Line breaks (CRLF)
Commas
double quotes

Value Path Selection

The path argument is an array of JSON object keys which are applied in order.

For example:

Source data	Path selection	Value selected
`{"a":{"b":{"c":1}}}`	`["a","b","c"]`	1

Use allow-missing-fields to avoid path errors where source data is missing.

`config` options for data types

Use the config flag when changing flags from default values.

Examples

Ingest data from SQL server database

SELECT queries the SQL-Server table then the ingester:

converts the data to Roaring Bitmap format
imports the records to the my_data FeatureBase index.

molecula-consumer-sql \
	--connection-string 'server=sqldb.myserver.com;userid=mysqlusername;password=secret;database=mydbname' \
	--featurebase-hosts 10.0.0.1:10101 \
	--batch-size 1000000 \
	--driver=mssql \
	--index=my_data \
	--id-field=id \
	--row-expr 'SELECT tableID as id__ID, zipcode as zipcode__String limit 10'

SQL ingest flags to import from assets table

SELECT queries the MySQL table, then the ingester:

converts the records to Roaring Bitmap format
imports the records to the asset_list FeatureBase index.

molecula-consumer-sql \
    --driver mysql \
    --connection-string 'username:password@(127.0.0.1:3306)/dbname' \
    --featurebase-hosts localhost:10101 \
    --batch-size 10000 \
    --index=asset_list \
    --primary-key-fields 'asset_tag' \
    --row-expr 'SELECT asset_tag as asset_tag__String, weight as weight__Int, warehouse as warehouse__String FROM assets'

Join Assets and Events tables into a single FeatureBase index

The SELECT statement queries the MySQL events table to:

return event data along with weight of relative asset_tag
create locale field based on the first three characters from the events.pk field
join assets.asset_tag and events.asset_tag.

The ingester then:

converts the records to Roaring Bitmap format
imports the records to the events_plus_weight FeatureBase index.

molecula-consumer-sql \
    --driver mysql \
    --connection-string 'username:password@(127.0.0.1:3306)/dbname' \
    --featurebase-hosts localhost:10101 \
    --batch-size 10000 \
    --index=events_plus_weight \
    --primary-key-fields 'pk' \
    --row-expr 'SELECT events.pk as pk__String, events.asset_tag as asset_tag__String, assets.weight as weight__Int, SUBSTRING(events.pk, 1, 3) as locale__String FROM events INNER JOIN assets on assets.asset_tag = events.asset_tag'

Further information

Example SQL and ingest tool flags