SQL ingest tool reference
The SQL ingester tool selects data from a MSSQL, MySQL or Postgres database table.
Source column names are used to specify column names for each record.
Table of contents
Before you begin
Syntax
molecula-consumer-sql \
<source-and-target-flags> \
<sql-flags> \
<id-flags> \
<batch-flags> \
<error-flags> \
<log-stat-flags> \
<testing-flags> \
<auth-token-flags> \
<tls-authentication-flags> \
Common flags
Flag | Data type | Description | Default | Required | Additional |
---|---|---|---|---|---|
--batch-size | int | Number of records to read before indexing them as a batch. Recommended: 1,048,576 | 1 | A larger value indicates better throughput and more memory usage. | |
--concurrency | int | Number of concurrent sources and indexing routines to launch. | 1 | When ingesting multiple CSV files | Does not support SQL ingestion or --auto-generate |
--featurebase-hosts | string | Supply FeatureBase default bind points using comma separated list of host:port pairs. | [localhost:10101] | ||
--index | string | Name of target FeatureBase index. | Yes | ||
--string-array-separator | string | character used to delineate values in string array | , | ||
--use-shard-transactional-endpoint | Use alternate import endpoint that ingests data for all fields in a shard in a single atomic request. | Recommended. | Flag has negative performance impact and better consistency |
SQL ingest flags
flag | data type | Description | Default | Required | |
---|---|---|---|---|---|
--driver | string | key used for finding go sql database driver | postgres | ||
--connection-string | string | sql endpoint credentials | postgres://user:password@localhost:5432/defaultindex?sslmode=disable | ||
--row-expr | string | sql + type description on input | |||
--sting-array-separator | Used to delineate values in string array | , |
Generate ID flags
Flag | Data type | Description | Default | Required |
---|---|---|---|---|
--auto-generate | Automatically generate IDs. Used for testing purposes. Cannot be used with --concurrency | When --id-field or --primary-key-fields not defined | ||
--external-generate | Allocate _id using the FeatureBase ID allocator. Supports --offset-mode . Requires --auto-generate | |||
--id-alloc-key-prefix | string | Prefix for ID allocator keys when using --external-generate . Requires different value for each concurrent ingester | ingest | |
--id-field | string | A sequence of positive integers that uniquely identifies each record. Use instead of --primary-key-fields | if --auto-generate or --primary-key-fields not defined | |
--primary-key-fields | string | Convert records to strings for use as unique _id . Single records are not added to target as records. Multiple records are concatenated using / and added to target as records. Use instead of --id-field | [] | If --auto-generate or --id-field are not defined. |
--offset-mode | Set Offset-mode based Autogenerated IDs. Requires --auto-generate and --external-generate | When ingesting from an offset-based data source |
Batch processing flags
flag | data type | Description | Default | Required |
---|---|---|---|---|
--batch-size | int | Number of records to read before indexing them as a batch. A larger value indicates better throughput and more memory usage. Recommended: 1,048,576 | 1 |
Error handling flags
flag | data type | Description | Default | Required |
---|---|---|---|---|
--allow-decimal-out-of-range | Allow ingest to continue when it encounters out of range decimals in Decimal Fields. | false | ||
--allow-int-out-of-range | Allow ingest to continue when it encounters out of range integers in Int Fields. | false | ||
--allow-timestamp-out-of-range | Allow ingest to continue when it encounters out of range timestamps in Timestamp Fields. | false | ||
--batch-max-staleness | duration | Maximum length of time the oldest record in a batch can exist before the batch is flushed. This may result in timeouts while waiting for the source | ||
--commit-timeout | duration | A commit is a process of informing the data source the current batch of records is ingested. --commit-timeout is the maximum time before the commit process is cancelled. May not function for CSV ingest process. | ||
--skip-bad-rows | int | Fail the ingest process if n rows are not processed. |
Logging & statistics flags
Flag | Data type | Description | Default | Required |
---|---|---|---|---|
--log-path | string | Log file to write to. | Empty means stderr. | |
--pprof | string | host:port on which to listen for pprof go package | “localhost:6062” | |
--stats | string | host:port on which to host metrics | “localhost:9093” | |
--track-progress | Periodically print status updates on how many records have been sourced. | |||
--verbose | Enable verbose logging. | |||
--write-csv | string | Write ingested data to the named CSV file. |
Testing flags
flag | Description | Default | Required |
---|---|---|---|
--delete-index | Delete an existing index specified by --index before starting ingest. USE WITH CAUTION | ||
--dry-run | Parse flags without starting an ingest process |
Authentication token flags
Flag | Data type | Description | Default | Required |
---|---|---|---|---|
--auth-token | string | Authentication Token for FeatureBase |
TLS authentication flags
Flag | Data type | Description | Default | Required | |
---|---|---|---|---|---|
--tls.ca-certificate | string | Path to CA certificate file on the target FeatureBase instance, or literal PEM data. | Yes | ||
--tls.certificate | string | Path to certificate file on the target FeatureBase instance, or literal PEM data. | Yes | ||
--tls.enable-client-verification | Enable verification of client certificates. | Yes | |||
--tls.key | string | Path to certificate key file on the target FeatureBase instance, or literal PEM data. | Yes | ||
--tls.skip-verify | Disables verification of server certificates. | Use for self-signed certificates. | Optional |
Additional information
batch
additional
There is no default batch-size
because memory usage per record varies between workloads.
During ingestion processing, there is a fixed overhead:
- from setting up an ingester transaction
- for each row
Setting large batch-size
values will:
- average-out the overheads
- proportionally increase memory usage
- improve performance (in general terms)
For example:
Workload includes | Batch size | Typical memory usage (MB) |
---|---|---|
High number of sparse keys | 20,000 | 100+ |
High-frequency keys | 1,000,000+ |
concurrency
additional
The concurrency
ingest flag is used to run ingesters in parallel which can:
- improve utilization on multi-core systems
- allow for redundancy
Alternatively, ingest processes can be launched individually on different environments.
SQL connection strings
--driver | example --connection-string | Additional information |
---|---|---|
Postgres | 'postgresql://postgres:password@localhost:5432/molecula?sslmode=disable' OR</br> 'user=postgres password=password host=localhost port=5432 dbname=molecula sslmode=disable' | postgres connection string |
MySql | 'myusername:password@(10.0.0.1:3306)/mydb' OR 'server=sqldb.myserver.com;userid=mysqlusername;password=secret;database=mydbname' | MySQL connection strings |
SQL-Server | --driver mssql --connection-string 'server=sqldb.myserver.com;userid=mysqlusername;password=secret;database=mydbname' | SQL-Server connection strings |
Missing value processing
Missing and empty string values are handled the same.
Field data type | Expected behaviour |
---|---|
"ID" | Error if "ID" selected for id-field. Otherwise, do not update value in index. |
"DateInt" | Raise error during ingestion - timestamp must have a valid value. |
"Timestamp" | Raise error during ingestion - input is not time. |
"RecordTime" | Do not update value in index. |
"Int" | Do not update value in index. |
"Decimal" | Do not update value in index. |
"String" | Error if "String" select for primary-key field. Otherwise do not update value in index. |
"Bool" | Do not update value in index. |
"StringArray" | Do not update value in index. |
"IDArray" | Do not update value in index. |
"ForeignKey" | Do not update value in index. |
Quoting values
Use double quotes “…” to enclose fields containing:
- Line breaks (CRLF)
- Commas
- double quotes
Value Path Selection
The path
argument is an array of JSON object keys which are applied in order.
For example:
Source data | Path selection | Value selected |
---|---|---|
{"a":{"b":{"c":1}}} | ["a","b","c"] | 1 |
Use allow-missing-fields
to avoid path
errors where source data is missing.
config
options for data types
- Use the
config
flag when changing flags from default values.
Examples
Ingest data from SQL server database
SELECT
queries the SQL-Server table then the ingester:
- converts the data to Roaring Bitmap format
- imports the records to the
my_data
FeatureBase index.
molecula-consumer-sql \
--connection-string 'server=sqldb.myserver.com;userid=mysqlusername;password=secret;database=mydbname' \
--featurebase-hosts 10.0.0.1:10101 \
--batch-size 1000000 \
--driver=mssql \
--index=my_data \
--id-field=id \
--row-expr 'SELECT tableID as id__ID, zipcode as zipcode__String limit 10'
SQL ingest flags to import from assets table
SELECT
queries the MySQL table, then the ingester:
- converts the records to Roaring Bitmap format
- imports the records to the
asset_list
FeatureBase index.
molecula-consumer-sql \
--driver mysql \
--connection-string 'username:password@(127.0.0.1:3306)/dbname' \
--featurebase-hosts localhost:10101 \
--batch-size 10000 \
--index=asset_list \
--primary-key-fields 'asset_tag' \
--row-expr 'SELECT asset_tag as asset_tag__String, weight as weight__Int, warehouse as warehouse__String FROM assets'
Join Assets and Events tables into a single FeatureBase index
The SELECT
statement queries the MySQL events
table to:
- return
event
data along withweight
of relativeasset_tag
- create
locale
field based on the first three characters from theevents.pk
field - join
assets.asset_tag
andevents.asset_tag
.
The ingester then:
- converts the records to Roaring Bitmap format
- imports the records to the
events_plus_weight
FeatureBase index.
molecula-consumer-sql \
--driver mysql \
--connection-string 'username:password@(127.0.0.1:3306)/dbname' \
--featurebase-hosts localhost:10101 \
--batch-size 10000 \
--index=events_plus_weight \
--primary-key-fields 'pk' \
--row-expr 'SELECT events.pk as pk__String, events.asset_tag as asset_tag__String, assets.weight as weight__Int, SUBSTRING(events.pk, 1, 3) as locale__String FROM events INNER JOIN assets on assets.asset_tag = events.asset_tag'