Skip to main content Link Menu Expand (external link) Document Search Copy Copied

SQL ingest tool reference

The SQL ingester tool selects data from a MSSQL, MySQL or Postgres database table.

Source column names are used to specify column names for each record.

Table of contents

Before you begin

Syntax

molecula-consumer-sql        \
  <source-and-target-flags>  \
  <sql-flags>                \
  <id-flags>                 \
  <batch-flags>              \
  <error-flags>              \
  <log-stat-flags>           \
  <testing-flags>            \
  <auth-token-flags>         \
  <tls-authentication-flags> \

Common flags

Flag Data type Description Default Required Additional
--batch-size int Number of records to read before indexing them as a batch. Recommended: 1,048,576 1   A larger value indicates better throughput and more memory usage.
--concurrency int Number of concurrent sources and indexing routines to launch. 1 When ingesting multiple CSV files Does not support SQL ingestion or --auto-generate
--featurebase-hosts string Supply FeatureBase default bind points using comma separated list of host:port pairs. [localhost:10101]    
--index string Name of target FeatureBase index.   Yes  
--string-array-separator string character used to delineate values in string array ,    
--use-shard-transactional-endpoint   Use alternate import endpoint that ingests data for all fields in a shard in a single atomic request.   Recommended. Flag has negative performance impact and better consistency

SQL ingest flags

flag data type Description Default Required  
--driver string key used for finding go sql database driver postgres    
--connection-string string sql endpoint credentials postgres://user:password@localhost:5432/defaultindex?sslmode=disable    
--row-expr string sql + type description on input      
--sting-array-separator   Used to delineate values in string array ,    

Generate ID flags

Flag Data type Description Default Required
--auto-generate   Automatically generate IDs. Used for testing purposes. Cannot be used with --concurrency   When --id-field or --primary-key-fields not defined
--external-generate   Allocate _id using the FeatureBase ID allocator. Supports --offset-mode. Requires --auto-generate    
--id-alloc-key-prefix string Prefix for ID allocator keys when using --external-generate. Requires different value for each concurrent ingester ingest  
--id-field string A sequence of positive integers that uniquely identifies each record. Use instead of --primary-key-fields   if --auto-generate or --primary-key-fields not defined
--primary-key-fields string Convert records to strings for use as unique _id. Single records are not added to target as records. Multiple records are concatenated using / and added to target as records. Use instead of --id-field [] If --auto-generate or --id-field are not defined.
--offset-mode   Set Offset-mode based Autogenerated IDs. Requires --auto-generate and --external-generate   When ingesting from an offset-based data source

Batch processing flags

flag data type Description Default Required
--batch-size int Number of records to read before indexing them as a batch. A larger value indicates better throughput and more memory usage. Recommended: 1,048,576 1  

Error handling flags

flag data type Description Default Required
--allow-decimal-out-of-range   Allow ingest to continue when it encounters out of range decimals in Decimal Fields. false  
--allow-int-out-of-range   Allow ingest to continue when it encounters out of range integers in Int Fields. false  
--allow-timestamp-out-of-range   Allow ingest to continue when it encounters out of range timestamps in Timestamp Fields. false  
--batch-max-staleness duration Maximum length of time the oldest record in a batch can exist before the batch is flushed. This may result in timeouts while waiting for the source    
--commit-timeout duration A commit is a process of informing the data source the current batch of records is ingested. --commit-timeout is the maximum time before the commit process is cancelled. May not function for CSV ingest process.    
         
--skip-bad-rows int Fail the ingest process if n rows are not processed.    

Logging & statistics flags

Flag Data type Description Default Required
--log-path string Log file to write to. Empty means stderr.  
--pprof string host:port on which to listen for pprof go package “localhost:6062”  
--stats string host:port on which to host metrics “localhost:9093”  
--track-progress   Periodically print status updates on how many records have been sourced.    
--verbose   Enable verbose logging.    
--write-csv string Write ingested data to the named CSV file.    

Testing flags

flag Description Default Required
--delete-index Delete an existing index specified by --index before starting ingest. USE WITH CAUTION    
--dry-run Parse flags without starting an ingest process    

Authentication token flags

Flag Data type Description Default Required
--auth-token string Authentication Token for FeatureBase    

TLS authentication flags

Flag Data type Description Default Required  
--tls.ca-certificate string Path to CA certificate file on the target FeatureBase instance, or literal PEM data.   Yes  
--tls.certificate string Path to certificate file on the target FeatureBase instance, or literal PEM data.   Yes  
--tls.enable-client-verification   Enable verification of client certificates.   Yes  
--tls.key string Path to certificate key file on the target FeatureBase instance, or literal PEM data.   Yes  
--tls.skip-verify   Disables verification of server certificates. Use for self-signed certificates.   Optional

Additional information

batch additional

There is no default batch-size because memory usage per record varies between workloads.

During ingestion processing, there is a fixed overhead:

  • from setting up an ingester transaction
  • for each row

Setting large batch-size values will:

  • average-out the overheads
  • proportionally increase memory usage
  • improve performance (in general terms)

For example:

Workload includes Batch size Typical memory usage (MB)
High number of sparse keys 20,000 100+
High-frequency keys 1,000,000+  

concurrency additional

The concurrency ingest flag is used to run ingesters in parallel which can:

  • improve utilization on multi-core systems
  • allow for redundancy

Alternatively, ingest processes can be launched individually on different environments.

SQL connection strings

--driver example --connection-string Additional information
Postgres 'postgresql://postgres:password@localhost:5432/molecula?sslmode=disable'
OR</br>'user=postgres password=password host=localhost port=5432 dbname=molecula sslmode=disable'
postgres connection string
MySql 'myusername:password@(10.0.0.1:3306)/mydb'
OR
'server=sqldb.myserver.com;userid=mysqlusername;password=secret;database=mydbname'
MySQL connection strings
SQL-Server --driver mssql --connection-string 'server=sqldb.myserver.com;userid=mysqlusername;password=secret;database=mydbname' SQL-Server connection strings

Missing value processing

Missing and empty string values are handled the same.

Field data type Expected behaviour
"ID" Error if "ID" selected for id-field. Otherwise, do not update value in index.
"DateInt" Raise error during ingestion - timestamp must have a valid value.
"Timestamp" Raise error during ingestion - input is not time.
"RecordTime" Do not update value in index.
"Int" Do not update value in index.
"Decimal" Do not update value in index.
"String" Error if "String" select for primary-key field. Otherwise do not update value in index.
"Bool" Do not update value in index.
"StringArray" Do not update value in index.
"IDArray" Do not update value in index.
"ForeignKey" Do not update value in index.

Quoting values

Use double quotes “…” to enclose fields containing:

  • Line breaks (CRLF)
  • Commas
  • double quotes

Value Path Selection

The path argument is an array of JSON object keys which are applied in order.

For example:

Source data Path selection Value selected
{"a":{"b":{"c":1}}} ["a","b","c"] 1

Use allow-missing-fields to avoid path errors where source data is missing.

config options for data types

  • Use the config flag when changing flags from default values.

Examples

Ingest data from SQL server database

SELECT queries the SQL-Server table then the ingester:

  • converts the data to Roaring Bitmap format
  • imports the records to the my_data FeatureBase index.
molecula-consumer-sql \
	--connection-string 'server=sqldb.myserver.com;userid=mysqlusername;password=secret;database=mydbname' \
	--featurebase-hosts 10.0.0.1:10101 \
	--batch-size 1000000 \
	--driver=mssql \
	--index=my_data \
	--id-field=id \
	--row-expr 'SELECT tableID as id__ID, zipcode as zipcode__String limit 10'

SQL ingest flags to import from assets table

SELECT queries the MySQL table, then the ingester:

  • converts the records to Roaring Bitmap format
  • imports the records to the asset_list FeatureBase index.
molecula-consumer-sql \
    --driver mysql \
    --connection-string 'username:password@(127.0.0.1:3306)/dbname' \
    --featurebase-hosts localhost:10101 \
    --batch-size 10000 \
    --index=asset_list \
    --primary-key-fields 'asset_tag' \
    --row-expr 'SELECT asset_tag as asset_tag__String, weight as weight__Int, warehouse as warehouse__String FROM assets'

Join Assets and Events tables into a single FeatureBase index

The SELECT statement queries the MySQL events table to:

  • return event data along with weight of relative asset_tag
  • create locale field based on the first three characters from the events.pk field
  • join assets.asset_tag and events.asset_tag.

The ingester then:

  • converts the records to Roaring Bitmap format
  • imports the records to the events_plus_weight FeatureBase index.
molecula-consumer-sql \
    --driver mysql \
    --connection-string 'username:password@(127.0.0.1:3306)/dbname' \
    --featurebase-hosts localhost:10101 \
    --batch-size 10000 \
    --index=events_plus_weight \
    --primary-key-fields 'pk' \
    --row-expr 'SELECT events.pk as pk__String, events.asset_tag as asset_tag__String, assets.weight as weight__Int, SUBSTRING(events.pk, 1, 3) as locale__String FROM events INNER JOIN assets on assets.asset_tag = events.asset_tag'

Further information