CSV ingest flags reference
Once your CSV file(s) are constructed, they can be ingested by FeatureBase using the ./molecula-consumer-csv
ingester.
Table of contents
Before you begin
Syntax
molecula-consumer-csv \
<common-flags> \
<csv-flags> \
<id-flags> \
<error-flags> \
<log-stat-flags> \
<testing-flags> \
<auth-token-flags> \
<tls-authentication-flags> \
Common flags
Flag | Data type | Description | Default | Required | Additional |
---|---|---|---|---|---|
--batch-size | int | Number of records to read before indexing them as a batch. Recommended: 1,048,576 | 1 | A larger value indicates better throughput and more memory usage. | |
--concurrency | int | Number of concurrent sources and indexing routines to launch. | 1 | When ingesting multiple CSV files | Does not support SQL ingestion or --auto-generate |
--featurebase-hosts | string | Supply FeatureBase default bind points using comma separated list of host:port pairs. | [localhost:10101] | ||
--index | string | Name of target FeatureBase index. | Yes | ||
--string-array-separator | string | character used to delineate values in string array | , | ||
--use-shard-transactional-endpoint | Use alternate import endpoint that ingests data for all fields in a shard in a single atomic request. | Recommended. | Flag has negative performance impact and better consistency |
CSV ingest flags
Flag | Data type | Description | Default | Required |
---|---|---|---|---|
--files | string | List of files, URLs, or directories to ingest | [] | Yes |
--header | string | Defined as {source_column_name}[__data_type[_constraint-value...]],... | [] | If data_type , constraint-value not defined in data file. |
--ignore-header | Ignore header in file and use --header flag to define column names and data types | When using --header flag |
Generate ID flags
Flag | Data type | Description | Default | Required |
---|---|---|---|---|
--auto-generate | Automatically generate IDs. Used for testing purposes. Cannot be used with --concurrency | When --id-field or --primary-key-fields not defined | ||
--external-generate | Allocate _id using the FeatureBase ID allocator. Supports --offset-mode . Requires --auto-generate | |||
--id-alloc-key-prefix | string | Prefix for ID allocator keys when using --external-generate . Requires different value for each concurrent ingester | ingest | |
--id-field | string | A sequence of positive integers that uniquely identifies each record. Use instead of --primary-key-fields | if --auto-generate or --primary-key-fields not defined | |
--primary-key-fields | string | Convert records to strings for use as unique _id . Single records are not added to target as records. Multiple records are concatenated using / and added to target as records. Use instead of --id-field | [] | If --auto-generate or --id-field are not defined. |
--offset-mode | Set Offset-mode based Autogenerated IDs. Requires --auto-generate and --external-generate | When ingesting from an offset-based data source |
Error handling flags
flag | data type | Description | Default | Required |
---|---|---|---|---|
--allow-decimal-out-of-range | Allow ingest to continue when it encounters out of range decimals in Decimal Fields. | false | ||
--allow-int-out-of-range | Allow ingest to continue when it encounters out of range integers in Int Fields. | false | ||
--allow-timestamp-out-of-range | Allow ingest to continue when it encounters out of range timestamps in Timestamp Fields. | false | ||
--batch-max-staleness | duration | Maximum length of time the oldest record in a batch can exist before the batch is flushed. This may result in timeouts while waiting for the source | ||
--commit-timeout | duration | A commit is a process of informing the data source the current batch of records is ingested. --commit-timeout is the maximum time before the commit process is cancelled. May not function for CSV ingest process. | ||
--skip-bad-rows | int | Fail the ingest process if n rows are not processed. |
Logging & statistics flags
Flag | Data type | Description | Default | Required |
---|---|---|---|---|
--log-path | string | Log file to write to. | Empty means stderr. | |
--pprof | string | host:port on which to listen for pprof go package | “localhost:6062” | |
--stats | string | host:port on which to host metrics | “localhost:9093” | |
--track-progress | Periodically print status updates on how many records have been sourced. | |||
--verbose | Enable verbose logging. | |||
--write-csv | string | Write ingested data to the named CSV file. |
Testing flags
flag | Description | Default | Required |
---|---|---|---|
--delete-index | Delete an existing index specified by --index before starting ingest. USE WITH CAUTION | ||
--dry-run | Parse flags without starting an ingest process |
Authentication token flags
Flag | Data type | Description | Default | Required |
---|---|---|---|---|
--auth-token | string | Authentication Token for FeatureBase |
TLS authentication flags
Flag | Data type | Description | Default | Required | |
---|---|---|---|---|---|
--tls.ca-certificate | string | Path to CA certificate file on the target FeatureBase instance, or literal PEM data. | Yes | ||
--tls.certificate | string | Path to certificate file on the target FeatureBase instance, or literal PEM data. | Yes | ||
--tls.enable-client-verification | Enable verification of client certificates. | Yes | |||
--tls.key | string | Path to certificate key file on the target FeatureBase instance, or literal PEM data. | Yes | ||
--tls.skip-verify | Disables verification of server certificates. | Use for self-signed certificates. | Optional |
TLS connections require an appropriate protocol set with --featurebase-hosts
(e.g., https://featurebase0.local:10101
).
Additional information
batch
additional
There is no default batch-size
because memory usage per record varies between workloads.
During ingestion processing, there is a fixed overhead:
- from setting up an ingester transaction
- for each row
Setting large batch-size
values will:
- average-out the overheads
- proportionally increase memory usage
- improve performance (in general terms)
For example:
Workload includes | Batch size | Typical memory usage (MB) |
---|---|---|
High number of sparse keys | 20,000 | 100+ |
High-frequency keys | 1,000,000+ |
concurrency
additional
The concurrency
ingest flag is used to run ingesters in parallel which can:
- improve utilization on multi-core systems
- allow for redundancy
Alternatively, ingest processes can be launched individually on different environments.
Using environment variables to set flags
All command line flags can be set via environment variables by:
- removing leading dashes
- adding
CONSUMER_
as prefix - writing the flag in UPPER CASE
- converting dashes or dots to underscores
For example:
Original flag | Equivalent for use with environment variables |
---|---|
--tls.ca-certificate | CONSUMER_TLS_CA_CERTIFICATE |
Missing value processing
Missing and empty string values are handled the same.
Field data type | Expected behaviour |
---|---|
"ID" | Error if "ID" selected for id-field. Otherwise, do not update value in index. |
"DateInt" | Raise error during ingestion - timestamp must have a valid value. |
"Timestamp" | Raise error during ingestion - input is not time. |
"RecordTime" | Do not update value in index. |
"Int" | Do not update value in index. |
"Decimal" | Do not update value in index. |
"String" | Error if "String" select for primary-key field. Otherwise do not update value in index. |
"Bool" | Do not update value in index. |
"StringArray" | Do not update value in index. |
"IDArray" | Do not update value in index. |
"ForeignKey" | Do not update value in index. |
Quoting values
Use double quotes “…” to enclose fields containing:
- Line breaks (CRLF)
- Commas
- double quotes
Value Path Selection
The path
argument is an array of JSON object keys which are applied in order.
For example:
Source data | Path selection | Value selected |
---|---|---|
{"a":{"b":{"c":1}}} | ["a","b","c"] | 1 |
Use allow-missing-fields
to avoid path
errors where source data is missing.
config
options for data types
- Use the
config
flag when changing flags from default values.
Examples
CSV ingest tool flags for header-defined.csv
The required header is defined in the source file
./molecula-consumer-csv \
--batch-size=10000 \
--primary-key-fields=username \
--index=users \
--files=header-defined.csv
Connect securely over TLS and define header flags
Use this method to:
- ignore CSV headers and define them at the command line
- define FeatureBase server
tls
certificates to securely connect to a remote server.
./molecula-consumer-csv \
--featurebase-hosts=https://localhost:10101
--tls.certificate=featurebase.local.crt \
--tls.key=featurebase.local.key \
--tls.skip-verify \
--batch-size=10000 \
--auto-generate \
--header=asset_tag__String,fan_time__RecordTime_2006-01-02,fan_val__String_F_YMD \
--ignore-header
--index=csv-ingest-tls \
--files=header-defined.csv,header-undefined.csv \