Supported file formats

BEEM's data lake ingestion supports several file formats for source data. The four formats below are the recommended ones for most use cases — additional formats can be supported on request.

Compatibility matrix

FormatFull-loadDirect (full-load)IncrementalAdd/Remove column
CSVYesYesYesYes (manual)
JSONYesYesYesYes
ParquetYesYesYesYes
AvroYesYesYes

Format details

CSV

Standard delimited text. Two variants are supported depending on whether values are wrapped in quotes. Add/remove column is supported but requires a manual process.

JSON

JSON documents containing one outer array of records (e.g. [ {...}, {...} ]).

Parquet

Columnar binary format. Position-based, so column renames at the source are transparent.

Avro

Binary row-based format with a self-describing schema embedded in the file header. Avro is name-based (column names in the file header must match the target schema), and the ingestion automatically sanitizes special characters in column names (e.g. $, /, spaces) for downstream compatibility.

Notes