Supported file formats

File formats supported by BEEM data lake ingestion.

BEEM's data lake ingestion supports several file formats for source data. The four formats below are the recommended ones for most use cases — additional formats can be supported on request.

Compatibility matrix

FormatFull-loadDirect (full-load)IncrementalAdd/Remove column
CSVYesYesYesYes (manual)
JSONYesYesYesYes
ParquetYesYesYesYes
AvroYesYesYes

Format details

CSV

Standard delimited text. Two variants are supported depending on whether values are wrapped in quotes. Add/remove column is supported but requires a manual process.

JSON

JSON documents containing one outer array of records (e.g. [ {...}, {...} ]).

Parquet

Columnar binary format. Position-based, so column renames at the source are transparent.

Avro

Binary row-based format with a self-describing schema embedded in the file header. Avro is name-based (column names in the file header must match the target schema), and the ingestion automatically sanitizes special characters in column names (e.g. $, /, spaces) for downstream compatibility.

Notes