IonToParquet
type: "io.kestra.plugin.serdes.parquet.IonToParquet"
Read a provided file containing ion serialized data and convert it to parquet.
Examples
Read a CSV file, transform it and store the transformed data as a parquet file.
id: ion_to_parquet
namespace: company.team
tasks:
- id: download_csv
type: io.kestra.plugin.core.http.Download
description: salaries of data professionals from 2020 to 2023 (source ai-jobs.net)
uri: https://huggingface.co/datasets/kestra/datasets/raw/main/csv/salaries.csv
- id: avg_salary_by_job_title
type: io.kestra.plugin.jdbc.duckdb.Query
inputFiles:
data.csv: "{{ outputs.download_csv.uri }}"
sql: |
SELECT
job_title,
ROUND(AVG(salary),2) AS avg_salary
FROM read_csv_auto('{{ workingDir }}/data.csv', header=True)
GROUP BY job_title
HAVING COUNT(job_title) > 10
ORDER BY avg_salary DESC;
store: true
- id: result
type: io.kestra.plugin.serdes.parquet.IonToParquet
from: "{{ outputs.avg_salary_by_job_title.uri }}"
schema: |
{
"type": "record",
"name": "Salary",
"namespace": "com.example.salary",
"fields": [
{"name": "job_title", "type": "string"},
{"name": "avg_salary", "type": "double"}
]
}
Properties
from
- Type: string
- Dynamic: ✔️
- Required: ✔️
Source file URI
schema
- Type: string
- Dynamic: ✔️
- Required: ✔️
The avro schema associated to the data
compressionCodec
- Type: string
- Dynamic: ❌
- Required: ❌
- Default:
GZIP
- Possible Values:
UNCOMPRESSED
SNAPPY
GZIP
ZSTD
The compression to used
dateFormat
- Type: string
- Dynamic: ✔️
- Required: ❌
- Default:
yyyy-MM-dd[XXX]
Format to use when parsing date
datetimeFormat
- Type: string
- Dynamic: ✔️
- Required: ❌
- Default:
yyyy-MM-dd'T'HH:mm[:ss][.SSSSSS][XXX]
Format to use when parsing datetime
Default value is yyyy-MM-dd'T'HH:mm[
][.SSSSSS]XXX
decimalSeparator
- Type: string
- Dynamic: ✔️
- Required: ❌
- Default:
.
Character to recognize as decimal point (e.g. use ‘,’ for European data).
Default value is '.'
dictionaryPageSize
- Type: integer
- Dynamic: ❌
- Required: ❌
- Default:
1048576
Max dictionary page size
falseValues
- Type: array
- SubType: string
- Dynamic: ✔️
- Required: ❌
- Default:
[ "f", "false", "disabled", "0", "off", "no", "" ]
Values to consider as False
inferAllFields
- Type: boolean
- Dynamic: ❌
- Required: ❌
- Default:
false
Try to infer all fields
If true, we try to infer all fields with
trueValues
,trueValues
&nullValues
.If false, we will infer bool & null only on field declared on schema asnull
andbool
.
nullValues
- Type: array
- SubType: string
- Dynamic: ✔️
- Required: ❌
- Default:
[ "", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "1.#IND", "1.#QNAN", "NA", "n/a", "nan", "null" ]
Values to consider as null
pageSize
- Type: integer
- Dynamic: ❌
- Required: ❌
- Default:
1048576
Target page size
rowGroupSize
- Type: integer
- Dynamic: ❌
- Required: ❌
- Default:
134217728
Target row group size
strictSchema
- Type: boolean
- Dynamic: ❌
- Required: ❌
- Default:
false
Whether to consider a field present in the data but not declared in the schema as an error
Default value is false
timeFormat
- Type: string
- Dynamic: ✔️
- Required: ❌
- Default:
HH:mm[:ss][.SSSSSS][XXX]
Format to use when parsing time
timeZoneId
- Type: string
- Dynamic: ❌
- Required: ❌
- Default:
Etc/UTC
Timezone to use when no timezone can be parsed on the source.
If null, the timezone will be
UTC
Default value is system timezone
trueValues
- Type: array
- SubType: string
- Dynamic: ✔️
- Required: ❌
- Default:
[ "t", "true", "enabled", "1", "on", "yes" ]
Values to consider as True
version
- Type: string
- Dynamic: ❌
- Required: ❌
- Default:
V2
- Possible Values:
V1
V2
Target row group size
Outputs
uri
- Type: string
- Required: ❌
- Format:
uri