Blueprints
Extract data from an API and load it to Postgres using Python, Git and Docker (passing custom environment variables to the container)
Source
yaml
id: postgres-s3-python-git
namespace: company.team
tasks:
- id: wdir
type: io.kestra.plugin.core.flow.WorkingDirectory
tasks:
- id: clone_repository
type: io.kestra.plugin.git.Clone
url: https://github.com/kestra-io/scripts
branch: main
- id: get_users
type: io.kestra.plugin.scripts.python.Commands
taskRunner:
type: io.kestra.plugin.scripts.runner.docker.Docker
containerImage: ghcr.io/kestra-io/pydata:latest
warningOnStdErr: false
commands:
- python etl/get_users_from_api.py
- id: save_users_pg
type: io.kestra.plugin.scripts.python.Commands
beforeCommands:
- pip install pandas psycopg2 sqlalchemy > /dev/null
warningOnStdErr: false
commands:
- python etl/save_users_pg.py
env:
DB_USERNAME: postgres
DB_PASSWORD: "{{ secret('DB_PASSWORD') }}"
DB_HOST: host.docker.internal
DB_PORT: "5432"
About this blueprint
pip Postgres Python Data Git
This flow clones a Git repository with two Python scripts:
- The first one extracts data from an API and stores the results into a file
users.json
- The second Python script reads that extracted raw data file and loads it to Postgres using Python and Pandas. The benefits of this approach:
- your orchestration logic (YAML) is decoupled from your business logic (here: Python) -- different tasks can be maintained by different team members without stepping on each other's toes
- you can update your business logic without having to touch any of your orchestration code (unless you want to, e.g. to modify the Docker image or pip package versions)
- each task can run in a separate Docker container, making dependency management easier; for instance the second task needs additional libraries for Postgres and Pandas, and those can be installed at runtime using the
beforeCommands
property.