Changelog#

0.7.1 (2024-03-11)#

  • Fix bug when Reading DECIMAL(precision, scale) columns to pandas task (precision was interpreted like for Float where precision <= 24 leads to float32). Beware that isinstance(sa.Float(), sa.Numeric) == True.

0.7.0 (2024-03-10)#

  • Rework TableReference support:

    • Rename TableReference to ExternalTableReference

    • Add support for ExternalTableReference to point to tables in external (i.e. not managed by pipedag) schemas.

    • Remove support for ExternalTableReference that points to table in schema of current stage. I.e. ExternalTableReference can only point to tables in external schemas.

  • Support code based configuration (see create_basic_pipedag_config() in README.md example without config file and without docker-compose)

  • Added NoBlobStore in case you don’t want to provide a directory that is created or needs to exist

  • Fix polars import in pyproject.toml when using OS X with rosetta2

  • Bug fix ibm_db2 backend:

    • input tables for SQL queries were not locked

0.6.10 (2024-02-29)#

  • Fix bug where a Task that was declared lazy but provided a Table without a query string would always be cache valid.

  • Improved documentation

0.6.9 (2024-01-24)#

  • Update dependencies and remove some upper boundaries

  • Polars dependency moved to >= 0.18.12 due to incompatible interface change

  • Workaround for duckdb issue: https://github.com/duckdb/duckdb/issues/10322

  • Workaround for prefect needing pytz dependency without declaring it on pypi

0.6.8 (2023-12-15)#

  • Bug fix ibm_db2 backend:

    • unspecified materialization_details was failing to load configuration

  • Bug fixes for mssql backend:

    • SELECT-INTO was invalid for keyword suffix labels: i.e. SELECT 1 as prefix_FROM

    • Raw SQL statements changing database link of connection via USE was causing pipedag generated commands to fail

0.6.7 (2023-12-05)#

  • Make separator customizable when splitting RawSql into statements.

  • Add DropNickname for DB2 and drop nicknames when dropping schemas.

  • Add debug function materialize_table.

  • Update install instructions and dependencies to enable DB2 and mssql development on OS X with an arm64 architecture.

  • Update PR template

  • Run RUNSTATS on every DB2 table after creation

  • Add materialization_details as an option to IBMDB2TableStore. For now DB2 compression, DB2 table spaces are supported and Postgres unlogged tables are supported.

    • For Postgres unlogged tables this is a breaking change. The unlogged_tables option does not exist anymore. Instead, use materialization_details: __any__: unlogged: true.

Workaround for known Problems:

  • add materialization_details in configuration when using ibm_db2 database connection

0.6.6 (2023-08-17)#

  • Implement support for loading polars dataframes from DuckDB.

  • Accelerate storing of dataframes (pandas and polars) to DuckDB (10-100x speedup).

  • Fix TypeError being raised when using pydiverse transform SQLTableImpl together with a local table cache.

0.6.5 (2023-08-16)#

  • Implemented automatic versioning of tasks by setting task version to AUTO_VERSION. This feature is currently only supported by Polars LazyFrame and by Pandas.

  • Added kroki_url config option.

0.6.4 (2023-08-07)#

  • Allow invocation of undecorated task functions when calling task object outside of flow definition context.

  • Rename ignore_fresh_input to ignore_cache_function.

  • Fix race condition leading to JSONDecodeError in ParquetTableCache when setting store_input: true together with the DaskEngine.

  • Fix running subset of tasks not working due to tables and blobs being retrieved from wrong schema.

0.6.3 (2023-07-25)#

  • Fix crash during config initialization when using DatabaseLockManager together with PostgreSQL.

0.6.2 (2023-07-23)#

  • Switch back to using numpy nullable dtypes for Pandas as default.

  • Ensure that indices get created in same schema as corresponding table (IBM Db2).

  • Fix private method SQLTableStore.get_stage_hash.

0.6.1 (2023-07-19)#

  • Create initial documentation for pipedag.

  • Remove stage argument from RawSql initializer.

  • Add RawSql to public API.

  • Fix PrefectTwoEngine failing on retrieval of results.

  • Added Flow.get_stage(), and Stage.get_task() methods.

  • Added MaterializingTask.get_output_from_store() method to allow retrieval of task output without running the Flow.

  • Created TableReference to simplify complex table loading operations.

  • Allow for easy retrieval of tables generated by RawSql. Passing a RawSql object into a task results in all tables that were generated by the RawSql to be dematerialized. The tables can then be accessed using raw_sql["table_name"]. Alternatively, the same syntax can also be used during flow definition to only pass in a specific table.

  • Fix private method SQLTableStore.get_stage_hash not working for IBM DB2.

0.6.0 (2023-07-07)#

  • Added delete-schemas command to pipedag-manage to help with cleaning up database

  • Remove all support for mssql database swapping. Instead, we now properly support schema swapping.

  • Fix UNLOGGED tables not working with Postgres.

  • Added hook_args section to table_store part of config file to support passing config arguments to table hooks.

  • Added dtype_backend hook argument for PandasTableHook to overriding the default pandas dtype backend to use.

  • Update raw sql metadata table (SQLTableStore).

  • Remove engine_dispatch and replace with SQLTableStore subclasses.

  • Moved local table cache from pydiverse.pipedag.backend.table_cache to pydiverse.pipedag.backend.table.cache namespace.

  • Changed order in which flow / instance config gets resolved.

0.5.0 (2023-06-28)#

  • add support for DuckDB

  • add support for pyarrow backed pandas dataframes

  • support execution of subflow

  • store final state of task in flow result object

  • tasks now have a position_hash associated with them to identify them purely based on their position (e.g. stage, name and input wiring) inside a flow.

  • breaking change to metadata: added position_hash to tasks metadata table and change type of hash columns from String(32) to String(20).

  • Flow, Subflow, and Result objects now provide additional options for visualizing them

  • added unlogged_tables flag to SQLTableStore for creating UNLOGGED tables with Postgres.

  • created pipedag-manage command line utility with clear-metadata command to help with migrating between different pipedag metadata versions.

0.4.1 (2023-06-17)#

  • implement DaskEngine: orchestration engine for running multiple tasks in parallel

  • implement DatabaseLockManager: lock manager based on locking mechanism provided by database

0.4.0 (2023-06-08)#

  • update public interface

  • encrypt IPC communication

  • remove preemptive os.makedirs from ParquetTableCache

  • improve logging and provide structlog utilities

0.3.0 (2023-05-25)#

  • breaking change to pipedag.yaml: introduced args subsections for arguments that are passed to backend classes

  • fix ibm_db_sa bug when copying dataframes from cache: uppercase table names by default

  • nicer readable SQL queries: use automatic aliases for inputs of SQLAlchemy tasks

  • implement option ignore_task_version: disable eager task caching for some instances to reduce overhead from task version bumping

  • implement local table cache: store input/output of dataframe tasks in parquet files and allow using it as cache to avoid rereading from database

0.2.4 (2023-05-05)#

  • fix errors by increasing output_json length in metadata table

  • fix cache invalidation: query normalization before checking for changes

  • add rudimentary support for ibis tasks (postgres + mssql)

  • add rudimentary support for polars + tidypolars tasks

  • implemented pandas type mapping to avoid row wise type checks of object columns

  • support pandas 2.0 (no arrow features used that)

  • support sqlalchemy 2.0 (except for with polars)

0.2.3 (2023-04-17)#

  • fixed python 3.9 compatibility (traceback.format_exception syntax changed)

  • fixed deferred table copy when task is invalid (introduced with 0.2.2)

  • fixed mssql to not reflect full schema while renamings happen

  • fixed clearing of metadata tables for lazy tables and raw sql tables

  • fixed mssql synonym resolution when reading input table for pandas task

  • initial implementation of issue #62: make query canonical before hashing

  • retry some DB calls in case they are aborted as deadlock victim

0.2.2 (2023-03-31)#

  • added option avoid_drop_create_schema to table store configuration

  • improve performance when working with IBM DB2 dialect (i.e. table locking)

  • prevent table copying and schema swapping for 100% cache valid stages

0.2.1 (2023-01-15)#

  • removed contextvars dependency (not needed for python >= 3.7 and broke conda-forge build)

0.2.0 (2023-01-14)#

  • SQLTableStore: support for Microsoft SQL Server and IBM DB2 (Linux) database connection strings

  • Support primary keys and indexes (can be configured with Table object and used in custom RawSql code)

  • RawSql: support additional return type for @materialize tasks which allows to emit raw SQL string including multiple create statements (currently, views/functions/procedures are only supported for dialect mssql). This feature should only be used for getting legacy code running in pipedag before converting it to programatically generated or manual SELECT statements.

  • Support pytsql library for executing raw SQL scripts with dialect=mssql (i.e. supports PRINT capture)

  • Manual Cache Invalidation for source nodes: @materialize(cache=f) parameter can take an arbitrary function that gets the same arguments as the task function and returns a hash. If the hash is different from for the previous run, the task is considered cache invalid.

  • New configuration file format pipedag.yaml can be used to configure multiple pipedag instances: see docs/reference/config.rst

0.1.0 (2022-09-01)#

Initial release.

  • @materialize annotations

  • flow definition with nestable stages

  • zookeeper synchronization

  • postgres database backend

  • Prefect 1.x and 2.x support

  • multi-processing/multi-node support for state exchange between @materialize tasks

  • support materialization for: pandas, sqlalchemy, raw sql text, pydiverse.transform