Changelog¶
0.12.13 (2026-04-16)¶
Fix #339: fixed bug when using schema_prefix with ParquetTableStore and separate metadata store
0.12.12 (2026-02-16)¶
Fix dataframe upload fallbacks for MSSQL and Snowflake backends
Fix crash when prod schema could not be moved to transaction schema (e.g. due to broken views)
Fix: improved google cloud storage (GCS) support
0.12.11 (2026-01-23)¶
Feat: simplify accessing tables in ParquetTableStore with polars
Fix: Fix mssqlkit bulk upload
Speed up pandas/polars transfer to/from duckdb for SqlTableStore
0.12.10 (2026-01-20)¶
Feat: Add ParquetTableStore.sync_metadata() function to sync local duckdb file with metadata store without running flow
Move schema creation inside ParquetTableStore.metadata_sync_views()
0.12.9 (2026-01-17)¶
Fix: DatabaseLockManager uses correct schema_prefix in case of metadata_table_store
0.12.8 (2026-01-12)¶
Fix #315: DuplicateNameError in case a task returns several ExternalTableReference objects with the same table name
Implement #316: Support top level import pydiverse.pipedag.CacheValidationMode
Fix: pydiverse.transform repr implementation could make
with TaskContext():exit expensiveFix: metadata_table_store uses correct schema_prefix for pipedag_metadata tables
0.12.7 (2025-12-15)¶
Fix: cache invalidation works more reliably for pydiverse.transform and ibis
Fix: be more fault-tolerant in duckdb view parsing (ParquetTableStore)
0.12.6 (2025-12-10)¶
Feat: Automatically check cache-validity of polars and pandas DataFrame tasks marked as lazy
Fix: lazy=True tasks in 100% cache valid stage can still rename outputs. This did not work for stage_commit_technique=READ_VIEWS.
Fix: Fix hang when using many imperative materializations in a single task (the implementation was not designed for this and storage size grew unsustainably)
0.12.5 (2025-11-26)¶
Workaround snowflake sqlalchemy dialect to enable ExternalTableReference to other database
Flag in CreateTable and DropTable DDL statements allows not quoting schema (needed for multi-part schema)
Support side-channel fresh input in stable pipeline instance (mode=ASSERT_NO_FRESH_INPUT)
0.12.4 (2025-11-17)¶
Support python 3.14, dropped support for python 3.10
Support pyarrow 22, arrow-odbc 9
Support dataframely >= 2.1 / colspec >= 0.3.1
Fixed pandas retrieval of view that renames columns
Fixed mssql automatic max-string-length adjustment for varchar(max) in index
0.12.3 (2025-11-12)¶
Allow multiple runs with FileLockManager and filelock >= 3.11 installed
0.12.2 (2025-11-12)¶
Support datetime.time and timedelta as task input and output (or anywhere where JSON serialization is needed).
Added support for Snowflake Database
Known Issue: ADBC download screws up datetimes with year outside 64 bit ns range. Years 1700..2200 are fine. Workaround: clip date range in query and add a column that rescues the correct year. (typically, year 0 and 9999 are the only special values outside the range 1700..2200 used in practice)
0.12.1 (2025-10-10)¶
Create all metadata tables even if some metadata tables already exist. This fixes problems with conditional need for sync_views table.
Make table hooks work even without ConfigContext (see example_mssql/download_parquet_files.py)
0.12.0 (2025-10-07)¶
Support pydiverse.common 0.4.1, pydiverse.transform 0.6.0, pydiverse.colspec 0.3.0.
Structlog logger initialization changed to stdlib logger factory to support dynamic loglevel filter in tests.
Switch cache misses from warning to info log level.
0.11.0 (2025-10-01)¶
Support View as task output to allow multi-parquet fusion in ParquetTableStore or basic column selection/renaming outside consumer task.
Support dataclass lazy field access at wiring time when a task returns a dataclass
Support Google Cloud Storage in ParquetTableStore (despite fsspec/gcsfs, configuration is a mess for s3 and gcs)
ExternalTableReference and View are automatically added to auto_table configuration
Expose optional dependency imports
Change some materialization detail error messages to warnings
Updated repr() and str() representations for some objects like Flow, ConfigContext, DagContext, …
Fix: mssql pyarrow-adbc download to pandas/polars
Fix: S3 example and error messages
0.10.11 (2025-09-08)¶
Fix: Late initialization of ParquetTableCache instance_id allows use of multi-config
@input_stage_versions
0.10.10 (2025-09-05)¶
support separate metadata_table_store to allow team-synchronization for example for duckdb based ParquetTableStore
Fix: improve handling of missing ADBC/ConnectorX installations and error messages
0.10.9 (2025-08-21)¶
allow setting S3 endpoint URL in pipedag.yaml
Fix: IBM DB2 works with colspec
0.10.8 (2025-08-18)¶
Fix: Only cut MSSQL VARCHAR(N) in arrow-odbc download if N is MAX
Feat: Add support for parameter
write_local_table_cacheinResult.get()andget_output_from_store()
0.10.7 (2025-08-01)¶
IBM DB2: massive speedup by using ADMIN_CMD(‘LOAD FROM (SELECT…’) instead of INSERT INTO SELECT
0.10.6 (2025-07-30)¶
Fix: support empty tables in arrow-odbc download (mssql); batched reading failed
Reduce local table cache warnings for pydiverse.transform use
initialize_test_s3_bucket takes optional host, port, and test_bucket arguments
Fix: allow modulo operator in SQL queries (every sa.text(str(query)) duplicates ‘%’)
Set pandas df.attrs[“name”] consistently. By default this copies table names from input to output.
Reduced table name length for temporary colspec tables
0.10.5 (2025-07-29)¶
Fix: Fix mssqlkit bulk upload
support version=AUTO_VERSION for input_type=pdt.Polars
support dataframely schema class annotations which allows using pydiverse.transform Table together with dataframely (only Polars backend)
0.10.4 (2025-07-14)¶
Make pydot optional dependency for visualization of flow execution
Make psycopg2/adbc-driver-postgresql optional dependency
0.10.3 (2025-07-10)¶
Fix: Do not implicitly depend on
kazooFix some misleading warnings
0.10.2 (2025-07-08)¶
Switch from psycopg2 to psycopg2-binary for pypi version (does not affect conda-forge)
0.10.1 (2025-07-08)¶
Fix pypi dependencies in
pyproject.tomlwhich prevented conda-forge build of 0.10.0
0.10.0 (2025-07-04)¶
Added ParquetTableStore which is based on duckdb SQLTableStore. It stores all tables as parquet files but still references them inside a duckdb database file as views to
FROM read_parquet(file). It supports both normal file systems and fsspec supported blob stores like AWS S3, Azure Blob Storage, and Google Cloud Storage.ColSpec / DataFramely support based on annotations of parameters and return type of tasks
Pandas table hook reads date/datetime columns as ‘datetime64[us]’ and thus does not need clipping and extra year columns any more
Materialize Local Table Cache before actual Table Store
Mssql based SQLTableStore uses mssqlkit or bcpandas for bulk upload and arrow-odbc for download by default
Postgres based SQLTableStore uses ADBC for download by default (already supports bulk upload)
Fix numpy import issue on OS X
CI: Use SQL Server docker image instead of Azure Edge
Improve documentation for materialization details.
Add support for columnstore tables in
MSSQLTableStoreviaMSSqlMaterializationDetails
0.9.10 (2025-03-18)¶
Fix incompatibility with pydiverse.transform >=0.2.1 (<0.2.0 still supported) (0.2.0 will not be supported):
pydiverse transform had a radical refactoring around this point
NotImplementedError when pydiverse.transform 0.2.0 - 0.2.2 is being used.
Pinned prefect to version “>=2.13.5, <3.0.0”, because future versions are currently not supported
IBM DB2 tests: switch from pypi to conda-forge; support osx-arm64; drop support for osx-64
0.9.9 (2025-02-05)¶
Fix incompatibility with DuckDB 1.1.
0.9.8 (2024-09-06)¶
Bugfix for
inputsargument forflow.run().
0.9.7 (2024-09-05)¶
Add support for passing
inputsfor tasks returning multiple Tables and forRawSqltasks.
0.9.6 (2024-08-29)¶
Support ExternalTableReference creation at flow wiring time. A pipedag
Table(ExternalTableReference(...))object can be passed as a parameter into any task instead of any other pipedag table reference.Fixed bug that caused a crash when retrieving a polars dataframe from SQL using polars >= 1
Fix warning about
ignore_position_hashesbeing printed even if the flag was not set.Added support for
inputsargument forflow.run()allowing to passExternalTableReferenceobjects to the flow that override the outputs of selected tasks.
0.9.5 (2024-07-22)¶
Fixed a bug in primary key generation when materializing pandas dataframe to postgres database
0.9.4 (2024-07-18)¶
Primary key and index identifiers are now automatically truncated to 63 characters to avoid issues with some database systems.
Added
ignore_position_hashesoption toflow.run()andget_output_from_store(). IfTrue, the position hashes of tasks are not checked when retrieving the inputs of a task from the cache. This can prevent caching errors when evaluating subgraphs. For this to work a task may never be used more than once per stage.Fixed a bug related to imperative materialization
0.9.3 (2024-06-11)¶
Added
upload_table()anddownload_table()functions to the PandasTableHook to allow for easy customization of up and download behavior of pandas and polars tables from/to the table store.More robust way of looking up hooks independent of import order. Subclasses of table stores don’t copy registered hooks in the moment of declaration. When registering a hook it is possible now, to specify the hooks that are replaced by a new registration.
0.9.2 (2024-05-07)¶
@input_stage_versions decorator allows specifying tasks which compare tables within the current stage transaction schema and another version of that stage. This can be the currently active stage schema of the same pipeline instance or from another instance. See: https://pydiversepipedag.readthedocs.io/en/latest/examples.html
0.9.1 (2024-04-26)¶
Support Snowflake as a backend for
SQLTableStore.For mssql backend, moved primary key adding after filling complete table.
Make polars dematerialization robust against missing connectorx. Fall back to pandas if connectorx is not available.
Fix some bugs with pandas < 2 and sqlalchemy < 2 compatibility as well as pyarrow handling.
Use pd.StringDtype(“pyarrow”) instead of pd.ArrowDtype(pa.string()) for dtype “string[pyarrow]”
0.9.0 (2024-04-17)¶
Support imperative materialization with
tbl_ref = dag.Table(...).materialize(). This is particularly useful for materializing subqueries within a task. It also helps see task in stack trace when materialization fails. There is one downside of using it: when a task returns multiple tables, it is assumed that all tables depend on previously imperatively materialized tables.Support group nodes with or without barrier effect on task ordering. They either be added by
with GroupNode():blocks around or withinwith Stage():blocks. Or they can be added in configuration viavisualization: default: group_nodes: group_name: {label: "some_str", tasks: ["task_name"], stages: ["stage_name"]}. Visualization of group nodes can be controlled very flexibly with hide_box, hide_content, box_color_always, …ExternalTableReference moved module and is now also a member of pydiverse.pipedag module. This is a breaking interface change for pipedag.
PrefectEngine moved to module pydiverse.pipedag.engine.prefect.PrefectEngine because it would otherwise import prefect whenever it is installed in environment which messes with logging library initialization. This is a breaking interface change.
Fixed an edgecase for mssql backend causing queries with columns named “from” to crash. The code to insert an INTO into mssql SELECT statements is still hacky but supports open quote detection. Comments may still confuse the logic.
0.8.0 (2024-04-02)¶
Significant refactoring of materialization is included. It splits creation of table from filling a table in many cases. This may lead to unexpected changes in log output. For now, the
INSERT INTO SELECTstatement is only printed in shortened version, because the creation of the table already includes the same statement in full. In the future, this might be made configurable, so your feedback is highly welcome.pipedag.Table() now supports new parameters
nullableandnon_nullable. This allows specifying which columns are nullable both as a positive and negative list. If both are specified, they must mention each column in the table and have no overlap. For most dialects, non-nullable statements are issued after creating the empty table. For dialectsmssqlandibm_db2, both nullable and non-nullable column alterations are issued because constant literals create non-nullable columns by default. If neither nullable nor non_nullable are specified, the defaultCREATE TABLE as SELECTis kept unmodified except for primary key columns where some dialects require explicitNOT NULLstatements.Refactored configuration for cache validation options. Now, there is a separate section called cache_validation configurable per instance which includes the following options:
mode: NORMAL, ASSERT_NO_FRESH_INPUT (protect a stable pipeline / fail if tasks with cache function are executed), IGNORE_FRESH_INPUT (same as ignore_cache_function=True before), FORCE_FRESH_INPUT (invalidates all tasks with cache function), FORCE_CACHE_INVALID (rerun all tasks)
disable_cache_function: True disables the call of cache functions. Downside: next mode=NORMAL run will be cache invalid.
ignore_task_version: Option existed before but a level higher
REMOVED option ignore_cache_function: Use
cache_validation: mode: IGNORE_FRESH_INPUTin pipedag.yaml orflow.run(cache_validation_mode=CacheValidationMode.IGNORE_FRESH_INPUT)instead.
Set transaction isolation level to READ UNCOMMITTED via SQLAlchemy functionality
Fix that unlogged tables were created as logged tables when they were copied as cache valid
Materialize lazy tasks, when they are executed without stage context.
0.7.2 (2024-03-25)¶
Disable Kroki links by default. New setting disable_kroki=True allows to still default kroki_url to https://kroki.io. Function create_basic_pipedag_config() just has a kroki_url parameter which defaults to None.
Added max_query_print_length parameter to MSSqlTableStore to limit the length of the printed SQL queries. Default is max_query_print_length=500000 characters.
Fix bug when creating a table with the same name as a
Tablegiven byExternalTableReferencein the same stageNew config options for
SQLTableStore:max_concurrent_copy_operationsto limit the number of concurrent copy operations when copying tables between schemas.sqlalchemy_pool_sizeandsqlalchemy_pool_timeoutto configure the pool size and timeout for the SQLAlchemy connection pool.The defaults fix a bug by setting sqlalchemy options to not time out when the first cache invalid task in a stage triggers copying of cache valid tables between schemas and copying takes longer than 30s.
0.7.1 (2024-03-11)¶
Fix bug when Reading DECIMAL(precision, scale) columns to pandas task (precision was interpreted like for Float where precision <= 24 leads to float32). Beware that
isinstance(sa.Float(), sa.Numeric) == True.
0.7.0 (2024-03-10)¶
Rework
TableReferencesupport:Rename
TableReferencetoExternalTableReferenceAdd support for
ExternalTableReferenceto point to tables in external (i.e. not managed bypipedag) schemas.Remove support for
ExternalTableReferencethat points to table in schema of current stage. I.e.ExternalTableReferencecan only point to tables in external schemas.
Support code based configuration (see create_basic_pipedag_config() in README.md example without config file and without docker-compose)
Added NoBlobStore in case you don’t want to provide a directory that is created or needs to exist
Fix polars import in
pyproject.tomlwhen using OS X with rosetta2Bug fix ibm_db2 backend:
input tables for SQL queries were not locked
0.6.10 (2024-02-29)¶
Fix bug where a
Taskthat was declared lazy but provided aTablewithout a query string would always be cache valid.Improved documentation
0.6.9 (2024-01-24)¶
Update dependencies and remove some upper boundaries
Polars dependency moved to >= 0.18.12 due to incompatible interface change
Workaround for duckdb issue: https://github.com/duckdb/duckdb/issues/10322
Workaround for prefect needing pytz dependency without declaring it on pypi
0.6.8 (2023-12-15)¶
Bug fix ibm_db2 backend:
unspecified materialization_details was failing to load configuration
Bug fixes for mssql backend:
SELECT-INTO was invalid for keyword suffix labels: i.e.
SELECT 1 as prefix_FROMRaw SQL statements changing database link of connection via
USEwas causing pipedag generated commands to fail
0.6.7 (2023-12-05)¶
increased metadata_version to 0.3.2 => please delete metadata with pipedag-manage when upgrading from <= 0.6.6 to >= 0.6.7
Make separator customizable when splitting RawSql into statements.
Add
DropNicknamefor DB2 and drop nicknames when dropping schemas.Add debug function
materialize_table.Update install instructions and dependencies to enable DB2 and mssql development on OS X with an
arm64architecture.Update PR template
Run
RUNSTATSon every DB2 table after creationAdd
materialization_detailsas an option toIBMDB2TableStore. For now DB2 compression, DB2 table spaces are supported and Postgresunloggedtables are supported.For Postgres
unloggedtables this is a breaking change. Theunlogged_tablesoption does not exist anymore. Instead, usematerialization_details: __any__: unlogged: true.
Workaround for known Problems:
add materialization_details in configuration when using ibm_db2 database connection
0.6.6 (2023-08-17)¶
Implement support for loading polars dataframes from DuckDB.
Accelerate storing of dataframes (pandas and polars) to DuckDB (10-100x speedup).
Fix
TypeErrorbeing raised when using pydiverse transform SQLTableImpl together with a local table cache.
0.6.5 (2023-08-16)¶
Implemented automatic versioning of tasks by setting task version to
AUTO_VERSION. This feature is currently only supported by PolarsLazyFrameand by Pandas.Added kroki_url config option.
0.6.4 (2023-08-07)¶
Allow invocation of undecorated task functions when calling task object outside of flow definition context.
Rename
ignore_fresh_inputtoignore_cache_function.Fix race condition leading to
JSONDecodeErrorinParquetTableCachewhen settingstore_input: truetogether with theDaskEngine.Fix running subset of tasks not working due to tables and blobs being retrieved from wrong schema.
0.6.3 (2023-07-25)¶
Fix crash during config initialization when using
DatabaseLockManagertogether withPostgreSQL.
0.6.2 (2023-07-23)¶
Switch back to using numpy nullable dtypes for Pandas as default.
Ensure that indices get created in same schema as corresponding table (IBM Db2).
Fix private method
SQLTableStore.get_stage_hash.
0.6.1 (2023-07-19)¶
Create initial documentation for pipedag.
Remove stage argument from
RawSqlinitializer.Add
RawSqlto public API.Fix
pydiverse.pipedag.engine.prefect.PrefectTwoEnginefailing on retrieval of results.Added
Flow.get_stage(), andStage.get_task()methods.Added
MaterializingTask.get_output_from_store()method to allow retrieval of task output without running the Flow.Created TableReference to simplify complex table loading operations.
Allow for easy retrieval of tables generated by
RawSql. Passing a RawSql object into a task results in all tables that were generated by the RawSql to be dematerialized. The tables can then be accessed usingraw_sql["table_name"]. Alternatively, the same syntax can also be used during flow definition to only pass in a specific table.Fix private method
SQLTableStore.get_stage_hashnot working for IBM DB2.
0.6.0 (2023-07-07)¶
Added
delete-schemascommand topipedag-manageto help with cleaning up databaseRemove all support for mssql database swapping. Instead, we now properly support schema swapping.
Fix UNLOGGED tables not working with Postgres.
Added
hook_argssection totable_storepart of config file to support passing config arguments to table hooks.Added
dtype_backendhook argument forPandasTableHookto overriding the default pandas dtype backend to use.Update raw sql metadata table (
SQLTableStore).Remove
engine_dispatchand replace with SQLTableStore subclasses.Moved local table cache from
pydiverse.pipedag.backend.table_cachetopydiverse.pipedag.backend.table.cachenamespace.Changed order in which flow / instance config gets resolved.
0.5.0 (2023-06-28)¶
add support for DuckDB
add support for pyarrow backed pandas dataframes
support execution of subflow
store final state of task in flow result object
tasks now have a
position_hashassociated with them to identify them purely based on their position (e.g. stage, name and input wiring) inside a flow.breaking change to metadata: added position_hash to
tasksmetadata table and change type of hash columns from String(32) to String(20).Flow,Subflow, andResultobjects now provide additional options for visualizing themadded
unlogged_tablesflag to SQLTableStore for creating UNLOGGED tables with Postgres.created
pipedag-managecommand line utility withclear-metadatacommand to help with migrating between different pipedag metadata versions.
0.4.1 (2023-06-17)¶
implement
DaskEngine: orchestration engine for running multiple tasks in parallelimplement
DatabaseLockManager: lock manager based on locking mechanism provided by database
0.4.0 (2023-06-08)¶
update public interface
encrypt IPC communication
remove preemptive
os.makedirsfrom ParquetTableCacheimprove logging and provide structlog utilities
0.3.0 (2023-05-25)¶
breaking change to pipedag.yaml: introduced
argssubsections for arguments that are passed to backend classesfix ibm_db_sa bug when copying dataframes from cache: uppercase table names by default
nicer readable SQL queries: use automatic aliases for inputs of SQLAlchemy tasks
implement option ignore_task_version: disable eager task caching for some instances to reduce overhead from task version bumping
implement local table cache: store input/output of dataframe tasks in parquet files and allow using it as cache to avoid rereading from database
0.2.4 (2023-05-05)¶
fix errors by increasing output_json length in metadata table
fix cache invalidation: query normalization before checking for changes
add rudimentary support for ibis tasks (postgres + mssql)
add rudimentary support for polars + tidypolars tasks
implemented pandas type mapping to avoid row wise type checks of object columns
support pandas 2.0 (no arrow features used that)
support sqlalchemy 2.0 (except for with polars)
0.2.3 (2023-04-17)¶
fixed python 3.9 compatibility (
traceback.format_exceptionsyntax changed)fixed deferred table copy when task is invalid (introduced with 0.2.2)
fixed mssql to not reflect full schema while renamings happen
fixed clearing of metadata tables for lazy tables and raw sql tables
fixed mssql synonym resolution when reading input table for pandas task
initial implementation of issue #62: make query canonical before hashing
retry some DB calls in case they are aborted as deadlock victim
0.2.2 (2023-03-31)¶
added option
avoid_drop_create_schemato table store configurationimprove performance when working with IBM DB2 dialect (i.e. table locking)
prevent table copying and schema swapping for 100% cache valid stages
0.2.1 (2023-01-15)¶
removed contextvars dependency (not needed for python >= 3.7 and broke conda-forge build)
0.2.0 (2023-01-14)¶
SQLTableStore: support for Microsoft SQL Server and IBM DB2 (Linux) database connection stringsSupport primary keys and indexes (can be configured with Table object and used in custom RawSql code)
RawSql: support additional return type for@materializetasks which allows to emit raw SQL string including multiple create statements (currently, views/functions/procedures are only supported for dialect mssql). This feature should only be used for getting legacy code running in pipedag before converting it to programmatically generated or manual SELECT statements.Support pytsql library for executing raw SQL scripts with
dialect=mssql(i.e. supports PRINT capture)Manual Cache Invalidation for source nodes:
@materialize(cache=f)parameter can take an arbitrary function that gets the same arguments as the task function and returns a hash. If the hash is different from for the previous run, the task is considered cache invalid.New configuration file format pipedag.yaml can be used to configure multiple pipedag instances: see docs/reference/config.rst
0.1.0 (2022-09-01)¶
Initial release.
@materializeannotationsflow definition with nestable stages
zookeeper synchronization
postgres database backend
Prefect 1.x and 2.x support
multi-processing/multi-node support for state exchange between
@materializetaskssupport materialization for: pandas, sqlalchemy, raw sql text, pydiverse.transform