Changelog#
0.7.1 (2024-03-11)#
Fix bug when Reading DECIMAL(precision, scale) columns to pandas task (precision was interpreted like for Float where precision <= 24 leads to float32). Beware that
isinstance(sa.Float(), sa.Numeric) == True.
0.7.0 (2024-03-10)#
Rework
TableReferencesupport:Rename
TableReferencetoExternalTableReferenceAdd support for
ExternalTableReferenceto point to tables in external (i.e. not managed bypipedag) schemas.Remove support for
ExternalTableReferencethat points to table in schema of current stage. I.e.ExternalTableReferencecan only point to tables in external schemas.
Support code based configuration (see create_basic_pipedag_config() in README.md example without config file and without docker-compose)
Added NoBlobStore in case you don’t want to provide a directory that is created or needs to exist
Fix polars import in
pyproject.tomlwhen using OS X with rosetta2Bug fix ibm_db2 backend:
input tables for SQL queries were not locked
0.6.10 (2024-02-29)#
Fix bug where a
Taskthat was declared lazy but provided aTablewithout a query string would always be cache valid.Improved documentation
0.6.9 (2024-01-24)#
Update dependencies and remove some upper boundaries
Polars dependency moved to >= 0.18.12 due to incompatible interface change
Workaround for duckdb issue: https://github.com/duckdb/duckdb/issues/10322
Workaround for prefect needing pytz dependency without declaring it on pypi
0.6.8 (2023-12-15)#
Bug fix ibm_db2 backend:
unspecified materialization_details was failing to load configuration
Bug fixes for mssql backend:
SELECT-INTO was invalid for keyword suffix labels: i.e.
SELECT 1 as prefix_FROMRaw SQL statements changing database link of connection via
USEwas causing pipedag generated commands to fail
0.6.7 (2023-12-05)#
Make separator customizable when splitting RawSql into statements.
Add
DropNicknamefor DB2 and drop nicknames when dropping schemas.Add debug function
materialize_table.Update install instructions and dependencies to enable DB2 and mssql development on OS X with an
arm64architecture.Update PR template
Run
RUNSTATSon every DB2 table after creationAdd
materialization_detailsas an option toIBMDB2TableStore. For now DB2 compression, DB2 table spaces are supported and Postgresunloggedtables are supported.For Postgres
unloggedtables this is a breaking change. Theunlogged_tablesoption does not exist anymore. Instead, usematerialization_details: __any__: unlogged: true.
Workaround for known Problems:
add materialization_details in configuration when using ibm_db2 database connection
0.6.6 (2023-08-17)#
Implement support for loading polars dataframes from DuckDB.
Accelerate storing of dataframes (pandas and polars) to DuckDB (10-100x speedup).
Fix
TypeErrorbeing raised when using pydiverse transform SQLTableImpl together with a local table cache.
0.6.5 (2023-08-16)#
Implemented automatic versioning of tasks by setting task version to
AUTO_VERSION. This feature is currently only supported by PolarsLazyFrameand by Pandas.Added kroki_url config option.
0.6.4 (2023-08-07)#
Allow invocation of undecorated task functions when calling task object outside of flow definition context.
Rename
ignore_fresh_inputtoignore_cache_function.Fix race condition leading to
JSONDecodeErrorinParquetTableCachewhen settingstore_input: truetogether with theDaskEngine.Fix running subset of tasks not working due to tables and blobs being retrieved from wrong schema.
0.6.3 (2023-07-25)#
Fix crash during config initialization when using
DatabaseLockManagertogether withPostgreSQL.
0.6.2 (2023-07-23)#
Switch back to using numpy nullable dtypes for Pandas as default.
Ensure that indices get created in same schema as corresponding table (IBM Db2).
Fix private method
SQLTableStore.get_stage_hash.
0.6.1 (2023-07-19)#
Create initial documentation for pipedag.
Remove stage argument from
RawSqlinitializer.Add
RawSqlto public API.Fix
PrefectTwoEnginefailing on retrieval of results.Added
Flow.get_stage(), andStage.get_task()methods.Added
MaterializingTask.get_output_from_store()method to allow retrieval of task output without running the Flow.Created
TableReferenceto simplify complex table loading operations.Allow for easy retrieval of tables generated by
RawSql. Passing a RawSql object into a task results in all tables that were generated by the RawSql to be dematerialized. The tables can then be accessed usingraw_sql["table_name"]. Alternatively, the same syntax can also be used during flow definition to only pass in a specific table.Fix private method
SQLTableStore.get_stage_hashnot working for IBM DB2.
0.6.0 (2023-07-07)#
Added
delete-schemascommand topipedag-manageto help with cleaning up databaseRemove all support for mssql database swapping. Instead, we now properly support schema swapping.
Fix UNLOGGED tables not working with Postgres.
Added
hook_argssection totable_storepart of config file to support passing config arguments to table hooks.Added
dtype_backendhook argument forPandasTableHookto overriding the default pandas dtype backend to use.Update raw sql metadata table (
SQLTableStore).Remove
engine_dispatchand replace with SQLTableStore subclasses.Moved local table cache from
pydiverse.pipedag.backend.table_cachetopydiverse.pipedag.backend.table.cachenamespace.Changed order in which flow / instance config gets resolved.
0.5.0 (2023-06-28)#
add support for DuckDB
add support for pyarrow backed pandas dataframes
support execution of subflow
store final state of task in flow result object
tasks now have a
position_hashassociated with them to identify them purely based on their position (e.g. stage, name and input wiring) inside a flow.breaking change to metadata: added position_hash to
tasksmetadata table and change type of hash columns from String(32) to String(20).Flow,Subflow, andResultobjects now provide additional options for visualizing themadded
unlogged_tablesflag to SQLTableStore for creating UNLOGGED tables with Postgres.created
pipedag-managecommand line utility withclear-metadatacommand to help with migrating between different pipedag metadata versions.
0.4.1 (2023-06-17)#
implement
DaskEngine: orchestration engine for running multiple tasks in parallelimplement
DatabaseLockManager: lock manager based on locking mechanism provided by database
0.4.0 (2023-06-08)#
update public interface
encrypt IPC communication
remove preemptive
os.makedirsfrom ParquetTableCacheimprove logging and provide structlog utilities
0.3.0 (2023-05-25)#
breaking change to pipedag.yaml: introduced
argssubsections for arguments that are passed to backend classesfix ibm_db_sa bug when copying dataframes from cache: uppercase table names by default
nicer readable SQL queries: use automatic aliases for inputs of SQLAlchemy tasks
implement option ignore_task_version: disable eager task caching for some instances to reduce overhead from task version bumping
implement local table cache: store input/output of dataframe tasks in parquet files and allow using it as cache to avoid rereading from database
0.2.4 (2023-05-05)#
fix errors by increasing output_json length in metadata table
fix cache invalidation: query normalization before checking for changes
add rudimentary support for ibis tasks (postgres + mssql)
add rudimentary support for polars + tidypolars tasks
implemented pandas type mapping to avoid row wise type checks of object columns
support pandas 2.0 (no arrow features used that)
support sqlalchemy 2.0 (except for with polars)
0.2.3 (2023-04-17)#
fixed python 3.9 compatibility (
traceback.format_exceptionsyntax changed)fixed deferred table copy when task is invalid (introduced with 0.2.2)
fixed mssql to not reflect full schema while renamings happen
fixed clearing of metadata tables for lazy tables and raw sql tables
fixed mssql synonym resolution when reading input table for pandas task
initial implementation of issue #62: make query canonical before hashing
retry some DB calls in case they are aborted as deadlock victim
0.2.2 (2023-03-31)#
added option
avoid_drop_create_schemato table store configurationimprove performance when working with IBM DB2 dialect (i.e. table locking)
prevent table copying and schema swapping for 100% cache valid stages
0.2.1 (2023-01-15)#
removed contextvars dependency (not needed for python >= 3.7 and broke conda-forge build)
0.2.0 (2023-01-14)#
SQLTableStore: support for Microsoft SQL Server and IBM DB2 (Linux) database connection stringsSupport primary keys and indexes (can be configured with Table object and used in custom RawSql code)
RawSql: support additional return type for@materializetasks which allows to emit raw SQL string including multiple create statements (currently, views/functions/procedures are only supported for dialect mssql). This feature should only be used for getting legacy code running in pipedag before converting it to programatically generated or manual SELECT statements.Support pytsql library for executing raw SQL scripts with
dialect=mssql(i.e. supports PRINT capture)Manual Cache Invalidation for source nodes:
@materialize(cache=f)parameter can take an arbitrary function that gets the same arguments as the task function and returns a hash. If the hash is different from for the previous run, the task is considered cache invalid.New configuration file format pipedag.yaml can be used to configure multiple pipedag instances: see docs/reference/config.rst
0.1.0 (2022-09-01)#
Initial release.
@materializeannotationsflow definition with nestable stages
zookeeper synchronization
postgres database backend
Prefect 1.x and 2.x support
multi-processing/multi-node support for state exchange between
@materializetaskssupport materialization for: pandas, sqlalchemy, raw sql text, pydiverse.transform