---------------------------------------------------------------------- This is the API documentation for the pointblank library. ---------------------------------------------------------------------- ## Validate When performing data validation, use the `Validate` class to get the process started. It takes the target table and options for metadata and failure thresholds (using the `Thresholds` class or shorthands). The `Validate` class has numerous methods for defining validation steps and for obtaining post-interrogation metrics and data. Validate(data: 'IntoDataFrame', reference: 'IntoFrame | None' = None, tbl_name: 'str | None' = None, label: 'str | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, final_actions: 'FinalActions | None' = None, brief: 'str | bool | None' = None, lang: 'str | None' = None, locale: 'str | None' = None, owner: 'str | None' = None, consumers: 'str | list[str] | None' = None, version: 'str | None' = None) -> None Workflow for defining a set of validations on a table and interrogating for results. The `Validate` class is used for defining a set of validation steps on a table and interrogating the table with the *validation plan*. This class is the main entry point for the *data quality reporting* workflow. The overall aim of this workflow is to generate comprehensive reporting information to assess the level of data quality for a target table. We can supply as many validation steps as needed, and having a large number of them should increase the validation coverage for a given table. The validation methods (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_between()`](`pointblank.Validate.col_vals_between`), etc.) translate to discrete validation steps, where each step will be sequentially numbered (useful when viewing the reporting data). This process of calling validation methods is known as developing a *validation plan*. The validation methods, when called, are merely instructions up to the point the concluding [`interrogate()`](`pointblank.Validate.interrogate`) method is called. That kicks off the process of acting on the *validation plan* by querying the target table getting reporting results for each step. Once the interrogation process is complete, we can say that the workflow now has reporting information. We can then extract useful information from the reporting data to understand the quality of the table. Printing the `Validate` object (or using the [`get_tabular_report()`](`pointblank.Validate.get_tabular_report`) method) will return a table with the results of the interrogation and [`get_sundered_data()`](`pointblank.Validate.get_sundered_data`) allows for the splitting of the table based on passing and failing rows. Parameters ---------- data The table to validate, which could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, a GitHub URL pointing to a CSV or Parquet file, or a database connection string. When providing a CSV or Parquet file path (as a string or `pathlib.Path` object), the file will be automatically loaded using an available DataFrame library (Polars or Pandas). Parquet input also supports glob patterns, directories containing .parquet files, and Spark-style partitioned datasets. GitHub URLs are automatically transformed to raw content URLs and downloaded. Connection strings enable direct database access via Ibis with optional table specification using the `::table_name` suffix. Read the *Supported Input Table Types* section for details on the supported table types. tbl_name An optional name to assign to the input table object. If no value is provided, a name will be generated based on whatever information is available. This table name will be displayed in the header area of the tabular report. label An optional label for the validation plan. If no value is provided, a label will be generated based on the current system date and time. Markdown can be used here to make the label more visually appealing (it will appear in the header area of the tabular report). thresholds Generate threshold failure levels so that all validation steps can report and react accordingly when exceeding the set levels. The thresholds are set at the global level and can be overridden at the validation step level (each validation step has its own `thresholds=` parameter). The default is `None`, which means that no thresholds will be set. Look at the *Thresholds* section for information on how to set threshold levels. actions The actions to take when validation steps meet or exceed any set threshold levels. These actions are paired with the threshold levels and are executed during the interrogation process when there are exceedances. The actions are executed right after each step is evaluated. Such actions should be provided in the form of an `Actions` object. If `None` then no global actions will be set. View the *Actions* section for information on how to set actions. final_actions The actions to take when the validation process is complete and the final results are available. This is useful for sending notifications or reporting the overall status of the validation process. The final actions are executed after all validation steps have been processed and the results have been collected. The final actions are not tied to any threshold levels, they are executed regardless of the validation results. Such actions should be provided in the form of a `FinalActions` object. If `None` then no finalizing actions will be set. Please see the *Actions* section for information on how to set final actions. brief A global setting for briefs, which are optional brief descriptions for validation steps (they be displayed in the reporting table). For such a global setting, templating elements like `"{step}"` (to insert the step number) or `"{auto}"` (to include an automatically generated brief) are useful. If `True` then each brief will be automatically generated. If `None` (the default) then briefs aren't globally set. lang The language to use for various reporting elements. By default, `None` will select English (`"en"`) as the but other options include French (`"fr"`), German (`"de"`), Italian (`"it"`), Spanish (`"es"`), and several more. Have a look at the *Reporting Languages* section for the full list of supported languages and information on how the language setting is utilized. locale An optional locale ID to use for formatting values in the reporting table according the locale's rules. Examples include `"en-US"` for English (United States) and `"fr-FR"` for French (France). More simply, this can be a language identifier without a designation of territory, like `"es"` for Spanish. owner An optional string identifying the owner of the data being validated. This is useful for governance purposes, indicating who is responsible for the quality and maintenance of the data. For example, `"data-platform-team"` or `"analytics-engineering"`. consumers An optional string or list of strings identifying who depends on or consumes this data. This helps document data dependencies and can be useful for impact analysis when data quality issues are detected. For example, `"ml-team"` or `["ml-team", "analytics"]`. version An optional string representing the version of the validation plan or data contract. This supports semantic versioning (e.g., `"1.0.0"`, `"2.1.0"`) and is useful for tracking changes to validation rules over time and for organizational governance. Returns ------- Validate A `Validate` object with the table and validations to be performed. Supported Input Table Types --------------------------- The `data=` parameter can be given any of the following table types: - Polars DataFrame (`"polars"`) - Pandas DataFrame (`"pandas"`) - PySpark table (`"pyspark"`) - DuckDB table (`"duckdb"`)* - MySQL table (`"mysql"`)* - PostgreSQL table (`"postgresql"`)* - SQLite table (`"sqlite"`)* - Microsoft SQL Server table (`"mssql"`)* - Snowflake table (`"snowflake"`)* - Databricks table (`"databricks"`)* - BigQuery table (`"bigquery"`)* - Parquet table (`"parquet"`)* - CSV files (string path or `pathlib.Path` object with `.csv` extension) - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet` extension, or partitioned dataset) - Database connection strings (URI format with optional table specification) The table types marked with an asterisk need to be prepared as Ibis tables (with type of `ibis.expr.types.relations.Table`). Furthermore, the use of `Validate` with such tables requires the Ibis library v9.5.0 and above to be installed. If the input table is a Polars or Pandas DataFrame, the Ibis library is not required. To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is provided. The file will be automatically detected and loaded using the best available DataFrame library. The loading preference is Polars first, then Pandas as a fallback. Connection strings follow database URL formats and must also specify a table using the `::table_name` suffix. Examples include: ``` "duckdb:///path/to/database.ddb::table_name" "sqlite:///path/to/database.db::table_name" "postgresql://user:password@localhost:5432/database::table_name" "mysql://user:password@localhost:3306/database::table_name" "bigquery://project/dataset::table_name" "snowflake://user:password@account/database/schema::table_name" ``` When using connection strings, the Ibis library with the appropriate backend driver is required. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for all validation steps. They are set here at the global level but can be overridden at the validation step level (each validation step has its own local `thresholds=` parameter). There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units for a validation step exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Actions ------- The `actions=` and `final_actions=` parameters provide mechanisms to respond to validation results. These actions can be used to notify users of validation failures, log issues, or trigger other processes when problems are detected. *Step Actions* The `actions=` parameter allows you to define actions that are triggered when validation steps exceed specific threshold levels (warning, error, or critical). These actions are executed during the interrogation process, right after each step is evaluated. Step actions should be provided using the [`Actions`](`pointblank.Actions`) class, which lets you specify different actions for different severity levels: ```python # Define an action that logs a message when warning threshold is exceeded def log_warning(): metadata = pb.get_action_metadata() print(f"WARNING: Step {metadata['step']} failed with type {metadata['type']}") # Define actions for different threshold levels actions = pb.Actions( warning = log_warning, error = lambda: send_email("Error in validation"), critical = "CRITICAL FAILURE DETECTED" ) # Use in Validate validation = pb.Validate( data=my_data, actions=actions # Global actions for all steps ) ``` You can also provide step-specific actions in individual validation methods: ```python validation.col_vals_gt( columns="revenue", value=0, actions=pb.Actions(warning=log_warning) # Only applies to this step ) ``` Step actions have access to step-specific context through the [`get_action_metadata()`](`pointblank.get_action_metadata`) function, which provides details about the current validation step that triggered the action. *Final Actions* The `final_actions=` parameter lets you define actions that execute after all validation steps have completed. These are useful for providing summaries, sending notifications based on overall validation status, or performing cleanup operations. Final actions should be provided using the [`FinalActions`](`pointblank.FinalActions`) class: ```python def send_report(): summary = pb.get_validation_summary() if summary["status"] == "CRITICAL": send_alert_email( subject=f"CRITICAL validation failures in {summary['tbl_name']}", body=f"{summary['critical_steps']} steps failed with critical severity." ) validation = pb.Validate( data=my_data, final_actions=pb.FinalActions(send_report) ) ``` Final actions have access to validation-wide summary information through the [`get_validation_summary()`](`pointblank.get_validation_summary`) function, which provides a comprehensive overview of the entire validation process. The combination of step actions and final actions provides a flexible system for responding to data quality issues at both the individual step level and the overall validation level. Reporting Languages ------------------- Various pieces of reporting in Pointblank can be localized to a specific language. This is done by setting the `lang=` parameter in `Validate`. Any of the following languages can be used (just provide the language code): - English (`"en"`) - French (`"fr"`) - German (`"de"`) - Italian (`"it"`) - Spanish (`"es"`) - Portuguese (`"pt"`) - Dutch (`"nl"`) - Swedish (`"sv"`) - Danish (`"da"`) - Norwegian Bokmål (`"nb"`) - Icelandic (`"is"`) - Finnish (`"fi"`) - Polish (`"pl"`) - Czech (`"cs"`) - Romanian (`"ro"`) - Greek (`"el"`) - Russian (`"ru"`) - Turkish (`"tr"`) - Arabic (`"ar"`) - Hindi (`"hi"`) - Simplified Chinese (`"zh-Hans"`) - Traditional Chinese (`"zh-Hant"`) - Japanese (`"ja"`) - Korean (`"ko"`) - Vietnamese (`"vi"`) - Indonesian (`"id"`) - Ukrainian (`"uk"`) - Bulgarian (`"bg"`) - Croatian (`"hr"`) - Estonian (`"et"`) - Hungarian (`"hu"`) - Irish (`"ga"`) - Latvian (`"lv"`) - Lithuanian (`"lt"`) - Maltese (`"mt"`) - Slovak (`"sk"`) - Slovenian (`"sl"`) - Hebrew (`"he"`) - Thai (`"th"`) - Persian (`"fa"`) Automatically generated briefs (produced by using `brief=True` or `brief="...{auto}..."`) will be written in the selected language. The language setting will also used when generating the validation report table through [`get_tabular_report()`](`pointblank.Validate.get_tabular_report`) (or printing the `Validate` object in a notebook environment). Examples -------- ### Creating a validation plan and interrogating Let's walk through a data quality analysis of an extremely small table. It's actually called `"small_table"` and it's accessible through the [`load_dataset()`](`pointblank.load_dataset`) function. ```{python} import pointblank as pb # Load the `small_table` dataset small_table = pb.load_dataset(dataset="small_table", tbl_type="polars") # Preview the table pb.preview(small_table) ``` We ought to think about what's tolerable in terms of data quality so let's designate proportional failure thresholds to the 'warning', 'error', and 'critical' states. This can be done by using the [`Thresholds`](`pointblank.Thresholds`) class. ```{python} thresholds = pb.Thresholds(warning=0.10, error=0.25, critical=0.35) ``` Now, we use the `Validate` class and give it the `thresholds` object (which serves as a default for all validation steps but can be overridden). The static thresholds provided in `thresholds=` will make the reporting a bit more useful. We also need to provide a target table and we'll use `small_table` for this. ```{python} validation = ( pb.Validate( data=small_table, tbl_name="small_table", label="`Validate` example.", thresholds=thresholds ) ) ``` Then, as with any `Validate` object, we can add steps to the validation plan by using as many validation methods as we want. To conclude the process (and actually query the data table), we use the [`interrogate()`](`pointblank.Validate.interrogate`) method. ```{python} validation = ( validation .col_vals_gt(columns="d", value=100) .col_vals_le(columns="c", value=5) .col_vals_between(columns="c", left=3, right=10, na_pass=True) .col_vals_regex(columns="b", pattern=r"[0-9]-[a-z]{3}-[0-9]{3}") .col_exists(columns=["date", "date_time"]) .interrogate() ) ``` The `validation` object can be printed as a reporting table. ```{python} validation ``` The report could be further customized by using the [`get_tabular_report()`](`pointblank.Validate.get_tabular_report`) method, which contains options for modifying the display of the table. ### Adding briefs Briefs are short descriptions of the validation steps. While they can be set for each step individually, they can also be set globally. The global setting is done by using the `brief=` argument in `Validate`. The global setting can be as simple as `True` to have automatically-generated briefs for each step. Alternatively, we can use templating elements like `"{step}"` (to insert the step number) or `"{auto}"` (to include an automatically generated brief). Here's an example of a global setting for briefs: ```{python} validation_2 = ( pb.Validate( data=pb.load_dataset(), tbl_name="small_table", label="Validation example with briefs", brief="Step {step}: {auto}", ) .col_vals_gt(columns="d", value=100) .col_vals_between(columns="c", left=3, right=10, na_pass=True) .col_vals_regex( columns="b", pattern=r"[0-9]-[a-z]{3}-[0-9]{3}", brief="Regex check for column {col}" ) .interrogate() ) validation_2 ``` We see the text of the briefs appear in the `STEP` column of the reporting table. Furthermore, the global brief's template (`"Step {step}: {auto}"`) is applied to all steps except for the final step, where the step-level `brief=` argument provided an override. If you should want to cancel the globally-defined brief for one or more validation steps, you can set `brief=False` in those particular steps. ### Post-interrogation methods The `Validate` class has a number of post-interrogation methods that can be used to extract useful information from the validation results. For example, the [`get_data_extracts()`](`pointblank.Validate.get_data_extracts`) method can be used to get the data extracts for each validation step. ```{python} validation_2.get_data_extracts() ``` We can also view step reports for each validation step using the [`get_step_report()`](`pointblank.Validate.get_step_report`) method. This method adapts to the type of validation step and shows the relevant information for a step's validation. ```{python} validation_2.get_step_report(i=2) ``` The `Validate` class also has a method for getting the sundered data, which is the data that passed or failed the validation steps. This can be done using the [`get_sundered_data()`](`pointblank.Validate.get_sundered_data`) method. ```{python} pb.preview(validation_2.get_sundered_data()) ``` The sundered data is a DataFrame that contains the rows that passed or failed the validation. The default behavior is to return the rows that failed the validation, as shown above. ### Working with CSV Files The `Validate` class can directly accept CSV file paths, making it easy to validate data stored in CSV files without manual loading: ```{python} # Get a path to a CSV file from the package data csv_path = pb.get_data_path("global_sales", "csv") validation_3 = ( pb.Validate( data=csv_path, label="CSV validation example" ) .col_exists(["customer_id", "product_id", "revenue"]) .col_vals_not_null(["customer_id", "product_id"]) .col_vals_gt(columns="revenue", value=0) .interrogate() ) validation_3 ``` You can also use a Path object to specify the CSV file. Here's an example of how to do that: ```{python} from pathlib import Path csv_file = Path(pb.get_data_path("game_revenue", "csv")) validation_4 = ( pb.Validate(data=csv_file, label="Game Revenue Validation") .col_exists(["player_id", "session_id", "item_name"]) .col_vals_regex( columns="session_id", pattern=r"[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12}" ) .col_vals_gt(columns="item_revenue", value=0, na_pass=True) .interrogate() ) validation_4 ``` The CSV loading is automatic, so when a string or Path with a `.csv` extension is provided, Pointblank will automatically load the file using the best available DataFrame library (Polars preferred, Pandas as fallback). The loaded data can then be used with all validation methods just like any other supported table type. ### Working with Parquet Files The `Validate` class can directly accept Parquet files and datasets in various formats. The following examples illustrate how to validate Parquet files: ```{python} # Single Parquet file from package data parquet_path = pb.get_data_path("nycflights", "parquet") validation_5 = ( pb.Validate( data=parquet_path, tbl_name="NYC Flights Data" ) .col_vals_not_null(["carrier", "origin", "dest"]) .col_vals_gt(columns="distance", value=0) .interrogate() ) validation_5 ``` You can also use glob patterns and directories. Here are some examples for how to: 1. load multiple Parquet files 2. load a Parquet-containing directory 3. load a partitioned Parquet dataset ```python # Multiple Parquet files with glob patterns validation_6 = pb.Validate(data="data/sales_*.parquet") # Directory containing Parquet files validation_7 = pb.Validate(data="parquet_data/") # Partitioned Parquet dataset validation_8 = ( pb.Validate(data="sales_data/") # Contains year=2023/quarter=Q1/region=US/sales.parquet .col_exists(["transaction_id", "amount", "year", "quarter", "region"]) .interrogate() ) ``` When you point to a directory that contains a partitioned Parquet dataset (with subdirectories like `year=2023/quarter=Q1/region=US/`), Pointblank will automatically: - discover all Parquet files recursively - extract partition column values from directory paths - add partition columns to the final DataFrame - combine all partitions into a single table for validation Both Polars and Pandas handle partitioned datasets natively, so this works seamlessly with either DataFrame library. The loading preference is Polars first, then Pandas as a fallback. ### Working with Database Connection Strings The `Validate` class supports database connection strings for direct validation of database tables. Connection strings must specify a table using the `::table_name` suffix: ```{python} # Get path to a DuckDB database file from package data duckdb_path = pb.get_data_path("game_revenue", "duckdb") validation_9 = ( pb.Validate( data=f"duckdb:///{duckdb_path}::game_revenue", label="DuckDB Game Revenue Validation" ) .col_exists(["player_id", "session_id", "item_revenue"]) .col_vals_gt(columns="item_revenue", value=0) .interrogate() ) validation_9 ``` For comprehensive documentation on supported connection string formats, error handling, and installation requirements, see the [`connect_to_table()`](`pointblank.connect_to_table`) function. This function handles all the connection logic and provides helpful error messages when table specifications are missing or backend dependencies are not installed. Thresholds(warning: 'int | float | bool | None' = None, error: 'int | float | bool | None' = None, critical: 'int | float | bool | None' = None) -> None Definition of threshold values. Thresholds are used to set limits on the number of failing test units at different levels. The levels are 'warning', 'error', and 'critical'. These levels correspond to different levels of severity when a threshold is reached. The threshold values can be set as absolute counts or as fractions of the total number of test units. When a threshold is reached, an action can be taken (e.g., displaying a message or calling a function) if there is an associated action defined for that level (defined through the [`Actions`](`pointblank.Actions`) class). Parameters ---------- warning The threshold for the 'warning' level. This can be an absolute count or a fraction of the total. Using `True` will set this threshold value to `1`. error The threshold for the 'error' level. This can be an absolute count or a fraction of the total. Using `True` will set this threshold value to `1`. critical The threshold for the 'critical' level. This can be an absolute count or a fraction of the total. Using `True` will set this threshold value to `1`. Returns ------- Thresholds A `Thresholds` object. This can be used when using the [`Validate`](`pointblank.Validate`) class (to set thresholds globally) or when defining validation steps like [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) (so that threshold values are scoped to individual validation steps, overriding any global thresholds). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_footer_timings=False) ``` In a data validation workflow, you can set thresholds for the number of failing test units at different levels. For example, you can set a threshold for the 'warning' level when the number of failing test units exceeds 10% of the total number of test units: ```{python} thresholds_1 = pb.Thresholds(warning=0.1) ``` You can also set thresholds for the 'error' and 'critical' levels: ```{python} thresholds_2 = pb.Thresholds(warning=0.1, error=0.2, critical=0.05) ``` Thresholds can also be set as absolute counts. Here's an example where the 'warning' level is set to `5` failing test units: ```{python} thresholds_3 = pb.Thresholds(warning=5) ``` The `thresholds` object can be used to set global thresholds for all validation steps. Or, you can set thresholds for individual validation steps, which will override the global thresholds. Here's a data validation workflow example where we set global thresholds and then override with different thresholds at the [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) step: ```{python} validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table"), label="Example Validation", thresholds=pb.Thresholds(warning=0.1, error=0.2, critical=0.3) ) .col_vals_not_null(columns=["c", "d"]) .col_vals_gt(columns="a", value=3, thresholds=pb.Thresholds(warning=5)) .interrogate() ) validation ``` As can be seen, the last step ([`col_vals_gt()`](`pointblank.Validate.col_vals_gt`)) has its own thresholds, which override the global thresholds set at the beginning of the validation workflow (in the [`Validate`](`pointblank.Validate`) class). Actions(warning: 'str | Callable | list[str | Callable] | None' = None, error: 'str | Callable | list[str | Callable] | None' = None, critical: 'str | Callable | list[str | Callable] | None' = None, default: 'str | Callable | list[str | Callable] | None' = None, highest_only: 'bool' = True) -> None Definition of action values. Actions complement threshold values by defining what action should be taken when a threshold level is reached. The action can be a string or a `Callable`. When a string is used, it is interpreted as a message to be displayed. When a `Callable` is used, it will be invoked at interrogation time if the threshold level is met or exceeded. There are three threshold levels: 'warning', 'error', and 'critical'. These levels correspond to different levels of severity when a threshold is reached. Those thresholds can be defined using the [`Thresholds`](`pointblank.Thresholds`) class or various shorthand forms. Actions don't have to be defined for all threshold levels; if an action is not defined for a level in exceedance, no action will be taken. Likewise, there is no negative consequence (other than a no-op) for defining actions for thresholds that don't exist (e.g., setting an action for the 'critical' level when no corresponding 'critical' threshold has been set). Parameters ---------- warning A string, `Callable`, or list of `Callable`/string values for the 'warning' level. Using `None` means no action should be performed at the 'warning' level. error A string, `Callable`, or list of `Callable`/string values for the 'error' level. Using `None` means no action should be performed at the 'error' level. critical A string, `Callable`, or list of `Callable`/string values for the 'critical' level. Using `None` means no action should be performed at the 'critical' level. default A string, `Callable`, or list of `Callable`/string values for all threshold levels. This parameter can be used to set the same action for all threshold levels. If an action is defined for a specific threshold level, it will override the action set for all levels. highest_only A boolean value that, when set to `True` (the default), results in executing only the action for the highest threshold level that is exceeded. Useful when you want to ensure that only the most severe action is taken when multiple threshold levels are exceeded. Returns ------- Actions An `Actions` object. This can be used when using the [`Validate`](`pointblank.Validate`) class (to set actions for meeting different threshold levels globally) or when defining validation steps like [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) (so that actions are scoped to individual validation steps, overriding any globally set actions). Types of Actions ---------------- Actions can be defined in different ways: 1. **String**: A message to be displayed when the threshold level is met or exceeded. 2. **Callable**: A function that is called when the threshold level is met or exceeded. 3. **List of Strings/Callables**: Multiple messages or functions to be called when the threshold level is met or exceeded. The actions are executed at interrogation time when the threshold level assigned to the action is exceeded by the number or proportion of failing test units. When providing a string, it will simply be printed to the console. A callable will also be executed at the time of interrogation. If providing a list of strings or callables, each item in the list will be executed in order. Such a list can contain a mix of strings and callables. String Templating ----------------- When using a string as an action, you can include placeholders for the following variables: - `{type}`: The validation step type where the action is executed (e.g., 'col_vals_gt', 'col_vals_lt', etc.) - `{level}`: The threshold level where the action is executed ('warning', 'error', or 'critical') - `{step}` or `{i}`: The step number in the validation workflow where the action is executed - `{col}` or `{column}`: The column name where the action is executed - `{val}` or `{value}`: An associated value for the validation method (e.g., the value to compare against in a 'col_vals_gt' validation step) - `{time}`: A datetime value for when the action was executed The first two placeholders can also be used in uppercase (e.g., `{TYPE}` or `{LEVEL}`) and the corresponding values will be displayed in uppercase. The placeholders are replaced with the actual values during interrogation. For example, the string `"{LEVEL}: '{type}' threshold exceeded for column {col}."` will be displayed as `"WARNING: 'col_vals_gt' threshold exceeded for column a."` when the 'warning' threshold is exceeded in a 'col_vals_gt' validation step involving column `a`. Crafting Callables with `get_action_metadata()` ----------------------------------------------- When creating a callable function to be used as an action, you can use the [`get_action_metadata()`](`pointblank.get_action_metadata`) function to retrieve metadata about the step where the action is executed. This metadata contains information about the validation step, including the step type, level, step number, column name, and associated value. You can use this information to craft your action message or to take specific actions based on the metadata provided. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_footer_timings=False) ``` Let's define both threshold values and actions for a data validation workflow. We'll set these thresholds and actions globally for all validation steps. In this specific example, the only actions we'll define are for the 'critical' level: ```{python} import pointblank as pb validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), actions=pb.Actions(critical="Major data quality issue found in step {step}."), ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}[0-9]{3}") .col_vals_gt(columns="item_revenue", value=0.05) .col_vals_gt(columns="session_duration", value=15) .interrogate() ) validation ``` Because we set the 'critical' action to display `"Major data quality issue found."` in the console, this message will be displayed if the number of failing test units exceeds the 'critical' threshold (set to 15% of the total number of test units). In step 3 of the validation workflow, the 'critical' threshold is exceeded, so the message is displayed in the console. Actions can be defined locally for individual validation steps, which will override any global actions set at the beginning of the validation workflow. Here's a variation of the above example where we set global threshold values but assign an action only for an individual validation step: ```{python} def dq_issue(): from datetime import datetime print(f"Data quality issue found ({datetime.now()}).") validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}[0-9]{3}") .col_vals_gt(columns="item_revenue", value=0.05) .col_vals_gt( columns="session_duration", value=15, actions=pb.Actions(warning=dq_issue), ) .interrogate() ) validation ``` In this case, the 'warning' action is set to call the `dq_issue()` function. This action is only executed when the 'warning' threshold is exceeded in the 'session_duration' column. Because all three thresholds are exceeded in step 3, the 'warning' action of executing the function occurs (resulting in a message being printed to the console). If actions were set for the other two threshold levels, they would also be executed. See Also -------- The [`get_action_metadata()`](`pointblank.get_action_metadata`) function, which can be used to retrieve metadata about the step where the action is executed. FinalActions(*args) -> 'None' Define actions to be taken after validation is complete. Final actions are executed after all validation steps have been completed. They provide a mechanism to respond to the overall validation results, such as sending alerts when critical failures are detected or generating summary reports. Parameters ---------- *actions One or more actions to execute after validation. An action can be (1) a callable function that will be executed with no arguments, or (2) a string message that will be printed to the console. Returns ------- FinalActions An `FinalActions` object. This can be used when using the [`Validate`](`pointblank.Validate`) class (to set final actions for the validation workflow). Types of Actions ---------------- Final actions can be defined in two different ways: 1. **String**: A message to be displayed when the validation is complete. 2. **Callable**: A function that is called when the validation is complete. The actions are executed at the end of the validation workflow. When providing a string, it will simply be printed to the console. A callable will also be executed at the time of validation completion. Several strings and callables can be provided to the `FinalActions` class, and they will be executed in the order they are provided. Crafting Callables with `get_validation_summary()` ------------------------------------------------- When creating a callable function to be used as a final action, you can use the [`get_validation_summary()`](`pointblank.get_validation_summary`) function to retrieve the summary of the validation results. This summary contains information about the validation workflow, including the number of test units, the number of failing test units, and the threshold levels that were exceeded. You can use this information to craft your final action message or to take specific actions based on the validation results. Examples -------- Final actions provide a powerful way to respond to the overall results of a validation workflow. They're especially useful for sending notifications, generating reports, or taking corrective actions based on the complete validation outcome. The following example shows how to create a final action that checks for critical failures and sends an alert: ```python import pointblank as pb def send_alert(): summary = pb.get_validation_summary() if summary["highest_severity"] == "critical": print(f"ALERT: Critical validation failures found in {summary['tbl_name']}") validation = ( pb.Validate( data=my_data, final_actions=pb.FinalActions(send_alert) ) .col_vals_gt(columns="revenue", value=0) .interrogate() ) ``` In this example, the `send_alert()` function is defined to check the validation summary for critical failures. If any are found, an alert message is printed to the console. The function is passed to the `FinalActions` class, which ensures it will be executed after all validation steps are complete. Note that we used the [`get_validation_summary()`](`pointblank.get_validation_summary`) function to retrieve the summary of the validation results to help craft the alert message. Multiple final actions can be provided in a sequence. They will be executed in the order they are specified after all validation steps have completed: ```python validation = ( pb.Validate( data=my_data, final_actions=pb.FinalActions( "Validation complete.", # a string message send_alert, # a callable function generate_report # another callable function ) ) .col_vals_gt(columns="revenue", value=0) .interrogate() ) ``` See Also -------- The [`get_validation_summary()`](`pointblank.get_validation_summary`) function, which can be used to retrieve the summary of the validation results. Schema(columns: 'str | list[str] | list[tuple[str, str]] | list[tuple[str]] | dict[str, str] | None' = None, tbl: 'Any | None' = None, **kwargs) -> 'None' Definition of a schema object. The schema object defines the structure of a table. Once it is defined, the object can be used in a validation workflow, using `Validate` and its methods, to ensure that the structure of a table matches the expected schema. The validation method that works with the schema object is called [`col_schema_match()`](`pointblank.Validate.col_schema_match`). A schema for a table can be constructed with the `Schema` class in a number of ways: 1. providing a list of column names to `columns=` (to check only the column names) 2. using a list of one- or two-element tuples in `columns=` (to check both column names and optionally dtypes, should be in the form of `[(column_name, dtype), ...]`) 3. providing a dictionary to `columns=`, where the keys are column names and the values are dtypes 4. providing individual column arguments in the form of keyword arguments (constructed as `column_name=dtype`) The schema object can also be constructed by providing a DataFrame or Ibis table object (using the `tbl=` parameter) and the schema will be collected from either type of object. The schema object can be printed to display the column names and dtypes. Note that if `tbl=` is provided then there shouldn't be any other inputs provided through either `columns=` or `**kwargs`. Parameters ---------- columns A list of strings (representing column names), a list of tuples (for column names and column dtypes), or a dictionary containing column and dtype information. If any of these inputs are provided here, it will take precedence over any column arguments provided via `**kwargs`. tbl A DataFrame (Polars or Pandas) or an Ibis table object from which the schema will be collected. Read the *Supported Input Table Types* section for details on the supported table types. **kwargs Individual column arguments that are in the form of `column=dtype` or `column=[dtype1, dtype2, ...]`. These will be ignored if the `columns=` parameter is not `None`. Returns ------- Schema A schema object. Supported Input Table Types --------------------------- The `tbl=` parameter, if used, can be given any of the following table types: - Polars DataFrame (`"polars"`) - Pandas DataFrame (`"pandas"`) - PySpark table (`"pyspark"`) - DuckDB table (`"duckdb"`)* - MySQL table (`"mysql"`)* - PostgreSQL table (`"postgresql"`)* - SQLite table (`"sqlite"`)* - Microsoft SQL Server table (`"mssql"`)* - Snowflake table (`"snowflake"`)* - Databricks table (`"databricks"`)* - BigQuery table (`"bigquery"`)* - Parquet table (`"parquet"`)* The table types marked with an asterisk need to be prepared as Ibis tables (with type of `ibis.expr.types.relations.Table`). Furthermore, using `Schema(tbl=)` with these types of tables requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a Polars or Pandas DataFrame, the availability of Ibis is not needed. Additional Notes on Schema Construction --------------------------------------- While there is flexibility in how a schema can be constructed, there is the potential for some confusion. So let's go through each of the methods of constructing a schema in more detail and single out some important points. When providing a list of column names to `columns=`, a [`col_schema_match()`](`pointblank.Validate.col_schema_match`) validation step will only check the column names. Any arguments pertaining to dtypes will be ignored. When using a list of tuples in `columns=`, the tuples could contain the column name and dtype or just the column name. This construction allows for more flexibility in constructing the schema as some columns will be checked for dtypes and others will not. This method is the only way to have mixed checks of column names and dtypes in [`col_schema_match()`](`pointblank.Validate.col_schema_match`). When providing a dictionary to `columns=`, the keys are the column names and the values are the dtypes. This method of input is useful in those cases where you might already have a dictionary of column names and dtypes that you want to use as the schema. If using individual column arguments in the form of keyword arguments, the column names are the keyword arguments and the dtypes are the values. This method emphasizes readability and is perhaps more convenient when manually constructing a schema with a small number of columns. Finally, multiple dtypes can be provided for a single column by providing a list or tuple of dtypes in place of a scalar string value. Having multiple dtypes for a column allows for the dtype check via [`col_schema_match()`](`pointblank.Validate.col_schema_match`) to make multiple attempts at matching the column dtype. Should any of the dtypes match the column dtype, that part of the schema check will pass. Here are some examples of how you could provide single and multiple dtypes for a column: ```python # list of tuples schema_1 = pb.Schema(columns=[("name", "String"), ("age", ["Float64", "Int64"])]) # dictionary schema_2 = pb.Schema(columns={"name": "String", "age": ["Float64", "Int64"]}) # keyword arguments schema_3 = pb.Schema(name="String", age=["Float64", "Int64"]) ``` All of the above examples will construct the same schema object. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` A schema can be constructed via the `Schema` class in multiple ways. Let's use the following Polars DataFrame as a basis for constructing a schema: ```{python} import pointblank as pb import polars as pl df = pl.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "height": [5.6, 6.0, 5.8] }) ``` You could provide `Schema(columns=)` a list of tuples containing column names and data types: ```{python} schema = pb.Schema(columns=[("name", "String"), ("age", "Int64"), ("height", "Float64")]) ``` Alternatively, a dictionary containing column names and dtypes also works: ```{python} schema = pb.Schema(columns={"name": "String", "age": "Int64", "height": "Float64"}) ``` Another input method involves using individual column arguments in the form of keyword arguments: ```{python} schema = pb.Schema(name="String", age="Int64", height="Float64") ``` Finally, could also provide a DataFrame (Polars and Pandas) or an Ibis table object to `tbl=` and the schema will be collected: ```python schema = pb.Schema(tbl=df) ``` Whichever method you choose, you can verify the schema inputs by printing the `schema` object: ```{python} print(schema) ``` The `Schema` object can be used to validate the structure of a table against the schema. The relevant `Validate` method for this is [`col_schema_match()`](`pointblank.Validate.col_schema_match`). In a validation workflow, you'll have a target table (defined at the beginning of the workflow) and you might want to ensure that your expectations of the table structure are met. The [`col_schema_match()`](`pointblank.Validate.col_schema_match`) method works with a `Schema` object to validate the structure of the table. Here's an example of how you could use [`col_schema_match()`](`pointblank.Validate.col_schema_match`) in a validation workflow: ```{python} # Define the schema schema = pb.Schema(name="String", age="Int64", height="Float64") # Define a validation that checks the schema against the table (`df`) validation = ( pb.Validate(data=df) .col_schema_match(schema=schema) .interrogate() ) # Display the validation results validation ``` The [`col_schema_match()`](`pointblank.Validate.col_schema_match`) validation method will validate the structure of the table against the schema during interrogation. If the structure of the table does not match the schema, the single test unit will fail. In this case, the defined schema matched the structure of the table, so the validation passed. We can also choose to check only the column names of the target table. This can be done by providing a simplified `Schema` object, which is given a list of column names: ```{python} schema = pb.Schema(columns=["name", "age", "height"]) validation = ( pb.Validate(data=df) .col_schema_match(schema=schema) .interrogate() ) validation ``` In this case, the schema only checks the column names of the table against the schema during interrogation. If the column names of the table do not match the schema, the single test unit will fail. In this case, the defined schema matched the column names of the table, so the validation passed. If you wanted to check column names and dtypes only for a subset of columns (and just the column names for the rest), you could use a list of mixed one- or two-item tuples in `columns=`: ```{python} schema = pb.Schema(columns=[("name", "String"), ("age", ), ("height", )]) validation = ( pb.Validate(data=df) .col_schema_match(schema=schema) .interrogate() ) validation ``` Not specifying a dtype for a column (as is the case for the `age` and `height` columns in the above example) will only check the column name. There may also be the case where you want to check the column names and specify multiple dtypes for a column to have several attempts at matching the dtype. This can be done by providing a list of dtypes where there would normally be a single dtype: ```{python} schema = pb.Schema( columns=[("name", "String"), ("age", ["Float64", "Int64"]), ("height", "Float64")] ) validation = ( pb.Validate(data=df) .col_schema_match(schema=schema) .interrogate() ) validation ``` For the `age` column, the schema will check for both `Float64` and `Int64` dtypes. If either of these dtypes is found in the column, the portion of the schema check will succeed. See Also -------- The [`col_schema_match()`](`pointblank.Validate.col_schema_match`) validation method, where a `Schema` object is used in a validation workflow. DraftValidation(data: 'Any', model: 'str', api_key: 'str | None' = None, verify_ssl: 'bool' = True) -> None Draft a validation plan for a given table using an LLM. By using a large language model (LLM) to draft a validation plan, you can quickly generate a starting point for validating a table. This can be useful when you have a new table and you want to get a sense of how to validate it (and adjustments could always be made later). The `DraftValidation` class uses the `chatlas` package to draft a validation plan for a given table using an LLM from either the `"anthropic"`, `"openai"`, `"ollama"` or `"bedrock"` provider. You can install all requirements for the class through an optional 'generate' install of Pointblank via `pip install pointblank[generate]`. :::{.callout-warning} The `DraftValidation` class is still experimental. Please report any issues you encounter in the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues). ::: Parameters ---------- data The data to be used for drafting a validation plan. model The model to be used. This should be in the form of `provider:model` (e.g., `"anthropic:claude-sonnet-4-5"`). Supported providers are `"anthropic"`, `"openai"`, `"ollama"`, and `"bedrock"`. api_key The API key to be used for the model. verify_ssl Whether to verify SSL certificates when making requests to the LLM provider. Set to `False` to disable SSL verification (e.g., when behind a corporate firewall with self-signed certificates). Defaults to `True`. Use with caution as disabling SSL verification can pose security risks. Returns ------- str The drafted validation plan. Constructing the `model` Argument --------------------------------- The `model=` argument should be constructed using the provider and model name separated by a colon (`provider:model`). The provider text can any of: - `"anthropic"` (Anthropic) - `"openai"` (OpenAI) - `"ollama"` (Ollama) - `"bedrock"` (Amazon Bedrock) The model name should be the specific model to be used from the provider. Model names are subject to change so consult the provider's documentation for the most up-to-date model names. Notes on Authentication ----------------------- Providing a valid API key as a string in the `api_key` argument is adequate for getting started but you should consider using a more secure method for handling API keys. One way to do this is to load the API key from an environent variable and retrieve it using the `os` module (specifically the `os.getenv()` function). Places to store the API key might include `.bashrc`, `.bash_profile`, `.zshrc`, or `.zsh_profile`. Another solution is to store one or more model provider API keys in an `.env` file (in the root of your project). If the API keys have correct names (e.g., `ANTHROPIC_API_KEY` or `OPENAI_API_KEY`) then DraftValidation will automatically load the API key from the `.env` file and there's no need to provide the `api_key` argument. An `.env` file might look like this: ```plaintext ANTHROPIC_API_KEY="your_anthropic_api_key_here" OPENAI_API_KEY="your_openai_api_key_here" ``` There's no need to have the `python-dotenv` package installed when using `.env` files in this way. Notes on SSL Certificate Verification -------------------------------------- By default, SSL certificate verification is enabled for all requests to LLM providers. However, in certain network environments (such as corporate networks with self-signed certificates or firewall proxies), you may encounter SSL certificate verification errors. To disable SSL verification, set the `verify_ssl` parameter to `False`: ```python import pointblank as pb data = pb.load_dataset(dataset="nycflights", tbl_type="duckdb") # Disable SSL verification for networks with self-signed certificates pb.DraftValidation( data=data, model="anthropic:claude-sonnet-4-5", verify_ssl=False ) ``` :::{.callout-warning} Disabling SSL verification (through `verify_ssl=False`) can expose your API keys and data to man-in-the-middle attacks. Only use this option in trusted network environments and when absolutely necessary. ::: Notes on Data Sent to the Model Provider ---------------------------------------- The data sent to the model provider is a JSON summary of the table. This data summary is generated internally by `DraftValidation` using the `DataScan` class. The summary includes the following information: - the number of rows and columns in the table - the type of dataset (e.g., Polars, DuckDB, Pandas, etc.) - the column names and their types - column level statistics such as the number of missing values, min, max, mean, and median, etc. - a short list of data values in each column The JSON summary is used to provide the model with the necessary information to draft a validation plan. As such, even very large tables can be used with the `DraftValidation` class since the contents of the table are not sent to the model provider. The Amazon Bedrock is a special case since it is a self-hosted model and security controls are in place to ensure that data is kept within the user's AWS environment. If using an Ollama model all data is handled locally, though only a few models are capable enough to perform the task of drafting a validation plan. Examples -------- Let's look at how the `DraftValidation` class can be used to draft a validation plan for a table. The table to be used is `"nycflights"`, which is available here via the [`load_dataset()`](`pointblank.load_dataset`) function. The model to be used is `"anthropic:claude-sonnet-4-5"` (which performs very well compared to other LLMs). The example assumes that the API key is stored in an `.env` file as `ANTHROPIC_API_KEY`. ```python import pointblank as pb # Load the "nycflights" dataset as a DuckDB table data = pb.load_dataset(dataset="nycflights", tbl_type="duckdb") # Draft a validation plan for the "nycflights" table pb.DraftValidation(data=data, model="anthropic:claude-sonnet-4-5") ``` The output will be a drafted validation plan for the `"nycflights"` table and this will appear in the console. ````plaintext ```python import pointblank as pb # Define schema based on column names and dtypes schema = pb.Schema(columns=[ ("year", "int64"), ("month", "int64"), ("day", "int64"), ("dep_time", "int64"), ("sched_dep_time", "int64"), ("dep_delay", "int64"), ("arr_time", "int64"), ("sched_arr_time", "int64"), ("arr_delay", "int64"), ("carrier", "string"), ("flight", "int64"), ("tailnum", "string"), ("origin", "string"), ("dest", "string"), ("air_time", "int64"), ("distance", "int64"), ("hour", "int64"), ("minute", "int64") ]) # The validation plan validation = ( pb.Validate( data=your_data, # Replace your_data with the actual data variable label="Draft Validation", thresholds=pb.Thresholds(warning=0.10, error=0.25, critical=0.35) ) .col_schema_match(schema=schema) .col_vals_not_null(columns=[ "year", "month", "day", "sched_dep_time", "carrier", "flight", "origin", "dest", "distance", "hour", "minute" ]) .col_vals_between(columns="month", left=1, right=12) .col_vals_between(columns="day", left=1, right=31) .col_vals_between(columns="sched_dep_time", left=106, right=2359) .col_vals_between(columns="dep_delay", left=-43, right=1301, na_pass=True) .col_vals_between(columns="air_time", left=20, right=695, na_pass=True) .col_vals_between(columns="distance", left=17, right=4983) .col_vals_between(columns="hour", left=1, right=23) .col_vals_between(columns="minute", left=0, right=59) .col_vals_in_set(columns="origin", set=["EWR", "LGA", "JFK"]) .col_count_match(count=18) .row_count_match(count=336776) .rows_distinct() .interrogate() ) validation ``` ```` The drafted validation plan can be copied and pasted into a Python script or notebook for further use. In other words, the generated plan can be adjusted as needed to suit the specific requirements of the table being validated. Note that the output does not know how the data was obtained, so it uses the placeholder `your_data` in the `data=` argument of the `Validate` class. When adapted for use, this should be replaced with the actual data variable. ## Validation Steps Validation steps are sequential validations on the target data. Call Validate's validation methods to build up a validation plan: a collection of steps that provides good validation coverage. col_vals_gt(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', value: 'float | int | Column', na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Are column data greater than a fixed value or data in another column? The `col_vals_gt()` validation method checks whether column values in a table are *greater than* a specified `value=` (the exact comparison used in this function is `col_val > value`). The `value=` can be specified as a single, literal value or as a column name given in [`col()`](`pointblank.col`). This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. value The value to compare against. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison. For more information on which types of values are allowed, see the *What Can Be Used in `value=`?* section. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. What Can Be Used in `value=`? ----------------------------- The `value=` argument allows for a variety of input types. The most common are: - a single numeric value - a single date or datetime value - A [`col()`](`pointblank.col`) object that represents a column name When supplying a number as the basis of comparison, keep in mind that all resolved columns must also be numeric. Should you have columns that are of the date or datetime types, you can supply a date or datetime value as the `value=` argument. There is flexibility in how you provide the date or datetime value, as it can be: - a string-based date or datetime (e.g., `"2023-10-01"`, `"2023-10-01 13:45:30"`, etc.) - a date or datetime object using the `datetime` module (e.g., `datetime.date(2023, 10, 1)`, `datetime.datetime(2023, 10, 1, 13, 45, 30)`, etc.) Finally, when supplying a column name in the `value=` argument, it must be specified within [`col()`](`pointblank.col`). This is a column-to-column comparison and, crucially, the columns being compared must be of the same type (e.g., both numeric, both date, etc.). Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns=` and `value=col(...)` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with three numeric columns (`a`, `b`, and `c`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 6, 5, 7, 6, 5], "b": [1, 2, 1, 2, 2, 2], "c": [2, 1, 2, 2, 3, 4], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are all greater than the value of `4`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=4) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_gt()`. All test units passed, and there are no failing test units. Aside from checking a column against a literal value, we can also use a column name in the `value=` argument (with the helper function [`col()`](`pointblank.col`) to perform a column-to-column comparison. For the next example, we'll use `col_vals_gt()` to check whether the values in column `c` are greater than values in column `b`. ```{python} validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="c", value=pb.col("b")) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are: - Row 1: `c` is `1` and `b` is `2`. - Row 3: `c` is `2` and `b` is `2`. col_vals_lt(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', value: 'float | int | Column', na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Are column data less than a fixed value or data in another column? The `col_vals_lt()` validation method checks whether column values in a table are *less than* a specified `value=` (the exact comparison used in this function is `col_val < value`). The `value=` can be specified as a single, literal value or as a column name given in [`col()`](`pointblank.col`). This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. value The value to compare against. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison. For more information on which types of values are allowed, see the *What Can Be Used in `value=`?* section. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. What Can Be Used in `value=`? ----------------------------- The `value=` argument allows for a variety of input types. The most common are: - a single numeric value - a single date or datetime value - A [`col()`](`pointblank.col`) object that represents a column name When supplying a number as the basis of comparison, keep in mind that all resolved columns must also be numeric. Should you have columns that are of the date or datetime types, you can supply a date or datetime value as the `value=` argument. There is flexibility in how you provide the date or datetime value, as it can be: - a string-based date or datetime (e.g., `"2023-10-01"`, `"2023-10-01 13:45:30"`, etc.) - a date or datetime object using the `datetime` module (e.g., `datetime.date(2023, 10, 1)`, `datetime.datetime(2023, 10, 1, 13, 45, 30)`, etc.) Finally, when supplying a column name in the `value=` argument, it must be specified within [`col()`](`pointblank.col`). This is a column-to-column comparison and, crucially, the columns being compared must be of the same type (e.g., both numeric, both date, etc.). Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns=` and `value=col(...)` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with three numeric columns (`a`, `b`, and `c`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 6, 5, 9, 7, 5], "b": [1, 2, 1, 2, 2, 2], "c": [2, 1, 1, 4, 3, 4], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are all less than the value of `10`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_lt(columns="a", value=10) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_lt()`. All test units passed, and there are no failing test units. Aside from checking a column against a literal value, we can also use a column name in the `value=` argument (with the helper function [`col()`](`pointblank.col`) to perform a column-to-column comparison. For the next example, we'll use `col_vals_lt()` to check whether the values in column `b` are less than values in column `c`. ```{python} validation = ( pb.Validate(data=tbl) .col_vals_lt(columns="b", value=pb.col("c")) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are: - Row 1: `b` is `2` and `c` is `1`. - Row 2: `b` is `1` and `c` is `1`. col_vals_ge(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', value: 'float | int | Column', na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Are column data greater than or equal to a fixed value or data in another column? The `col_vals_ge()` validation method checks whether column values in a table are *greater than or equal to* a specified `value=` (the exact comparison used in this function is `col_val >= value`). The `value=` can be specified as a single, literal value or as a column name given in [`col()`](`pointblank.col`). This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. value The value to compare against. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison. For more information on which types of values are allowed, see the *What Can Be Used in `value=`?* section. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. What Can Be Used in `value=`? ----------------------------- The `value=` argument allows for a variety of input types. The most common are: - a single numeric value - a single date or datetime value - A [`col()`](`pointblank.col`) object that represents a column name When supplying a number as the basis of comparison, keep in mind that all resolved columns must also be numeric. Should you have columns that are of the date or datetime types, you can supply a date or datetime value as the `value=` argument. There is flexibility in how you provide the date or datetime value, as it can be: - a string-based date or datetime (e.g., `"2023-10-01"`, `"2023-10-01 13:45:30"`, etc.) - a date or datetime object using the `datetime` module (e.g., `datetime.date(2023, 10, 1)`, `datetime.datetime(2023, 10, 1, 13, 45, 30)`, etc.) Finally, when supplying a column name in the `value=` argument, it must be specified within [`col()`](`pointblank.col`). This is a column-to-column comparison and, crucially, the columns being compared must be of the same type (e.g., both numeric, both date, etc.). Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns=` and `value=col(...)` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with three numeric columns (`a`, `b`, and `c`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 6, 5, 9, 7, 5], "b": [5, 3, 1, 8, 2, 3], "c": [2, 3, 1, 4, 3, 4], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are all greater than or equal to the value of `5`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_ge(columns="a", value=5) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_ge()`. All test units passed, and there are no failing test units. Aside from checking a column against a literal value, we can also use a column name in the `value=` argument (with the helper function [`col()`](`pointblank.col`) to perform a column-to-column comparison. For the next example, we'll use `col_vals_ge()` to check whether the values in column `b` are greater than values in column `c`. ```{python} validation = ( pb.Validate(data=tbl) .col_vals_ge(columns="b", value=pb.col("c")) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are: - Row 0: `b` is `2` and `c` is `3`. - Row 4: `b` is `3` and `c` is `4`. col_vals_le(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', value: 'float | int | Column', na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Are column data less than or equal to a fixed value or data in another column? The `col_vals_le()` validation method checks whether column values in a table are *less than or equal to* a specified `value=` (the exact comparison used in this function is `col_val <= value`). The `value=` can be specified as a single, literal value or as a column name given in [`col()`](`pointblank.col`). This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. value The value to compare against. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison. For more information on which types of values are allowed, see the *What Can Be Used in `value=`?* section. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. What Can Be Used in `value=`? ----------------------------- The `value=` argument allows for a variety of input types. The most common are: - a single numeric value - a single date or datetime value - A [`col()`](`pointblank.col`) object that represents a column name When supplying a number as the basis of comparison, keep in mind that all resolved columns must also be numeric. Should you have columns that are of the date or datetime types, you can supply a date or datetime value as the `value=` argument. There is flexibility in how you provide the date or datetime value, as it can be: - a string-based date or datetime (e.g., `"2023-10-01"`, `"2023-10-01 13:45:30"`, etc.) - a date or datetime object using the `datetime` module (e.g., `datetime.date(2023, 10, 1)`, `datetime.datetime(2023, 10, 1, 13, 45, 30)`, etc.) Finally, when supplying a column name in the `value=` argument, it must be specified within [`col()`](`pointblank.col`). This is a column-to-column comparison and, crucially, the columns being compared must be of the same type (e.g., both numeric, both date, etc.). Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns=` and `value=col(...)` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with three numeric columns (`a`, `b`, and `c`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 6, 5, 9, 7, 5], "b": [1, 3, 1, 5, 2, 5], "c": [2, 1, 1, 4, 3, 4], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are all less than or equal to the value of `9`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_le(columns="a", value=9) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_le()`. All test units passed, and there are no failing test units. Aside from checking a column against a literal value, we can also use a column name in the `value=` argument (with the helper function [`col()`](`pointblank.col`) to perform a column-to-column comparison. For the next example, we'll use `col_vals_le()` to check whether the values in column `c` are less than values in column `b`. ```{python} validation = ( pb.Validate(data=tbl) .col_vals_le(columns="c", value=pb.col("b")) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are: - Row 0: `c` is `2` and `b` is `1`. - Row 4: `c` is `3` and `b` is `2`. col_vals_eq(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', value: 'float | int | Column', na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Are column data equal to a fixed value or data in another column? The `col_vals_eq()` validation method checks whether column values in a table are *equal to* a specified `value=` (the exact comparison used in this function is `col_val == value`). The `value=` can be specified as a single, literal value or as a column name given in [`col()`](`pointblank.col`). This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. value The value to compare against. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison. For more information on which types of values are allowed, see the *What Can Be Used in `value=`?* section. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. What Can Be Used in `value=`? ----------------------------- The `value=` argument allows for a variety of input types. The most common are: - a single numeric value - a single date or datetime value - A [`col()`](`pointblank.col`) object that represents a column name When supplying a number as the basis of comparison, keep in mind that all resolved columns must also be numeric. Should you have columns that are of the date or datetime types, you can supply a date or datetime value as the `value=` argument. There is flexibility in how you provide the date or datetime value, as it can be: - a string-based date or datetime (e.g., `"2023-10-01"`, `"2023-10-01 13:45:30"`, etc.) - a date or datetime object using the `datetime` module (e.g., `datetime.date(2023, 10, 1)`, `datetime.datetime(2023, 10, 1, 13, 45, 30)`, etc.) Finally, when supplying a column name in the `value=` argument, it must be specified within [`col()`](`pointblank.col`). This is a column-to-column comparison and, crucially, the columns being compared must be of the same type (e.g., both numeric, both date, etc.). Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns=` and `value=col(...)` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with two numeric columns (`a` and `b`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 5, 5, 5, 5, 5], "b": [5, 5, 5, 6, 5, 4], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are all equal to the value of `5`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_eq(columns="a", value=5) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_eq()`. All test units passed, and there are no failing test units. Aside from checking a column against a literal value, we can also use a column name in the `value=` argument (with the helper function [`col()`](`pointblank.col`) to perform a column-to-column comparison. For the next example, we'll use `col_vals_eq()` to check whether the values in column `a` are equal to the values in column `b`. ```{python} validation = ( pb.Validate(data=tbl) .col_vals_eq(columns="a", value=pb.col("b")) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are: - Row 3: `a` is `5` and `b` is `6`. - Row 5: `a` is `5` and `b` is `4`. col_vals_ne(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', value: 'float | int | Column', na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Are column data not equal to a fixed value or data in another column? The `col_vals_ne()` validation method checks whether column values in a table are *not equal to* a specified `value=` (the exact comparison used in this function is `col_val != value`). The `value=` can be specified as a single, literal value or as a column name given in [`col()`](`pointblank.col`). This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. value The value to compare against. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison. For more information on which types of values are allowed, see the *What Can Be Used in `value=`?* section. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. What Can Be Used in `value=`? ----------------------------- The `value=` argument allows for a variety of input types. The most common are: - a single numeric value - a single date or datetime value - A [`col()`](`pointblank.col`) object that represents a column name When supplying a number as the basis of comparison, keep in mind that all resolved columns must also be numeric. Should you have columns that are of the date or datetime types, you can supply a date or datetime value as the `value=` argument. There is flexibility in how you provide the date or datetime value, as it can be: - a string-based date or datetime (e.g., `"2023-10-01"`, `"2023-10-01 13:45:30"`, etc.) - a date or datetime object using the `datetime` module (e.g., `datetime.date(2023, 10, 1)`, `datetime.datetime(2023, 10, 1, 13, 45, 30)`, etc.) Finally, when supplying a column name in the `value=` argument, it must be specified within [`col()`](`pointblank.col`). This is a column-to-column comparison and, crucially, the columns being compared must be of the same type (e.g., both numeric, both date, etc.). Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns=` and `value=col(...)` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with two numeric columns (`a` and `b`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 5, 5, 5, 5, 5], "b": [5, 6, 3, 6, 5, 8], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are not equal to the value of `3`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_ne(columns="a", value=3) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_ne()`. All test units passed, and there are no failing test units. Aside from checking a column against a literal value, we can also use a column name in the `value=` argument (with the helper function [`col()`](`pointblank.col`) to perform a column-to-column comparison. For the next example, we'll use `col_vals_ne()` to check whether the values in column `a` aren't equal to the values in column `b`. ```{python} validation = ( pb.Validate(data=tbl) .col_vals_ne(columns="a", value=pb.col("b")) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are in rows 0 and 4, where `a` is `5` and `b` is `5` in both cases (i.e., they are equal to each other). col_vals_between(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', left: 'float | int | Column', right: 'float | int | Column', inclusive: 'tuple[bool, bool]' = (True, True), na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Do column data lie between two specified values or data in other columns? The `col_vals_between()` validation method checks whether column values in a table fall within a range. The range is specified with three arguments: `left=`, `right=`, and `inclusive=`. The `left=` and `right=` values specify the lower and upper bounds. These bounds can be specified as literal values or as column names provided within [`col()`](`pointblank.col`). The validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. left The lower bound of the range. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison for this bound. See the *What Can Be Used in `left=` and `right=`?* section for details on this. right The upper bound of the range. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison for this bound. See the *What Can Be Used in `left=` and `right=`?* section for details on this. inclusive A tuple of two boolean values indicating whether the comparison should be inclusive. The position of the boolean values correspond to the `left=` and `right=` values, respectively. By default, both values are `True`. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. What Can Be Used in `left=` and `right=`? ----------------------------------------- The `left=` and `right=` arguments both allow for a variety of input types. The most common are: - a single numeric value - a single date or datetime value - A [`col()`](`pointblank.col`) object that represents a column in the target table When supplying a number as the basis of comparison, keep in mind that all resolved columns must also be numeric. Should you have columns that are of the date or datetime types, you can supply a date or datetime value within `left=` and `right=`. There is flexibility in how you provide the date or datetime values for the bounds; they can be: - string-based dates or datetimes (e.g., `"2023-10-01"`, `"2023-10-01 13:45:30"`, etc.) - date or datetime objects using the `datetime` module (e.g., `datetime.date(2023, 10, 1)`, `datetime.datetime(2023, 10, 1, 13, 45, 30)`, etc.) Finally, when supplying a column name in either `left=` or `right=` (or both), it must be specified within [`col()`](`pointblank.col`). This facilitates column-to-column comparisons and, crucially, the columns being compared to either/both of the bounds must be of the same type as the column data (e.g., all numeric, all dates, etc.). Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns=` and `left=col(...)`/`right=col(...)` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with three numeric columns (`a`, `b`, and `c`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [2, 3, 2, 4, 3, 4], "b": [5, 6, 1, 6, 8, 5], "c": [9, 8, 8, 7, 7, 8], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are all between the fixed boundary values of `1` and `5`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_between(columns="a", left=1, right=5) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_between()`. All test units passed, and there are no failing test units. Aside from checking a column against two literal values representing the lower and upper bounds, we can also provide column names to the `left=` and/or `right=` arguments (by using the helper function [`col()`](`pointblank.col`). In this way, we can perform three additional comparison types: 1. `left=column`, `right=column` 2. `left=literal`, `right=column` 3. `left=column`, `right=literal` For the next example, we'll use `col_vals_between()` to check whether the values in column `b` are between than corresponding values in columns `a` (lower bound) and `c` (upper bound). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_between(columns="b", left=pb.col("a"), right=pb.col("c")) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are: - Row 2: `b` is `1` but the bounds are `2` (`a`) and `8` (`c`). - Row 4: `b` is `8` but the bounds are `3` (`a`) and `7` (`c`). col_vals_outside(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', left: 'float | int | Column', right: 'float | int | Column', inclusive: 'tuple[bool, bool]' = (True, True), na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Do column data lie outside of two specified values or data in other columns? The `col_vals_between()` validation method checks whether column values in a table *do not* fall within a certain range. The range is specified with three arguments: `left=`, `right=`, and `inclusive=`. The `left=` and `right=` values specify the lower and upper bounds. These bounds can be specified as literal values or as column names provided within [`col()`](`pointblank.col`). The validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. left The lower bound of the range. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison for this bound. See the *What Can Be Used in `left=` and `right=`?* section for details on this. right The upper bound of the range. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison for this bound. See the *What Can Be Used in `left=` and `right=`?* section for details on this. inclusive A tuple of two boolean values indicating whether the comparison should be inclusive. The position of the boolean values correspond to the `left=` and `right=` values, respectively. By default, both values are `True`. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. What Can Be Used in `left=` and `right=`? ----------------------------------------- The `left=` and `right=` arguments both allow for a variety of input types. The most common are: - a single numeric value - a single date or datetime value - A [`col()`](`pointblank.col`) object that represents a column in the target table When supplying a number as the basis of comparison, keep in mind that all resolved columns must also be numeric. Should you have columns that are of the date or datetime types, you can supply a date or datetime value within `left=` and `right=`. There is flexibility in how you provide the date or datetime values for the bounds; they can be: - string-based dates or datetimes (e.g., `"2023-10-01"`, `"2023-10-01 13:45:30"`, etc.) - date or datetime objects using the `datetime` module (e.g., `datetime.date(2023, 10, 1)`, `datetime.datetime(2023, 10, 1, 13, 45, 30)`, etc.) Finally, when supplying a column name in either `left=` or `right=` (or both), it must be specified within [`col()`](`pointblank.col`). This facilitates column-to-column comparisons and, crucially, the columns being compared to either/both of the bounds must be of the same type as the column data (e.g., all numeric, all dates, etc.). Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns=` and `left=col(...)`/`right=col(...)` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with three numeric columns (`a`, `b`, and `c`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 6, 5, 7, 5, 5], "b": [2, 3, 6, 4, 3, 6], "c": [9, 8, 8, 9, 9, 7], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are all outside the fixed boundary values of `1` and `4`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_outside(columns="a", left=1, right=4) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_outside()`. All test units passed, and there are no failing test units. Aside from checking a column against two literal values representing the lower and upper bounds, we can also provide column names to the `left=` and/or `right=` arguments (by using the helper function [`col()`](`pointblank.col`). In this way, we can perform three additional comparison types: 1. `left=column`, `right=column` 2. `left=literal`, `right=column` 3. `left=column`, `right=literal` For the next example, we'll use `col_vals_outside()` to check whether the values in column `b` are outside of the range formed by the corresponding values in columns `a` (lower bound) and `c` (upper bound). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_outside(columns="b", left=pb.col("a"), right=pb.col("c")) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are: - Row 2: `b` is `6` and the bounds are `5` (`a`) and `8` (`c`). - Row 5: `b` is `6` and the bounds are `5` (`a`) and `7` (`c`). col_vals_in_set(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', set: 'Collection[Any]', pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Validate whether column values are in a set of values. The `col_vals_in_set()` validation method checks whether column values in a table are part of a specified `set=` of values. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. set A collection of values to compare against. Can be a list of values, a Python Enum class, or a collection containing Enum instances. When an Enum class is provided, all enum values will be used. When a collection contains Enum instances, their values will be extracted automatically. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to a column via `columns=` that is expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with two numeric columns (`a` and `b`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 2, 4, 6, 2, 5], "b": [5, 8, 2, 6, 5, 1], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are all in the set of `[2, 3, 4, 5, 6]`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_in_set(columns="a", set=[2, 3, 4, 5, 6]) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_in_set()`. All test units passed, and there are no failing test units. Now, let's use that same set of values for a validation on column `b`. ```{python} validation = ( pb.Validate(data=tbl) .col_vals_in_set(columns="b", set=[2, 3, 4, 5, 6]) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are for the column `b` values of `8` and `1`, which are not in the set of `[2, 3, 4, 5, 6]`. **Using Python Enums** The `col_vals_in_set()` method also supports Python Enum classes and instances, which can make validations more readable and maintainable: ```{python} from enum import Enum class Color(Enum): RED = "red" GREEN = "green" BLUE = "blue" # Create a table with color data tbl_colors = pl.DataFrame({ "product": ["shirt", "pants", "hat", "shoes"], "color": ["red", "blue", "green", "yellow"] }) # Validate using an Enum class (all enum values are allowed) validation = ( pb.Validate(data=tbl_colors) .col_vals_in_set(columns="color", set=Color) .interrogate() ) validation ``` This validation will fail for the `"yellow"` value since it's not in the `Color` enum. You can also use specific Enum instances or mix them with regular values: ```{python} # Validate using specific Enum instances validation = ( pb.Validate(data=tbl_colors) .col_vals_in_set(columns="color", set=[Color.RED, Color.BLUE]) .interrogate() ) # Mix Enum instances with regular values validation = ( pb.Validate(data=tbl_colors) .col_vals_in_set(columns="color", set=[Color.RED, Color.BLUE, "yellow"]) .interrogate() ) validation ``` In this case, the `"green"` value will cause a failing test unit since it's not part of the specified set. col_vals_not_in_set(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', set: 'Collection[Any]', pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Validate whether column values are not in a set of values. The `col_vals_not_in_set()` validation method checks whether column values in a table are *not* part of a specified `set=` of values. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. set A collection of values to compare against. Can be a list of values, a Python Enum class, or a collection containing Enum instances. When an Enum class is provided, all enum values will be used. When a collection contains Enum instances, their values will be extracted automatically. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to a column via `columns=` that is expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with two numeric columns (`a` and `b`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [7, 8, 1, 9, 1, 7], "b": [1, 8, 2, 6, 9, 1], } ) pb.preview(tbl) ``` Let's validate that none of the values in column `a` are in the set of `[2, 3, 4, 5, 6]`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_not_in_set(columns="a", set=[2, 3, 4, 5, 6]) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_not_in_set()`. All test units passed, and there are no failing test units. Now, let's use that same set of values for a validation on column `b`. ```{python} validation = ( pb.Validate(data=tbl) .col_vals_not_in_set(columns="b", set=[2, 3, 4, 5, 6]) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are for the column `b` values of `2` and `6`, both of which are in the set of `[2, 3, 4, 5, 6]`. **Using Python Enums** Like `col_vals_in_set()`, this method also supports Python Enum classes and instances: ```{python} from enum import Enum class InvalidStatus(Enum): DELETED = "deleted" ARCHIVED = "archived" # Create a table with status data status_table = pl.DataFrame({ "product": ["widget", "gadget", "tool", "device"], "status": ["active", "pending", "deleted", "active"] }) # Validate that no values are in the invalid status set validation = ( pb.Validate(data=status_table) .col_vals_not_in_set(columns="status", set=InvalidStatus) .interrogate() ) validation ``` This `"deleted"` value in the `status` column will fail since it matches one of the invalid statuses in the `InvalidStatus` enum. col_vals_increasing(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', allow_stationary: 'bool' = False, decreasing_tol: 'float | None' = None, na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Are column data increasing by row? The `col_vals_increasing()` validation method checks whether column values in a table are increasing when moving down a table. There are options for allowing missing values in the target column, allowing stationary phases (where consecutive values don't change), and even one for allowing decreasing movements up to a certain threshold. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. allow_stationary An option to allow pauses in increasing values. For example, if the values for the test units are `[80, 82, 82, 85, 88]` then the third unit (`82`, appearing a second time) would be marked as failing when `allow_stationary` is `False`. Using `allow_stationary=True` will result in all the test units in `[80, 82, 82, 85, 88]` to be marked as passing. decreasing_tol An optional threshold value that allows for movement of numerical values in the negative direction. By default this is `None` but using a numerical value will set the absolute threshold of negative travel allowed across numerical test units. Note that setting a value here also has the effect of setting `allow_stationary` to `True`. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with a numeric column (`a`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5, 6], "b": [1, 2, 2, 3, 4, 5], "c": [1, 2, 1, 3, 4, 5], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are increasing. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_increasing(columns="a") .interrogate() ) validation ``` The validation passed as all values in column `a` are increasing. Now let's check column `b` which has a stationary value: ```{python} validation = ( pb.Validate(data=tbl) .col_vals_increasing(columns="b") .interrogate() ) validation ``` This validation fails at the third row because the value `2` is repeated. If we want to allow stationary values, we can use `allow_stationary=True`: ```{python} validation = ( pb.Validate(data=tbl) .col_vals_increasing(columns="b", allow_stationary=True) .interrogate() ) validation ``` col_vals_decreasing(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', allow_stationary: 'bool' = False, increasing_tol: 'float | None' = None, na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Are column data decreasing by row? The `col_vals_decreasing()` validation method checks whether column values in a table are decreasing when moving down a table. There are options for allowing missing values in the target column, allowing stationary phases (where consecutive values don't change), and even one for allowing increasing movements up to a certain threshold. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. allow_stationary An option to allow pauses in decreasing values. For example, if the values for the test units are `[88, 85, 85, 82, 80]` then the third unit (`85`, appearing a second time) would be marked as failing when `allow_stationary` is `False`. Using `allow_stationary=True` will result in all the test units in `[88, 85, 85, 82, 80]` to be marked as passing. increasing_tol An optional threshold value that allows for movement of numerical values in the positive direction. By default this is `None` but using a numerical value will set the absolute threshold of positive travel allowed across numerical test units. Note that setting a value here also has the effect of setting `allow_stationary` to `True`. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with a numeric column (`a`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [6, 5, 4, 3, 2, 1], "b": [5, 4, 4, 3, 2, 1], "c": [5, 4, 5, 3, 2, 1], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are decreasing. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_decreasing(columns="a") .interrogate() ) validation ``` The validation passed as all values in column `a` are decreasing. Now let's check column `b` which has a stationary value: ```{python} validation = ( pb.Validate(data=tbl) .col_vals_decreasing(columns="b") .interrogate() ) validation ``` This validation fails at the third row because the value `4` is repeated. If we want to allow stationary values, we can use `allow_stationary=True`: ```{python} validation = ( pb.Validate(data=tbl) .col_vals_decreasing(columns="b", allow_stationary=True) .interrogate() ) validation ``` col_vals_null(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Validate whether values in a column are Null. The `col_vals_null()` validation method checks whether column values in a table are Null. This validation will operate over the number of test units that is equal to the number of rows in the table. Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to a column via `columns=` that is expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with two numeric columns (`a` and `b`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [None, None, None, None], "b": [None, 2, None, 9], } ).with_columns(pl.col("a").cast(pl.Int64)) pb.preview(tbl) ``` Let's validate that values in column `a` are all Null values. We'll determine if this validation had any failing test units (there are four test units, one for each row). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_null(columns="a") .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_null()`. All test units passed, and there are no failing test units. Now, let's use that same set of values for a validation on column `b`. ```{python} validation = ( pb.Validate(data=tbl) .col_vals_null(columns="b") .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are for the two non-Null values in column `b`. col_vals_not_null(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Validate whether values in a column are not Null. The `col_vals_not_null()` validation method checks whether column values in a table are not Null. This validation will operate over the number of test units that is equal to the number of rows in the table. Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to a column via `columns=` that is expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with two numeric columns (`a` and `b`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [4, 7, 2, 8], "b": [5, None, 1, None], } ) pb.preview(tbl) ``` Let's validate that none of the values in column `a` are Null values. We'll determine if this validation had any failing test units (there are four test units, one for each row). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_not_null(columns="a") .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_not_null()`. All test units passed, and there are no failing test units. Now, let's use that same set of values for a validation on column `b`. ```{python} validation = ( pb.Validate(data=tbl) .col_vals_not_null(columns="b") .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are for the two Null values in column `b`. col_vals_regex(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', pattern: 'str', na_pass: 'bool' = False, inverse: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Validate whether column values match a regular expression pattern. The `col_vals_regex()` validation method checks whether column values in a table correspond to a `pattern=` matching expression. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. pattern A regular expression pattern to compare against. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. inverse Should the validation step be inverted? If `True`, then the expectation is that column values should *not* match the specified `pattern=` regex. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to a column via `columns=` that is expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with two string columns (`a` and `b`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": ["rb-0343", "ra-0232", "ry-0954", "rc-1343"], "b": ["ra-0628", "ra-583", "rya-0826", "rb-0735"], } ) pb.preview(tbl) ``` Let's validate that all of the values in column `a` match a particular regex pattern. We'll determine if this validation had any failing test units (there are four test units, one for each row). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_regex(columns="a", pattern=r"r[a-z]-[0-9]{4}") .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_regex()`. All test units passed, and there are no failing test units. Now, let's use the same regex for a validation on column `b`. ```{python} validation = ( pb.Validate(data=tbl) .col_vals_regex(columns="b", pattern=r"r[a-z]-[0-9]{4}") .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are for the string values of rows 1 and 2 in column `b`. col_vals_within_spec(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', spec: 'str', na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Validate whether column values fit within a specification. The `col_vals_within_spec()` validation method checks whether column values in a table correspond to a specification (`spec=`) type (details of which are available in the *Specifications* section). Specifications include common data types like email addresses, URLs, postal codes, vehicle identification numbers (VINs), International Bank Account Numbers (IBANs), and more. This validation will operate over the number of test units that is equal to the number of rows in the table. Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. spec A specification string for defining the specification type. Examples are `"email"`, `"url"`, and `"postal_code[USA]"`. See the *Specifications* section for all available options. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Specifications -------------- A specification type must be used with the `spec=` argument. This is a string-based keyword that corresponds to the type of data in the specified columns. The following keywords can be used: - `"isbn"`: The International Standard Book Number (ISBN) is a unique numerical identifier for books. This keyword validates both 10-digit and 13-digit ISBNs. - `"vin"`: A vehicle identification number (VIN) is a unique code used by the automotive industry to identify individual motor vehicles. - `"postal_code[]"`: A postal code (also known as postcodes, PIN, or ZIP codes) is a series of letters, digits, or both included in a postal address. Because the coding varies by country, a country code in either the 2-letter (ISO 3166-1 alpha-2) or 3-letter (ISO 3166-1 alpha-3) format needs to be supplied (e.g., `"postal_code[US]"` or `"postal_code[USA]"`). The keyword alias `"zip"` can be used for US ZIP codes. - `"credit_card"`: A credit card number can be validated across a variety of issuers. The validation uses the Luhn algorithm. - `"iban[]"`: The International Bank Account Number (IBAN) is a system of identifying bank accounts across countries. Because the length and coding varies by country, a country code needs to be supplied (e.g., `"iban[DE]"` or `"iban[DEU]"`). - `"swift"`: Business Identifier Codes (also known as SWIFT-BIC, BIC, or SWIFT code) are unique identifiers for financial and non-financial institutions. - `"phone"`, `"email"`, `"url"`, `"ipv4"`, `"ipv6"`, `"mac"`: Phone numbers, email addresses, Internet URLs, IPv4 or IPv6 addresses, and MAC addresses can be validated with their respective keywords. Only a single `spec=` value should be provided per function call. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to a column via `columns=` that is expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with an email column. The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "email": [ "user@example.com", "admin@test.org", "invalid-email", "contact@company.co.uk", ], } ) pb.preview(tbl) ``` Let's validate that all of the values in the `email` column are valid email addresses. We'll determine if this validation had any failing test units (there are four test units, one for each row). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_within_spec(columns="email", spec="email") .interrogate() ) validation ``` The validation table shows that one test unit failed (the invalid email address in row 3). col_vals_expr(self, expr: 'Any', pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Validate column values using a custom expression. The `col_vals_expr()` validation method checks whether column values in a table satisfy a custom `expr=` expression. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- expr A column expression that will evaluate each row in the table, returning a boolean value per table row. If the target table is a Polars DataFrame, the expression should either be a Polars column expression or a Narwhals one. For a Pandas DataFrame, the expression should either be a lambda expression or a Narwhals column expression. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 1, 7, 8, 6], "b": [0, 0, 0, 1, 1, 1], "c": [0.5, 0.3, 0.8, 1.4, 1.9, 1.2], } ) pb.preview(tbl) ``` Let's validate that the values in column `a` are all integers. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```{python} validation = ( pb.Validate(data=tbl) .col_vals_expr(expr=pl.col("a") % 1 == 0) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_expr()`. All test units passed, with no failing test units. col_exists(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Validate whether one or more columns exist in the table. The `col_exists()` method checks whether one or more columns exist in the target table. The only requirement is specification of the column names. Each validation step or expectation will operate over a single test unit, which is whether the column exists or not. Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with a string columns (`a`) and a numeric column (`b`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": ["apple", "banana", "cherry", "date"], "b": [1, 6, 3, 5], } ) pb.preview(tbl) ``` Let's validate that the columns `a` and `b` actually exist in the table. We'll determine if this validation had any failing test units (each validation will have a single test unit). ```{python} validation = ( pb.Validate(data=tbl) .col_exists(columns=["a", "b"]) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows two entries (one check per column) generated by the `col_exists()` validation step. Both steps passed since both columns provided in `columns=` are present in the table. Now, let's check for the existence of a different set of columns. ```{python} validation = ( pb.Validate(data=tbl) .col_exists(columns=["b", "c"]) .interrogate() ) validation ``` The validation table reports one passing validation step (the check for column `b`) and one failing validation step (the check for column `c`, which doesn't exist). col_pct_null(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', p: 'float', tol: 'Tolerance' = 0, thresholds: 'int | float | None | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Validate whether a column has a specific percentage of Null values. The `col_pct_null()` validation method checks whether the percentage of Null values in a column matches a specified percentage `p=` (within an optional tolerance `tol=`). This validation operates at the column level, generating a single validation step per column that passes or fails based on whether the actual percentage of Null values falls within the acceptable range defined by `p ± tol`. Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. p The expected percentage of Null values in the column, expressed as a decimal between `0.0` and `1.0`. For example, `p=0.5` means 50% of values should be Null. tol The tolerance allowed when comparing the actual percentage of Null values to the expected percentage `p=`. The validation passes if the actual percentage falls within the range `[p - tol, p + tol]`. Default is `0`, meaning an exact match is required. See the *Tolerance* section for details on all supported formats (absolute, relative, symmetric, and asymmetric bounds). thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Tolerance --------- The `tol=` parameter accepts several different formats to specify the acceptable deviation from the expected percentage `p=`. The tolerance can be expressed as: 1. *single integer* (absolute tolerance): the exact number of test units that can deviate. For example, `tol=2` means the actual count can differ from the expected count by up to 2 units in either direction. 2. *single float between 0 and 1* (relative tolerance): a proportion of the expected count. For example, if the expected count is 50 and `tol=0.1`, the acceptable range is 45 to 55 (50 ± 10% of 50 = 50 ± 5). 3. *tuple of two integers* (absolute bounds): explicitly specify the lower and upper bounds as absolute deviations. For example, `tol=(1, 3)` means the actual count can be 1 unit below or 3 units above the expected count. 4. *tuple of two floats between 0 and 1* (relative bounds): explicitly specify the lower and upper bounds as proportional deviations. For example, `tol=(0.05, 0.15)` means the lower bound is 5% below and the upper bound is 15% above the expected count. When using a single value (integer or float), the tolerance is applied symmetrically in both directions. When using a tuple, you can specify asymmetric tolerances where the lower and upper bounds differ. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`) that have different percentages of Null values. The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5, 6, 7, 8], "b": [1, None, 3, None, 5, None, 7, None], "c": [None, None, None, None, None, None, 1, 2], } ) pb.preview(tbl) ``` Let's validate that column `a` has 0% Null values (i.e., no Null values at all). ```{python} validation = ( pb.Validate(data=tbl) .col_pct_null(columns="a", p=0.0) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_pct_null()`. The validation passed since column `a` has no Null values. Now, let's check that column `b` has exactly 50% Null values. ```{python} validation = ( pb.Validate(data=tbl) .col_pct_null(columns="b", p=0.5) .interrogate() ) validation ``` This validation also passes, as column `b` has exactly 4 out of 8 values as Null (50%). Finally, let's validate column `c` with a tolerance. Column `c` has 75% Null values, so we'll check if it's approximately 70% Null with a tolerance of 10%. ```{python} validation = ( pb.Validate(data=tbl) .col_pct_null(columns="c", p=0.70, tol=0.10) .interrogate() ) validation ``` This validation passes because the actual percentage (75%) falls within the acceptable range of 60% to 80% (70% ± 10%). The `tol=` parameter supports multiple formats to express tolerance. Let's explore all the different ways to specify tolerance using column `b`, which has exactly 50% Null values (4 out of 8 values). *Using an absolute tolerance (integer)*: Specify the exact number of rows that can deviate. With `tol=1`, we allow the count to differ by 1 row in either direction. ```{python} validation = ( pb.Validate(data=tbl) .col_pct_null(columns="b", p=0.375, tol=1) # Expect 3 nulls, allow ±1 (range: 2-4) .interrogate() ) validation ``` This passes because column `b` has 4 Null values, which falls within the acceptable range of 2 to 4 (3 ± 1). *Using a relative tolerance (float)*: Specify the tolerance as a proportion of the expected count. With `tol=0.25`, we allow a 25% deviation from the expected count. ```{python} validation = ( pb.Validate(data=tbl) .col_pct_null(columns="b", p=0.375, tol=0.25) # Expect 3 nulls, allow ±25% (range: 2.25-3.75) .interrogate() ) validation ``` This passes because 4 Null values falls within the acceptable range (3 ± 0.75 calculates to 2.25 to 3.75, which rounds down to 2 to 3 rows). *Using asymmetric absolute bounds (tuple of integers)*: Specify different lower and upper bounds as absolute values. With `tol=(0, 2)`, we allow no deviation below but up to 2 rows above the expected count. ```{python} validation = ( pb.Validate(data=tbl) .col_pct_null(columns="b", p=0.25, tol=(0, 2)) # Expect 2 Nulls, allow +0/-2 (range: 2-4) .interrogate() ) validation ``` This passes because 4 Null values falls within the acceptable range of 2 to 4. *Using asymmetric relative bounds (tuple of floats)*: Specify different lower and upper bounds as proportions. With `tol=(0.1, 0.3)`, we allow 10% below and 30% above the expected count. ```{python} validation = ( pb.Validate(data=tbl) .col_pct_null(columns="b", p=0.375, tol=(0.1, 0.3)) # Expect 3 Nulls, allow -10%/+30% .interrogate() ) validation ``` This passes because 4 Null values falls within the acceptable range (3 - 0.3 to 3 + 0.9 calculates to 2.7 to 3.9, which rounds down to 2 to 3 rows). rows_distinct(self, columns_subset: 'str | list[str] | None' = None, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Validate whether rows in the table are distinct. The `rows_distinct()` method checks whether rows in the table are distinct. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns_subset A single column or a list of columns to use as a subset for the distinct comparison. If `None`, then all columns in the table will be used for the comparison. If multiple columns are supplied, the distinct comparison will be made over the combination of values in those columns. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns_subset=` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with three string columns (`col_1`, `col_2`, and `col_3`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "col_1": ["a", "b", "c", "d"], "col_2": ["a", "a", "c", "d"], "col_3": ["a", "a", "d", "e"], } ) pb.preview(tbl) ``` Let's validate that the rows in the table are distinct with `rows_distinct()`. We'll determine if this validation had any failing test units (there are four test units, one for each row). A failing test units means that a given row is not distinct from every other row. ```{python} validation = ( pb.Validate(data=tbl) .rows_distinct() .interrogate() ) validation ``` From this validation table we see that there are no failing test units. All rows in the table are distinct from one another. We can also use a subset of columns to determine distinctness. Let's specify the subset using columns `col_2` and `col_3` for the next validation. ```{python} validation = ( pb.Validate(data=tbl) .rows_distinct(columns_subset=["col_2", "col_3"]) .interrogate() ) validation ``` The validation table reports two failing test units. The first and second rows are duplicated when considering only the values in columns `col_2` and `col_3`. There's only one set of duplicates but there are two failing test units since each row is compared to all others. rows_complete(self, columns_subset: 'str | list[str] | None' = None, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Validate whether row data are complete by having no missing values. The `rows_complete()` method checks whether rows in the table are complete. Completeness of a row means that there are no missing values within the row. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). A subset of columns can be specified for the completeness check. If no subset is provided, all columns in the table will be used. Parameters ---------- columns_subset A single column or a list of columns to use as a subset for the completeness check. If `None` (the default), then all columns in the table will be used. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns_subset=` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with three string columns (`col_1`, `col_2`, and `col_3`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "col_1": ["a", None, "c", "d"], "col_2": ["a", "a", "c", None], "col_3": ["a", "a", "d", None], } ) pb.preview(tbl) ``` Let's validate that the rows in the table are complete with `rows_complete()`. We'll determine if this validation had any failing test units (there are four test units, one for each row). A failing test units means that a given row is not complete (i.e., has at least one missing value). ```{python} validation = ( pb.Validate(data=tbl) .rows_complete() .interrogate() ) validation ``` From this validation table we see that there are two failing test units. This is because two rows in the table have at least one missing value (the second row and the last row). We can also use a subset of columns to determine completeness. Let's specify the subset using columns `col_2` and `col_3` for the next validation. ```{python} validation = ( pb.Validate(data=tbl) .rows_complete(columns_subset=["col_2", "col_3"]) .interrogate() ) validation ``` The validation table reports a single failing test units. The last row contains missing values in both the `col_2` and `col_3` columns. others. col_schema_match(self, schema: 'Schema', complete: 'bool' = True, in_order: 'bool' = True, case_sensitive_colnames: 'bool' = True, case_sensitive_dtypes: 'bool' = True, full_match_dtypes: 'bool' = True, pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Do columns in the table (and their types) match a predefined schema? The `col_schema_match()` method works in conjunction with an object generated by the [`Schema`](`pointblank.Schema`) class. That class object is the expectation for the actual schema of the target table. The validation step operates over a single test unit, which is whether the schema matches that of the table (within the constraints enforced by the `complete=`, and `in_order=` options). Parameters ---------- schema A `Schema` object that represents the expected schema of the table. This object is generated by the [`Schema`](`pointblank.Schema`) class. complete Should the schema match be complete? If `True`, then the target table must have all columns specified in the schema. If `False`, then the table can have additional columns not in the schema (i.e., the schema is a subset of the target table's columns). in_order Should the schema match be in order? If `True`, then the columns in the schema must appear in the same order as they do in the target table. If `False`, then the order of columns in the schema and the target table can differ. case_sensitive_colnames Should the schema match be case-sensitive with regard to column names? If `True`, then the column names in the schema and the target table must match exactly. If `False`, then the column names are compared in a case-insensitive manner. case_sensitive_dtypes Should the schema match be case-sensitive with regard to column data types? If `True`, then the column data types in the schema and the target table must match exactly. If `False`, then the column data types are compared in a case-insensitive manner. full_match_dtypes Should the schema match require a full match of data types? If `True`, then the column data types in the schema and the target table must match exactly. If `False` then substring matches are allowed, so a schema data type of `Int` would match a target table data type of `Int64`. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with three columns (string, integer, and float). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": ["apple", "banana", "cherry", "date"], "b": [1, 6, 3, 5], "c": [1.1, 2.2, 3.3, 4.4], } ) pb.preview(tbl) ``` Let's validate that the columns in the table match a predefined schema. A schema can be defined using the [`Schema`](`pointblank.Schema`) class. ```{python} schema = pb.Schema( columns=[("a", "String"), ("b", "Int64"), ("c", "Float64")] ) ``` You can print the schema object to verify that the expected schema is as intended. ```{python} print(schema) ``` Now, we'll use the `col_schema_match()` method to validate the table against the expected `schema` object. There is a single test unit for this validation step (whether the schema matches the table or not). ```{python} validation = ( pb.Validate(data=tbl) .col_schema_match(schema=schema) .interrogate() ) validation ``` The validation table shows that the schema matches the table. The single test unit passed since the table columns and their types match the schema. row_count_match(self, count: 'int | Any', tol: 'Tolerance' = 0, inverse: 'bool' = False, pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Validate whether the row count of the table matches a specified count. The `row_count_match()` method checks whether the row count of the target table matches a specified count. This validation will operate over a single test unit, which is whether the row count matches the specified count. We also have the option to invert the validation step by setting `inverse=True`. This will make the expectation that the row count of the target table *does not* match the specified count. Parameters ---------- count The expected row count of the table. This can be an integer value, a Polars or Pandas DataFrame object, or an Ibis backend table. If a DataFrame/table is provided, the row count of that object will be used as the expected count. tol The tolerance allowable for the row count match. This can be specified as a single numeric value (integer or float) or as a tuple of two integers representing the lower and upper bounds of the tolerance range. If a single integer value (greater than 1) is provided, it represents the absolute bounds of the tolerance, ie. plus or minus the value. If a float value (between 0-1) is provided, it represents the relative tolerance, ie. plus or minus the relative percentage of the target. If a tuple is provided, it represents the lower and upper absolute bounds of the tolerance range. See the examples for more. inverse Should the validation step be inverted? If `True`, then the expectation is that the row count of the target table should not match the specified `count=` value. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False) ``` For the examples here, we'll use the built in dataset `"small_table"`. The table can be obtained by calling `load_dataset("small_table")`. ```{python} import pointblank as pb small_table = pb.load_dataset("small_table") pb.preview(small_table) ``` Let's validate that the number of rows in the table matches a fixed value. In this case, we will use the value `13` as the expected row count. ```{python} validation = ( pb.Validate(data=small_table) .row_count_match(count=13) .interrogate() ) validation ``` The validation table shows that the expectation value of `13` matches the actual count of rows in the target table. So, the single test unit passed. Let's modify our example to show the different ways we can allow some tolerance to our validation by using the `tol` argument. ```{python} smaller_small_table = small_table.sample(n = 12) # within the lower bound validation = ( pb.Validate(data=smaller_small_table) .row_count_match(count=13,tol=(2, 0)) # minus 2 but plus 0, ie. 11-13 .interrogate() ) validation validation = ( pb.Validate(data=smaller_small_table) .row_count_match(count=13,tol=.05) # .05% tolerance of 13 .interrogate() ) even_smaller_table = small_table.sample(n = 2) validation = ( pb.Validate(data=even_smaller_table) .row_count_match(count=13,tol=5) # plus or minus 5; this test will fail .interrogate() ) validation ``` col_count_match(self, count: 'int | Any', inverse: 'bool' = False, pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Validate whether the column count of the table matches a specified count. The `col_count_match()` method checks whether the column count of the target table matches a specified count. This validation will operate over a single test unit, which is whether the column count matches the specified count. We also have the option to invert the validation step by setting `inverse=True`. This will make the expectation that column row count of the target table *does not* match the specified count. Parameters ---------- count The expected column count of the table. This can be an integer value, a Polars or Pandas DataFrame object, or an Ibis backend table. If a DataFrame/table is provided, the column count of that object will be used as the expected count. inverse Should the validation step be inverted? If `True`, then the expectation is that the column count of the target table should not match the specified `count=` value. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False) ``` For the examples here, we'll use the built in dataset `"game_revenue"`. The table can be obtained by calling `load_dataset("game_revenue")`. ```{python} import pointblank as pb game_revenue = pb.load_dataset("game_revenue") pb.preview(game_revenue) ``` Let's validate that the number of columns in the table matches a fixed value. In this case, we will use the value `11` as the expected column count. ```{python} validation = ( pb.Validate(data=game_revenue) .col_count_match(count=11) .interrogate() ) validation ``` The validation table shows that the expectation value of `11` matches the actual count of columns in the target table. So, the single test unit passed. data_freshness(self, column: 'str', max_age: 'str | datetime.timedelta', reference_time: 'datetime.datetime | str | None' = None, timezone: 'str | None' = None, allow_tz_mismatch: 'bool' = False, pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Validate that data in a datetime column is not older than a specified maximum age. The `data_freshness()` validation method checks whether the most recent timestamp in the specified datetime column is within the allowed `max_age=` from the `reference_time=` (which defaults to the current time). This is useful for ensuring data pipelines are delivering fresh data and for enforcing data SLAs. This method helps detect stale data by comparing the maximum (most recent) value in a datetime column against an expected freshness threshold. Parameters ---------- column The name of the datetime column to check for freshness. This column should contain date or datetime values. max_age The maximum allowed age of the data. Can be specified as: (1) a string with a human-readable duration like `"24 hours"`, `"1 day"`, `"30 minutes"`, `"2 weeks"`, etc. (supported units: `seconds`, `minutes`, `hours`, `days`, `weeks`), or (2) a `datetime.timedelta` object for precise control. reference_time The reference point in time to compare against. Defaults to `None`, which uses the current time (UTC if `timezone=` is not specified). Can be: (1) a `datetime.datetime` object (timezone-aware recommended), (2) a string in ISO 8601 format (e.g., `"2024-01-15T10:30:00"` or `"2024-01-15T10:30:00+05:30"`), or (3) `None` to use the current time. timezone The timezone to use for interpreting the data and reference time. Accepts IANA timezone names (e.g., `"America/New_York"`), hour offsets (e.g., `"-7"`), or ISO 8601 offsets (e.g., `"-07:00"`). When `None` (default), naive datetimes are treated as UTC. See the *The `timezone=` Parameter* section for details. allow_tz_mismatch Whether to allow timezone mismatches between the column data and reference time. By default (`False`), a warning note is added when comparing timezone-naive with timezone-aware datetimes. Set to `True` to suppress these warnings. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. How Timezones Affect Freshness Checks ------------------------------------- Freshness validation involves comparing two times: the **data time** (the most recent timestamp in your column) and the **execution time** (when and where the validation runs). Timezone confusion typically arises because these two times may originate from different contexts. Consider these common scenarios: - your data timestamps are stored in UTC (common for databases), but you're running validation on your laptop in New York (Eastern Time) - you develop and test validation locally, then deploy it to a cloud workflow that runs in UTC—suddenly your 'same' validation behaves differently - your data comes from servers in multiple regions, each recording timestamps in their local timezone The `timezone=` parameter exists to solve this problem by establishing a single, explicit timezone context for the freshness comparison. When you specify a timezone, Pointblank interprets both the data timestamps (if naive) and the execution time in that timezone, ensuring consistent behavior whether you run validation on your laptop or in a cloud workflow. **Scenario 1: Data has timezone-aware datetimes** ```python # Your data column has values like: 2024-01-15 10:30:00+00:00 (UTC) # Comparison is straightforward as both sides have explicit timezones .data_freshness(column="updated_at", max_age="24 hours") ``` **Scenario 2: Data has naive datetimes (no timezone)** ```python # Your data column has values like: 2024-01-15 10:30:00 (no timezone) # Specify the timezone the data was recorded in: .data_freshness(column="updated_at", max_age="24 hours", timezone="America/New_York") ``` **Scenario 3: Ensuring consistent behavior across environments** ```python # Pin the timezone to ensure identical results whether running locally or in the cloud .data_freshness( column="updated_at", max_age="24 hours", timezone="UTC", # Explicit timezone removes environment dependence ) ``` The `timezone=` Parameter --------------------------- The `timezone=` parameter accepts several convenient formats, making it easy to specify timezones in whatever way is most natural for your use case. The following examples illustrate the three supported input styles. **IANA Timezone Names** (recommended for regions with daylight saving time): ```python timezone="America/New_York" # Eastern Time (handles DST automatically) timezone="Europe/London" # UK time timezone="Asia/Tokyo" # Japan Standard Time timezone="Australia/Sydney" # Australian Eastern Time timezone="UTC" # Coordinated Universal Time ``` **Simple Hour Offsets** (quick and easy): ```python timezone="-7" # UTC-7 (e.g., Mountain Standard Time) timezone="+5" # UTC+5 (e.g., Pakistan Standard Time) timezone="0" # UTC timezone="-12" # UTC-12 ``` **ISO 8601 Offset Format** (precise, including fractional hours): ```python timezone="-07:00" # UTC-7 timezone="+05:30" # UTC+5:30 (e.g., India Standard Time) timezone="+00:00" # UTC timezone="-09:30" # UTC-9:30 ``` When a timezone is specified: - naive datetime values in the column are assumed to be in this timezone. - the reference time (if naive) is assumed to be in this timezone. - the validation report will show times in this timezone. When `None` (default): - if your column has timezone-aware datetimes, those timezones are used - if your column has naive datetimes, they're treated as UTC - the current time reference uses UTC Note that IANA timezone names are preferred when daylight saving time transitions matter, as they automatically handle the offset changes. Fixed offsets like `"-7"` or `"-07:00"` do not account for DST. Recommendations for Working with Timestamps ------------------------------------------- When working with datetime data, storing timestamps in UTC in your databases is strongly recommended since it provides a consistent reference point regardless of where your data originates or where it's consumed. Using timezone-aware datetimes whenever possible helps avoid ambiguity—when a datetime has an explicit timezone, there's no guessing about what time it actually represents. If you're working with naive datetimes (which lack timezone information), always specify the `timezone=` parameter so Pointblank knows how to interpret those values. When providing `reference_time=` as a string, use ISO 8601 format with the timezone offset included (e.g., `"2024-01-15T10:30:00+00:00"`) to ensure unambiguous parsing. Finally, prefer IANA timezone names (like `"America/New_York"`) over fixed offsets (like `"-05:00"`) when daylight saving time transitions matter, since IANA names automatically handle the twice-yearly offset changes. To see all available IANA timezone names in Python, use `zoneinfo.available_timezones()` from the standard library's `zoneinfo` module. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False) ``` The simplest use of `data_freshness()` requires just two arguments: the `column=` containing your timestamps and `max_age=` specifying how old the data can be. In this first example, we create sample data with an `"updated_at"` column containing timestamps from 1, 12, and 20 hours ago. By setting `max_age="24 hours"`, we're asserting that the most recent timestamp should be within 24 hours of the current time. Since the newest record is only 1 hour old, this validation passes. ```{python} import pointblank as pb import polars as pl from datetime import datetime, timedelta # Create sample data with recent timestamps recent_data = pl.DataFrame({ "id": [1, 2, 3], "updated_at": [ datetime.now() - timedelta(hours=1), datetime.now() - timedelta(hours=12), datetime.now() - timedelta(hours=20), ] }) validation = ( pb.Validate(data=recent_data) .data_freshness(column="updated_at", max_age="24 hours") .interrogate() ) validation ``` The `max_age=` parameter accepts human-readable strings with various time units. You can chain multiple `data_freshness()` calls to check different freshness thresholds simultaneously—useful for tiered SLAs where you might want warnings at 30 minutes but errors at 2 days. ```{python} # Check data is fresh within different time windows validation = ( pb.Validate(data=recent_data) .data_freshness(column="updated_at", max_age="30 minutes") # Very fresh .data_freshness(column="updated_at", max_age="2 days") # Reasonably fresh .data_freshness(column="updated_at", max_age="1 week") # Within a week .interrogate() ) validation ``` When your data contains naive datetimes (timestamps without timezone information), use the `timezone=` parameter to specify what timezone those values represent. Here we have event data recorded in Eastern Time, so we set `timezone="America/New_York"` to ensure the freshness comparison is done correctly. ```{python} # Data with naive datetimes (assume they're in Eastern Time) eastern_data = pl.DataFrame({ "event_time": [ datetime.now() - timedelta(hours=2), datetime.now() - timedelta(hours=5), ] }) validation = ( pb.Validate(data=eastern_data) .data_freshness( column="event_time", max_age="12 hours", timezone="America/New_York" # Interpret times as Eastern ) .interrogate() ) validation ``` For reproducible validations or historical checks, you can use `reference_time=` to compare against a specific point in time instead of the current time. This is particularly useful for testing or when validating data snapshots. The reference time should include a timezone offset (like `+00:00` for UTC) to avoid ambiguity. ```{python} validation = ( pb.Validate(data=recent_data) .data_freshness( column="updated_at", max_age="24 hours", reference_time="2024-01-15T12:00:00+00:00" ) .interrogate() ) validation ``` tbl_match(self, tbl_compare: 'Any', pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Validate whether the target table matches a comparison table. The `tbl_match()` method checks whether the target table's composition matches that of a comparison table. The validation performs a comprehensive comparison using progressively stricter checks (from least to most stringent): 1. **Column count match**: both tables must have the same number of columns 2. **Row count match**: both tables must have the same number of rows 3. **Schema match (loose)**: column names and dtypes match (case-insensitive, any order) 4. **Schema match (order)**: columns in the correct order (case-insensitive names) 5. **Schema match (exact)**: column names match exactly (case-sensitive, correct order) 6. **Data match**: values in corresponding cells must be identical This progressive approach helps identify exactly where tables differ. The validation will fail at the first check that doesn't pass, making it easier to diagnose mismatches. This validation operates over a single test unit (pass/fail for complete table match). Parameters ---------- tbl_compare The comparison table to validate against. This can be a DataFrame object (Polars or Pandas), an Ibis table object, or a callable that returns a table. If a callable is provided, it will be executed during interrogation to obtain the comparison table. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that the same preprocessing is **not** applied to the comparison table; only the target table is preprocessed. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Cross-Backend Validation ------------------------ The `tbl_match()` method supports **automatic backend coercion** when comparing tables from different backends (e.g., comparing a Polars DataFrame against a Pandas DataFrame, or comparing database tables from DuckDB/SQLite against in-memory DataFrames). When tables with different backends are detected, the comparison table is automatically converted to match the data table's backend before validation proceeds. **Certified Backend Combinations:** All combinations of the following backends have been tested and certified to work (in both directions): - Pandas DataFrame - Polars DataFrame - DuckDB (native) - DuckDB (as Ibis table) - SQLite (via Ibis) Note that database backends (DuckDB, SQLite, PostgreSQL, MySQL, Snowflake, BigQuery) are automatically materialized during validation: - if comparing **against Polars**: materialized to Polars - if comparing **against Pandas**: materialized to Pandas - if **both tables are database backends**: both materialized to Polars This ensures optimal performance and type consistency. **Data Types That Work Best in Cross-Backend Validation:** - numeric types: int, float columns (including proper NaN handling) - string types: text columns with consistent encodings - boolean types: True/False values - null values: `None` and `NaN` are treated as equivalent across backends - list columns: nested list structures (with basic types) **Known Limitations:** While many data types work well in cross-backend validation, there are some known limitations to be aware of: - date/datetime types: When converting between Polars and Pandas, date objects may be represented differently. For example, `datetime.date` objects in Pandas may become `pd.Timestamp` objects when converted from Polars, leading to false mismatches. To work around this, ensure both tables use the same datetime representation before comparison. - custom types: User-defined types or complex nested structures may not convert cleanly between backends and could cause unexpected comparison failures. - categorical types: Categorical/factor columns may have different internal representations across backends. - timezone-aware datetimes: Timezone handling differs between backends and may cause comparison issues. Here are some ideas to overcome such limitations: - for date/datetime columns, consider using `pre=` preprocessing to normalize representations before comparison. - when working with custom types, manually convert tables to the same backend before using `tbl_match()`. - use the same datetime precision (e.g., milliseconds vs microseconds) in both tables. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False) ``` For the examples here, we'll create two simple tables to demonstrate the `tbl_match()` validation. ```{python} import pointblank as pb import polars as pl # Create the first table tbl_1 = pl.DataFrame({ "a": [1, 2, 3, 4], "b": ["w", "x", "y", "z"], "c": [4.0, 5.0, 6.0, 7.0] }) # Create an identical table tbl_2 = pl.DataFrame({ "a": [1, 2, 3, 4], "b": ["w", "x", "y", "z"], "c": [4.0, 5.0, 6.0, 7.0] }) pb.preview(tbl_1) ``` Let's validate that `tbl_1` matches `tbl_2`. Since these tables are identical, the validation should pass. ```{python} validation = ( pb.Validate(data=tbl_1) .tbl_match(tbl_compare=tbl_2) .interrogate() ) validation ``` The validation table shows that the single test unit passed, indicating that the two tables match completely. Now, let's create a table with a slight difference and see what happens. ```{python} # Create a table with one different value tbl_3 = pl.DataFrame({ "a": [1, 2, 3, 4], "b": ["w", "x", "y", "z"], "c": [4.0, 5.5, 6.0, 7.0] # Changed 5.0 to 5.5 }) validation = ( pb.Validate(data=tbl_1) .tbl_match(tbl_compare=tbl_3) .interrogate() ) validation ``` The validation table shows that the single test unit failed because the tables don't match (one value is different in column `c`). conjointly(self, *exprs: 'Callable', pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Perform multiple row-wise validations for joint validity. The `conjointly()` validation method checks whether each row in the table passes multiple validation conditions simultaneously. This enables compound validation logic where a test unit (typically a row) must satisfy all specified conditions to pass the validation. This method accepts multiple validation expressions as callables, which should return boolean expressions when applied to the data. You can use lambdas that incorporate Polars/Pandas/Ibis expressions (based on the target table type) or create more complex validation functions. The validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- *exprs Multiple validation expressions provided as callable functions. Each callable should accept a table as its single argument and return a boolean expression or Series/Column that evaluates to boolean values for each row. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use a simple Polars DataFrame with three numeric columns (`a`, `b`, and `c`). The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 7, 1, 3, 9, 4], "b": [6, 3, 0, 5, 8, 2], "c": [10, 4, 8, 9, 10, 5], } ) pb.preview(tbl) ``` Let's validate that the values in each row satisfy multiple conditions simultaneously: 1. Column `a` should be greater than 2 2. Column `b` should be less than 7 3. The sum of `a` and `b` should be less than the value in column `c` We'll use `conjointly()` to check all these conditions together: ```{python} validation = ( pb.Validate(data=tbl) .conjointly( lambda df: pl.col("a") > 2, lambda df: pl.col("b") < 7, lambda df: pl.col("a") + pl.col("b") < pl.col("c") ) .interrogate() ) validation ``` The validation table shows that not all rows satisfy all three conditions together. For a row to pass the conjoint validation, all three conditions must be true for that row. We can also use preprocessing to filter the data before applying the conjoint validation: ```{python} # Define preprocessing function for serialization compatibility def filter_by_c_gt_5(df): return df.filter(pl.col("c") > 5) validation = ( pb.Validate(data=tbl) .conjointly( lambda df: pl.col("a") > 2, lambda df: pl.col("b") < 7, lambda df: pl.col("a") + pl.col("b") < pl.col("c"), pre=filter_by_c_gt_5 ) .interrogate() ) validation ``` This allows for more complex validation scenarios where the data is first prepared and then validated against multiple conditions simultaneously. Or, you can use the backend-agnostic column expression helper [`expr_col()`](`pointblank.expr_col`) to write expressions that work across different table backends: ```{python} tbl = pl.DataFrame( { "a": [5, 7, 1, 3, 9, 4], "b": [6, 3, 0, 5, 8, 2], "c": [10, 4, 8, 9, 10, 5], } ) # Using backend-agnostic syntax with expr_col() validation = ( pb.Validate(data=tbl) .conjointly( lambda df: pb.expr_col("a") > 2, lambda df: pb.expr_col("b") < 7, lambda df: pb.expr_col("a") + pb.expr_col("b") < pb.expr_col("c") ) .interrogate() ) validation ``` Using [`expr_col()`](`pointblank.expr_col`) allows your validation code to work consistently across Pandas, Polars, and Ibis table backends without changes, making your validation pipelines more portable. See Also -------- Look at the documentation of the [`expr_col()`](`pointblank.expr_col`) function for more information on how to use it with different table backends. specially(self, expr: 'Callable', pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Perform a specialized validation with customized logic. The `specially()` validation method allows for the creation of specialized validation expressions that can be used to validate specific conditions or logic in the data. This method provides maximum flexibility by accepting a custom callable that encapsulates your validation logic. The callable function can have one of two signatures: - a function accepting a single parameter (the data table): `def validate(data): ...` - a function with no parameters: `def validate(): ...` The second form is particularly useful for environment validations that don't need to inspect the data table. The callable function must ultimately return one of: 1. a single boolean value or boolean list 2. a table where the final column contains boolean values (column name is unimportant) The validation will operate over the number of test units that is equal to the number of rows in the data table (if returning a table with boolean values). If returning a scalar boolean value, the validation will operate over a single test unit. For a return of a list of boolean values, the length of the list constitutes the number of test units. Parameters ---------- expr A callable function that defines the specialized validation logic. This function should: (1) accept the target data table as its single argument (though it may ignore it), or (2) take no parameters at all (for environment validations). The function must ultimately return boolean values representing validation results. Design your function to incorporate any custom parameters directly within the function itself using closure variables or default parameters. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` The `specially()` method offers maximum flexibility for validation, allowing you to create custom validation logic that fits your specific needs. The following examples demonstrate different patterns and use cases for this powerful validation approach. ### Simple validation with direct table access This example shows the most straightforward use case where we create a function that directly checks if the sum of two columns is positive. ```{python} import pointblank as pb import polars as pl simple_tbl = pl.DataFrame({ "a": [5, 7, 1, 3, 9, 4], "b": [6, 3, 0, 5, 8, 2] }) # Simple function that validates directly on the table def validate_sum_positive(data): return data.select(pl.col("a") + pl.col("b") > 0) ( pb.Validate(data=simple_tbl) .specially(expr=validate_sum_positive) .interrogate() ) ``` The function returns a Polars DataFrame with a single boolean column indicating whether the sum of columns `a` and `b` is positive for each row. Each row in the resulting DataFrame is a distinct test unit. This pattern works well for simple validations where you don't need configurable parameters. ### Advanced validation with closure variables for parameters When you need to make your validation configurable, you can use the function factory pattern (also known as closures) to create parameterized validations: ```{python} # Create a parameterized validation function using closures def make_column_ratio_validator(col1, col2, min_ratio): def validate_column_ratio(data): return data.select((pl.col(col1) / pl.col(col2)) > min_ratio) return validate_column_ratio ( pb.Validate(data=simple_tbl) .specially( expr=make_column_ratio_validator(col1="a", col2="b", min_ratio=0.5) ) .interrogate() ) ``` This approach allows you to create reusable validation functions that can be configured with different parameters without modifying the function itself. ### Validation function returning a list of booleans This example demonstrates how to create a validation function that returns a list of boolean values, where each element represents a separate test unit: ```{python} import pointblank as pb import polars as pl import random # Create sample data transaction_tbl = pl.DataFrame({ "transaction_id": [f"TX{i:04d}" for i in range(1, 11)], "amount": [120.50, 85.25, 50.00, 240.75, 35.20, 150.00, 85.25, 65.00, 210.75, 90.50], "category": ["food", "shopping", "entertainment", "travel", "utilities", "food", "shopping", "entertainment", "travel", "utilities"] }) # Define a validation function that returns a list of booleans def validate_transaction_rules(data): # Create a list to store individual test results test_results = [] # Check each row individually against multiple business rules for row in data.iter_rows(named=True): # Rule: transaction IDs must start with "TX" and be 6 chars long valid_id = row["transaction_id"].startswith("TX") and len(row["transaction_id"]) == 6 # Rule: Amounts must be appropriate for their category valid_amount = True if row["category"] == "food" and (row["amount"] < 10 or row["amount"] > 200): valid_amount = False elif row["category"] == "utilities" and (row["amount"] < 20 or row["amount"] > 300): valid_amount = False elif row["category"] == "entertainment" and row["amount"] > 100: valid_amount = False # A transaction passes if it satisfies both rules test_results.append(valid_id and valid_amount) return test_results ( pb.Validate(data=transaction_tbl) .specially( expr=validate_transaction_rules, brief="Validate transaction IDs and amounts by category." ) .interrogate() ) ``` This example shows how to create a validation function that applies multiple business rules to each row and returns a list of boolean results. Each boolean in the list represents a separate test unit, and a test unit passes only if all rules are satisfied for a given row. The function iterates through each row in the data table, checking: 1. if transaction IDs follow the required format 2. if transaction amounts are appropriate for their respective categories This approach is powerful when you need to apply complex, conditional logic that can't be easily expressed using the built-in validation functions. ### Table-level validation returning a single boolean Sometimes you need to validate properties of the entire table rather than row-by-row. In these cases, your function can return a single boolean value: ```{python} def validate_table_properties(data): # Check if table has at least one row with column 'a' > 10 has_large_values = data.filter(pl.col("a") > 10).height > 0 # Check if mean of column 'b' is positive has_positive_mean = data.select(pl.mean("b")).item() > 0 # Return a single boolean for the entire table return has_large_values and has_positive_mean ( pb.Validate(data=simple_tbl) .specially(expr=validate_table_properties) .interrogate() ) ``` This example demonstrates how to perform multiple checks on the table as a whole and combine them into a single validation result. ### Environment validation that doesn't use the data table The `specially()` validation method can even be used to validate aspects of your environment that are completely independent of the data: ```{python} def validate_pointblank_version(): try: import importlib.metadata version = importlib.metadata.version("pointblank") version_parts = version.split(".") # Get major and minor components regardless of how many parts there are major = int(version_parts[0]) minor = int(version_parts[1]) # Check both major and minor components for version `0.9+` return (major > 0) or (major == 0 and minor >= 9) except Exception as e: # More specific error handling could be added here print(f"Version check failed: {e}") return False ( pb.Validate(data=simple_tbl) .specially( expr=validate_pointblank_version, brief="Check Pointblank version `>=0.9.0`." ) .interrogate() ) ``` This pattern shows how to validate external dependencies or environment conditions as part of your validation workflow. Notice that the function doesn't take any parameters at all, which makes it cleaner when the validation doesn't need to access the data table. By combining these patterns, you can create sophisticated validation workflows that address virtually any data quality requirement in your organization. prompt(self, prompt: 'str', model: 'str', columns_subset: 'str | list[str] | None' = None, batch_size: 'int' = 1000, max_concurrent: 'int' = 3, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool | Callable' = True) -> 'Validate' Validate rows using AI/LLM-powered analysis. The `prompt()` validation method uses Large Language Models (LLMs) to validate rows of data based on natural language criteria. Similar to other Pointblank validation methods, this generates binary test results (pass/fail) that integrate seamlessly with the standard reporting framework. Like `col_vals_*()` methods, `prompt()` evaluates data against specific criteria, but instead of using programmatic rules, it uses natural language prompts interpreted by an LLM. Like `rows_distinct()` and `rows_complete()`, it operates at the row level and allows you to specify a subset of columns for evaluation using `columns_subset=`. The system automatically combines your validation criteria from the `prompt=` parameter with the necessary technical context, data formatting instructions, and response structure requirements. This is all so you only need to focus on describing your validation logic in plain language. Each row becomes a test unit that either passes or fails the validation criteria, producing the familiar True/False results that appear in Pointblank validation reports. This method is particularly useful for complex validation rules that are difficult to express with traditional validation methods, such as semantic checks, context-dependent validation, or subjective quality assessments. Parameters ---------- prompt A natural language description of the validation criteria. This prompt should clearly describe what constitutes valid vs invalid rows. Some examples: `"Each row should contain a valid email address and a realistic person name"`, `"Values should indicate positive sentiment"`, `"The description should mention a country name"`. columns_subset A single column or list of columns to include in the validation. If `None`, all columns will be included. Specifying fewer columns can improve performance and reduce API costs so try to include only the columns necessary for the validation. model The model to be used. This should be in the form of `provider:model` (e.g., `"anthropic:claude-sonnet-4-5"`). Supported providers are `"anthropic"`, `"openai"`, `"ollama"`, and `"bedrock"`. The model name should be the specific model to be used from the provider. Model names are subject to change so consult the provider's documentation for the most up-to-date model names. batch_size Number of rows to process in each batch. Larger batches are more efficient but may hit API limits. Default is `1000`. max_concurrent Maximum number of concurrent API requests. Higher values speed up processing but may hit rate limits. Default is `3`. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Constructing the `model` Argument --------------------------------- The `model=` argument should be constructed using the provider and model name separated by a colon (`provider:model`). The provider text can any of: - `"anthropic"` (Anthropic) - `"openai"` (OpenAI) - `"ollama"` (Ollama) - `"bedrock"` (Amazon Bedrock) The model name should be the specific model to be used from the provider. Model names are subject to change so consult the provider's documentation for the most up-to-date model names. Notes on Authentication ----------------------- API keys are automatically loaded from environment variables or `.env` files and are **not** stored in the validation object for security reasons. You should consider using a secure method for handling API keys. One way to do this is to load the API key from an environment variable and retrieve it using the `os` module (specifically the `os.getenv()` function). Places to store the API key might include `.bashrc`, `.bash_profile`, `.zshrc`, or `.zsh_profile`. Another solution is to store one or more model provider API keys in an `.env` file (in the root of your project). If the API keys have correct names (e.g., `ANTHROPIC_API_KEY` or `OPENAI_API_KEY`) then the AI validation will automatically load the API key from the `.env` file. An `.env` file might look like this: ```plaintext ANTHROPIC_API_KEY="your_anthropic_api_key_here" OPENAI_API_KEY="your_openai_api_key_here" ``` There's no need to have the `python-dotenv` package installed when using `.env` files in this way. **Provider-specific setup**: - **OpenAI**: set `OPENAI_API_KEY` environment variable or create `.env` file - **Anthropic**: set `ANTHROPIC_API_KEY` environment variable or create `.env` file - **Ollama**: no API key required, just ensure Ollama is running locally - **Bedrock**: configure AWS credentials through standard AWS methods AI Validation Process --------------------- The AI validation process works as follows: 1. data batching: the data is split into batches of the specified size 2. row deduplication: duplicate rows (based on selected columns) are identified and only unique combinations are sent to the LLM for analysis 3. json conversion: each batch of unique rows is converted to JSON format for the LLM 4. prompt construction: the user prompt is embedded in a structured system prompt 5. llm processing: each batch is sent to the LLM for analysis 6. response parsing: LLM responses are parsed to extract validation results 7. result projection: results are mapped back to all original rows using row signatures 8. result aggregation: results from all batches are combined **Performance Optimization**: the process uses row signature memoization to avoid redundant LLM calls. When multiple rows have identical values in the selected columns, only one representative row is validated, and the result is applied to all matching rows. This can dramatically reduce API costs and processing time for datasets with repetitive patterns. The LLM receives data in this JSON format: ```json { "columns": ["col1", "col2", "col3"], "rows": [ {"col1": "value1", "col2": "value2", "col3": "value3", "_pb_row_index": 0}, {"col1": "value4", "col2": "value5", "col3": "value6", "_pb_row_index": 1} ] } ``` The LLM returns validation results in this format: ```json [ {"index": 0, "result": true}, {"index": 1, "result": false} ] ``` Prompt Design Tips ------------------ For best results, design prompts that are: - boolean-oriented: frame validation criteria to elicit clear valid/invalid responses - specific: clearly define what makes a row valid/invalid - unambiguous: avoid subjective language that could be interpreted differently - context-aware: include relevant business rules or domain knowledge - example-driven: consider providing examples in the prompt when helpful **Critical**: Prompts must be designed so the LLM can determine whether each row passes or fails the validation criteria. The system expects binary validation responses, so avoid open-ended questions or prompts that might generate explanatory text instead of clear pass/fail judgments. Good prompt examples: - "Each row should contain a valid email address in the 'email' column and a non-empty name in the 'name' column" - "The 'sentiment' column should contain positive sentiment words (happy, good, excellent, etc.)" - "Product descriptions should mention at least one technical specification" Poor prompt examples (avoid these): - "What do you think about this data?" (too open-ended) - "Describe the quality of each row" (asks for description, not validation) - "How would you improve this data?" (asks for suggestions, not pass/fail) Performance Considerations -------------------------- AI validation is significantly slower than traditional validation methods due to API calls to LLM providers. However, performance varies dramatically based on data characteristics: **High Memoization Scenarios** (seconds to minutes): - data with many duplicate rows in the selected columns - low cardinality data (repeated patterns) - small number of unique row combinations **Low Memoization Scenarios** (minutes to hours): - high cardinality data with mostly unique rows - large datasets with few repeated patterns - all or most rows requiring individual LLM evaluation The row signature memoization optimization can reduce processing time significantly when data has repetitive patterns. For datasets where every row is unique, expect longer processing times similar to validating each row individually. **Strategies to Reduce Processing Time**: - test on data slices: define a sampling function like `def sample_1000(df): return df.head(1000)` and use `pre=sample_1000` to validate on smaller samples - filter relevant data: define filter functions like `def active_only(df): return df.filter(df["status"] == "active")` and use `pre=active_only` to focus on a specific subset - optimize column selection: use `columns_subset=` to include only the columns necessary for validation - start with smaller batches: begin with `batch_size=100` for testing, then increase gradually - reduce concurrency: lower `max_concurrent=1` if hitting rate limits - use faster/cheaper models: consider using smaller or more efficient models for initial testing before switching to more capable models Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` The following examples demonstrate how to use AI validation for different types of data quality checks. These examples show both basic usage and more advanced configurations with custom thresholds and actions. **Basic AI validation example:** This first example shows a simple validation scenario where we want to check that customer records have both valid email addresses and non-empty names. Notice how we use `columns_subset=` to focus only on the relevant columns, which improves both performance and cost-effectiveness. ```python import pointblank as pb import polars as pl # Sample data with email and name columns tbl = pl.DataFrame({ "email": ["john@example.com", "invalid-email", "jane@test.org"], "name": ["John Doe", "", "Jane Smith"], "age": [25, 30, 35] }) # Validate using AI validation = ( pb.Validate(data=tbl) .prompt( prompt="Each row should have a valid email address and a non-empty name", columns_subset=["email", "name"], # Only check these columns model="openai:gpt-4o-mini", ) .interrogate() ) validation ``` In this example, the AI will identify that the second row fails validation because it has an invalid email format (`"invalid-email"`) and the third row also fails because it has an empty name field. The validation results will show 2 out of 3 rows failing the criteria. **Advanced example with custom thresholds:** This more sophisticated example demonstrates how to use AI validation with custom thresholds and actions. Here we're validating phone number formats to ensure they include area codes, which is a common data quality requirement for customer contact information. ```python customer_data = pl.DataFrame({ "customer_id": [1, 2, 3, 4, 5], "name": ["John Doe", "Jane Smith", "Bob Johnson", "Alice Brown", "Charlie Davis"], "phone_number": [ "(555) 123-4567", # Valid with area code "555-987-6543", # Valid with area code "123-4567", # Missing area code "(800) 555-1234", # Valid with area code "987-6543" # Missing area code ] }) validation = ( pb.Validate(data=customer_data) .prompt( prompt="Do all the phone numbers include an area code?", columns_subset="phone_number", # Only check the `phone_number` column model="openai:gpt-4o", batch_size=500, max_concurrent=5, thresholds=pb.Thresholds(warning=0.1, error=0.2, critical=0.3), actions=pb.Actions(error="Too many phone numbers missing area codes.") ) .interrogate() ) ``` This validation will identify that 2 out of 5 phone numbers (40%) are missing area codes, which exceeds all threshold levels. The validation will trigger the specified error action since the failure rate (40%) is above the error threshold (20%). The AI can recognize various phone number formats and determine whether they include area codes. ## Aggregation Steps These validation methods check aggregated column values (sums, averages, standard deviations) against fixed values or column references. col_sum_gt(self: 'Validate', columns: 'str | Collection[str]', value: 'float | int | Column | ReferenceColumn | None' = None, tol: 'float' = 0, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, brief: 'str | bool | None' = None, actions: 'Actions | None' = None, active: 'bool | Callable' = True) -> 'Validate' Does the column sum satisfy a greater than comparison? The `col_sum_gt()` validation method checks whether the sum of values in a column is greater than a specified `value=`. This is an aggregation-based validation where the entire column is reduced to a single sum value that is then compared against the target. The comparison used in this function is `sum(column) > value`. Unlike row-level validations (e.g., `col_vals_gt()`), this method treats the entire column as a single test unit. The validation either passes completely (if the aggregated value satisfies the comparison) or fails completely. Parameters ---------- columns A single column or a list of columns to validate. If multiple columns are supplied, there will be a separate validation step generated for each column. The columns must contain numeric data for the sum to be computed. value The value to compare the column sum against. This can be: (1) a numeric literal (`int` or `float`), (2) a [`col()`](`pointblank.col`) object referencing another column whose sum will be used for comparison, (3) a [`ref()`](`pointblank.ref`) object referencing a column in reference data (when `Validate(reference=)` has been set), or (4) `None` to automatically compare against the same column in reference data (shorthand for `ref(column_name)` when reference data is set). tol A tolerance value for the comparison. The default is `0`, meaning exact comparison. When set to a positive value, the comparison becomes more lenient. For example, with `tol=0.5`, a sum that differs from the target by up to `0.5` will still pass. The `tol=` parameter expands the acceptable range for the comparison. For `col_sum_gt()`, a tolerance of `tol=0.5` would mean the sum can be within `0.5` of the target value and still pass validation. thresholds Failure threshold levels so that the validation step can react accordingly when failing test units are level. Since this is an aggregation-based validation with only one test unit, threshold values typically should be set as absolute counts (e.g., `1`) to indicate pass/fail, or as proportions where any value less than `1.0` means failure is acceptable. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Using Reference Data -------------------- The `col_sum_gt()` method supports comparing column aggregations against reference data. This is useful for validating that statistical properties remain consistent across different versions of a dataset, or for comparing current data against historical baselines. To use reference data, set the `reference=` parameter when creating the `Validate` object: ```python validation = ( pb.Validate(data=current_data, reference=baseline_data) .col_sum_gt(columns="revenue") # Compares sum(current.revenue) vs sum(baseline.revenue) .interrogate() ) ``` When `value=None` and reference data is set, the method automatically compares against the same column in the reference data. You can also explicitly specify reference columns using the `ref()` helper: ```python .col_sum_gt(columns="revenue", value=pb.ref("baseline_revenue")) ``` Understanding Tolerance ----------------------- The `tol=` parameter allows for fuzzy comparisons, which is especially important for floating-point aggregations where exact equality is often unreliable. The `tol=` parameter expands the acceptable range for the comparison. For `col_sum_gt()`, a tolerance of `tol=0.5` would mean the sum can be within `0.5` of the target value and still pass validation. For equality comparisons (`col_*_eq`), the tolerance creates a range `[value - tol, value + tol]` within which the aggregation is considered valid. For inequality comparisons, the tolerance shifts the comparison boundary. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. Since aggregation validations operate on a single test unit (the aggregated value), threshold values are typically set as absolute counts: - `thresholds=1` means any failure triggers a 'warning' - `thresholds=(1, 1, 1)` means any failure triggers all three levels Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples, we'll use a simple Polars DataFrame with numeric columns. The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5], "b": [2, 2, 2, 2, 2], } ) pb.preview(tbl) ``` Let's validate that the sum of column `a` is greater than `15`: ```{python} validation = ( pb.Validate(data=tbl) .col_sum_gt(columns="a", value=15) .interrogate() ) validation ``` The validation result shows whether the sum comparison passed or failed. Since this is an aggregation-based validation, there is exactly one test unit per column. When validating multiple columns, each column gets its own validation step: ```{python} validation = ( pb.Validate(data=tbl) .col_sum_gt(columns=["a", "b"], value=15) .interrogate() ) validation ``` Using tolerance for flexible comparisons: ```{python} validation = ( pb.Validate(data=tbl) .col_sum_gt(columns="a", value=15, tol=1.0) .interrogate() ) validation ``` col_sum_lt(self: 'Validate', columns: 'str | Collection[str]', value: 'float | int | Column | ReferenceColumn | None' = None, tol: 'float' = 0, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, brief: 'str | bool | None' = None, actions: 'Actions | None' = None, active: 'bool | Callable' = True) -> 'Validate' Does the column sum satisfy a less than comparison? The `col_sum_lt()` validation method checks whether the sum of values in a column is less than a specified `value=`. This is an aggregation-based validation where the entire column is reduced to a single sum value that is then compared against the target. The comparison used in this function is `sum(column) < value`. Unlike row-level validations (e.g., `col_vals_gt()`), this method treats the entire column as a single test unit. The validation either passes completely (if the aggregated value satisfies the comparison) or fails completely. Parameters ---------- columns A single column or a list of columns to validate. If multiple columns are supplied, there will be a separate validation step generated for each column. The columns must contain numeric data for the sum to be computed. value The value to compare the column sum against. This can be: (1) a numeric literal (`int` or `float`), (2) a [`col()`](`pointblank.col`) object referencing another column whose sum will be used for comparison, (3) a [`ref()`](`pointblank.ref`) object referencing a column in reference data (when `Validate(reference=)` has been set), or (4) `None` to automatically compare against the same column in reference data (shorthand for `ref(column_name)` when reference data is set). tol A tolerance value for the comparison. The default is `0`, meaning exact comparison. When set to a positive value, the comparison becomes more lenient. For example, with `tol=0.5`, a sum that differs from the target by up to `0.5` will still pass. The `tol=` parameter expands the acceptable range for the comparison. For `col_sum_lt()`, a tolerance of `tol=0.5` would mean the sum can be within `0.5` of the target value and still pass validation. thresholds Failure threshold levels so that the validation step can react accordingly when failing test units are level. Since this is an aggregation-based validation with only one test unit, threshold values typically should be set as absolute counts (e.g., `1`) to indicate pass/fail, or as proportions where any value less than `1.0` means failure is acceptable. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Using Reference Data -------------------- The `col_sum_lt()` method supports comparing column aggregations against reference data. This is useful for validating that statistical properties remain consistent across different versions of a dataset, or for comparing current data against historical baselines. To use reference data, set the `reference=` parameter when creating the `Validate` object: ```python validation = ( pb.Validate(data=current_data, reference=baseline_data) .col_sum_lt(columns="revenue") # Compares sum(current.revenue) vs sum(baseline.revenue) .interrogate() ) ``` When `value=None` and reference data is set, the method automatically compares against the same column in the reference data. You can also explicitly specify reference columns using the `ref()` helper: ```python .col_sum_lt(columns="revenue", value=pb.ref("baseline_revenue")) ``` Understanding Tolerance ----------------------- The `tol=` parameter allows for fuzzy comparisons, which is especially important for floating-point aggregations where exact equality is often unreliable. The `tol=` parameter expands the acceptable range for the comparison. For `col_sum_lt()`, a tolerance of `tol=0.5` would mean the sum can be within `0.5` of the target value and still pass validation. For equality comparisons (`col_*_eq`), the tolerance creates a range `[value - tol, value + tol]` within which the aggregation is considered valid. For inequality comparisons, the tolerance shifts the comparison boundary. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. Since aggregation validations operate on a single test unit (the aggregated value), threshold values are typically set as absolute counts: - `thresholds=1` means any failure triggers a 'warning' - `thresholds=(1, 1, 1)` means any failure triggers all three levels Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples, we'll use a simple Polars DataFrame with numeric columns. The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5], "b": [2, 2, 2, 2, 2], } ) pb.preview(tbl) ``` Let's validate that the sum of column `a` is less than `15`: ```{python} validation = ( pb.Validate(data=tbl) .col_sum_lt(columns="a", value=15) .interrogate() ) validation ``` The validation result shows whether the sum comparison passed or failed. Since this is an aggregation-based validation, there is exactly one test unit per column. When validating multiple columns, each column gets its own validation step: ```{python} validation = ( pb.Validate(data=tbl) .col_sum_lt(columns=["a", "b"], value=15) .interrogate() ) validation ``` Using tolerance for flexible comparisons: ```{python} validation = ( pb.Validate(data=tbl) .col_sum_lt(columns="a", value=15, tol=1.0) .interrogate() ) validation ``` col_sum_ge(self: 'Validate', columns: 'str | Collection[str]', value: 'float | int | Column | ReferenceColumn | None' = None, tol: 'float' = 0, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, brief: 'str | bool | None' = None, actions: 'Actions | None' = None, active: 'bool | Callable' = True) -> 'Validate' Does the column sum satisfy a greater than or equal to comparison? The `col_sum_ge()` validation method checks whether the sum of values in a column is at least a specified `value=`. This is an aggregation-based validation where the entire column is reduced to a single sum value that is then compared against the target. The comparison used in this function is `sum(column) >= value`. Unlike row-level validations (e.g., `col_vals_gt()`), this method treats the entire column as a single test unit. The validation either passes completely (if the aggregated value satisfies the comparison) or fails completely. Parameters ---------- columns A single column or a list of columns to validate. If multiple columns are supplied, there will be a separate validation step generated for each column. The columns must contain numeric data for the sum to be computed. value The value to compare the column sum against. This can be: (1) a numeric literal (`int` or `float`), (2) a [`col()`](`pointblank.col`) object referencing another column whose sum will be used for comparison, (3) a [`ref()`](`pointblank.ref`) object referencing a column in reference data (when `Validate(reference=)` has been set), or (4) `None` to automatically compare against the same column in reference data (shorthand for `ref(column_name)` when reference data is set). tol A tolerance value for the comparison. The default is `0`, meaning exact comparison. When set to a positive value, the comparison becomes more lenient. For example, with `tol=0.5`, a sum that differs from the target by up to `0.5` will still pass. The `tol=` parameter expands the acceptable range for the comparison. For `col_sum_ge()`, a tolerance of `tol=0.5` would mean the sum can be within `0.5` of the target value and still pass validation. thresholds Failure threshold levels so that the validation step can react accordingly when failing test units are level. Since this is an aggregation-based validation with only one test unit, threshold values typically should be set as absolute counts (e.g., `1`) to indicate pass/fail, or as proportions where any value less than `1.0` means failure is acceptable. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Using Reference Data -------------------- The `col_sum_ge()` method supports comparing column aggregations against reference data. This is useful for validating that statistical properties remain consistent across different versions of a dataset, or for comparing current data against historical baselines. To use reference data, set the `reference=` parameter when creating the `Validate` object: ```python validation = ( pb.Validate(data=current_data, reference=baseline_data) .col_sum_ge(columns="revenue") # Compares sum(current.revenue) vs sum(baseline.revenue) .interrogate() ) ``` When `value=None` and reference data is set, the method automatically compares against the same column in the reference data. You can also explicitly specify reference columns using the `ref()` helper: ```python .col_sum_ge(columns="revenue", value=pb.ref("baseline_revenue")) ``` Understanding Tolerance ----------------------- The `tol=` parameter allows for fuzzy comparisons, which is especially important for floating-point aggregations where exact equality is often unreliable. The `tol=` parameter expands the acceptable range for the comparison. For `col_sum_ge()`, a tolerance of `tol=0.5` would mean the sum can be within `0.5` of the target value and still pass validation. For equality comparisons (`col_*_eq`), the tolerance creates a range `[value - tol, value + tol]` within which the aggregation is considered valid. For inequality comparisons, the tolerance shifts the comparison boundary. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. Since aggregation validations operate on a single test unit (the aggregated value), threshold values are typically set as absolute counts: - `thresholds=1` means any failure triggers a 'warning' - `thresholds=(1, 1, 1)` means any failure triggers all three levels Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples, we'll use a simple Polars DataFrame with numeric columns. The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5], "b": [2, 2, 2, 2, 2], } ) pb.preview(tbl) ``` Let's validate that the sum of column `a` is at least `15`: ```{python} validation = ( pb.Validate(data=tbl) .col_sum_ge(columns="a", value=15) .interrogate() ) validation ``` The validation result shows whether the sum comparison passed or failed. Since this is an aggregation-based validation, there is exactly one test unit per column. When validating multiple columns, each column gets its own validation step: ```{python} validation = ( pb.Validate(data=tbl) .col_sum_ge(columns=["a", "b"], value=15) .interrogate() ) validation ``` Using tolerance for flexible comparisons: ```{python} validation = ( pb.Validate(data=tbl) .col_sum_ge(columns="a", value=15, tol=1.0) .interrogate() ) validation ``` col_sum_le(self: 'Validate', columns: 'str | Collection[str]', value: 'float | int | Column | ReferenceColumn | None' = None, tol: 'float' = 0, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, brief: 'str | bool | None' = None, actions: 'Actions | None' = None, active: 'bool | Callable' = True) -> 'Validate' Does the column sum satisfy a less than or equal to comparison? The `col_sum_le()` validation method checks whether the sum of values in a column is at most a specified `value=`. This is an aggregation-based validation where the entire column is reduced to a single sum value that is then compared against the target. The comparison used in this function is `sum(column) <= value`. Unlike row-level validations (e.g., `col_vals_gt()`), this method treats the entire column as a single test unit. The validation either passes completely (if the aggregated value satisfies the comparison) or fails completely. Parameters ---------- columns A single column or a list of columns to validate. If multiple columns are supplied, there will be a separate validation step generated for each column. The columns must contain numeric data for the sum to be computed. value The value to compare the column sum against. This can be: (1) a numeric literal (`int` or `float`), (2) a [`col()`](`pointblank.col`) object referencing another column whose sum will be used for comparison, (3) a [`ref()`](`pointblank.ref`) object referencing a column in reference data (when `Validate(reference=)` has been set), or (4) `None` to automatically compare against the same column in reference data (shorthand for `ref(column_name)` when reference data is set). tol A tolerance value for the comparison. The default is `0`, meaning exact comparison. When set to a positive value, the comparison becomes more lenient. For example, with `tol=0.5`, a sum that differs from the target by up to `0.5` will still pass. The `tol=` parameter expands the acceptable range for the comparison. For `col_sum_le()`, a tolerance of `tol=0.5` would mean the sum can be within `0.5` of the target value and still pass validation. thresholds Failure threshold levels so that the validation step can react accordingly when failing test units are level. Since this is an aggregation-based validation with only one test unit, threshold values typically should be set as absolute counts (e.g., `1`) to indicate pass/fail, or as proportions where any value less than `1.0` means failure is acceptable. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Using Reference Data -------------------- The `col_sum_le()` method supports comparing column aggregations against reference data. This is useful for validating that statistical properties remain consistent across different versions of a dataset, or for comparing current data against historical baselines. To use reference data, set the `reference=` parameter when creating the `Validate` object: ```python validation = ( pb.Validate(data=current_data, reference=baseline_data) .col_sum_le(columns="revenue") # Compares sum(current.revenue) vs sum(baseline.revenue) .interrogate() ) ``` When `value=None` and reference data is set, the method automatically compares against the same column in the reference data. You can also explicitly specify reference columns using the `ref()` helper: ```python .col_sum_le(columns="revenue", value=pb.ref("baseline_revenue")) ``` Understanding Tolerance ----------------------- The `tol=` parameter allows for fuzzy comparisons, which is especially important for floating-point aggregations where exact equality is often unreliable. The `tol=` parameter expands the acceptable range for the comparison. For `col_sum_le()`, a tolerance of `tol=0.5` would mean the sum can be within `0.5` of the target value and still pass validation. For equality comparisons (`col_*_eq`), the tolerance creates a range `[value - tol, value + tol]` within which the aggregation is considered valid. For inequality comparisons, the tolerance shifts the comparison boundary. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. Since aggregation validations operate on a single test unit (the aggregated value), threshold values are typically set as absolute counts: - `thresholds=1` means any failure triggers a 'warning' - `thresholds=(1, 1, 1)` means any failure triggers all three levels Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples, we'll use a simple Polars DataFrame with numeric columns. The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5], "b": [2, 2, 2, 2, 2], } ) pb.preview(tbl) ``` Let's validate that the sum of column `a` is at most `15`: ```{python} validation = ( pb.Validate(data=tbl) .col_sum_le(columns="a", value=15) .interrogate() ) validation ``` The validation result shows whether the sum comparison passed or failed. Since this is an aggregation-based validation, there is exactly one test unit per column. When validating multiple columns, each column gets its own validation step: ```{python} validation = ( pb.Validate(data=tbl) .col_sum_le(columns=["a", "b"], value=15) .interrogate() ) validation ``` Using tolerance for flexible comparisons: ```{python} validation = ( pb.Validate(data=tbl) .col_sum_le(columns="a", value=15, tol=1.0) .interrogate() ) validation ``` col_sum_eq(self: 'Validate', columns: 'str | Collection[str]', value: 'float | int | Column | ReferenceColumn | None' = None, tol: 'float' = 0, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, brief: 'str | bool | None' = None, actions: 'Actions | None' = None, active: 'bool | Callable' = True) -> 'Validate' Does the column sum satisfy an equal to comparison? The `col_sum_eq()` validation method checks whether the sum of values in a column equals a specified `value=`. This is an aggregation-based validation where the entire column is reduced to a single sum value that is then compared against the target. The comparison used in this function is `sum(column) == value`. Unlike row-level validations (e.g., `col_vals_gt()`), this method treats the entire column as a single test unit. The validation either passes completely (if the aggregated value satisfies the comparison) or fails completely. Parameters ---------- columns A single column or a list of columns to validate. If multiple columns are supplied, there will be a separate validation step generated for each column. The columns must contain numeric data for the sum to be computed. value The value to compare the column sum against. This can be: (1) a numeric literal (`int` or `float`), (2) a [`col()`](`pointblank.col`) object referencing another column whose sum will be used for comparison, (3) a [`ref()`](`pointblank.ref`) object referencing a column in reference data (when `Validate(reference=)` has been set), or (4) `None` to automatically compare against the same column in reference data (shorthand for `ref(column_name)` when reference data is set). tol A tolerance value for the comparison. The default is `0`, meaning exact comparison. When set to a positive value, the comparison becomes more lenient. For example, with `tol=0.5`, a sum that differs from the target by up to `0.5` will still pass. The `tol=` parameter is particularly useful with `col_sum_eq()` since exact equality comparisons on floating-point aggregations can be problematic due to numerical precision. Setting a small tolerance (e.g., `tol=0.001`) allows for minor differences that arise from floating-point arithmetic. thresholds Failure threshold levels so that the validation step can react accordingly when failing test units are level. Since this is an aggregation-based validation with only one test unit, threshold values typically should be set as absolute counts (e.g., `1`) to indicate pass/fail, or as proportions where any value less than `1.0` means failure is acceptable. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Using Reference Data -------------------- The `col_sum_eq()` method supports comparing column aggregations against reference data. This is useful for validating that statistical properties remain consistent across different versions of a dataset, or for comparing current data against historical baselines. To use reference data, set the `reference=` parameter when creating the `Validate` object: ```python validation = ( pb.Validate(data=current_data, reference=baseline_data) .col_sum_eq(columns="revenue") # Compares sum(current.revenue) vs sum(baseline.revenue) .interrogate() ) ``` When `value=None` and reference data is set, the method automatically compares against the same column in the reference data. You can also explicitly specify reference columns using the `ref()` helper: ```python .col_sum_eq(columns="revenue", value=pb.ref("baseline_revenue")) ``` Understanding Tolerance ----------------------- The `tol=` parameter allows for fuzzy comparisons, which is especially important for floating-point aggregations where exact equality is often unreliable. The `tol=` parameter is particularly useful with `col_sum_eq()` since exact equality comparisons on floating-point aggregations can be problematic due to numerical precision. Setting a small tolerance (e.g., `tol=0.001`) allows for minor differences that arise from floating-point arithmetic. For equality comparisons (`col_*_eq`), the tolerance creates a range `[value - tol, value + tol]` within which the aggregation is considered valid. For inequality comparisons, the tolerance shifts the comparison boundary. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. Since aggregation validations operate on a single test unit (the aggregated value), threshold values are typically set as absolute counts: - `thresholds=1` means any failure triggers a 'warning' - `thresholds=(1, 1, 1)` means any failure triggers all three levels Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples, we'll use a simple Polars DataFrame with numeric columns. The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5], "b": [2, 2, 2, 2, 2], } ) pb.preview(tbl) ``` Let's validate that the sum of column `a` equals `15`: ```{python} validation = ( pb.Validate(data=tbl) .col_sum_eq(columns="a", value=15) .interrogate() ) validation ``` The validation result shows whether the sum comparison passed or failed. Since this is an aggregation-based validation, there is exactly one test unit per column. When validating multiple columns, each column gets its own validation step: ```{python} validation = ( pb.Validate(data=tbl) .col_sum_eq(columns=["a", "b"], value=15) .interrogate() ) validation ``` Using tolerance for flexible comparisons: ```{python} validation = ( pb.Validate(data=tbl) .col_sum_eq(columns="a", value=15, tol=1.0) .interrogate() ) validation ``` col_avg_gt(self: 'Validate', columns: 'str | Collection[str]', value: 'float | int | Column | ReferenceColumn | None' = None, tol: 'float' = 0, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, brief: 'str | bool | None' = None, actions: 'Actions | None' = None, active: 'bool | Callable' = True) -> 'Validate' Does the column average satisfy a greater than comparison? The `col_avg_gt()` validation method checks whether the average of values in a column is greater than a specified `value=`. This is an aggregation-based validation where the entire column is reduced to a single average value that is then compared against the target. The comparison used in this function is `average(column) > value`. Unlike row-level validations (e.g., `col_vals_gt()`), this method treats the entire column as a single test unit. The validation either passes completely (if the aggregated value satisfies the comparison) or fails completely. Parameters ---------- columns A single column or a list of columns to validate. If multiple columns are supplied, there will be a separate validation step generated for each column. The columns must contain numeric data for the average to be computed. value The value to compare the column average against. This can be: (1) a numeric literal (`int` or `float`), (2) a [`col()`](`pointblank.col`) object referencing another column whose average will be used for comparison, (3) a [`ref()`](`pointblank.ref`) object referencing a column in reference data (when `Validate(reference=)` has been set), or (4) `None` to automatically compare against the same column in reference data (shorthand for `ref(column_name)` when reference data is set). tol A tolerance value for the comparison. The default is `0`, meaning exact comparison. When set to a positive value, the comparison becomes more lenient. For example, with `tol=0.5`, a average that differs from the target by up to `0.5` will still pass. The `tol=` parameter expands the acceptable range for the comparison. For `col_avg_gt()`, a tolerance of `tol=0.5` would mean the average can be within `0.5` of the target value and still pass validation. thresholds Failure threshold levels so that the validation step can react accordingly when failing test units are level. Since this is an aggregation-based validation with only one test unit, threshold values typically should be set as absolute counts (e.g., `1`) to indicate pass/fail, or as proportions where any value less than `1.0` means failure is acceptable. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Using Reference Data -------------------- The `col_avg_gt()` method supports comparing column aggregations against reference data. This is useful for validating that statistical properties remain consistent across different versions of a dataset, or for comparing current data against historical baselines. To use reference data, set the `reference=` parameter when creating the `Validate` object: ```python validation = ( pb.Validate(data=current_data, reference=baseline_data) .col_avg_gt(columns="revenue") # Compares sum(current.revenue) vs sum(baseline.revenue) .interrogate() ) ``` When `value=None` and reference data is set, the method automatically compares against the same column in the reference data. You can also explicitly specify reference columns using the `ref()` helper: ```python .col_avg_gt(columns="revenue", value=pb.ref("baseline_revenue")) ``` Understanding Tolerance ----------------------- The `tol=` parameter allows for fuzzy comparisons, which is especially important for floating-point aggregations where exact equality is often unreliable. The `tol=` parameter expands the acceptable range for the comparison. For `col_avg_gt()`, a tolerance of `tol=0.5` would mean the average can be within `0.5` of the target value and still pass validation. For equality comparisons (`col_*_eq`), the tolerance creates a range `[value - tol, value + tol]` within which the aggregation is considered valid. For inequality comparisons, the tolerance shifts the comparison boundary. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. Since aggregation validations operate on a single test unit (the aggregated value), threshold values are typically set as absolute counts: - `thresholds=1` means any failure triggers a 'warning' - `thresholds=(1, 1, 1)` means any failure triggers all three levels Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples, we'll use a simple Polars DataFrame with numeric columns. The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5], "b": [2, 2, 2, 2, 2], } ) pb.preview(tbl) ``` Let's validate that the average of column `a` is greater than `3`: ```{python} validation = ( pb.Validate(data=tbl) .col_avg_gt(columns="a", value=3) .interrogate() ) validation ``` The validation result shows whether the average comparison passed or failed. Since this is an aggregation-based validation, there is exactly one test unit per column. When validating multiple columns, each column gets its own validation step: ```{python} validation = ( pb.Validate(data=tbl) .col_avg_gt(columns=["a", "b"], value=3) .interrogate() ) validation ``` Using tolerance for flexible comparisons: ```{python} validation = ( pb.Validate(data=tbl) .col_avg_gt(columns="a", value=3, tol=1.0) .interrogate() ) validation ``` col_avg_lt(self: 'Validate', columns: 'str | Collection[str]', value: 'float | int | Column | ReferenceColumn | None' = None, tol: 'float' = 0, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, brief: 'str | bool | None' = None, actions: 'Actions | None' = None, active: 'bool | Callable' = True) -> 'Validate' Does the column average satisfy a less than comparison? The `col_avg_lt()` validation method checks whether the average of values in a column is less than a specified `value=`. This is an aggregation-based validation where the entire column is reduced to a single average value that is then compared against the target. The comparison used in this function is `average(column) < value`. Unlike row-level validations (e.g., `col_vals_gt()`), this method treats the entire column as a single test unit. The validation either passes completely (if the aggregated value satisfies the comparison) or fails completely. Parameters ---------- columns A single column or a list of columns to validate. If multiple columns are supplied, there will be a separate validation step generated for each column. The columns must contain numeric data for the average to be computed. value The value to compare the column average against. This can be: (1) a numeric literal (`int` or `float`), (2) a [`col()`](`pointblank.col`) object referencing another column whose average will be used for comparison, (3) a [`ref()`](`pointblank.ref`) object referencing a column in reference data (when `Validate(reference=)` has been set), or (4) `None` to automatically compare against the same column in reference data (shorthand for `ref(column_name)` when reference data is set). tol A tolerance value for the comparison. The default is `0`, meaning exact comparison. When set to a positive value, the comparison becomes more lenient. For example, with `tol=0.5`, a average that differs from the target by up to `0.5` will still pass. The `tol=` parameter expands the acceptable range for the comparison. For `col_avg_lt()`, a tolerance of `tol=0.5` would mean the average can be within `0.5` of the target value and still pass validation. thresholds Failure threshold levels so that the validation step can react accordingly when failing test units are level. Since this is an aggregation-based validation with only one test unit, threshold values typically should be set as absolute counts (e.g., `1`) to indicate pass/fail, or as proportions where any value less than `1.0` means failure is acceptable. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Using Reference Data -------------------- The `col_avg_lt()` method supports comparing column aggregations against reference data. This is useful for validating that statistical properties remain consistent across different versions of a dataset, or for comparing current data against historical baselines. To use reference data, set the `reference=` parameter when creating the `Validate` object: ```python validation = ( pb.Validate(data=current_data, reference=baseline_data) .col_avg_lt(columns="revenue") # Compares sum(current.revenue) vs sum(baseline.revenue) .interrogate() ) ``` When `value=None` and reference data is set, the method automatically compares against the same column in the reference data. You can also explicitly specify reference columns using the `ref()` helper: ```python .col_avg_lt(columns="revenue", value=pb.ref("baseline_revenue")) ``` Understanding Tolerance ----------------------- The `tol=` parameter allows for fuzzy comparisons, which is especially important for floating-point aggregations where exact equality is often unreliable. The `tol=` parameter expands the acceptable range for the comparison. For `col_avg_lt()`, a tolerance of `tol=0.5` would mean the average can be within `0.5` of the target value and still pass validation. For equality comparisons (`col_*_eq`), the tolerance creates a range `[value - tol, value + tol]` within which the aggregation is considered valid. For inequality comparisons, the tolerance shifts the comparison boundary. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. Since aggregation validations operate on a single test unit (the aggregated value), threshold values are typically set as absolute counts: - `thresholds=1` means any failure triggers a 'warning' - `thresholds=(1, 1, 1)` means any failure triggers all three levels Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples, we'll use a simple Polars DataFrame with numeric columns. The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5], "b": [2, 2, 2, 2, 2], } ) pb.preview(tbl) ``` Let's validate that the average of column `a` is less than `3`: ```{python} validation = ( pb.Validate(data=tbl) .col_avg_lt(columns="a", value=3) .interrogate() ) validation ``` The validation result shows whether the average comparison passed or failed. Since this is an aggregation-based validation, there is exactly one test unit per column. When validating multiple columns, each column gets its own validation step: ```{python} validation = ( pb.Validate(data=tbl) .col_avg_lt(columns=["a", "b"], value=3) .interrogate() ) validation ``` Using tolerance for flexible comparisons: ```{python} validation = ( pb.Validate(data=tbl) .col_avg_lt(columns="a", value=3, tol=1.0) .interrogate() ) validation ``` col_avg_ge(self: 'Validate', columns: 'str | Collection[str]', value: 'float | int | Column | ReferenceColumn | None' = None, tol: 'float' = 0, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, brief: 'str | bool | None' = None, actions: 'Actions | None' = None, active: 'bool | Callable' = True) -> 'Validate' Does the column average satisfy a greater than or equal to comparison? The `col_avg_ge()` validation method checks whether the average of values in a column is at least a specified `value=`. This is an aggregation-based validation where the entire column is reduced to a single average value that is then compared against the target. The comparison used in this function is `average(column) >= value`. Unlike row-level validations (e.g., `col_vals_gt()`), this method treats the entire column as a single test unit. The validation either passes completely (if the aggregated value satisfies the comparison) or fails completely. Parameters ---------- columns A single column or a list of columns to validate. If multiple columns are supplied, there will be a separate validation step generated for each column. The columns must contain numeric data for the average to be computed. value The value to compare the column average against. This can be: (1) a numeric literal (`int` or `float`), (2) a [`col()`](`pointblank.col`) object referencing another column whose average will be used for comparison, (3) a [`ref()`](`pointblank.ref`) object referencing a column in reference data (when `Validate(reference=)` has been set), or (4) `None` to automatically compare against the same column in reference data (shorthand for `ref(column_name)` when reference data is set). tol A tolerance value for the comparison. The default is `0`, meaning exact comparison. When set to a positive value, the comparison becomes more lenient. For example, with `tol=0.5`, a average that differs from the target by up to `0.5` will still pass. The `tol=` parameter expands the acceptable range for the comparison. For `col_avg_ge()`, a tolerance of `tol=0.5` would mean the average can be within `0.5` of the target value and still pass validation. thresholds Failure threshold levels so that the validation step can react accordingly when failing test units are level. Since this is an aggregation-based validation with only one test unit, threshold values typically should be set as absolute counts (e.g., `1`) to indicate pass/fail, or as proportions where any value less than `1.0` means failure is acceptable. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Using Reference Data -------------------- The `col_avg_ge()` method supports comparing column aggregations against reference data. This is useful for validating that statistical properties remain consistent across different versions of a dataset, or for comparing current data against historical baselines. To use reference data, set the `reference=` parameter when creating the `Validate` object: ```python validation = ( pb.Validate(data=current_data, reference=baseline_data) .col_avg_ge(columns="revenue") # Compares sum(current.revenue) vs sum(baseline.revenue) .interrogate() ) ``` When `value=None` and reference data is set, the method automatically compares against the same column in the reference data. You can also explicitly specify reference columns using the `ref()` helper: ```python .col_avg_ge(columns="revenue", value=pb.ref("baseline_revenue")) ``` Understanding Tolerance ----------------------- The `tol=` parameter allows for fuzzy comparisons, which is especially important for floating-point aggregations where exact equality is often unreliable. The `tol=` parameter expands the acceptable range for the comparison. For `col_avg_ge()`, a tolerance of `tol=0.5` would mean the average can be within `0.5` of the target value and still pass validation. For equality comparisons (`col_*_eq`), the tolerance creates a range `[value - tol, value + tol]` within which the aggregation is considered valid. For inequality comparisons, the tolerance shifts the comparison boundary. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. Since aggregation validations operate on a single test unit (the aggregated value), threshold values are typically set as absolute counts: - `thresholds=1` means any failure triggers a 'warning' - `thresholds=(1, 1, 1)` means any failure triggers all three levels Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples, we'll use a simple Polars DataFrame with numeric columns. The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5], "b": [2, 2, 2, 2, 2], } ) pb.preview(tbl) ``` Let's validate that the average of column `a` is at least `3`: ```{python} validation = ( pb.Validate(data=tbl) .col_avg_ge(columns="a", value=3) .interrogate() ) validation ``` The validation result shows whether the average comparison passed or failed. Since this is an aggregation-based validation, there is exactly one test unit per column. When validating multiple columns, each column gets its own validation step: ```{python} validation = ( pb.Validate(data=tbl) .col_avg_ge(columns=["a", "b"], value=3) .interrogate() ) validation ``` Using tolerance for flexible comparisons: ```{python} validation = ( pb.Validate(data=tbl) .col_avg_ge(columns="a", value=3, tol=1.0) .interrogate() ) validation ``` col_avg_le(self: 'Validate', columns: 'str | Collection[str]', value: 'float | int | Column | ReferenceColumn | None' = None, tol: 'float' = 0, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, brief: 'str | bool | None' = None, actions: 'Actions | None' = None, active: 'bool | Callable' = True) -> 'Validate' Does the column average satisfy a less than or equal to comparison? The `col_avg_le()` validation method checks whether the average of values in a column is at most a specified `value=`. This is an aggregation-based validation where the entire column is reduced to a single average value that is then compared against the target. The comparison used in this function is `average(column) <= value`. Unlike row-level validations (e.g., `col_vals_gt()`), this method treats the entire column as a single test unit. The validation either passes completely (if the aggregated value satisfies the comparison) or fails completely. Parameters ---------- columns A single column or a list of columns to validate. If multiple columns are supplied, there will be a separate validation step generated for each column. The columns must contain numeric data for the average to be computed. value The value to compare the column average against. This can be: (1) a numeric literal (`int` or `float`), (2) a [`col()`](`pointblank.col`) object referencing another column whose average will be used for comparison, (3) a [`ref()`](`pointblank.ref`) object referencing a column in reference data (when `Validate(reference=)` has been set), or (4) `None` to automatically compare against the same column in reference data (shorthand for `ref(column_name)` when reference data is set). tol A tolerance value for the comparison. The default is `0`, meaning exact comparison. When set to a positive value, the comparison becomes more lenient. For example, with `tol=0.5`, a average that differs from the target by up to `0.5` will still pass. The `tol=` parameter expands the acceptable range for the comparison. For `col_avg_le()`, a tolerance of `tol=0.5` would mean the average can be within `0.5` of the target value and still pass validation. thresholds Failure threshold levels so that the validation step can react accordingly when failing test units are level. Since this is an aggregation-based validation with only one test unit, threshold values typically should be set as absolute counts (e.g., `1`) to indicate pass/fail, or as proportions where any value less than `1.0` means failure is acceptable. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Using Reference Data -------------------- The `col_avg_le()` method supports comparing column aggregations against reference data. This is useful for validating that statistical properties remain consistent across different versions of a dataset, or for comparing current data against historical baselines. To use reference data, set the `reference=` parameter when creating the `Validate` object: ```python validation = ( pb.Validate(data=current_data, reference=baseline_data) .col_avg_le(columns="revenue") # Compares sum(current.revenue) vs sum(baseline.revenue) .interrogate() ) ``` When `value=None` and reference data is set, the method automatically compares against the same column in the reference data. You can also explicitly specify reference columns using the `ref()` helper: ```python .col_avg_le(columns="revenue", value=pb.ref("baseline_revenue")) ``` Understanding Tolerance ----------------------- The `tol=` parameter allows for fuzzy comparisons, which is especially important for floating-point aggregations where exact equality is often unreliable. The `tol=` parameter expands the acceptable range for the comparison. For `col_avg_le()`, a tolerance of `tol=0.5` would mean the average can be within `0.5` of the target value and still pass validation. For equality comparisons (`col_*_eq`), the tolerance creates a range `[value - tol, value + tol]` within which the aggregation is considered valid. For inequality comparisons, the tolerance shifts the comparison boundary. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. Since aggregation validations operate on a single test unit (the aggregated value), threshold values are typically set as absolute counts: - `thresholds=1` means any failure triggers a 'warning' - `thresholds=(1, 1, 1)` means any failure triggers all three levels Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples, we'll use a simple Polars DataFrame with numeric columns. The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5], "b": [2, 2, 2, 2, 2], } ) pb.preview(tbl) ``` Let's validate that the average of column `a` is at most `3`: ```{python} validation = ( pb.Validate(data=tbl) .col_avg_le(columns="a", value=3) .interrogate() ) validation ``` The validation result shows whether the average comparison passed or failed. Since this is an aggregation-based validation, there is exactly one test unit per column. When validating multiple columns, each column gets its own validation step: ```{python} validation = ( pb.Validate(data=tbl) .col_avg_le(columns=["a", "b"], value=3) .interrogate() ) validation ``` Using tolerance for flexible comparisons: ```{python} validation = ( pb.Validate(data=tbl) .col_avg_le(columns="a", value=3, tol=1.0) .interrogate() ) validation ``` col_avg_eq(self: 'Validate', columns: 'str | Collection[str]', value: 'float | int | Column | ReferenceColumn | None' = None, tol: 'float' = 0, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, brief: 'str | bool | None' = None, actions: 'Actions | None' = None, active: 'bool | Callable' = True) -> 'Validate' Does the column average satisfy an equal to comparison? The `col_avg_eq()` validation method checks whether the average of values in a column equals a specified `value=`. This is an aggregation-based validation where the entire column is reduced to a single average value that is then compared against the target. The comparison used in this function is `average(column) == value`. Unlike row-level validations (e.g., `col_vals_gt()`), this method treats the entire column as a single test unit. The validation either passes completely (if the aggregated value satisfies the comparison) or fails completely. Parameters ---------- columns A single column or a list of columns to validate. If multiple columns are supplied, there will be a separate validation step generated for each column. The columns must contain numeric data for the average to be computed. value The value to compare the column average against. This can be: (1) a numeric literal (`int` or `float`), (2) a [`col()`](`pointblank.col`) object referencing another column whose average will be used for comparison, (3) a [`ref()`](`pointblank.ref`) object referencing a column in reference data (when `Validate(reference=)` has been set), or (4) `None` to automatically compare against the same column in reference data (shorthand for `ref(column_name)` when reference data is set). tol A tolerance value for the comparison. The default is `0`, meaning exact comparison. When set to a positive value, the comparison becomes more lenient. For example, with `tol=0.5`, a average that differs from the target by up to `0.5` will still pass. The `tol=` parameter is particularly useful with `col_avg_eq()` since exact equality comparisons on floating-point aggregations can be problematic due to numerical precision. Setting a small tolerance (e.g., `tol=0.001`) allows for minor differences that arise from floating-point arithmetic. thresholds Failure threshold levels so that the validation step can react accordingly when failing test units are level. Since this is an aggregation-based validation with only one test unit, threshold values typically should be set as absolute counts (e.g., `1`) to indicate pass/fail, or as proportions where any value less than `1.0` means failure is acceptable. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Using Reference Data -------------------- The `col_avg_eq()` method supports comparing column aggregations against reference data. This is useful for validating that statistical properties remain consistent across different versions of a dataset, or for comparing current data against historical baselines. To use reference data, set the `reference=` parameter when creating the `Validate` object: ```python validation = ( pb.Validate(data=current_data, reference=baseline_data) .col_avg_eq(columns="revenue") # Compares sum(current.revenue) vs sum(baseline.revenue) .interrogate() ) ``` When `value=None` and reference data is set, the method automatically compares against the same column in the reference data. You can also explicitly specify reference columns using the `ref()` helper: ```python .col_avg_eq(columns="revenue", value=pb.ref("baseline_revenue")) ``` Understanding Tolerance ----------------------- The `tol=` parameter allows for fuzzy comparisons, which is especially important for floating-point aggregations where exact equality is often unreliable. The `tol=` parameter is particularly useful with `col_avg_eq()` since exact equality comparisons on floating-point aggregations can be problematic due to numerical precision. Setting a small tolerance (e.g., `tol=0.001`) allows for minor differences that arise from floating-point arithmetic. For equality comparisons (`col_*_eq`), the tolerance creates a range `[value - tol, value + tol]` within which the aggregation is considered valid. For inequality comparisons, the tolerance shifts the comparison boundary. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. Since aggregation validations operate on a single test unit (the aggregated value), threshold values are typically set as absolute counts: - `thresholds=1` means any failure triggers a 'warning' - `thresholds=(1, 1, 1)` means any failure triggers all three levels Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples, we'll use a simple Polars DataFrame with numeric columns. The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5], "b": [2, 2, 2, 2, 2], } ) pb.preview(tbl) ``` Let's validate that the average of column `a` equals `3`: ```{python} validation = ( pb.Validate(data=tbl) .col_avg_eq(columns="a", value=3) .interrogate() ) validation ``` The validation result shows whether the average comparison passed or failed. Since this is an aggregation-based validation, there is exactly one test unit per column. When validating multiple columns, each column gets its own validation step: ```{python} validation = ( pb.Validate(data=tbl) .col_avg_eq(columns=["a", "b"], value=3) .interrogate() ) validation ``` Using tolerance for flexible comparisons: ```{python} validation = ( pb.Validate(data=tbl) .col_avg_eq(columns="a", value=3, tol=1.0) .interrogate() ) validation ``` col_sd_gt(self: 'Validate', columns: 'str | Collection[str]', value: 'float | int | Column | ReferenceColumn | None' = None, tol: 'float' = 0, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, brief: 'str | bool | None' = None, actions: 'Actions | None' = None, active: 'bool | Callable' = True) -> 'Validate' Does the column standard deviation satisfy a greater than comparison? The `col_sd_gt()` validation method checks whether the standard deviation of values in a column is greater than a specified `value=`. This is an aggregation-based validation where the entire column is reduced to a single standard deviation value that is then compared against the target. The comparison used in this function is `standard deviation(column) > value`. Unlike row-level validations (e.g., `col_vals_gt()`), this method treats the entire column as a single test unit. The validation either passes completely (if the aggregated value satisfies the comparison) or fails completely. Parameters ---------- columns A single column or a list of columns to validate. If multiple columns are supplied, there will be a separate validation step generated for each column. The columns must contain numeric data for the standard deviation to be computed. value The value to compare the column standard deviation against. This can be: (1) a numeric literal (`int` or `float`), (2) a [`col()`](`pointblank.col`) object referencing another column whose standard deviation will be used for comparison, (3) a [`ref()`](`pointblank.ref`) object referencing a column in reference data (when `Validate(reference=)` has been set), or (4) `None` to automatically compare against the same column in reference data (shorthand for `ref(column_name)` when reference data is set). tol A tolerance value for the comparison. The default is `0`, meaning exact comparison. When set to a positive value, the comparison becomes more lenient. For example, with `tol=0.5`, a standard deviation that differs from the target by up to `0.5` will still pass. The `tol=` parameter expands the acceptable range for the comparison. For `col_sd_gt()`, a tolerance of `tol=0.5` would mean the standard deviation can be within `0.5` of the target value and still pass validation. thresholds Failure threshold levels so that the validation step can react accordingly when failing test units are level. Since this is an aggregation-based validation with only one test unit, threshold values typically should be set as absolute counts (e.g., `1`) to indicate pass/fail, or as proportions where any value less than `1.0` means failure is acceptable. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Using Reference Data -------------------- The `col_sd_gt()` method supports comparing column aggregations against reference data. This is useful for validating that statistical properties remain consistent across different versions of a dataset, or for comparing current data against historical baselines. To use reference data, set the `reference=` parameter when creating the `Validate` object: ```python validation = ( pb.Validate(data=current_data, reference=baseline_data) .col_sd_gt(columns="revenue") # Compares sum(current.revenue) vs sum(baseline.revenue) .interrogate() ) ``` When `value=None` and reference data is set, the method automatically compares against the same column in the reference data. You can also explicitly specify reference columns using the `ref()` helper: ```python .col_sd_gt(columns="revenue", value=pb.ref("baseline_revenue")) ``` Understanding Tolerance ----------------------- The `tol=` parameter allows for fuzzy comparisons, which is especially important for floating-point aggregations where exact equality is often unreliable. The `tol=` parameter expands the acceptable range for the comparison. For `col_sd_gt()`, a tolerance of `tol=0.5` would mean the standard deviation can be within `0.5` of the target value and still pass validation. For equality comparisons (`col_*_eq`), the tolerance creates a range `[value - tol, value + tol]` within which the aggregation is considered valid. For inequality comparisons, the tolerance shifts the comparison boundary. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. Since aggregation validations operate on a single test unit (the aggregated value), threshold values are typically set as absolute counts: - `thresholds=1` means any failure triggers a 'warning' - `thresholds=(1, 1, 1)` means any failure triggers all three levels Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples, we'll use a simple Polars DataFrame with numeric columns. The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5], "b": [2, 2, 2, 2, 2], } ) pb.preview(tbl) ``` Let's validate that the standard deviation of column `a` is greater than `2`: ```{python} validation = ( pb.Validate(data=tbl) .col_sd_gt(columns="a", value=2) .interrogate() ) validation ``` The validation result shows whether the standard deviation comparison passed or failed. Since this is an aggregation-based validation, there is exactly one test unit per column. When validating multiple columns, each column gets its own validation step: ```{python} validation = ( pb.Validate(data=tbl) .col_sd_gt(columns=["a", "b"], value=2) .interrogate() ) validation ``` Using tolerance for flexible comparisons: ```{python} validation = ( pb.Validate(data=tbl) .col_sd_gt(columns="a", value=2, tol=1.0) .interrogate() ) validation ``` col_sd_lt(self: 'Validate', columns: 'str | Collection[str]', value: 'float | int | Column | ReferenceColumn | None' = None, tol: 'float' = 0, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, brief: 'str | bool | None' = None, actions: 'Actions | None' = None, active: 'bool | Callable' = True) -> 'Validate' Does the column standard deviation satisfy a less than comparison? The `col_sd_lt()` validation method checks whether the standard deviation of values in a column is less than a specified `value=`. This is an aggregation-based validation where the entire column is reduced to a single standard deviation value that is then compared against the target. The comparison used in this function is `standard deviation(column) < value`. Unlike row-level validations (e.g., `col_vals_gt()`), this method treats the entire column as a single test unit. The validation either passes completely (if the aggregated value satisfies the comparison) or fails completely. Parameters ---------- columns A single column or a list of columns to validate. If multiple columns are supplied, there will be a separate validation step generated for each column. The columns must contain numeric data for the standard deviation to be computed. value The value to compare the column standard deviation against. This can be: (1) a numeric literal (`int` or `float`), (2) a [`col()`](`pointblank.col`) object referencing another column whose standard deviation will be used for comparison, (3) a [`ref()`](`pointblank.ref`) object referencing a column in reference data (when `Validate(reference=)` has been set), or (4) `None` to automatically compare against the same column in reference data (shorthand for `ref(column_name)` when reference data is set). tol A tolerance value for the comparison. The default is `0`, meaning exact comparison. When set to a positive value, the comparison becomes more lenient. For example, with `tol=0.5`, a standard deviation that differs from the target by up to `0.5` will still pass. The `tol=` parameter expands the acceptable range for the comparison. For `col_sd_lt()`, a tolerance of `tol=0.5` would mean the standard deviation can be within `0.5` of the target value and still pass validation. thresholds Failure threshold levels so that the validation step can react accordingly when failing test units are level. Since this is an aggregation-based validation with only one test unit, threshold values typically should be set as absolute counts (e.g., `1`) to indicate pass/fail, or as proportions where any value less than `1.0` means failure is acceptable. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Using Reference Data -------------------- The `col_sd_lt()` method supports comparing column aggregations against reference data. This is useful for validating that statistical properties remain consistent across different versions of a dataset, or for comparing current data against historical baselines. To use reference data, set the `reference=` parameter when creating the `Validate` object: ```python validation = ( pb.Validate(data=current_data, reference=baseline_data) .col_sd_lt(columns="revenue") # Compares sum(current.revenue) vs sum(baseline.revenue) .interrogate() ) ``` When `value=None` and reference data is set, the method automatically compares against the same column in the reference data. You can also explicitly specify reference columns using the `ref()` helper: ```python .col_sd_lt(columns="revenue", value=pb.ref("baseline_revenue")) ``` Understanding Tolerance ----------------------- The `tol=` parameter allows for fuzzy comparisons, which is especially important for floating-point aggregations where exact equality is often unreliable. The `tol=` parameter expands the acceptable range for the comparison. For `col_sd_lt()`, a tolerance of `tol=0.5` would mean the standard deviation can be within `0.5` of the target value and still pass validation. For equality comparisons (`col_*_eq`), the tolerance creates a range `[value - tol, value + tol]` within which the aggregation is considered valid. For inequality comparisons, the tolerance shifts the comparison boundary. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. Since aggregation validations operate on a single test unit (the aggregated value), threshold values are typically set as absolute counts: - `thresholds=1` means any failure triggers a 'warning' - `thresholds=(1, 1, 1)` means any failure triggers all three levels Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples, we'll use a simple Polars DataFrame with numeric columns. The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5], "b": [2, 2, 2, 2, 2], } ) pb.preview(tbl) ``` Let's validate that the standard deviation of column `a` is less than `2`: ```{python} validation = ( pb.Validate(data=tbl) .col_sd_lt(columns="a", value=2) .interrogate() ) validation ``` The validation result shows whether the standard deviation comparison passed or failed. Since this is an aggregation-based validation, there is exactly one test unit per column. When validating multiple columns, each column gets its own validation step: ```{python} validation = ( pb.Validate(data=tbl) .col_sd_lt(columns=["a", "b"], value=2) .interrogate() ) validation ``` Using tolerance for flexible comparisons: ```{python} validation = ( pb.Validate(data=tbl) .col_sd_lt(columns="a", value=2, tol=1.0) .interrogate() ) validation ``` col_sd_ge(self: 'Validate', columns: 'str | Collection[str]', value: 'float | int | Column | ReferenceColumn | None' = None, tol: 'float' = 0, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, brief: 'str | bool | None' = None, actions: 'Actions | None' = None, active: 'bool | Callable' = True) -> 'Validate' Does the column standard deviation satisfy a greater than or equal to comparison? The `col_sd_ge()` validation method checks whether the standard deviation of values in a column is at least a specified `value=`. This is an aggregation-based validation where the entire column is reduced to a single standard deviation value that is then compared against the target. The comparison used in this function is `standard deviation(column) >= value`. Unlike row-level validations (e.g., `col_vals_gt()`), this method treats the entire column as a single test unit. The validation either passes completely (if the aggregated value satisfies the comparison) or fails completely. Parameters ---------- columns A single column or a list of columns to validate. If multiple columns are supplied, there will be a separate validation step generated for each column. The columns must contain numeric data for the standard deviation to be computed. value The value to compare the column standard deviation against. This can be: (1) a numeric literal (`int` or `float`), (2) a [`col()`](`pointblank.col`) object referencing another column whose standard deviation will be used for comparison, (3) a [`ref()`](`pointblank.ref`) object referencing a column in reference data (when `Validate(reference=)` has been set), or (4) `None` to automatically compare against the same column in reference data (shorthand for `ref(column_name)` when reference data is set). tol A tolerance value for the comparison. The default is `0`, meaning exact comparison. When set to a positive value, the comparison becomes more lenient. For example, with `tol=0.5`, a standard deviation that differs from the target by up to `0.5` will still pass. The `tol=` parameter expands the acceptable range for the comparison. For `col_sd_ge()`, a tolerance of `tol=0.5` would mean the standard deviation can be within `0.5` of the target value and still pass validation. thresholds Failure threshold levels so that the validation step can react accordingly when failing test units are level. Since this is an aggregation-based validation with only one test unit, threshold values typically should be set as absolute counts (e.g., `1`) to indicate pass/fail, or as proportions where any value less than `1.0` means failure is acceptable. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Using Reference Data -------------------- The `col_sd_ge()` method supports comparing column aggregations against reference data. This is useful for validating that statistical properties remain consistent across different versions of a dataset, or for comparing current data against historical baselines. To use reference data, set the `reference=` parameter when creating the `Validate` object: ```python validation = ( pb.Validate(data=current_data, reference=baseline_data) .col_sd_ge(columns="revenue") # Compares sum(current.revenue) vs sum(baseline.revenue) .interrogate() ) ``` When `value=None` and reference data is set, the method automatically compares against the same column in the reference data. You can also explicitly specify reference columns using the `ref()` helper: ```python .col_sd_ge(columns="revenue", value=pb.ref("baseline_revenue")) ``` Understanding Tolerance ----------------------- The `tol=` parameter allows for fuzzy comparisons, which is especially important for floating-point aggregations where exact equality is often unreliable. The `tol=` parameter expands the acceptable range for the comparison. For `col_sd_ge()`, a tolerance of `tol=0.5` would mean the standard deviation can be within `0.5` of the target value and still pass validation. For equality comparisons (`col_*_eq`), the tolerance creates a range `[value - tol, value + tol]` within which the aggregation is considered valid. For inequality comparisons, the tolerance shifts the comparison boundary. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. Since aggregation validations operate on a single test unit (the aggregated value), threshold values are typically set as absolute counts: - `thresholds=1` means any failure triggers a 'warning' - `thresholds=(1, 1, 1)` means any failure triggers all three levels Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples, we'll use a simple Polars DataFrame with numeric columns. The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5], "b": [2, 2, 2, 2, 2], } ) pb.preview(tbl) ``` Let's validate that the standard deviation of column `a` is at least `2`: ```{python} validation = ( pb.Validate(data=tbl) .col_sd_ge(columns="a", value=2) .interrogate() ) validation ``` The validation result shows whether the standard deviation comparison passed or failed. Since this is an aggregation-based validation, there is exactly one test unit per column. When validating multiple columns, each column gets its own validation step: ```{python} validation = ( pb.Validate(data=tbl) .col_sd_ge(columns=["a", "b"], value=2) .interrogate() ) validation ``` Using tolerance for flexible comparisons: ```{python} validation = ( pb.Validate(data=tbl) .col_sd_ge(columns="a", value=2, tol=1.0) .interrogate() ) validation ``` col_sd_le(self: 'Validate', columns: 'str | Collection[str]', value: 'float | int | Column | ReferenceColumn | None' = None, tol: 'float' = 0, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, brief: 'str | bool | None' = None, actions: 'Actions | None' = None, active: 'bool | Callable' = True) -> 'Validate' Does the column standard deviation satisfy a less than or equal to comparison? The `col_sd_le()` validation method checks whether the standard deviation of values in a column is at most a specified `value=`. This is an aggregation-based validation where the entire column is reduced to a single standard deviation value that is then compared against the target. The comparison used in this function is `standard deviation(column) <= value`. Unlike row-level validations (e.g., `col_vals_gt()`), this method treats the entire column as a single test unit. The validation either passes completely (if the aggregated value satisfies the comparison) or fails completely. Parameters ---------- columns A single column or a list of columns to validate. If multiple columns are supplied, there will be a separate validation step generated for each column. The columns must contain numeric data for the standard deviation to be computed. value The value to compare the column standard deviation against. This can be: (1) a numeric literal (`int` or `float`), (2) a [`col()`](`pointblank.col`) object referencing another column whose standard deviation will be used for comparison, (3) a [`ref()`](`pointblank.ref`) object referencing a column in reference data (when `Validate(reference=)` has been set), or (4) `None` to automatically compare against the same column in reference data (shorthand for `ref(column_name)` when reference data is set). tol A tolerance value for the comparison. The default is `0`, meaning exact comparison. When set to a positive value, the comparison becomes more lenient. For example, with `tol=0.5`, a standard deviation that differs from the target by up to `0.5` will still pass. The `tol=` parameter expands the acceptable range for the comparison. For `col_sd_le()`, a tolerance of `tol=0.5` would mean the standard deviation can be within `0.5` of the target value and still pass validation. thresholds Failure threshold levels so that the validation step can react accordingly when failing test units are level. Since this is an aggregation-based validation with only one test unit, threshold values typically should be set as absolute counts (e.g., `1`) to indicate pass/fail, or as proportions where any value less than `1.0` means failure is acceptable. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Using Reference Data -------------------- The `col_sd_le()` method supports comparing column aggregations against reference data. This is useful for validating that statistical properties remain consistent across different versions of a dataset, or for comparing current data against historical baselines. To use reference data, set the `reference=` parameter when creating the `Validate` object: ```python validation = ( pb.Validate(data=current_data, reference=baseline_data) .col_sd_le(columns="revenue") # Compares sum(current.revenue) vs sum(baseline.revenue) .interrogate() ) ``` When `value=None` and reference data is set, the method automatically compares against the same column in the reference data. You can also explicitly specify reference columns using the `ref()` helper: ```python .col_sd_le(columns="revenue", value=pb.ref("baseline_revenue")) ``` Understanding Tolerance ----------------------- The `tol=` parameter allows for fuzzy comparisons, which is especially important for floating-point aggregations where exact equality is often unreliable. The `tol=` parameter expands the acceptable range for the comparison. For `col_sd_le()`, a tolerance of `tol=0.5` would mean the standard deviation can be within `0.5` of the target value and still pass validation. For equality comparisons (`col_*_eq`), the tolerance creates a range `[value - tol, value + tol]` within which the aggregation is considered valid. For inequality comparisons, the tolerance shifts the comparison boundary. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. Since aggregation validations operate on a single test unit (the aggregated value), threshold values are typically set as absolute counts: - `thresholds=1` means any failure triggers a 'warning' - `thresholds=(1, 1, 1)` means any failure triggers all three levels Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples, we'll use a simple Polars DataFrame with numeric columns. The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5], "b": [2, 2, 2, 2, 2], } ) pb.preview(tbl) ``` Let's validate that the standard deviation of column `a` is at most `2`: ```{python} validation = ( pb.Validate(data=tbl) .col_sd_le(columns="a", value=2) .interrogate() ) validation ``` The validation result shows whether the standard deviation comparison passed or failed. Since this is an aggregation-based validation, there is exactly one test unit per column. When validating multiple columns, each column gets its own validation step: ```{python} validation = ( pb.Validate(data=tbl) .col_sd_le(columns=["a", "b"], value=2) .interrogate() ) validation ``` Using tolerance for flexible comparisons: ```{python} validation = ( pb.Validate(data=tbl) .col_sd_le(columns="a", value=2, tol=1.0) .interrogate() ) validation ``` col_sd_eq(self: 'Validate', columns: 'str | Collection[str]', value: 'float | int | Column | ReferenceColumn | None' = None, tol: 'float' = 0, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, brief: 'str | bool | None' = None, actions: 'Actions | None' = None, active: 'bool | Callable' = True) -> 'Validate' Does the column standard deviation satisfy an equal to comparison? The `col_sd_eq()` validation method checks whether the standard deviation of values in a column equals a specified `value=`. This is an aggregation-based validation where the entire column is reduced to a single standard deviation value that is then compared against the target. The comparison used in this function is `standard deviation(column) == value`. Unlike row-level validations (e.g., `col_vals_gt()`), this method treats the entire column as a single test unit. The validation either passes completely (if the aggregated value satisfies the comparison) or fails completely. Parameters ---------- columns A single column or a list of columns to validate. If multiple columns are supplied, there will be a separate validation step generated for each column. The columns must contain numeric data for the standard deviation to be computed. value The value to compare the column standard deviation against. This can be: (1) a numeric literal (`int` or `float`), (2) a [`col()`](`pointblank.col`) object referencing another column whose standard deviation will be used for comparison, (3) a [`ref()`](`pointblank.ref`) object referencing a column in reference data (when `Validate(reference=)` has been set), or (4) `None` to automatically compare against the same column in reference data (shorthand for `ref(column_name)` when reference data is set). tol A tolerance value for the comparison. The default is `0`, meaning exact comparison. When set to a positive value, the comparison becomes more lenient. For example, with `tol=0.5`, a standard deviation that differs from the target by up to `0.5` will still pass. The `tol=` parameter is particularly useful with `col_sd_eq()` since exact equality comparisons on floating-point aggregations can be problematic due to numerical precision. Setting a small tolerance (e.g., `tol=0.001`) allows for minor differences that arise from floating-point arithmetic. thresholds Failure threshold levels so that the validation step can react accordingly when failing test units are level. Since this is an aggregation-based validation with only one test unit, threshold values typically should be set as absolute counts (e.g., `1`) to indicate pass/fail, or as proportions where any value less than `1.0` means failure is acceptable. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. active A boolean value or callable that determines whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). A callable can also be provided; it will receive the data table as its single argument and must return a boolean value. The callable is evaluated *before* any `pre=` processing. Inspection functions like [`has_columns()`](`pointblank.has_columns`) and [`has_rows()`](`pointblank.has_rows`) can be used here to conditionally activate a step based on properties of the target table. Returns ------- Validate The `Validate` object with the added validation step. Using Reference Data -------------------- The `col_sd_eq()` method supports comparing column aggregations against reference data. This is useful for validating that statistical properties remain consistent across different versions of a dataset, or for comparing current data against historical baselines. To use reference data, set the `reference=` parameter when creating the `Validate` object: ```python validation = ( pb.Validate(data=current_data, reference=baseline_data) .col_sd_eq(columns="revenue") # Compares sum(current.revenue) vs sum(baseline.revenue) .interrogate() ) ``` When `value=None` and reference data is set, the method automatically compares against the same column in the reference data. You can also explicitly specify reference columns using the `ref()` helper: ```python .col_sd_eq(columns="revenue", value=pb.ref("baseline_revenue")) ``` Understanding Tolerance ----------------------- The `tol=` parameter allows for fuzzy comparisons, which is especially important for floating-point aggregations where exact equality is often unreliable. The `tol=` parameter is particularly useful with `col_sd_eq()` since exact equality comparisons on floating-point aggregations can be problematic due to numerical precision. Setting a small tolerance (e.g., `tol=0.001`) allows for minor differences that arise from floating-point arithmetic. For equality comparisons (`col_*_eq`), the tolerance creates a range `[value - tol, value + tol]` within which the aggregation is considered valid. For inequality comparisons, the tolerance shifts the comparison boundary. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. Since aggregation validations operate on a single test unit (the aggregated value), threshold values are typically set as absolute counts: - `thresholds=1` means any failure triggers a 'warning' - `thresholds=(1, 1, 1)` means any failure triggers all three levels Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples, we'll use a simple Polars DataFrame with numeric columns. The table is shown below: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5], "b": [2, 2, 2, 2, 2], } ) pb.preview(tbl) ``` Let's validate that the standard deviation of column `a` equals `2`: ```{python} validation = ( pb.Validate(data=tbl) .col_sd_eq(columns="a", value=2) .interrogate() ) validation ``` The validation result shows whether the standard deviation comparison passed or failed. Since this is an aggregation-based validation, there is exactly one test unit per column. When validating multiple columns, each column gets its own validation step: ```{python} validation = ( pb.Validate(data=tbl) .col_sd_eq(columns=["a", "b"], value=2) .interrogate() ) validation ``` Using tolerance for flexible comparisons: ```{python} validation = ( pb.Validate(data=tbl) .col_sd_eq(columns="a", value=2, tol=1.0) .interrogate() ) validation ``` ## Column Selection Use the `col()` function along with column selection helpers to flexibly select columns for validation. Combine `col()` with `starts_with()`, `matches()`, etc. for selecting multiple target columns. col(exprs: 'str | ColumnSelector | ColumnSelectorNarwhals | nw.selectors.Selector') -> 'Column | ColumnLiteral | ColumnSelectorNarwhals' Helper function for referencing a column in the input table. Many of the validation methods (i.e., `col_vals_*()` methods) in Pointblank have a `value=` argument. These validations are comparisons between column values and a literal value, or, between column values and adjacent values in another column. The `col()` helper function is used to specify that it is a column being referenced, not a literal value. The `col()` doesn't check that the column exists in the input table. It acts to signal that the value being compared is a column value. During validation (i.e., when [`interrogate()`](`pointblank.Validate.interrogate`) is called), Pointblank will then check that the column exists in the input table. For creating expressions to use with the `conjointly()` validation method, use the [`expr_col()`](`pointblank.expr_col`) function instead. Parameters ---------- exprs Either the name of a single column in the target table, provided as a string, or, an expression involving column selector functions (e.g., `starts_with("a")`, `ends_with("e") | starts_with("a")`, etc.). Returns ------- Column | ColumnLiteral | ColumnSelectorNarwhals: A column object or expression representing the column reference. Usage with the `columns=` Argument ----------------------------------- The `col()` function can be used in the `columns=` argument of the following validation methods: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_exists()`](`pointblank.Validate.col_exists`) If specifying a single column with certainty (you have the exact name), `col()` is not necessary since you can just pass the column name as a string (though it is still valid to use `col("column_name")`, if preferred). However, if you want to select columns based on complex logic involving multiple column selector functions (e.g., columns that start with `"a"` but don't end with `"e"`), you need to use `col()` to wrap expressions involving column selector functions and logical operators such as `&`, `|`, `-`, and `~`. Here is an example of such usage with the [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) validation method: ```python col_vals_gt(columns=col(starts_with("a") & ~ends_with("e")), value=10) ``` If using only a single column selector function, you can pass the function directly to the `columns=` argument of the validation method, or, you can use `col()` to wrap the function (either is valid though the first is more concise). Here is an example of that simpler usage: ```python col_vals_gt(columns=starts_with("a"), value=10) ``` Usage with the `value=`, `left=`, and `right=` Arguments -------------------------------------------------------- The `col()` function can be used in the `value=` argument of the following validation methods - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) and in the `left=` and `right=` arguments (either or both) of these two validation methods - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) You cannot use column selector functions such as [`starts_with()`](`pointblank.starts_with`) in either of the `value=`, `left=`, or `right=` arguments since there would be no guarantee that a single column will be resolved from the target table with this approach. The `col()` function is used to signal that the value being compared is a column value and not a literal value. Available Selectors ------------------- There is a collection of selectors available in pointblank, allowing you to select columns based on attributes of column names and positions. The selectors are: - [`starts_with()`](`pointblank.starts_with`) - [`ends_with()`](`pointblank.ends_with`) - [`contains()`](`pointblank.contains`) - [`matches()`](`pointblank.matches`) - [`everything()`](`pointblank.everything`) - [`first_n()`](`pointblank.first_n`) - [`last_n()`](`pointblank.last_n`) Alternatively, we support selectors from the Narwhals library! Those selectors can additionally take advantage of the data types of the columns. The selectors are: - `boolean()` - `by_dtype()` - `categorical()` - `matches()` - `numeric()` - `string()` Have a look at the [Narwhals API documentation on selectors](https://narwhals-dev.github.io/narwhals/api-reference/selectors/) for more information. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` Suppose we have a table with columns `a` and `b` and we'd like to validate that the values in column `a` are greater than the values in column `b`. We can use the `col()` helper function to reference the comparison column when creating the validation step. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 6, 5, 7, 6, 5], "b": [4, 2, 3, 3, 4, 3], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=pb.col("b")) .interrogate() ) validation ``` From results of the validation table it can be seen that values in `a` were greater than values in `b` for every row (or test unit). Using `value=pb.col("b")` specified that the greater-than comparison is across columns, not with a fixed literal value. If you want to select an arbitrary set of columns upon which to base a validation, you can use column selector functions (e.g., [`starts_with()`](`pointblank.starts_with`), [`ends_with()`](`pointblank.ends_with`), etc.) to specify columns in the `columns=` argument of a validation method. Let's use the [`starts_with()`](`pointblank.starts_with`) column selector function to select columns that start with `"paid"` and validate that the values in those columns are greater than `10`. ```{python} tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "paid_2021": [16.32, 16.25, 15.75], "paid_2022": [18.62, 16.95, 18.25], "person_id": ["A123", "B456", "C789"], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns=pb.col(pb.starts_with("paid")), value=10) .interrogate() ) validation ``` In the above example the `col()` function contains the invocation of the [`starts_with()`](`pointblank.starts_with`) column selector function. This is not strictly necessary when using a single column selector function, so `columns=pb.starts_with("paid")` would be equivalent usage here. However, the use of `col()` is required when using multiple column selector functions with logical operators. Here is an example of that more complex usage: ```{python} tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "hours_2022": [160, 180, 160], "hours_2023": [182, 168, 175], "hours_2024": [200, 165, 190], "paid_2022": [18.62, 16.95, 18.25], "paid_2023": [19.29, 17.75, 18.35], "paid_2024": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt( columns=pb.col(pb.starts_with("paid") & pb.matches("2023|2024")), value=10 ) .interrogate() ) validation ``` In the above example the `col()` function contains the invocation of the [`starts_with()`](`pointblank.starts_with`) and [`matches()`](`pointblank.matches`) column selector functions, combined with the `&` operator. This is necessary to specify the set of columns that start with `"paid"` *and* match the text `"2023"` or `"2024"`. If you'd like to take advantage of Narwhals selectors, that's also possible. Here is an example of using the `numeric()` column selector function to select all numeric columns for validation, checking that their values are greater than `0`. ```{python} import narwhals.selectors as ncs tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "hours_2022": [160, 180, 160], "hours_2023": [182, 168, 175], "hours_2024": [200, 165, 190], "paid_2022": [18.62, 16.95, 18.25], "paid_2023": [19.29, 17.75, 18.35], "paid_2024": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_ge(columns=pb.col(ncs.numeric()), value=0) .interrogate() ) validation ``` In the above example the `col()` function contains the invocation of the `numeric()` column selector function from Narwhals. As with the other selectors, this is not strictly necessary when using a single column selector, so `columns=ncs.numeric()` would also be fine here. Narwhals selectors can also use operators to combine multiple selectors. Here is an example of using the `numeric()` and [`matches()`](`pointblank.matches`) selectors together to select all numeric columns that fit a specific pattern. ```{python} tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "2022_status": ["ft", "ft", "pt"], "2023_status": ["ft", "pt", "ft"], "2024_status": ["ft", "pt", "ft"], "2022_pay_total": [18.62, 16.95, 18.25], "2023_pay_total": [19.29, 17.75, 18.35], "2024_pay_total": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_lt(columns=pb.col(ncs.numeric() & ncs.matches("2023|2024")), value=30) .interrogate() ) validation ``` In the above example the `col()` function contains the invocation of the `numeric()` and [`matches()`](`pointblank.matches`) column selector functions from Narwhals, combined with the `&` operator. This is necessary to specify the set of columns that are numeric *and* match the text `"2023"` or `"2024"`. See Also -------- Create a column expression for use in `conjointly()` validation with the [`expr_col()`](`pointblank.expr_col`) function. starts_with(text: 'str', case_sensitive: 'bool' = False) -> 'StartsWith' Select columns that start with specified text. Many validation methods have a `columns=` argument that can be used to specify the columns for validation (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). The `starts_with()` selector function can be used to select one or more columns that start with some specified text. So if the set of table columns consists of `[name_first, name_last, age, address]` and you want to validate columns that start with `"name"`, you can use `columns=starts_with("name")`. This will select the `name_first` and `name_last` columns. There will be a validation step created for every resolved column. Note that if there aren't any columns resolved from using `starts_with()` (or any other expression using selector functions), the validation step will fail to be evaluated during the interrogation process. Such a failure to evaluate will be reported in the validation results but it won't affect the interrogation process overall (i.e., the process won't be halted). Parameters ---------- text The text that the column name should start with. case_sensitive Whether column names should be treated as case-sensitive. The default is `False`. Returns ------- StartsWith A `StartsWith` object, which can be used to select columns that start with the specified text. Relevant Validation Methods where `starts_with()` can be Used ------------------------------------------------------------- This selector function can be used in the `columns=` argument of the following validation methods: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_exists()`](`pointblank.Validate.col_exists`) The `starts_with()` selector function doesn't need to be used in isolation. Read the next section for information on how to compose it with other column selectors for more refined ways to select columns. Additional Flexibilty through Composition with Other Column Selectors --------------------------------------------------------------------- The `starts_with()` function can be composed with other column selectors to create fine-grained column selections. For example, to select columns that start with `"a"` and end with `"e"`, you can use the `starts_with()` and [`ends_with()`](`pointblank.ends_with`) functions together. The only condition is that the expressions are wrapped in the [`col()`](`pointblank.col`) function, like this: ```python col(starts_with("a") & ends_with("e")) ``` There are four operators that can be used to compose column selectors: - `&` (*and*) - `|` (*or*) - `-` (*difference*) - `~` (*not*) The `&` operator is used to select columns that satisfy both conditions. The `|` operator is used to select columns that satisfy either condition. The `-` operator is used to select columns that satisfy the first condition but not the second. The `~` operator is used to select columns that don't satisfy the condition. As many selector functions can be used as needed and the operators can be combined to create complex column selection criteria (parentheses can be used to group conditions and control the order of evaluation). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` Suppose we have a table with columns `name`, `paid_2021`, `paid_2022`, and `person_id` and we'd like to validate that the values in columns that start with `"paid"` are greater than `10`. We can use the `starts_with()` column selector function to specify the columns that start with `"paid"` as the columns to validate. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "paid_2021": [16.32, 16.25, 15.75], "paid_2022": [18.62, 16.95, 18.25], "person_id": ["A123", "B456", "C789"], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns=pb.starts_with("paid"), value=10) .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `paid_2021` and one for `paid_2022`. The values in both columns were all greater than `10`. We can also use the `starts_with()` function in combination with other column selectors (within [`col()`](`pointblank.col`)) to create more complex column selection criteria (i.e., to select columns that satisfy multiple conditions). For example, to select columns that start with `"paid"` and match the text `"2023"` or `"2024"`, we can use the `&` operator to combine column selectors. ```{python} tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "hours_2022": [160, 180, 160], "hours_2023": [182, 168, 175], "hours_2024": [200, 165, 190], "paid_2022": [18.62, 16.95, 18.25], "paid_2023": [19.29, 17.75, 18.35], "paid_2024": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt( columns=pb.col(pb.starts_with("paid") & pb.matches("23|24")), value=10 ) .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `paid_2023` and one for `paid_2024`. ends_with(text: 'str', case_sensitive: 'bool' = False) -> 'EndsWith' Select columns that end with specified text. Many validation methods have a `columns=` argument that can be used to specify the columns for validation (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). The `ends_with()` selector function can be used to select one or more columns that end with some specified text. So if the set of table columns consists of `[first_name, last_name, age, address]` and you want to validate columns that end with `"name"`, you can use `columns=ends_with("name")`. This will select the `first_name` and `last_name` columns. There will be a validation step created for every resolved column. Note that if there aren't any columns resolved from using `ends_with()` (or any other expression using selector functions), the validation step will fail to be evaluated during the interrogation process. Such a failure to evaluate will be reported in the validation results but it won't affect the interrogation process overall (i.e., the process won't be halted). Parameters ---------- text The text that the column name should end with. case_sensitive Whether column names should be treated as case-sensitive. The default is `False`. Returns ------- EndsWith An `EndsWith` object, which can be used to select columns that end with the specified text. Relevant Validation Methods where `ends_with()` can be Used ----------------------------------------------------------- This selector function can be used in the `columns=` argument of the following validation methods: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_exists()`](`pointblank.Validate.col_exists`) The `ends_with()` selector function doesn't need to be used in isolation. Read the next section for information on how to compose it with other column selectors for more refined ways to select columns. Additional Flexibilty through Composition with Other Column Selectors --------------------------------------------------------------------- The `ends_with()` function can be composed with other column selectors to create fine-grained column selections. For example, to select columns that end with `"e"` and start with `"a"`, you can use the `ends_with()` and [`starts_with()`](`pointblank.starts_with`) functions together. The only condition is that the expressions are wrapped in the [`col()`](`pointblank.col`) function, like this: ```python col(ends_with("e") & starts_with("a")) ``` There are four operators that can be used to compose column selectors: - `&` (*and*) - `|` (*or*) - `-` (*difference*) - `~` (*not*) The `&` operator is used to select columns that satisfy both conditions. The `|` operator is used to select columns that satisfy either condition. The `-` operator is used to select columns that satisfy the first condition but not the second. The `~` operator is used to select columns that don't satisfy the condition. As many selector functions can be used as needed and the operators can be combined to create complex column selection criteria (parentheses can be used to group conditions and control the order of evaluation). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` Suppose we have a table with columns `name`, `2021_pay`, `2022_pay`, and `person_id` and we'd like to validate that the values in columns that end with `"pay"` are greater than `10`. We can use the `ends_with()` column selector function to specify the columns that end with `"pay"` as the columns to validate. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "2021_pay": [16.32, 16.25, 15.75], "2022_pay": [18.62, 16.95, 18.25], "person_id": ["A123", "B456", "C789"], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns=pb.ends_with("pay"), value=10) .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `2021_pay` and one for `2022_pay`. The values in both columns were all greater than `10`. We can also use the `ends_with()` function in combination with other column selectors (within [`col()`](`pointblank.col`)) to create more complex column selection criteria (i.e., to select columns that satisfy multiple conditions). For example, to select columns that end with `"pay"` and match the text `"2023"` or `"2024"`, we can use the `&` operator to combine column selectors. ```{python} tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "2022_hours": [160, 180, 160], "2023_hours": [182, 168, 175], "2024_hours": [200, 165, 190], "2022_pay": [18.62, 16.95, 18.25], "2023_pay": [19.29, 17.75, 18.35], "2024_pay": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt( columns=pb.col(pb.ends_with("pay") & pb.matches("2023|2024")), value=10 ) .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `2023_pay` and one for `2024_pay`. contains(text: 'str', case_sensitive: 'bool' = False) -> 'Contains' Select columns that contain specified text. Many validation methods have a `columns=` argument that can be used to specify the columns for validation (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). The `contains()` selector function can be used to select one or more columns that contain some specified text. So if the set of table columns consists of `[profit, conv_first, conv_last, highest_conv, age]` and you want to validate columns that have `"conv"` in the name, you can use `columns=contains("conv")`. This will select the `conv_first`, `conv_last`, and `highest_conv` columns. There will be a validation step created for every resolved column. Note that if there aren't any columns resolved from using `contains()` (or any other expression using selector functions), the validation step will fail to be evaluated during the interrogation process. Such a failure to evaluate will be reported in the validation results but it won't affect the interrogation process overall (i.e., the process won't be halted). Parameters ---------- text The text that the column name should contain. case_sensitive Whether column names should be treated as case-sensitive. The default is `False`. Returns ------- Contains A `Contains` object, which can be used to select columns that contain the specified text. Relevant Validation Methods where `contains()` can be Used ---------------------------------------------------------- This selector function can be used in the `columns=` argument of the following validation methods: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_exists()`](`pointblank.Validate.col_exists`) The `contains()` selector function doesn't need to be used in isolation. Read the next section for information on how to compose it with other column selectors for more refined ways to select columns. Additional Flexibilty through Composition with Other Column Selectors --------------------------------------------------------------------- The `contains()` function can be composed with other column selectors to create fine-grained column selections. For example, to select columns that have the text `"_n"` and start with `"item"`, you can use the `contains()` and [`starts_with()`](`pointblank.starts_with`) functions together. The only condition is that the expressions are wrapped in the [`col()`](`pointblank.col`) function, like this: ```python col(contains("_n") & starts_with("item")) ``` There are four operators that can be used to compose column selectors: - `&` (*and*) - `|` (*or*) - `-` (*difference*) - `~` (*not*) The `&` operator is used to select columns that satisfy both conditions. The `|` operator is used to select columns that satisfy either condition. The `-` operator is used to select columns that satisfy the first condition but not the second. The `~` operator is used to select columns that don't satisfy the condition. As many selector functions can be used as needed and the operators can be combined to create complex column selection criteria (parentheses can be used to group conditions and control the order of evaluation). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` Suppose we have a table with columns `name`, `2021_pay_total`, `2022_pay_total`, and `person_id` and we'd like to validate that the values in columns having `"pay"` in the name are greater than `10`. We can use the `contains()` column selector function to specify the column names that contain `"pay"` as the columns to validate. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "2021_pay_total": [16.32, 16.25, 15.75], "2022_pay_total": [18.62, 16.95, 18.25], "person_id": ["A123", "B456", "C789"], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns=pb.contains("pay"), value=10) .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `2021_pay_total` and one for `2022_pay_total`. The values in both columns were all greater than `10`. We can also use the `contains()` function in combination with other column selectors (within [`col()`](`pointblank.col`)) to create more complex column selection criteria (i.e., to select columns that satisfy multiple conditions). For example, to select columns that contain `"pay"` and match the text `"2023"` or `"2024"`, we can use the `&` operator to combine column selectors. ```{python} tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "2022_hours": [160, 180, 160], "2023_hours": [182, 168, 175], "2024_hours": [200, 165, 190], "2022_pay_total": [18.62, 16.95, 18.25], "2023_pay_total": [19.29, 17.75, 18.35], "2024_pay_total": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt( columns=pb.col(pb.contains("pay") & pb.matches("2023|2024")), value=10 ) .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `2023_pay_total` and one for `2024_pay_total`. matches(pattern: 'str', case_sensitive: 'bool' = False) -> 'Matches' Select columns that match a specified regular expression pattern. Many validation methods have a `columns=` argument that can be used to specify the columns for validation (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). The `matches()` selector function can be used to select one or more columns matching a provided regular expression pattern. So if the set of table columns consists of `[rev_01, rev_02, profit_01, profit_02, age]` and you want to validate columns that have two digits at the end of the name, you can use `columns=matches(r"[0-9]{2}$")`. This will select the `rev_01`, `rev_02`, `profit_01`, and `profit_02` columns. There will be a validation step created for every resolved column. Note that if there aren't any columns resolved from using `matches()` (or any other expression using selector functions), the validation step will fail to be evaluated during the interrogation process. Such a failure to evaluate will be reported in the validation results but it won't affect the interrogation process overall (i.e., the process won't be halted). Parameters ---------- pattern The regular expression pattern that the column name should match. case_sensitive Whether column names should be treated as case-sensitive. The default is `False`. Returns ------- Matches A `Matches` object, which can be used to select columns that match the specified pattern. Relevant Validation Methods where `matches()` can be Used --------------------------------------------------------- This selector function can be used in the `columns=` argument of the following validation methods: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_exists()`](`pointblank.Validate.col_exists`) The `matches()` selector function doesn't need to be used in isolation. Read the next section for information on how to compose it with other column selectors for more refined ways to select columns. Additional Flexibilty through Composition with Other Column Selectors --------------------------------------------------------------------- The `matches()` function can be composed with other column selectors to create fine-grained column selections. For example, to select columns that have the text starting with five digits and end with `"_id"`, you can use the `matches()` and [`ends_with()`](`pointblank.ends_with`) functions together. The only condition is that the expressions are wrapped in the [`col()`](`pointblank.col`) function, like this: ```python col(matches(r"^[0-9]{5}") & ends_with("_id")) ``` There are four operators that can be used to compose column selectors: - `&` (*and*) - `|` (*or*) - `-` (*difference*) - `~` (*not*) The `&` operator is used to select columns that satisfy both conditions. The `|` operator is used to select columns that satisfy either condition. The `-` operator is used to select columns that satisfy the first condition but not the second. The `~` operator is used to select columns that don't satisfy the condition. As many selector functions can be used as needed and the operators can be combined to create complex column selection criteria (parentheses can be used to group conditions and control the order of evaluation). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` Suppose we have a table with columns `name`, `id_old`, `new_identifier`, and `pay_2021` and we'd like to validate that text values in columns having `"id"` or `"identifier"` in the name have a specific syntax. We can use the `matches()` column selector function to specify the columns that match the pattern. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "id_old": ["ID0021", "ID0032", "ID0043"], "new_identifier": ["ID9054", "ID9065", "ID9076"], "pay_2021": [16.32, 16.25, 15.75], } ) validation = ( pb.Validate(data=tbl) .col_vals_regex(columns=pb.matches("id|identifier"), pattern=r"ID[0-9]{4}") .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `id_old` and one for `new_identifier`. The values in both columns all match the pattern `"ID[0-9]{4}"`. We can also use the `matches()` function in combination with other column selectors (within [`col()`](`pointblank.col`)) to create more complex column selection criteria (i.e., to select columns that satisfy multiple conditions). For example, to select columns that contain `"pay"` and match the text `"2023"` or `"2024"`, we can use the `&` operator to combine column selectors. ```{python} tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "2022_hours": [160, 180, 160], "2023_hours": [182, 168, 175], "2024_hours": [200, 165, 190], "2022_pay_total": [18.62, 16.95, 18.25], "2023_pay_total": [19.29, 17.75, 18.35], "2024_pay_total": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt( columns=pb.col(pb.contains("pay") & pb.matches("2023|2024")), value=10 ) .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `2023_pay_total` and one for `2024_pay_total`. everything() -> 'Everything' Select all columns. Many validation methods have a `columns=` argument that can be used to specify the columns for validation (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). The `everything()` selector function can be used to select every column in the table. If you have a table with six columns and they're all suitable for a specific type of validation, you can use `columns=everything())` and all six columns will be selected for validation. Returns ------- Everything An `Everything` object, which can be used to select all columns. Relevant Validation Methods where `everything()` can be Used ------------------------------------------------------------ This selector function can be used in the `columns=` argument of the following validation methods: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_exists()`](`pointblank.Validate.col_exists`) The `everything()` selector function doesn't need to be used in isolation. Read the next section for information on how to compose it with other column selectors for more refined ways to select columns. Additional Flexibilty through Composition with Other Column Selectors --------------------------------------------------------------------- The `everything()` function can be composed with other column selectors to create fine-grained column selections. For example, to select all column names except those having starting with "id_", you can use the `everything()` and [`starts_with()`](`pointblank.starts_with`) functions together. The only condition is that the expressions are wrapped in the [`col()`](`pointblank.col`) function, like this: ```python col(everything() - starts_with("id_")) ``` There are four operators that can be used to compose column selectors: - `&` (*and*) - `|` (*or*) - `-` (*difference*) - `~` (*not*) The `&` operator is used to select columns that satisfy both conditions. The `|` operator is used to select columns that satisfy either condition. The `-` operator is used to select columns that satisfy the first condition but not the second. The `~` operator is used to select columns that don't satisfy the condition. As many selector functions can be used as needed and the operators can be combined to create complex column selection criteria (parentheses can be used to group conditions and control the order of evaluation). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` Suppose we have a table with several numeric columns and we'd like to validate that all these columns have less than `1000`. We can use the `everything()` column selector function to select all columns for validation. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "2023_hours": [182, 168, 175], "2024_hours": [200, 165, 190], "2023_pay_total": [19.29, 17.75, 18.35], "2024_pay_total": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_lt(columns=pb.everything(), value=1000) .interrogate() ) validation ``` From the results of the validation table we get four validation steps, one each column in the table. The values in every column were all lower than `1000`. We can also use the `everything()` function in combination with other column selectors (within [`col()`](`pointblank.col`)) to create more complex column selection criteria (i.e., to select columns that satisfy multiple conditions). For example, to select every column except those that begin with `"2023"` we can use the `-` operator to combine column selectors. ```{python} tbl = pl.DataFrame( { "2023_hours": [182, 168, 175], "2024_hours": [200, 165, 190], "2023_pay_total": [19.29, 17.75, 18.35], "2024_pay_total": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_lt(columns=pb.col(pb.everything() - pb.starts_with("2023")), value=1000) .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `2024_hours` and one for `2024_pay_total`. first_n(n: 'int', offset: 'int' = 0) -> 'FirstN' Select the first `n` columns in the column list. Many validation methods have a `columns=` argument that can be used to specify the columns for validation (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). The `first_n()` selector function can be used to select *n* columns positioned at the start of the column list. So if the set of table columns consists of `[rev_01, rev_02, profit_01, profit_02, age]` and you want to validate the first two columns, you can use `columns=first_n(2)`. This will select the `rev_01` and `rev_02` columns and a validation step will be created for each. The `offset=` parameter can be used to skip a certain number of columns from the start of the column list. So if you want to select the third and fourth columns, you can use `columns=first_n(2, offset=2)`. Parameters ---------- n The number of columns to select from the start of the column list. Should be a positive integer value. If `n` is greater than the number of columns in the table, all columns will be selected. offset The offset from the start of the column list. The default is `0`. If `offset` is greater than the number of columns in the table, no columns will be selected. Returns ------- FirstN A `FirstN` object, which can be used to select the first `n` columns. Relevant Validation Methods where `first_n()` can be Used --------------------------------------------------------- This selector function can be used in the `columns=` argument of the following validation methods: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_exists()`](`pointblank.Validate.col_exists`) The `first_n()` selector function doesn't need to be used in isolation. Read the next section for information on how to compose it with other column selectors for more refined ways to select columns. Additional Flexibilty through Composition with Other Column Selectors --------------------------------------------------------------------- The `first_n()` function can be composed with other column selectors to create fine-grained column selections. For example, to select all column names starting with "rev" along with the first two columns, you can use the `first_n()` and [`starts_with()`](`pointblank.starts_with`) functions together. The only condition is that the expressions are wrapped in the [`col()`](`pointblank.col`) function, like this: ```python col(first_n(2) | starts_with("rev")) ``` There are four operators that can be used to compose column selectors: - `&` (*and*) - `|` (*or*) - `-` (*difference*) - `~` (*not*) The `&` operator is used to select columns that satisfy both conditions. The `|` operator is used to select columns that satisfy either condition. The `-` operator is used to select columns that satisfy the first condition but not the second. The `~` operator is used to select columns that don't satisfy the condition. As many selector functions can be used as needed and the operators can be combined to create complex column selection criteria (parentheses can be used to group conditions and control the order of evaluation). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` Suppose we have a table with columns `paid_2021`, `paid_2022`, `paid_2023`, `paid_2024`, and `name` and we'd like to validate that the values in the first four columns are greater than `10`. We can use the `first_n()` column selector function to specify that the first four columns in the table are the columns to validate. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "paid_2021": [17.94, 16.55, 17.85], "paid_2022": [18.62, 16.95, 18.25], "paid_2023": [19.29, 17.75, 18.35], "paid_2024": [20.73, 18.35, 20.10], "name": ["Alice", "Bob", "Charlie"], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns=pb.first_n(4), value=10) .interrogate() ) validation ``` From the results of the validation table we get four validation steps. The values in all those columns were all greater than `10`. We can also use the `first_n()` function in combination with other column selectors (within [`col()`](`pointblank.col`)) to create more complex column selection criteria (i.e., to select columns that satisfy multiple conditions). For example, to select the first four columns but also omit those columns that end with `"2023"`, we can use the `-` operator to combine column selectors. ```{python} tbl = pl.DataFrame( { "paid_2021": [17.94, 16.55, 17.85], "paid_2022": [18.62, 16.95, 18.25], "paid_2023": [19.29, 17.75, 18.35], "paid_2024": [20.73, 18.35, 20.10], "name": ["Alice", "Bob", "Charlie"], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns=pb.col(pb.first_n(4) - pb.ends_with("2023")), value=10) .interrogate() ) validation ``` From the results of the validation table we get three validation steps, one for `paid_2021`, `paid_2022`, and `paid_2024`. last_n(n: 'int', offset: 'int' = 0) -> 'LastN' Select the last `n` columns in the column list. Many validation methods have a `columns=` argument that can be used to specify the columns for validation (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). The `last_n()` selector function can be used to select *n* columns positioned at the end of the column list. So if the set of table columns consists of `[age, rev_01, rev_02, profit_01, profit_02]` and you want to validate the last two columns, you can use `columns=last_n(2)`. This will select the `profit_01` and `profit_02` columns and a validation step will be created for each. The `offset=` parameter can be used to skip a certain number of columns from the end of the column list. So if you want to select the third and fourth columns from the end, you can use `columns=last_n(2, offset=2)`. Parameters ---------- n The number of columns to select from the end of the column list. Should be a positive integer value. If `n` is greater than the number of columns in the table, all columns will be selected. offset The offset from the end of the column list. The default is `0`. If `offset` is greater than the number of columns in the table, no columns will be selected. Returns ------- LastN A `LastN` object, which can be used to select the last `n` columns. Relevant Validation Methods where `last_n()` can be Used -------------------------------------------------------- This selector function can be used in the `columns=` argument of the following validation methods: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_exists()`](`pointblank.Validate.col_exists`) The `last_n()` selector function doesn't need to be used in isolation. Read the next section for information on how to compose it with other column selectors for more refined ways to select columns. Additional Flexibilty through Composition with Other Column Selectors --------------------------------------------------------------------- The `last_n()` function can be composed with other column selectors to create fine-grained column selections. For example, to select all column names starting with "rev" along with the last two columns, you can use the `last_n()` and [`starts_with()`](`pointblank.starts_with`) functions together. The only condition is that the expressions are wrapped in the [`col()`](`pointblank.col`) function, like this: ```python col(last_n(2) | starts_with("rev")) ``` There are four operators that can be used to compose column selectors: - `&` (*and*) - `|` (*or*) - `-` (*difference*) - `~` (*not*) The `&` operator is used to select columns that satisfy both conditions. The `|` operator is used to select columns that satisfy either condition. The `-` operator is used to select columns that satisfy the first condition but not the second. The `~` operator is used to select columns that don't satisfy the condition. As many selector functions can be used as needed and the operators can be combined to create complex column selection criteria (parentheses can be used to group conditions and control the order of evaluation). Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` Suppose we have a table with columns `name`, `paid_2021`, `paid_2022`, `paid_2023`, and `paid_2024` and we'd like to validate that the values in the last four columns are greater than `10`. We can use the `last_n()` column selector function to specify that the last four columns in the table are the columns to validate. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "paid_2021": [17.94, 16.55, 17.85], "paid_2022": [18.62, 16.95, 18.25], "paid_2023": [19.29, 17.75, 18.35], "paid_2024": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns=pb.last_n(4), value=10) .interrogate() ) validation ``` From the results of the validation table we get four validation steps. The values in all those columns were all greater than `10`. We can also use the `last_n()` function in combination with other column selectors (within [`col()`](`pointblank.col`)) to create more complex column selection criteria (i.e., to select columns that satisfy multiple conditions). For example, to select the last four columns but also omit those columns that end with `"2023"`, we can use the `-` operator to combine column selectors. ```{python} tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "paid_2021": [17.94, 16.55, 17.85], "paid_2022": [18.62, 16.95, 18.25], "paid_2023": [19.29, 17.75, 18.35], "paid_2024": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns=pb.col(pb.last_n(4) - pb.ends_with("2023")), value=10) .interrogate() ) validation ``` From the results of the validation table we get three validation steps, one for `paid_2021`, `paid_2022`, and `paid_2024`. expr_col(column_name: 'str') -> 'ColumnExpression' Create a column expression for use in `conjointly()` validation. This function returns a ColumnExpression object that supports operations like `>`, `<`, `+`, etc. for use in [`conjointly()`](`pointblank.Validate.conjointly`) validation expressions. Parameters ---------- column_name The name of the column to reference. Returns ------- ColumnExpression A column expression that can be used in comparisons and operations. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` Let's say we have a table with three columns: `a`, `b`, and `c`. We want to validate that: - The values in column `a` are greater than `2`. - The values in column `b` are less than `7`. - The sum of columns `a` and `b` is less than the values in column `c`. We can use the `expr_col()` function to create a column expression for each of these conditions. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 7, 1, 3, 9, 4], "b": [6, 3, 0, 5, 8, 2], "c": [10, 4, 8, 9, 10, 5], } ) # Using expr_col() to create backend-agnostic validation expressions validation = ( pb.Validate(data=tbl) .conjointly( lambda df: pb.expr_col("a") > 2, lambda df: pb.expr_col("b") < 7, lambda df: pb.expr_col("a") + pb.expr_col("b") < pb.expr_col("c") ) .interrogate() ) validation ``` The above code creates a validation object that checks the specified conditions using the `expr_col()` function. The resulting validation table will show whether each condition was satisfied for each row in the table. See Also -------- The [`conjointly()`](`pointblank.Validate.conjointly`) validation method, which is where this function should be used. ## Segment Groups Combine multiple values into a single segment using `seg_*()` helper functions. seg_group(values: 'list[Any]') -> 'Segment' Group together values for segmentation. Many validation methods have a `segments=` argument that can be used to specify one or more columns, or certain values within a column, to create segments for validation (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). When passing in a column, or a tuple with a column and certain values, a segment will be created for each individual value within the column or given values. The `seg_group()` selector enables values to be grouped together into a segment. For example, if you were to create a segment for a column "region", investigating just "North" and "South" regions, a typical segment would look like: `segments=("region", ["North", "South"])` This would create two validation steps, one for each of the regions. If you wanted to group these two regions into a single segment, you could use the `seg_group()` function like this: `segments=("region", pb.seg_group(["North", "South"]))` You could create a second segment for "East" and "West" regions like this: `segments=("region", pb.seg_group([["North", "South"], ["East", "West"]]))` There will be a validation step created for every segment. Note that if there aren't any segments created using `seg_group()` (or any other segment expression), the validation step will fail to be evaluated during the interrogation process. Such a failure to evaluate will be reported in the validation results but it won't affect the interrogation process overall (i.e., the process won't be halted). Parameters ---------- values A list of values to be grouped into a segment. This can be a single list or a list of lists. Returns ------- Segment A `Segment` object, which can be used to combine values into a segment. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` Let's say we're analyzing sales from our local bookstore, and want to check the number of books sold for the month exceeds a certain threshold. We could pass in the argument `segments="genre"`, which would return a segment for each unique genre in the datasets. We could also pass in `segments=("genre", ["Fantasy", "Science Fiction"])`, to only create segments for those two genres. However, if we wanted to group these two genres into a single segment, we could use the `seg_group()` function. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "title": [ "The Hobbit", "Harry Potter and the Sorcerer's Stone", "The Lord of the Rings", "A Game of Thrones", "The Name of the Wind", "The Girl with the Dragon Tattoo", "The Da Vinci Code", "The Hitchhiker's Guide to the Galaxy", "The Martian", "Brave New World" ], "genre": [ "Fantasy", "Fantasy", "Fantasy", "Fantasy", "Fantasy", "Mystery", "Mystery", "Science Fiction", "Science Fiction", "Science Fiction", ], "units_sold": [875, 932, 756, 623, 445, 389, 678, 534, 712, 598], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt( columns="units_sold", value=500, segments=("genre", pb.seg_group(["Fantasy", "Science Fiction"])) ) .interrogate() ) validation ``` What's more, we can create multiple segments, combining the genres in different ways. ```{python} validation = ( pb.Validate(data=tbl) .col_vals_gt( columns="units_sold", value=500, segments=("genre", pb.seg_group([ ["Fantasy", "Science Fiction"], ["Fantasy", "Mystery"], ["Mystery", "Science Fiction"] ])) ) .interrogate() ) validation ``` ## Interrogation and Reporting The validation plan is executed when `interrogate()` is called. After interrogation, view validation reports, extract metrics, or split data based on results. interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' = True, get_first_n: 'int | None' = None, sample_n: 'int | None' = None, sample_frac: 'int | float | None' = None, extract_limit: 'int' = 500) -> 'Validate' Execute each validation step against the table and store the results. When a validation plan has been set with a series of validation steps, the interrogation process through `interrogate()` should then be invoked. Interrogation will evaluate each validation step against the table and store the results. The interrogation process will collect extracts of failing rows if the `collect_extracts=` option is set to `True` (the default). We can control the number of rows collected using the `get_first_n=`, `sample_n=`, and `sample_frac=` options. The `extract_limit=` option will enforce a hard limit on the number of rows collected when `collect_extracts=True`. After interrogation is complete, the `Validate` object will have gathered information, and we can use methods like [`n_passed()`](`pointblank.Validate.n_passed`), [`f_failed()`](`pointblank.Validate.f_failed`), etc., to understand how the table performed against the validation plan. A visual representation of the validation results can be viewed by printing the `Validate` object; this will display the validation table in an HTML viewing environment. Parameters ---------- collect_extracts An option to collect rows of the input table that didn't pass a particular validation step. The default is `True` and further options (i.e., `get_first_n=`, `sample_*=`) allow for fine control of how these rows are collected. collect_tbl_checked The processed data frames produced by executing the validation steps is collected and stored in the `Validate` object if `collect_tbl_checked=True`. This information is necessary for some methods (e.g., [`get_sundered_data()`](`pointblank.Validate.get_sundered_data`)), but it can potentially make the object grow to a large size. To opt out of attaching this data, set this to `False`. get_first_n If the option to collect rows where test units is chosen, there is the option here to collect the first `n` rows. Supply an integer number of rows to extract from the top of subset table containing non-passing rows (the ordering of data from the original table is retained). sample_n If the option to collect non-passing rows is chosen, this option allows for the sampling of `n` rows. Supply an integer number of rows to sample from the subset table. If `n` happens to be greater than the number of non-passing rows, then all such rows will be returned. sample_frac If the option to collect non-passing rows is chosen, this option allows for the sampling of a fraction of those rows. Provide a number in the range of `0` and `1`. The number of rows to return could be very large, however, the `extract_limit=` option will apply a hard limit to the returned rows. extract_limit A value that limits the possible number of rows returned when extracting non-passing rows. The default is `500` rows. This limit is applied after any sampling or limiting options are applied. If the number of rows to be returned is greater than this limit, then the number of rows returned will be limited to this value. This is useful for preventing the collection of too many rows when the number of non-passing rows is very large. Returns ------- Validate The `Validate` object with the results of the interrogation. Examples -------- Let's use a built-in dataset (`"game_revenue"`) to demonstrate some of the options of the interrogation process. A series of validation steps will populate our validation plan. After setting up the plan, the next step is to interrogate the table and see how well it aligns with our expectations. We'll use the `get_first_n=` option so that any extracts of failing rows are limited to the first `n` rows. ```{python} import pointblank as pb import polars as pl validation = ( pb.Validate(data=pb.load_dataset(dataset="game_revenue")) .col_vals_lt(columns="item_revenue", value=200) .col_vals_gt(columns="item_revenue", value=0) .col_vals_gt(columns="session_duration", value=5) .col_vals_in_set(columns="item_type", set=["iap", "ad"]) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}[0-9]{3}") ) validation.interrogate(get_first_n=10) ``` The validation table shows that step 3 (checking for `session_duration` greater than `5`) has 18 failing test units. This means that 18 rows in the table are problematic. We'd like to see the rows that failed this validation step and we can do that with the [`get_data_extracts()`](`pointblank.Validate.get_data_extracts`) method. ```{python} pb.preview(validation.get_data_extracts(i=3, frame=True)) ``` The [`get_data_extracts()`](`pointblank.Validate.get_data_extracts`) method will return a Polars DataFrame here with the first 10 rows that failed the validation step (we passed that into the [`preview()`](`pointblank.preview`) function for a better display). There are actually 18 rows that failed but we limited the collection of extracts with `get_first_n=10`. set_tbl(self, tbl: 'Any', tbl_name: 'str | None' = None, label: 'str | None' = None) -> 'Validate' Set or replace the table associated with the Validate object. This method allows you to replace the table associated with a Validate object with a different (but presumably similar) table. This is useful when you want to apply the same validation plan to multiple tables or when you have a validation workflow defined but want to swap in a different data source. Parameters ---------- tbl The table to replace the existing table with. This can be any supported table type including DataFrame objects, Ibis table objects, CSV file paths, Parquet file paths, GitHub URLs, or database connection strings. The same table type constraints apply as in the `Validate` constructor. tbl_name An optional name to assign to the new input table object. If no value is provided, the existing table name will be retained. label An optional label for the validation plan. If no value is provided, the existing label will be retained. Returns ------- Validate A new `Validate` object with the replacement table. When to Use ----------- The `set_tbl()` method is particularly useful in scenarios where you have: - multiple similar tables that need the same validation checks - a template validation workflow that should be applied to different data sources - YAML-defined validations where you want to override the table specified in the YAML The `set_tbl()` method creates a copy of the validation object with the new table, so the original validation object remains unchanged. This allows you to reuse validation plans across multiple tables without interference. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` We will first create two similar tables for our future validation plans. ```{python} import pointblank as pb import polars as pl # Create two similar tables table_1 = pl.DataFrame({ "x": [1, 2, 3, 4, 5], "y": [5, 4, 3, 2, 1], "z": ["a", "b", "c", "d", "e"] }) table_2 = pl.DataFrame({ "x": [2, 4, 6, 8, 10], "y": [10, 8, 6, 4, 2], "z": ["f", "g", "h", "i", "j"] }) ``` Create a validation plan with the first table. ```{python} validation_table_1 = ( pb.Validate( data=table_1, tbl_name="Table 1", label="Validation applied to the first table" ) .col_vals_gt(columns="x", value=0) .col_vals_lt(columns="y", value=10) ) ``` Now apply the same validation plan to the second table. ```{python} validation_table_2 = ( validation_table_1 .set_tbl( tbl=table_2, tbl_name="Table 2", label="Validation applied to the second table" ) ) ``` Here is the interrogation of the first table: ```{python} validation_table_1.interrogate() ``` And the second table: ```{python} validation_table_2.interrogate() ``` get_tabular_report(self, title: 'str | None' = ':default:', incl_header: 'bool | None' = None, incl_footer: 'bool | None' = None, incl_footer_timings: 'bool | None' = None, incl_footer_notes: 'bool | None' = None) -> 'GT' Validation report as a GT table. The `get_tabular_report()` method returns a GT table object that represents the validation report. This validation table provides a summary of the validation results, including the validation steps, the number of test units, the number of failing test units, and the fraction of failing test units. The table also includes status indicators for the 'warning', 'error', and 'critical' levels. You could simply display the validation table without the use of the `get_tabular_report()` method. However, the method provides a way to customize the title of the report. In the future this method may provide additional options for customizing the report. Parameters ---------- title Options for customizing the title of the report. The default is the `":default:"` value which produces a generic title. Another option is `":tbl_name:"`, and that presents the name of the table as the title for the report. If no title is wanted, then `":none:"` can be used. Aside from keyword options, text can be provided for the title. This will be interpreted as Markdown text and transformed internally to HTML. incl_header Controls whether the header section should be displayed. If `None`, uses the global configuration setting. The header contains the table name, label, and threshold information. incl_footer Controls whether the footer section should be displayed. If `None`, uses the global configuration setting. The footer can contain validation timing information and notes. incl_footer_timings Controls whether validation timing information (start time, duration, end time) should be displayed in the footer. If `None`, uses the global configuration setting. Only applies when `incl_footer=True`. incl_footer_notes Controls whether notes from validation steps should be displayed in the footer. If `None`, uses the global configuration setting. Only applies when `incl_footer=True`. Returns ------- GT A GT table object that represents the validation report. Examples -------- Let's create a `Validate` object with a few validation steps and then interrogate the data table to see how it performs against the validation plan. We can then generate a tabular report to get a summary of the results. ```{python} import pointblank as pb import polars as pl # Create a Polars DataFrame tbl_pl = pl.DataFrame({"x": [1, 2, 3, 4], "y": [4, 5, 6, 7]}) # Validate data using Polars DataFrame validation = ( pb.Validate(data=tbl_pl, tbl_name="tbl_xy", thresholds=(2, 3, 4)) .col_vals_gt(columns="x", value=1) .col_vals_lt(columns="x", value=3) .col_vals_le(columns="y", value=7) .interrogate() ) # Look at the validation table validation ``` The validation table is displayed with a default title ('Validation Report'). We can use the `get_tabular_report()` method to customize the title of the report. For example, we can set the title to the name of the table by using the `title=":tbl_name:"` option. This will use the string provided in the `tbl_name=` argument of the `Validate` object. ```{python} validation.get_tabular_report(title=":tbl_name:") ``` The title of the report is now set to the name of the table, which is 'tbl_xy'. This can be useful if you have multiple tables and want to keep track of which table the validation report is for. Alternatively, you can provide your own title for the report. ```{python} validation.get_tabular_report(title="Report for Table XY") ``` The title of the report is now set to 'Report for Table XY'. This can be useful if you want to provide a more descriptive title for the report. get_step_report(self, i: 'int', columns_subset: 'str | list[str] | Column | None' = None, header: 'str' = ':default:', limit: 'int | None' = 10) -> 'GT' Get a detailed report for a single validation step. The `get_step_report()` method returns a report of what went well---or what failed spectacularly---for a given validation step. The report includes a summary of the validation step and a detailed breakdown of the interrogation results. The report is presented as a GT table object, which can be displayed in a notebook or exported to an HTML file. :::{.callout-warning} The `get_step_report()` method is still experimental. Please report any issues you encounter in the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues). ::: Parameters ---------- i The step number for which to get the report. columns_subset The columns to display in a step report that shows errors in the input table. By default all columns are shown (`None`). If a subset of columns is desired, we can provide a list of column names, a string with a single column name, a `Column` object, or a `ColumnSelector` object. The last two options allow for more flexible column selection using column selector functions. Errors are raised if the column names provided don't match any columns in the table (when provided as a string or list of strings) or if column selector expressions don't resolve to any columns. header Options for customizing the header of the step report. The default is the `":default:"` value which produces a header with a standard title and set of details underneath. Aside from this default, free text can be provided for the header. This will be interpreted as Markdown text and transformed internally to HTML. You can provide one of two templating elements: `{title}` and `{details}`. The default header has the template `"{title}{details}"` so you can easily start from that and modify as you see fit. If you don't want a header at all, you can set `header=None` to remove it entirely. limit The number of rows to display for those validation steps that check values in rows (the `col_vals_*()` validation steps). The default is `10` rows and the limit can be removed entirely by setting `limit=None`. Returns ------- GT A GT table object that represents the detailed report for the validation step. Types of Step Reports --------------------- The `get_step_report()` method produces a report based on the *type* of validation step. The following column-value or row-based validation step validation methods will produce a report that shows the rows of the data that failed: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_vals_expr()`](`pointblank.Validate.col_vals_expr`) - [`conjointly()`](`pointblank.Validate.conjointly`) - [`prompt()`](`pointblank.Validate.prompt`) - [`rows_complete()`](`pointblank.Validate.rows_complete`) The [`rows_distinct()`](`pointblank.Validate.rows_distinct`) validation step will produce a report that shows duplicate rows (or duplicate values in one or a set of columns as defined in that method's `columns_subset=` parameter. The [`col_schema_match()`](`pointblank.Validate.col_schema_match`) validation step will produce a report that shows the schema of the data table and the schema of the validation step. The report will indicate whether the schemas match or not. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` Let's create a validation plan with a few validation steps and interrogate the data. With that, we'll have a look at the validation reporting table for the entire collection of steps and what went well or what failed. ```{python} import pointblank as pb validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="pandas"), tbl_name="small_table", label="Example for the get_step_report() method", thresholds=(1, 0.20, 0.40) ) .col_vals_lt(columns="d", value=3500) .col_vals_between(columns="c", left=1, right=8) .col_vals_gt(columns="a", value=3) .col_vals_regex(columns="b", pattern=r"[0-9]-[a-z]{3}-[0-9]{3}") .interrogate() ) validation ``` There were four validation steps performed, where the first three steps had failing test units and the last step had no failures. Let's get a detailed report for the first step by using the `get_step_report()` method. ```{python} validation.get_step_report(i=1) ``` The report for the first step is displayed. The report includes a summary of the validation step and a detailed breakdown of the interrogation results. The report provides details on what the validation step was checking, the extent to which the test units failed, and a table that shows the failing rows of the data with the column of interest highlighted. The second and third steps also had failing test units. Reports for those steps can be viewed by using `get_step_report(i=2)` and `get_step_report(i=3)` respectively. The final step did not have any failing test units. A report for the final step can still be viewed by using `get_step_report(i=4)`. The report will indicate that every test unit passed and a prview of the target table will be provided. ```{python} validation.get_step_report(i=4) ``` If you'd like to trim down the number of columns shown in the report, you can provide a subset of columns to display. For example, if you only want to see the columns `a`, `b`, and `c`, you can provide those column names as a list. ```{python} validation.get_step_report(i=1, columns_subset=["a", "b", "c"]) ``` If you'd like to increase or reduce the maximum number of rows shown in the report, you can provide a different value for the `limit` parameter. For example, if you'd like to see only up to 5 rows, you can set `limit=5`. ```{python} validation.get_step_report(i=3, limit=5) ``` Step 3 actually had 7 failing test units, but only the first 5 rows are shown in the step report because of the `limit=5` parameter. get_json_report(self, use_fields: 'list[str] | None' = None, exclude_fields: 'list[str] | None' = None) -> 'str' Get a report of the validation results as a JSON-formatted string. The `get_json_report()` method provides a machine-readable report of validation results in JSON format. This is particularly useful for programmatic processing, storing validation results, or integrating with other systems. The report includes detailed information about each validation step, such as assertion type, columns validated, threshold values, test results, and more. By default, all available validation information fields are included in the report. However, you can customize the fields to include or exclude using the `use_fields=` and `exclude_fields=` parameters. Parameters ---------- use_fields An optional list of specific fields to include in the report. If provided, only these fields will be included in the JSON output. If `None` (the default), all standard validation report fields are included. Have a look at the *Available Report Fields* section below for a list of fields that can be included in the report. exclude_fields An optional list of fields to exclude from the report. If provided, these fields will be omitted from the JSON output. If `None` (the default), no fields are excluded. This parameter cannot be used together with `use_fields=`. The *Available Report Fields* provides a listing of fields that can be excluded from the report. Returns ------- str A JSON-formatted string representing the validation report, with each validation step as an object in the report array. Available Report Fields ----------------------- The JSON report can include any of the standard validation report fields, including: - `i`: the step number (1-indexed) - `i_o`: the original step index from the validation plan (pre-expansion) - `assertion_type`: the type of validation assertion (e.g., `"col_vals_gt"`, etc.) - `column`: the column being validated (or columns used in certain validations) - `values`: the comparison values or parameters used in the validation - `inclusive`: whether the comparison is inclusive (for range-based validations) - `na_pass`: whether `NA`/`Null` values are considered passing (for certain validations) - `pre`: preprocessing function applied before validation - `segments`: data segments to which the validation was applied - `thresholds`: threshold level statement that was used for the validation step - `label`: custom label for the validation step - `brief`: a brief description of the validation step - `active`: whether the validation step is active - `all_passed`: whether all test units passed in the step - `n`: total number of test units - `n_passed`, `n_failed`: number of test units that passed and failed - `f_passed`, `f_failed`: Fraction of test units that passed and failed - `warning`, `error`, `critical`: whether the namesake threshold level was exceeded (is `null` if threshold not set) - `time_processed`: when the validation step was processed (ISO 8601 format) - `proc_duration_s`: the processing duration in seconds Examples -------- Let's create a validation plan with a few validation steps and generate a JSON report of the results: ```{python} import pointblank as pb import polars as pl # Create a sample DataFrame tbl = pl.DataFrame({ "a": [5, 7, 8, 9], "b": [3, 4, 2, 1] }) # Create and execute a validation plan validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=6) .col_vals_lt(columns="b", value=4) .interrogate() ) # Get the full JSON report json_report = validation.get_json_report() print(json_report) ``` You can also customize which fields to include: ```{python} json_report = validation.get_json_report( use_fields=["i", "assertion_type", "column", "n_passed", "n_failed"] ) print(json_report) ``` Or which fields to exclude: ```{python} json_report = validation.get_json_report( exclude_fields=[ "i_o", "thresholds", "pre", "segments", "values", "na_pass", "inclusive", "label", "brief", "active", "time_processed", "proc_duration_s" ] ) print(json_report) ``` The JSON output can be further processed or analyzed programmatically: ```{python} import json # Parse the JSON report report_data = json.loads(validation.get_json_report()) # Extract and analyze validation results failing_steps = [step for step in report_data if step["n_failed"] > 0] print(f"Number of failing validation steps: {len(failing_steps)}") ``` See Also -------- - [`get_tabular_report()`](`pointblank.Validate.get_tabular_report`): Get a formatted HTML report as a GT table - [`get_data_extracts()`](`pointblank.Validate.get_data_extracts`): Get rows that failed validation get_sundered_data(self, type: 'str' = 'pass') -> 'Any' Get the data that passed or failed the validation steps. Validation of the data is one thing but, sometimes, you want to use the best part of the input dataset for something else. The `get_sundered_data()` method works with a `Validate` object that has been interrogated (i.e., the [`interrogate()`](`pointblank.Validate.interrogate`) method was used). We can get either the 'pass' data piece (rows with no failing test units across all column-value based validation functions), or, the 'fail' data piece (rows with at least one failing test unit across the same series of validations). Details ------- There are some caveats to sundering. The validation steps considered for this splitting will only involve steps where: - of certain check types, where test units are cells checked down a column (e.g., the `col_vals_*()` methods) - `active=` is not set to `False` - `pre=` has not been given an expression for modifying the input table So long as these conditions are met, the data will be split into two constituent tables: one with the rows that passed all validation steps and another with the rows that failed at least one validation step. Parameters ---------- type The type of data to return. Options are `"pass"` or `"fail"`, where the former returns a table only containing rows where test units always passed validation steps, and the latter returns a table only containing rows had test units that failed in at least one validation step. Returns ------- Any A table containing the data that passed or failed the validation steps. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(preview_incl_header=False) ``` Let's create a `Validate` object with three validation steps and then interrogate the data. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [7, 6, 9, 7, 3, 2], "b": [9, 8, 10, 5, 10, 6], "c": ["c", "d", "a", "b", "a", "b"] } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=5) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation ``` From the validation table, we can see that the first and second steps each had 4 passing test units. A failing test unit will mark the entire row as failing in the context of the `get_sundered_data()` method. We can use this method to get the rows of data that passed the during interrogation. ```{python} pb.preview(validation.get_sundered_data()) ``` The returned DataFrame contains the rows that passed all validation steps (we passed this object to [`preview()`](`pointblank.preview`) to show it in an HTML view). From the six-row input DataFrame, the first two rows and the last two rows had test units that failed validation. Thus the middle two rows are the only ones that passed all validation steps and that's what we see in the returned DataFrame. get_data_extracts(self, i: 'int | list[int] | None' = None, frame: 'bool' = False) -> 'dict[int, Any] | Any' Get the rows that failed for each validation step. After the [`interrogate()`](`pointblank.Validate.interrogate`) method has been called, the `get_data_extracts()` method can be used to extract the rows that failed in each column-value or row-based validation step (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`rows_distinct()`](`pointblank.Validate.rows_distinct`), etc.). The method returns a dictionary of tables containing the rows that failed in every validation step. If `frame=True` and `i=` is a scalar, the value is conveniently returned as a table (forgoing the dictionary structure). Parameters ---------- i The validation step number(s) from which the failed rows are obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. frame If `True` and `i=` is a scalar, return the value as a DataFrame instead of a dictionary. Returns ------- dict[int, Any] | Any A dictionary of tables containing the rows that failed in every compatible validation step. Alternatively, it can be a DataFrame if `frame=True` and `i=` is a scalar. Compatible Validation Methods for Yielding Extracted Rows --------------------------------------------------------- The following validation methods operate on column values and will have rows extracted when there are failing test units. - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_vals_expr()`](`pointblank.Validate.col_vals_expr`) - [`conjointly()`](`pointblank.Validate.conjointly`) - [`prompt()`](`pointblank.Validate.prompt`) An extracted row for these validation methods means that a test unit failed for that row in the validation step. These row-based validation methods will also have rows extracted should there be failing rows: - [`rows_distinct()`](`pointblank.Validate.rows_distinct`) - [`rows_complete()`](`pointblank.Validate.rows_complete`) The extracted rows are a subset of the original table and are useful for further analysis or for understanding the nature of the failing test units. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(preview_incl_header=False) ``` Let's perform a series of validation steps on a Polars DataFrame. We'll use the [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) in the first step, [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) in the second step, and [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) in the third step. The [`interrogate()`](`pointblank.Validate.interrogate`) method executes the validation; then, we can extract the rows that failed for each validation step. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 6, 5, 3, 6, 1], "b": [1, 2, 1, 5, 2, 6], "c": [3, 7, 2, 6, 3, 1], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=4) .col_vals_lt(columns="c", value=5) .col_vals_ge(columns="b", value=1) .interrogate() ) validation.get_data_extracts() ``` The `get_data_extracts()` method returns a dictionary of tables, where each table contains a subset of rows from the table. These are the rows that failed for each validation step. In the first step, the[`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) method was used to check if the values in column `a` were greater than `4`. The extracted table shows the rows where this condition was not met; look at the `a` column: all values are less than `4`. In the second step, the [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) method was used to check if the values in column `c` were less than `5`. In the extracted two-row table, we see that the values in column `c` are greater than `5`. The third step ([`col_vals_ge()`](`pointblank.Validate.col_vals_ge`)) checked if the values in column `b` were greater than or equal to `1`. There were no failing test units, so the extracted table is empty (i.e., has columns but no rows). The `i=` argument can be used to narrow down the extraction to one or more steps. For example, to extract the rows that failed in the first step only: ```{python} validation.get_data_extracts(i=1) ``` Note that the first validation step is indexed at `1` (not `0`). This 1-based indexing is in place here to match the step numbers reported in the validation table. What we get back is still a dictionary, but it only contains one table (the one for the first step). If you want to get the extracted table as a DataFrame, set `frame=True` and provide a scalar value for `i`. For example, to get the extracted table for the second step as a DataFrame: ```{python} pb.preview(validation.get_data_extracts(i=2, frame=True)) ``` The extracted table is now a DataFrame, which can serve as a more convenient format for further analysis or visualization. We further used the [`preview()`](`pointblank.preview`) function to show the DataFrame in an HTML view. all_passed(self) -> 'bool' Determine if every validation step passed perfectly, with no failing test units. The `all_passed()` method determines if every validation step passed perfectly, with no failing test units. This method is useful for quickly checking if the table passed all validation steps with flying colors. If there's even a single failing test unit in any validation step, this method will return `False`. This validation metric might be overly stringent for some validation plans where failing test units are generally expected (and the strategy is to monitor data quality over time). However, the value of `all_passed()` could be suitable for validation plans designed to ensure that every test unit passes perfectly (e.g., checks for column presence, null-checking tests, etc.). Returns ------- bool `True` if all validation steps had no failing test units, `False` otherwise. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps, and the second step will have a failing test unit (the value `10` isn't less than `9`). After interrogation, the `all_passed()` method is used to determine if all validation steps passed perfectly. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 9, 5], "b": [5, 6, 10, 3], "c": ["a", "b", "a", "a"], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=0) .col_vals_lt(columns="b", value=9) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation.all_passed() ``` The returned value is `False` since the second validation step had a failing test unit. If it weren't for that one failing test unit, the return value would have been `True`. assert_passing(self) -> 'None' Raise an `AssertionError` if all tests are not passing. The `assert_passing()` method will raise an `AssertionError` if a test does not pass. This method simply wraps `all_passed` for more ready use in test suites. The step number and assertion made is printed in the `AssertionError` message if a failure occurs, ensuring some details are preserved. If the validation has not yet been interrogated, this method will automatically call [`interrogate()`](`pointblank.Validate.interrogate`) with default parameters before checking for passing tests. Raises ------- AssertionError If any validation step has failing test units. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps, and the second step will have a failing test unit (the value `10` isn't less than `9`). The `assert_passing()` method is used to assert that all validation steps passed perfectly, automatically performing the interrogation if needed. ```{python} #| error: True import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 9, 5], "b": [5, 6, 10, 3], "c": ["a", "b", "a", "a"], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=0) .col_vals_lt(columns="b", value=9) # this assertion is false .col_vals_in_set(columns="c", set=["a", "b"]) ) # No need to call [`interrogate()`](`pointblank.Validate.interrogate`) explicitly validation.assert_passing() ``` assert_below_threshold(self, level: 'str' = 'warning', i: 'int | None' = None, message: 'str | None' = None) -> 'None' Raise an `AssertionError` if validation steps exceed a specified threshold level. The `assert_below_threshold()` method checks whether validation steps' failure rates are below a given threshold level (`"warning"`, `"error"`, or `"critical"`). This is particularly useful in automated testing environments where you want to ensure your data quality meets minimum standards before proceeding. If any validation step exceeds the specified threshold level, an `AssertionError` will be raised with details about which steps failed. If the validation has not yet been interrogated, this method will automatically call [`interrogate()`](`pointblank.Validate.interrogate`) with default parameters. Parameters ---------- level The threshold level to check against, which could be any of `"warning"` (the default), `"error"`, or `"critical"`. An `AssertionError` will be raised if any validation step exceeds this level. i Specific validation step number(s) to check. Can be provided as a single integer or a list of integers. If `None` (the default), all steps are checked. message Custom error message to use if assertion fails. If `None`, a default message will be generated that lists the specific steps that exceeded the threshold. Returns ------- None Raises ------ AssertionError If any specified validation step exceeds the given threshold level. ValueError If an invalid threshold level is provided. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` Below are some examples of how to use the `assert_below_threshold()` method. First, we'll create a simple Polars DataFrame with two columns (`a` and `b`). ```{python} import polars as pl tbl = pl.DataFrame({ "a": [7, 4, 9, 7, 12], "b": [9, 8, 10, 5, 10] }) ``` Then a validation plan will be created with thresholds (`warning=0.1`, `error=0.2`, `critical=0.3`). After interrogating, we display the validation report table: ```{python} import pointblank as pb validation = ( pb.Validate(data=tbl, thresholds=(0.1, 0.2, 0.3)) .col_vals_gt(columns="a", value=5) # 1 failing test unit .col_vals_lt(columns="b", value=10) # 2 failing test units .interrogate() ) validation ``` Using `assert_below_threshold(level="warning")` will raise an `AssertionError` if any step exceeds the 'warning' threshold: ```{python} try: validation.assert_below_threshold(level="warning") except AssertionError as e: print(f"Assertion failed: {e}") ``` Check a specific step against the 'critical' threshold using the `i=` parameter: ```{python} validation.assert_below_threshold(level="critical", i=1) # Won't raise an error ``` As the first step is below the 'critical' threshold (it exceeds the 'warning' and 'error' thresholds), no error is raised and nothing is printed. We can also provide a custom error message with the `message=` parameter. Let's try that here: ```{python} try: validation.assert_below_threshold( level="error", message="Data quality too low for processing!" ) except AssertionError as e: print(f"Custom error: {e}") ``` See Also -------- - [`warning()`](`pointblank.Validate.warning`): get the 'warning' status for each validation step - [`error()`](`pointblank.Validate.error`): get the 'error' status for each validation step - [`critical()`](`pointblank.Validate.critical`): get the 'critical' status for each validation step - [`assert_passing()`](`pointblank.Validate.assert_passing`): assert all validations pass completely above_threshold(self, level: 'str' = 'warning', i: 'int | None' = None) -> 'bool' Check if any validation steps exceed a specified threshold level. The `above_threshold()` method checks whether validation steps exceed a given threshold level. This provides a non-exception-based alternative to [`assert_below_threshold()`](`pointblank.Validate.assert_below_threshold`) for conditional workflow control based on validation results. This method is useful in scenarios where you want to check if any validation steps failed beyond a certain threshold without raising an exception, allowing for more flexible programmatic responses to validation issues. Parameters ---------- level The threshold level to check against. Valid options are: `"warning"` (the least severe threshold level), `"error"` (the middle severity threshold level), and `"critical"` (the most severe threshold level). The default is `"warning"`. i Specific validation step number(s) to check. If a single integer, checks only that step. If a list of integers, checks all specified steps. If `None` (the default), checks all validation steps. Step numbers are 1-based (first step is `1`, not `0`). Returns ------- bool `True` if any of the specified validation steps exceed the given threshold level, `False` otherwise. Raises ------ ValueError If an invalid threshold level is provided. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` Below are some examples of how to use the `above_threshold()` method. First, we'll create a simple Polars DataFrame with a single column (`values`). ```{python} import polars as pl tbl = pl.DataFrame({ "values": [1, 2, 3, 4, 5, 0, -1] }) ``` Then a validation plan will be created with thresholds (`warning=0.1`, `error=0.2`, `critical=0.3`). After interrogating, we display the validation report table: ```{python} import pointblank as pb validation = ( pb.Validate(data=tbl, thresholds=(0.1, 0.2, 0.3)) .col_vals_gt(columns="values", value=0) .col_vals_lt(columns="values", value=10) .col_vals_between(columns="values", left=0, right=5) .interrogate() ) validation ``` Let's check if any steps exceed the 'warning' threshold with the `above_threshold()` method. A message will be printed if that's the case: ```{python} if validation.above_threshold(level="warning"): print("Some steps have exceeded the warning threshold") ``` Check if only steps 2 and 3 exceed the 'error' threshold through use of the `i=` argument: ```{python} if validation.above_threshold(level="error", i=[2, 3]): print("Steps 2 and/or 3 have exceeded the error threshold") ``` You can use this in a workflow to conditionally trigger processes. Here's a snippet of how you might use this in a function: ```python def process_data(validation_obj): # Only continue processing if validation passes critical thresholds if not validation_obj.above_threshold(level="critical"): # Continue with processing print("Data meets critical quality thresholds, proceeding...") return True else: # Log failure and stop processing print("Data fails critical quality checks, aborting...") return False ``` Note that this is just a suggestion for how to implement conditional workflow processes. You should adapt this pattern to your specific requirements, which might include different threshold levels, custom logging mechanisms, or integration with your organization's data pipelines and notification systems. See Also -------- - [`assert_below_threshold()`](`pointblank.Validate.assert_below_threshold`): a similar method that raises an exception if thresholds are exceeded - [`warning()`](`pointblank.Validate.warning`): get the 'warning' status for each validation step - [`error()`](`pointblank.Validate.error`): get the 'error' status for each validation step - [`critical()`](`pointblank.Validate.critical`): get the 'critical' status for each validation step n(self, i: 'int | list[int] | None' = None, scalar: 'bool' = False) -> 'dict[int, int] | int' Provides a dictionary of the number of test units for each validation step. The `n()` method provides the number of test units for each validation step. This is the total number of test units that were evaluated in the validation step. It is always an integer value. Test units are the atomic units of the validation process. Different validations can have different numbers of test units. For example, a validation that checks for the presence of a column in a table will have a single test unit. A validation that checks for the presence of a value in a column will have as many test units as there are rows in the table. The method provides a dictionary of the number of test units for each validation step. If the `scalar=True` argument is provided and `i=` is a scalar, the value is returned as a scalar instead of a dictionary. The total number of test units for a validation step is the sum of the number of passing and failing test units (i.e., `n = n_passed + n_failed`). Parameters ---------- i The validation step number(s) from which the number of test units is obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. scalar If `True` and `i=` is a scalar, return the value as a scalar instead of a dictionary. Returns ------- dict[int, int] | int A dictionary of the number of test units for each validation step or a scalar value. Examples -------- Different types of validation steps can have different numbers of test units. In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps, and the number of test units for each step will be a little bit different. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 9, 5], "b": [5, 6, 10, 3], "c": ["a", "b", "a", "a"], } ) # Define a preprocessing function def filter_by_a_gt_1(df): return df.filter(pl.col("a") > 1) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=0) .col_exists(columns="b") .col_vals_lt(columns="b", value=9, pre=filter_by_a_gt_1) .interrogate() ) ``` The first validation step checks that all values in column `a` are greater than `0`. Let's use the `n()` method to determine the number of test units this validation step. ```{python} validation.n(i=1, scalar=True) ``` The returned value of `4` is the number of test units for the first validation step. This value is the same as the number of rows in the table. The second validation step checks for the existence of column `b`. Using the `n()` method we can get the number of test units for this the second step. ```{python} validation.n(i=2, scalar=True) ``` There's a single test unit here because the validation step is checking for the presence of a single column. The third validation step checks that all values in column `b` are less than `9` after filtering the table to only include rows where the value in column `a` is greater than `1`. Because the table is filtered, the number of test units will be less than the total number of rows in the input table. Let's prove this by using the `n()` method. ```{python} validation.n(i=3, scalar=True) ``` The returned value of `3` is the number of test units for the third validation step. When using the `pre=` argument, the input table can be mutated before performing the validation. The `n()` method is a good way to determine whether the mutation performed as expected. In all of these examples, the `scalar=True` argument was used to return the value as a scalar integer value. If `scalar=False`, the method will return a dictionary with an entry for the validation step number (from the `i=` argument) and the number of test units. Futhermore, leaving out the `i=` argument altogether will return a dictionary with filled with the number of test units for each validation step. Here's what that looks like: ```{python} validation.n() ``` n_passed(self, i: 'int | list[int] | None' = None, scalar: 'bool' = False) -> 'dict[int, int] | int' Provides a dictionary of the number of test units that passed for each validation step. The `n_passed()` method provides the number of test units that passed for each validation step. This is the number of test units that passed in the the validation step. It is always some integer value between `0` and the total number of test units. Test units are the atomic units of the validation process. Different validations can have different numbers of test units. For example, a validation that checks for the presence of a column in a table will have a single test unit. A validation that checks for the presence of a value in a column will have as many test units as there are rows in the table. The method provides a dictionary of the number of passing test units for each validation step. If the `scalar=True` argument is provided and `i=` is a scalar, the value is returned as a scalar instead of a dictionary. Furthermore, a value obtained here will be the complement to the analogous value returned by the [`n_passed()`](`pointblank.Validate.n_passed`) method (i.e., `n - n_failed`). Parameters ---------- i The validation step number(s) from which the number of passing test units is obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. scalar If `True` and `i=` is a scalar, return the value as a scalar instead of a dictionary. Returns ------- dict[int, int] | int A dictionary of the number of passing test units for each validation step or a scalar value. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps and, as it turns out, all of them will have failing test units. After interrogation, the `n_passed()` method is used to determine the number of passing test units for each validation step. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [7, 4, 9, 7, 12], "b": [9, 8, 10, 5, 10], "c": ["a", "b", "c", "a", "b"] } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=5) .col_vals_gt(columns="b", value=pb.col("a")) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation.n_passed() ``` The returned dictionary shows that all validation steps had no passing test units (each value was less than `5`, which is the total number of test units for each step). If we wanted to check the number of passing test units for a single validation step, we can provide the step number. Also, we could forego the dictionary and get a scalar value by setting `scalar=True` (ensuring that `i=` is a scalar). ```{python} validation.n_passed(i=1) ``` The returned value of `4` is the number of passing test units for the first validation step. n_failed(self, i: 'int | list[int] | None' = None, scalar: 'bool' = False) -> 'dict[int, int] | int' Provides a dictionary of the number of test units that failed for each validation step. The `n_failed()` method provides the number of test units that failed for each validation step. This is the number of test units that did not pass in the the validation step. It is always some integer value between `0` and the total number of test units. Test units are the atomic units of the validation process. Different validations can have different numbers of test units. For example, a validation that checks for the presence of a column in a table will have a single test unit. A validation that checks for the presence of a value in a column will have as many test units as there are rows in the table. The method provides a dictionary of the number of failing test units for each validation step. If the `scalar=True` argument is provided and `i=` is a scalar, the value is returned as a scalar instead of a dictionary. Furthermore, a value obtained here will be the complement to the analogous value returned by the [`n_passed()`](`pointblank.Validate.n_passed`) method (i.e., `n - n_passed`). Parameters ---------- i The validation step number(s) from which the number of failing test units is obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. scalar If `True` and `i=` is a scalar, return the value as a scalar instead of a dictionary. Returns ------- dict[int, int] | int A dictionary of the number of failing test units for each validation step or a scalar value. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps and, as it turns out, all of them will have failing test units. After interrogation, the `n_failed()` method is used to determine the number of failing test units for each validation step. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [7, 4, 9, 7, 12], "b": [9, 8, 10, 5, 10], "c": ["a", "b", "c", "a", "b"] } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=5) .col_vals_gt(columns="b", value=pb.col("a")) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation.n_failed() ``` The returned dictionary shows that all validation steps had failing test units. If we wanted to check the number of failing test units for a single validation step, we can provide the step number. Also, we could forego the dictionary and get a scalar value by setting `scalar=True` (ensuring that `i=` is a scalar). ```{python} validation.n_failed(i=1) ``` The returned value of `1` is the number of failing test units for the first validation step. f_passed(self, i: 'int | list[int] | None' = None, scalar: 'bool' = False) -> 'dict[int, float] | float' Provides a dictionary of the fraction of test units that passed for each validation step. A measure of the fraction of test units that passed is provided by the `f_passed` attribute. This is the fraction of test units that passed the validation step over the total number of test units. Given this is a fractional value, it will always be in the range of `0` to `1`. Test units are the atomic units of the validation process. Different validations can have different numbers of test units. For example, a validation that checks for the presence of a column in a table will have a single test unit. A validation that checks for the presence of a value in a column will have as many test units as there are rows in the table. This method provides a dictionary of the fraction of passing test units for each validation step. If the `scalar=True` argument is provided and `i=` is a scalar, the value is returned as a scalar instead of a dictionary. Furthermore, a value obtained here will be the complement to the analogous value returned by the [`f_failed()`](`pointblank.Validate.f_failed`) method (i.e., `1 - f_failed()`). Parameters ---------- i The validation step number(s) from which the fraction of passing test units is obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. scalar If `True` and `i=` is a scalar, return the value as a scalar instead of a dictionary. Returns ------- dict[int, float] | float A dictionary of the fraction of passing test units for each validation step or a scalar value. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps, all having some failing test units. After interrogation, the `f_passed()` method is used to determine the fraction of passing test units for each validation step. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [7, 4, 9, 7, 12, 3, 10], "b": [9, 8, 10, 5, 10, 6, 2], "c": ["a", "b", "c", "a", "b", "d", "c"] } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=5) .col_vals_gt(columns="b", value=pb.col("a")) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation.f_passed() ``` The returned dictionary shows the fraction of passing test units for each validation step. The values are all less than `1` since there were failing test units in each step. If we wanted to check the fraction of passing test units for a single validation step, we can provide the step number. Also, we could have the value returned as a scalar by setting `scalar=True` (ensuring that `i=` is a scalar). ```{python} validation.f_passed(i=1) ``` The returned value is the proportion of passing test units for the first validation step (5 passing test units out of 7 total test units). f_failed(self, i: 'int | list[int] | None' = None, scalar: 'bool' = False) -> 'dict[int, float] | float' Provides a dictionary of the fraction of test units that failed for each validation step. A measure of the fraction of test units that failed is provided by the `f_failed` attribute. This is the fraction of test units that failed the validation step over the total number of test units. Given this is a fractional value, it will always be in the range of `0` to `1`. Test units are the atomic units of the validation process. Different validations can have different numbers of test units. For example, a validation that checks for the presence of a column in a table will have a single test unit. A validation that checks for the presence of a value in a column will have as many test units as there are rows in the table. This method provides a dictionary of the fraction of failing test units for each validation step. If the `scalar=True` argument is provided and `i=` is a scalar, the value is returned as a scalar instead of a dictionary. Furthermore, a value obtained here will be the complement to the analogous value returned by the [`f_passed()`](`pointblank.Validate.f_passed`) method (i.e., `1 - f_passed()`). Parameters ---------- i The validation step number(s) from which the fraction of failing test units is obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. scalar If `True` and `i=` is a scalar, return the value as a scalar instead of a dictionary. Returns ------- dict[int, float] | float A dictionary of the fraction of failing test units for each validation step or a scalar value. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps, all having some failing test units. After interrogation, the `f_failed()` method is used to determine the fraction of failing test units for each validation step. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [7, 4, 9, 7, 12, 3, 10], "b": [9, 8, 10, 5, 10, 6, 2], "c": ["a", "b", "c", "a", "b", "d", "c"] } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=5) .col_vals_gt(columns="b", value=pb.col("a")) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation.f_failed() ``` The returned dictionary shows the fraction of failing test units for each validation step. The values are all greater than `0` since there were failing test units in each step. If we wanted to check the fraction of failing test units for a single validation step, we can provide the step number. Also, we could have the value returned as a scalar by setting `scalar=True` (ensuring that `i=` is a scalar). ```{python} validation.f_failed(i=1) ``` The returned value is the proportion of failing test units for the first validation step (2 failing test units out of 7 total test units). warning(self, i: 'int | list[int] | None' = None, scalar: 'bool' = False) -> 'dict[int, bool] | bool' Get the 'warning' level status for each validation step. The 'warning' status for a validation step is `True` if the fraction of failing test units meets or exceeds the threshold for the 'warning' level. Otherwise, the status is `False`. The ascribed name of 'warning' is semantic and does not imply that a warning message is generated, it is simply a status indicator that could be used to trigger some action to be taken. Here's how it fits in with other status indicators: - 'warning': the status obtained by calling 'warning()', least severe - 'error': the status obtained by calling [`error()`](`pointblank.Validate.error`), middle severity - 'critical': the status obtained by calling [`critical()`](`pointblank.Validate.critical`), most severe This method provides a dictionary of the 'warning' status for each validation step. If the `scalar=True` argument is provided and `i=` is a scalar, the value is returned as a scalar instead of a dictionary. Parameters ---------- i The validation step number(s) from which the 'warning' status is obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. scalar If `True` and `i=` is a scalar, return the value as a scalar instead of a dictionary. Returns ------- dict[int, bool] | bool A dictionary of the 'warning' status for each validation step or a scalar value. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps, and the first step will have some failing test units, the rest will be completely passing. We've set thresholds here for each of the steps by using `thresholds=(2, 4, 5)`, which means: - the 'warning' threshold is `2` failing test units - the 'error' threshold is `4` failing test units - the 'critical' threshold is `5` failing test units After interrogation, the `warning()` method is used to determine the 'warning' status for each validation step. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [7, 4, 9, 7, 12, 3, 10], "b": [9, 8, 10, 5, 10, 6, 2], "c": ["a", "b", "a", "a", "b", "b", "a"] } ) validation = ( pb.Validate(data=tbl, thresholds=(2, 4, 5)) .col_vals_gt(columns="a", value=5) .col_vals_lt(columns="b", value=15) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation.warning() ``` The returned dictionary provides the 'warning' status for each validation step. The first step has a `True` value since the number of failing test units meets the threshold for the 'warning' level. The second and third steps have `False` values since the number of failing test units was `0`, which is below the threshold for the 'warning' level. We can also visually inspect the 'warning' status across all steps by viewing the validation table: ```{python} validation ``` We can see that there's a filled gray circle in the first step (look to the far right side, in the `W` column) indicating that the 'warning' threshold was met. The other steps have empty gray circles. This means that thresholds were 'set but not met' in those steps. If we wanted to check the 'warning' status for a single validation step, we can provide the step number. Also, we could have the value returned as a scalar by setting `scalar=True` (ensuring that `i=` is a scalar). ```{python} validation.warning(i=1) ``` The returned value is `True`, indicating that the first validation step had met the 'warning' threshold. error(self, i: 'int | list[int] | None' = None, scalar: 'bool' = False) -> 'dict[int, bool] | bool' Get the 'error' level status for each validation step. The 'error' status for a validation step is `True` if the fraction of failing test units meets or exceeds the threshold for the 'error' level. Otherwise, the status is `False`. The ascribed name of 'error' is semantic and does not imply that the validation process is halted, it is simply a status indicator that could be used to trigger some action to be taken. Here's how it fits in with other status indicators: - 'warning': the status obtained by calling [`warning()`](`pointblank.Validate.warning`), least severe - 'error': the status obtained by calling `error()`, middle severity - 'critical': the status obtained by calling [`critical()`](`pointblank.Validate.critical`), most severe This method provides a dictionary of the 'error' status for each validation step. If the `scalar=True` argument is provided and `i=` is a scalar, the value is returned as a scalar instead of a dictionary. Parameters ---------- i The validation step number(s) from which the 'error' status is obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. scalar If `True` and `i=` is a scalar, return the value as a scalar instead of a dictionary. Returns ------- dict[int, bool] | bool A dictionary of the 'error' status for each validation step or a scalar value. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps, and the first step will have some failing test units, the rest will be completely passing. We've set thresholds here for each of the steps by using `thresholds=(2, 4, 5)`, which means: - the 'warning' threshold is `2` failing test units - the 'error' threshold is `4` failing test units - the 'critical' threshold is `5` failing test units After interrogation, the `error()` method is used to determine the 'error' status for each validation step. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [3, 4, 9, 7, 2, 3, 8], "b": [9, 8, 10, 5, 10, 6, 2], "c": ["a", "b", "a", "a", "b", "b", "a"] } ) validation = ( pb.Validate(data=tbl, thresholds=(2, 4, 5)) .col_vals_gt(columns="a", value=5) .col_vals_lt(columns="b", value=15) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation.error() ``` The returned dictionary provides the 'error' status for each validation step. The first step has a `True` value since the number of failing test units meets the threshold for the 'error' level. The second and third steps have `False` values since the number of failing test units was `0`, which is below the threshold for the 'error' level. We can also visually inspect the 'error' status across all steps by viewing the validation table: ```{python} validation ``` We can see that there are filled gray and yellow circles in the first step (far right side, in the `W` and `E` columns) indicating that the 'warning' and 'error' thresholds were met. The other steps have empty gray and yellow circles. This means that thresholds were 'set but not met' in those steps. If we wanted to check the 'error' status for a single validation step, we can provide the step number. Also, we could have the value returned as a scalar by setting `scalar=True` (ensuring that `i=` is a scalar). ```{python} validation.error(i=1) ``` The returned value is `True`, indicating that the first validation step had the 'error' threshold met. critical(self, i: 'int | list[int] | None' = None, scalar: 'bool' = False) -> 'dict[int, bool] | bool' Get the 'critical' level status for each validation step. The 'critical' status for a validation step is `True` if the fraction of failing test units meets or exceeds the threshold for the 'critical' level. Otherwise, the status is `False`. The ascribed name of 'critical' is semantic and is thus simply a status indicator that could be used to trigger some action to be take. Here's how it fits in with other status indicators: - 'warning': the status obtained by calling [`warning()`](`pointblank.Validate.warning`), least severe - 'error': the status obtained by calling [`error()`](`pointblank.Validate.error`), middle severity - 'critical': the status obtained by calling `critical()`, most severe This method provides a dictionary of the 'critical' status for each validation step. If the `scalar=True` argument is provided and `i=` is a scalar, the value is returned as a scalar instead of a dictionary. Parameters ---------- i The validation step number(s) from which the 'critical' status is obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. scalar If `True` and `i=` is a scalar, return the value as a scalar instead of a dictionary. Returns ------- dict[int, bool] | bool A dictionary of the 'critical' status for each validation step or a scalar value. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps, and the first step will have many failing test units, the rest will be completely passing. We've set thresholds here for each of the steps by using `thresholds=(2, 4, 5)`, which means: - the 'warning' threshold is `2` failing test units - the 'error' threshold is `4` failing test units - the 'critical' threshold is `5` failing test units After interrogation, the `critical()` method is used to determine the 'critical' status for each validation step. ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [2, 4, 4, 7, 2, 3, 8], "b": [9, 8, 10, 5, 10, 6, 2], "c": ["a", "b", "a", "a", "b", "b", "a"] } ) validation = ( pb.Validate(data=tbl, thresholds=(2, 4, 5)) .col_vals_gt(columns="a", value=5) .col_vals_lt(columns="b", value=15) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation.critical() ``` The returned dictionary provides the 'critical' status for each validation step. The first step has a `True` value since the number of failing test units meets the threshold for the 'critical' level. The second and third steps have `False` values since the number of failing test units was `0`, which is below the threshold for the 'critical' level. We can also visually inspect the 'critical' status across all steps by viewing the validation table: ```{python} validation ``` We can see that there are filled gray, yellow, and red circles in the first step (far right side, in the `W`, `E`, and `C` columns) indicating that the 'warning', 'error', and 'critical' thresholds were met. The other steps have empty gray, yellow, and red circles. This means that thresholds were 'set but not met' in those steps. If we wanted to check the 'critical' status for a single validation step, we can provide the step number. Also, we could have the value returned as a scalar by setting `scalar=True` (ensuring that `i=` is a scalar). ```{python} validation.critical(i=1) ``` The returned value is `True`, indicating that the first validation step had the 'critical' threshold met. ## Inspection and Assistance Functions for getting to grips with a new data table. Use DataScan for a quick overview, `preview()` for first/last rows, `col_summary_tbl()` for column summaries, and `missing_vals_tbl()` for missing value analysis. DataScan(data: 'Any', tbl_name: 'str | None' = None) -> 'None' Get a summary of a dataset. The `DataScan` class provides a way to get a summary of a dataset. The summary includes the following information: - the name of the table (if provided) - the type of the table (e.g., `"polars"`, `"pandas"`, etc.) - the number of rows and columns in the table - column-level information, including: - the column name - the column type - measures of missingness and distinctness - measures of negative, zero, and positive values (for numerical columns) - a sample of the data (the first 5 values) - statistics (if the column contains numbers, strings, or datetimes) To obtain a dictionary representation of the summary, you can use the `to_dict()` method. To get a JSON representation of the summary, you can use the `to_json()` method. To save the JSON text to a file, the `save_to_json()` method could be used. :::{.callout-warning} The `DataScan()` class is still experimental. Please report any issues you encounter in the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues). ::: Parameters ---------- data The data to scan and summarize. This could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, a GitHub URL pointing to a CSV or Parquet file, or a database connection string. tbl_name Optionally, the name of the table could be provided as `tbl_name`. Measures of Missingness and Distinctness ---------------------------------------- For each column, the following measures are provided: - `n_missing_values`: the number of missing values in the column - `f_missing_values`: the fraction of missing values in the column - `n_unique_values`: the number of unique values in the column - `f_unique_values`: the fraction of unique values in the column The fractions are calculated as the ratio of the measure to the total number of rows in the dataset. Counts and Fractions of Negative, Zero, and Positive Values ----------------------------------------------------------- For numerical columns, the following measures are provided: - `n_negative_values`: the number of negative values in the column - `f_negative_values`: the fraction of negative values in the column - `n_zero_values`: the number of zero values in the column - `f_zero_values`: the fraction of zero values in the column - `n_positive_values`: the number of positive values in the column - `f_positive_values`: the fraction of positive values in the column The fractions are calculated as the ratio of the measure to the total number of rows in the dataset. Statistics for Numerical and String Columns ------------------------------------------- For numerical and string columns, several statistical measures are provided. Please note that for string columms, the statistics are based on the lengths of the strings in the column. The following descriptive statistics are provided: - `mean`: the mean of the column - `std_dev`: the standard deviation of the column Additionally, the following quantiles are provided: - `min`: the minimum value in the column - `p05`: the 5th percentile of the column - `q_1`: the first quartile of the column - `med`: the median of the column - `q_3`: the third quartile of the column - `p95`: the 95th percentile of the column - `max`: the maximum value in the column - `iqr`: the interquartile range of the column Statistics for Date and Datetime Columns ---------------------------------------- For date/datetime columns, the following statistics are provided: - `min`: the minimum date/datetime in the column - `max`: the maximum date/datetime in the column Returns ------- DataScan A DataScan object. preview(data: 'Any', columns_subset: 'str | list[str] | Column | None' = None, n_head: 'int' = 5, n_tail: 'int' = 5, limit: 'int' = 50, show_row_numbers: 'bool' = True, max_col_width: 'int' = 250, min_tbl_width: 'int' = 500, incl_header: 'bool | None' = None) -> 'GT' Display a table preview that shows some rows from the top, some from the bottom. To get a quick look at the data in a table, we can use the `preview()` function to display a preview of the table. The function shows a subset of the rows from the start and end of the table, with the number of rows from the start and end determined by the `n_head=` and `n_tail=` parameters (set to `5` by default). This function works with any table that is supported by the `pointblank` library, including Pandas, Polars, and Ibis backend tables (e.g., DuckDB, MySQL, PostgreSQL, SQLite, Parquet, etc.). The view is optimized for readability, with column names and data types displayed in a compact format. The column widths are sized to fit the column names, dtypes, and column content up to a configurable maximum width of `max_col_width=` pixels. The table can be scrolled horizontally to view even very large datasets. Since the output is a Great Tables (`GT`) object, it can be further customized using the `great_tables` API. Parameters ---------- data The table to preview, which could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, or a database connection string. When providing a CSV or Parquet file path (as a string or `pathlib.Path` object), the file will be automatically loaded using an available DataFrame library (Polars or Pandas). Parquet input also supports glob patterns, directories containing .parquet files, and Spark-style partitioned datasets. Connection strings enable direct database access via Ibis with optional table specification using the `::table_name` suffix. Read the *Supported Input Table Types* section for details on the supported table types. columns_subset The columns to display in the table, by default `None` (all columns are shown). This can be a string, a list of strings, a `Column` object, or a `ColumnSelector` object. The latter two options allow for more flexible column selection using column selector functions. Errors are raised if the column names provided don't match any columns in the table (when provided as a string or list of strings) or if column selector expressions don't resolve to any columns. n_head The number of rows to show from the start of the table. Set to `5` by default. n_tail The number of rows to show from the end of the table. Set to `5` by default. limit The limit value for the sum of `n_head=` and `n_tail=` (the total number of rows shown). If the sum of `n_head=` and `n_tail=` exceeds the limit, an error is raised. The default value is `50`. show_row_numbers Should row numbers be shown? The numbers shown reflect the row numbers of the head and tail in the input `data=` table. By default, this is set to `True`. max_col_width The maximum width of the columns (in pixels) before the text is truncated. The default value is `250` (`"250px"`). min_tbl_width The minimum width of the table in pixels. If the sum of the column widths is less than this value, the all columns are sized up to reach this minimum width value. The default value is `500` (`"500px"`). incl_header Should the table include a header with the table type and table dimensions? Set to `True` by default. Returns ------- GT A GT object that displays the preview of the table. Supported Input Table Types --------------------------- The `data=` parameter can be given any of the following table types: - Polars DataFrame (`"polars"`) - Pandas DataFrame (`"pandas"`) - PySpark table (`"pyspark"`) - DuckDB table (`"duckdb"`)* - MySQL table (`"mysql"`)* - PostgreSQL table (`"postgresql"`)* - SQLite table (`"sqlite"`)* - Microsoft SQL Server table (`"mssql"`)* - Snowflake table (`"snowflake"`)* - Databricks table (`"databricks"`)* - BigQuery table (`"bigquery"`)* - Parquet table (`"parquet"`)* - CSV files (string path or `pathlib.Path` object with `.csv` extension) - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet` extension, or partitioned dataset) - Database connection strings (URI format with optional table specification) The table types marked with an asterisk need to be prepared as Ibis tables (with type of `ibis.expr.types.relations.Table`). Furthermore, using `preview()` with these types of tables requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a Polars or Pandas DataFrame, the availability of Ibis is not needed. To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is provided. The file will be automatically detected and loaded using the best available DataFrame library. The loading preference is Polars first, then Pandas as a fallback. Connection strings follow database URL formats and must also specify a table using the `::table_name` suffix. Examples include: ``` "duckdb:///path/to/database.ddb::table_name" "sqlite:///path/to/database.db::table_name" "postgresql://user:password@localhost:5432/database::table_name" "mysql://user:password@localhost:3306/database::table_name" "bigquery://project/dataset::table_name" "snowflake://user:password@account/database/schema::table_name" ``` When using connection strings, the Ibis library with the appropriate backend driver is required. Examples -------- It's easy to preview a table using the `preview()` function. Here's an example using the `small_table` dataset (itself loaded using the [`load_dataset()`](`pointblank.load_dataset`) function): ```{python} import pointblank as pb small_table_polars = pb.load_dataset("small_table") pb.preview(small_table_polars) ``` This table is a Polars DataFrame, but the `preview()` function works with any table supported by `pointblank`, including Pandas DataFrames and Ibis backend tables. Here's an example using a DuckDB table handled by Ibis: ```{python} small_table_duckdb = pb.load_dataset("small_table", tbl_type="duckdb") pb.preview(small_table_duckdb) ``` The blue dividing line marks the end of the first `n_head=` rows and the start of the last `n_tail=` rows. We can adjust the number of rows shown from the start and end of the table by setting the `n_head=` and `n_tail=` parameters. Let's enlarge each of these to `10`: ```{python} pb.preview(small_table_polars, n_head=10, n_tail=10) ``` In the above case, the entire dataset is shown since the sum of `n_head=` and `n_tail=` is greater than the number of rows in the table (which is 13). The `columns_subset=` parameter can be used to show only specific columns in the table. You can provide a list of column names to make the selection. Let's try that with the `"game_revenue"` dataset as a Pandas DataFrame: ```{python} game_revenue_pandas = pb.load_dataset("game_revenue", tbl_type="pandas") pb.preview(game_revenue_pandas, columns_subset=["player_id", "item_name", "item_revenue"]) ``` Alternatively, we can use column selector functions like [`starts_with()`](`pointblank.starts_with`) and [`matches()`](`pointblank.matches`)` to select columns based on text or patterns: ```{python} pb.preview(game_revenue_pandas, n_head=2, n_tail=2, columns_subset=pb.starts_with("session")) ``` Multiple column selector functions can be combined within [`col()`](`pointblank.col`) using operators like `|` and `&`: ```{python} pb.preview( game_revenue_pandas, n_head=2, n_tail=2, columns_subset=pb.col(pb.starts_with("item") | pb.matches("player")) ) ``` ### Working with CSV Files The `preview()` function can directly accept CSV file paths, making it easy to preview data stored in CSV files without manual loading: ```{python} # Get a path to a CSV file from the package data csv_path = pb.get_data_path("global_sales", "csv") pb.preview(csv_path) ``` You can also use a Path object to specify the CSV file: ```{python} from pathlib import Path csv_file = Path(pb.get_data_path("game_revenue", "csv")) pb.preview(csv_file, n_head=3, n_tail=3) ``` ### Working with Parquet Files The `preview()` function can directly accept Parquet files and datasets in various formats: ```{python} # Single Parquet file from package data parquet_path = pb.get_data_path("nycflights", "parquet") pb.preview(parquet_path) ``` You can also use glob patterns and directories: ```python # Multiple Parquet files with glob patterns pb.preview("data/sales_*.parquet") # Directory containing Parquet files pb.preview("parquet_data/") # Partitioned Parquet dataset pb.preview("sales_data/") # Auto-discovers partition columns ``` ### Working with Database Connection Strings The `preview()` function supports database connection strings for direct preview of database tables. Connection strings must specify a table using the `::table_name` suffix: ```{python} # Get path to a DuckDB database file from package data duckdb_path = pb.get_data_path("game_revenue", "duckdb") pb.preview(f"duckdb:///{duckdb_path}::game_revenue") ``` For comprehensive documentation on supported connection string formats, error handling, and installation requirements, see the [`connect_to_table()`](`pointblank.connect_to_table`) function. col_summary_tbl(data: 'Any', tbl_name: 'str | None' = None) -> 'GT' Generate a column-level summary table of a dataset. The `col_summary_tbl()` function generates a summary table of a dataset, focusing on providing column-level information about the dataset. The summary includes the following information: - the type of the table (e.g., `"polars"`, `"pandas"`, etc.) - the number of rows and columns in the table - column-level information, including: - the column name - the column type - measures of missingness and distinctness - descriptive stats and quantiles - statistics for datetime columns The summary table is returned as a GT object, which can be displayed in a notebook or saved to an HTML file. :::{.callout-warning} The `col_summary_tbl()` function is still experimental. Please report any issues you encounter in the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues). ::: Parameters ---------- data The table to summarize, which could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, or a database connection string. Read the *Supported Input Table Types* section for details on the supported table types. tbl_name Optionally, the name of the table could be provided as `tbl_name=`. Returns ------- GT A GT object that displays the column-level summaries of the table. Supported Input Table Types --------------------------- The `data=` parameter can be given any of the following table types: - Polars DataFrame (`"polars"`) - Pandas DataFrame (`"pandas"`) - DuckDB table (`"duckdb"`)* - MySQL table (`"mysql"`)* - PostgreSQL table (`"postgresql"`)* - SQLite table (`"sqlite"`)* - Parquet table (`"parquet"`)* - CSV files (string path or `pathlib.Path` object with `.csv` extension) - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet` extension, or partitioned dataset) - GitHub URLs (direct links to CSV or Parquet files on GitHub) - Database connection strings (URI format with optional table specification) The table types marked with an asterisk need to be prepared as Ibis tables (with type of `ibis.expr.types.relations.Table`). Furthermore, using `col_summary_tbl()` with these types of tables requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a Polars or Pandas DataFrame, the availability of Ibis is not needed. Examples -------- It's easy to get a column-level summary of a table using the `col_summary_tbl()` function. Here's an example using the `small_table` dataset (itself loaded using the [`load_dataset()`](`pointblank.load_dataset`) function): ```{python} import pointblank as pb small_table = pb.load_dataset(dataset="small_table", tbl_type="polars") pb.col_summary_tbl(data=small_table) ``` This table used above was a Polars DataFrame, but the `col_summary_tbl()` function works with any table supported by `pointblank`, including Pandas DataFrames and Ibis backend tables. Here's an example using a DuckDB table handled by Ibis: ```{python} nycflights = pb.load_dataset(dataset="nycflights", tbl_type="duckdb") pb.col_summary_tbl(data=nycflights, tbl_name="nycflights") ``` missing_vals_tbl(data: 'Any') -> 'GT' Display a table that shows the missing values in the input table. The `missing_vals_tbl()` function generates a table that shows the missing values in the input table. The table is displayed using the Great Tables API, which allows for further customization of the table's appearance if so desired. Parameters ---------- data The table for which to display the missing values. This could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, or a database connection string. Read the *Supported Input Table Types* section for details on the supported table types. Returns ------- GT A GT object that displays the table of missing values in the input table. Supported Input Table Types --------------------------- The `data=` parameter can be given any of the following table types: - Polars DataFrame (`"polars"`) - Pandas DataFrame (`"pandas"`) - PySpark table (`"pyspark"`) - DuckDB table (`"duckdb"`)* - MySQL table (`"mysql"`)* - PostgreSQL table (`"postgresql"`)* - SQLite table (`"sqlite"`)* - Microsoft SQL Server table (`"mssql"`)* - Snowflake table (`"snowflake"`)* - Databricks table (`"databricks"`)* - BigQuery table (`"bigquery"`)* - Parquet table (`"parquet"`)* - CSV files (string path or `pathlib.Path` object with `.csv` extension) - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet` extension, or partitioned dataset) - Database connection strings (URI format with optional table specification) The table types marked with an asterisk need to be prepared as Ibis tables (with type of `ibis.expr.types.relations.Table`). Furthermore, using `missing_vals_tbl()` with these types of tables requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a Polars or Pandas DataFrame, the availability of Ibis is not needed. The Missing Values Table ------------------------ The missing values table shows the proportion of missing values in each column of the input table. The table is divided into sectors, with each sector representing a range of rows in the table. The proportion of missing values in each sector is calculated for each column. The table is displayed using the Great Tables API, which allows for further customization of the table's appearance. To ensure that the table can scale to tables with many columns, each row in the reporting table represents a column in the input table. There are 10 sectors shown in the table, where the first sector represents the first 10% of the rows, the second sector represents the next 10% of the rows, and so on. Any sectors that are light blue indicate that there are no missing values in that sector. If there are missing values, the proportion of missing values is shown by a gray color (light gray for low proportions, dark gray to black for very high proportions). Examples -------- The `missing_vals_tbl()` function is useful for quickly identifying columns with missing values in a table. Here's an example using the `nycflights` dataset (loaded as a Polars DataFrame using the [`load_dataset()`](`pointblank.load_dataset`) function): ```{python} import pointblank as pb nycflights = pb.load_dataset("nycflights", tbl_type="polars") pb.missing_vals_tbl(nycflights) ``` The table shows the proportion of missing values in each column of the `nycflights` dataset. The table is divided into sectors, with each sector representing a range of rows in the table (with around 34,000 rows per sector). The proportion of missing values in each sector is calculated for each column. The various shades of gray indicate the proportion of missing values in each sector. Many columns have no missing values at all, and those sectors are colored light blue. load_dataset(dataset: "Literal['small_table', 'game_revenue', 'nycflights', 'global_sales']" = 'small_table', tbl_type: "Literal['polars', 'pandas', 'duckdb']" = 'polars') -> 'Any' Load a dataset hosted in the library as specified table type. The Pointblank library includes several datasets that can be loaded using the `load_dataset()` function. The datasets can be loaded as a Polars DataFrame, a Pandas DataFrame, or as a DuckDB table (which uses the Ibis library backend). These datasets are used throughout the documentation's examples to demonstrate the functionality of the library. They're also useful for experimenting with the library and trying out different validation scenarios. Parameters ---------- dataset The name of the dataset to load. Current options are `"small_table"`, `"game_revenue"`, `"nycflights"`, and `"global_sales"`. tbl_type The type of table to generate from the dataset. The named options are `"polars"`, `"pandas"`, and `"duckdb"`. Returns ------- Any The dataset for the `Validate` object. This could be a Polars DataFrame, a Pandas DataFrame, or a DuckDB table as an Ibis table. Included Datasets ----------------- There are three included datasets that can be loaded using the `load_dataset()` function: - `"small_table"`: A small dataset with 13 rows and 8 columns. This dataset is useful for testing and demonstration purposes. - `"game_revenue"`: A dataset with 2000 rows and 11 columns. Provides revenue data for a game development company. For the particular game, there are records of player sessions, the items they purchased, ads viewed, and the revenue generated. - `"nycflights"`: A dataset with 336,776 rows and 18 columns. This dataset provides information about flights departing from New York City airports (JFK, LGA, or EWR) in 2013. - `"global_sales"`: A dataset with 50,000 rows and 20 columns. Provides information about global sales of products across different regions and countries. Supported DataFrame Types ------------------------- The `tbl_type=` parameter can be set to one of the following: - `"polars"`: A Polars DataFrame. - `"pandas"`: A Pandas DataFrame. - `"duckdb"`: An Ibis table for a DuckDB database. Examples -------- Load the `"small_table"` dataset as a Polars DataFrame by calling `load_dataset()` with `dataset="small_table"` and `tbl_type="polars"`: ```{python} import pointblank as pb small_table = pb.load_dataset(dataset="small_table", tbl_type="polars") pb.preview(small_table) ``` Note that the `"small_table"` dataset is a Polars DataFrame and using the [`preview()`](`pointblank.preview`) function will display the table in an HTML viewing environment. The `"game_revenue"` dataset can be loaded as a Pandas DataFrame by specifying the dataset name and setting `tbl_type="pandas"`: ```{python} game_revenue = pb.load_dataset(dataset="game_revenue", tbl_type="pandas") pb.preview(game_revenue) ``` The `"game_revenue"` dataset is a more real-world dataset with a mix of data types, and it's significantly larger than the `small_table` dataset at 2000 rows and 11 columns. The `"nycflights"` dataset can be loaded as a DuckDB table by specifying the dataset name and setting `tbl_type="duckdb"`: ```{python} nycflights = pb.load_dataset(dataset="nycflights", tbl_type="duckdb") pb.preview(nycflights) ``` The `"nycflights"` dataset is a large dataset with 336,776 rows and 18 columns. This dataset is truly a real-world dataset and provides information about flights originating from New York City airports in 2013. Finally, the `"global_sales"` dataset can be loaded as a Polars table by specifying the dataset name. Since `tbl_type=` is set to `"polars"` by default, we don't need to specify it: ```{python} global_sales = pb.load_dataset(dataset="global_sales") pb.preview(global_sales) ``` The `"global_sales"` dataset is a large dataset with 50,000 rows and 20 columns. Each record describes the sales of a particular product to a customer located in one of three global regions: North America, Europe, or Asia. get_data_path(dataset: "Literal['small_table', 'game_revenue', 'nycflights', 'global_sales']" = 'small_table', file_type: "Literal['csv', 'parquet', 'duckdb']" = 'csv') -> 'str' Get the file path to a dataset included with the Pointblank package. This function provides direct access to the file paths of datasets included with Pointblank. These paths can be used in examples and documentation to demonstrate file-based data loading without requiring the actual data files. The returned paths can be used with `Validate(data=path)` to demonstrate CSV and Parquet file loading capabilities. Parameters ---------- dataset The name of the dataset to get the path for. Current options are `"small_table"`, `"game_revenue"`, `"nycflights"`, and `"global_sales"`. file_type The file format to get the path for. Options are `"csv"`, `"parquet"`, or `"duckdb"`. Returns ------- str The file path to the requested dataset file. Included Datasets ----------------- The available datasets are the same as those in [`load_dataset()`](`pointblank.load_dataset`): - `"small_table"`: A small dataset with 13 rows and 8 columns. Ideal for testing and examples. - `"game_revenue"`: A dataset with 2000 rows and 11 columns. Revenue data for a game company. - `"nycflights"`: A dataset with 336,776 rows and 18 columns. Flight data from NYC airports. - `"global_sales"`: A dataset with 50,000 rows and 20 columns. Global sales data across regions. File Types ---------- Each dataset is available in multiple formats: - `"csv"`: Comma-separated values file (`.csv`) - `"parquet"`: Parquet file (`.parquet`) - `"duckdb"`: DuckDB database file (`.ddb`) Examples -------- Get the path to a CSV file and use it with `Validate`: ```{python} import pointblank as pb # Get path to the small_table CSV file csv_path = pb.get_data_path("small_table", "csv") print(csv_path) # Use the path directly with Validate validation = ( pb.Validate(data=csv_path) .col_exists(["a", "b", "c"]) .col_vals_gt(columns="d", value=0) .interrogate() ) validation ``` Get a Parquet file path for validation examples: ```{python} # Get path to the game_revenue Parquet file parquet_path = pb.get_data_path(dataset="game_revenue", file_type="parquet") # Validate the Parquet file directly validation = ( pb.Validate(data=parquet_path, label="Game Revenue Data Validation") .col_vals_not_null(columns=["player_id", "session_id"]) .col_vals_gt(columns="item_revenue", value=0) .interrogate() ) validation ``` This is particularly useful for documentation examples where you want to demonstrate file-based workflows without requiring users to have specific data files: ```{python} # Example showing CSV file validation sales_csv = pb.get_data_path(dataset="global_sales", file_type="csv") validation = ( pb.Validate(data=sales_csv, label="Sales Data Validation") .col_exists(["customer_id", "product_id", "amount"]) .col_vals_regex(columns="customer_id", pattern=r"CUST_[0-9]{6}") .interrogate() ) ``` See Also -------- [`load_dataset()`](`pointblank.load_dataset`) for loading datasets directly as table objects. connect_to_table(connection_string: 'str') -> 'Any' Connect to a database table using a connection string. This utility function tests whether a connection string leads to a valid table and returns the table object if successful. It provides helpful error messages when no table is specified or when backend dependencies are missing. Parameters ---------- connection_string A database connection string with a required table specification using the `::table_name` suffix. Supported formats are outlined in the *Supported Connection String Formats* section. Returns ------- Any An Ibis table object for the specified database table. Supported Connection String Formats ----------------------------------- The `connection_string` parameter must include a valid connection string with a table name specified using the `::` syntax. Here are some examples on how to format connection strings for various backends: ``` DuckDB: "duckdb:///path/to/database.ddb::table_name" SQLite: "sqlite:///path/to/database.db::table_name" PostgreSQL: "postgresql://user:password@localhost:5432/database::table_name" MySQL: "mysql://user:password@localhost:3306/database::table_name" BigQuery: "bigquery://project/dataset::table_name" Snowflake: "snowflake://user:password@account/database/schema::table_name" ``` If the connection string does not include a table name, the function will attempt to connect to the database and list available tables, providing guidance on how to specify a table. Examples -------- Connect to a DuckDB table: ```{python} import pointblank as pb # Get path to a DuckDB database file from package data duckdb_path = pb.get_data_path("game_revenue", "duckdb") # Connect to the `game_revenue` table in the DuckDB database game_revenue = pb.connect_to_table(f"duckdb:///{duckdb_path}::game_revenue") # Use with the `preview()` function pb.preview(game_revenue) ``` Here are some backend-specific connection examples: ```python # PostgreSQL pg_table = pb.connect_to_table( "postgresql://user:password@localhost:5432/warehouse::customer_data" ) # SQLite sqlite_table = pb.connect_to_table("sqlite:///local_data.db::products") # BigQuery bq_table = pb.connect_to_table("bigquery://my-project/analytics::daily_metrics") ``` This function requires the Ibis library with appropriate backend drivers: ```bash # You can install a set of common backends: pip install 'ibis-framework[duckdb,postgres,mysql,sqlite]' # ...or specific backends as needed: pip install 'ibis-framework[duckdb]' # for DuckDB pip install 'ibis-framework[postgres]' # for PostgreSQL ``` See Also -------- print_database_tables : List all available tables in a database for discovery print_database_tables(connection_string: 'str') -> 'list[str]' List all tables in a database from a connection string. The `print_database_tables()` function connects to a database and returns a list of all available tables. This is particularly useful for discovering what tables exist in a database before connecting to a specific table with `connect_to_table(). The function automatically filters out temporary Ibis tables (memtables) to show only user tables. It supports all database backends available through Ibis, including DuckDB, SQLite, PostgreSQL, MySQL, BigQuery, and Snowflake. Parameters ---------- connection_string A database connection string *without* the `::table_name` suffix. Example: `"duckdb:///path/to/database.ddb"`. Returns ------- list[str] List of table names, excluding temporary Ibis tables. See Also -------- connect_to_table : Connect to a database table with full connection string documentation ## Table Pre-checks Helper functions for use with the `active=` parameter of validation methods. These inspect the target table before a step runs and conditionally skip the step when preconditions are not met. has_columns(*columns: 'str | list[str]') -> 'Callable[[Any], bool]' Check whether one or more columns exist in a table. This function returns a callable that, when given a table, checks whether all specified columns are present. It is primarily designed for use with the `active=` parameter of validation methods. When a validation step has `active=has_columns("col_a", "col_b")`, the step will be skipped (made inactive) if either `col_a` or `col_b` is missing from the target table. The callable is evaluated against the original table *before* any `pre=` processing is applied. This means the column check is performed on the raw input data, not on a pre-processed version of it. A note is attached to any skipped step in the validation report explaining which columns were not found. Parameters ---------- *columns One or more column names to check for in the table. Each argument can be a string or a list of strings. All specified columns must be present for the callable to return `True`. Returns ------- Callable[[Any], bool] A callable that accepts a table and returns `True` if every column in `columns` exists in the table, `False` otherwise. Raises ------ ValueError If no column names are provided. TypeError If any of the provided column names is not a string or list of strings. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` Using `has_columns()` with the `active=` parameter to conditionally run a validation step: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=0, active=pb.has_columns("a")) .col_vals_gt(columns="a", value=0, active=pb.has_columns("z")) .interrogate() ) validation ``` The first step ran because column `a` exists. The second step was skipped because column `z` is missing, and the report note explains which column was not found. When checking for multiple columns, the step is only active when *all* columns are present: ```{python} validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=0, active=pb.has_columns("a", "b")) .col_vals_gt(columns="a", value=0, active=pb.has_columns("a", "x", "y")) .interrogate() ) validation ``` The first step is active because both `a` and `b` exist. The second step is skipped because `x` and `y` are missing. Column names can also be provided as a list: ```{python} validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=0, active=pb.has_columns(["a", "b"])) .interrogate() ) validation ``` has_rows(count: 'int | None' = None, *, min: 'int | None' = None, max: 'int | None' = None) -> 'Callable[[Any], bool]' Check whether a table has a certain number of rows. The `has_rows()` function returns a callable that, when given a table, checks whether the row count satisfies a specified condition. It is designed for use with the `active=` parameter of validation methods so that a validation step can be conditionally skipped when the target table is too small, too large, or empty. The callable supports several modes: - **exact count**: `has_rows(count=N)` returns `True` only if the table has exactly `N` rows. - **minimum**: `has_rows(min=N)` returns `True` if the table has at least `N` rows. - **maximum**: `has_rows(max=N)` returns `True` if the table has at most `N` rows. - **range**: `has_rows(min=A, max=B)` returns `True` if the row count falls within `[A, B]`. - **non-empty**: `has_rows()` (no arguments) returns `True` if the table has at least one row. A note is attached to any skipped step in the validation report explaining the row count condition that was not met. The callable is evaluated against the original table *before* any `pre=` processing is applied. This means the column check is performed on the raw input data, not on a pre-processed version of it. Parameters ---------- count The exact number of rows the table should have. Cannot be used together with `min=` or `max=`. min The minimum number of rows (inclusive) the table should have. Can be used alone or with `max=`. max The maximum number of rows (inclusive) the table should have. Can be used alone or with `min=`. Returns ------- Callable[[Any], bool] A callable that accepts a table and returns `True` if the row count satisfies the specified condition, `False` otherwise. When the callable returns `False`, it stores diagnostic information that is used to generate a descriptive note in the validation report. How It Works ------------ When [`interrogate()`](`pointblank.Validate.interrogate`) is called, each validation step whose `active=` parameter is a callable will have that callable evaluated with the target table. If the callable returns `False`, the step is deactivated and an explanatory note is added to the validation report. The note is locale-aware: if the [`Validate`](`pointblank.Validate`) object was created with a non-English `locale=`, the note will be translated accordingly. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` Skip a validation step if the table is empty: ```{python} import pointblank as pb import polars as pl tbl = pl.DataFrame({"x": [1, 2, 3]}) empty_tbl = pl.DataFrame({"x": []}) validation = ( pb.Validate(data=empty_tbl) .col_vals_gt(columns="x", value=0, active=pb.has_rows()) .interrogate() ) validation ``` The step was skipped because the table has no rows. Only run a step when the table has at least a minimum number of rows: ```{python} validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="x", value=0, active=pb.has_rows(min=100)) .interrogate() ) validation ``` The step was skipped because the table has only 3 rows, which is fewer than the required minimum of `100`. You can also check for an exact count or a range: ```{python} validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="x", value=0, active=pb.has_rows(count=3)) .col_vals_gt(columns="x", value=0, active=pb.has_rows(min=2, max=10)) .col_vals_gt(columns="x", value=0, active=pb.has_rows(count=100)) .interrogate() ) validation ``` The first two steps ran because the table has exactly 3 rows (matching `count=3`) and falls within the range `[2, 10]`. The third step was skipped because `3` does not equal `100`. ## YAML Functions for using YAML to orchestrate validation workflows. yaml_interrogate(yaml: 'Union[str, Path]', set_tbl: 'Any' = None, namespaces: 'Optional[Union[Iterable[str], Mapping[str, str]]]' = None) -> 'Validate' Execute a YAML-based validation workflow. This is the main entry point for YAML-based validation workflows. It takes YAML configuration (as a string or file path) and returns a validated `Validate` object with interrogation results. The YAML configuration defines the data source, validation steps, and optional settings like thresholds and labels. This function automatically loads the data, builds the validation plan, executes all validation steps, and returns the interrogated results. Parameters ---------- yaml YAML configuration as string or file path. Can be: (1) a YAML string containing the validation configuration, or (2) a Path object or string path to a YAML file. set_tbl An optional table to override the table specified in the YAML configuration. This allows you to apply a YAML-defined validation workflow to a different table than what's specified in the configuration. If provided, this table will replace the table defined in the YAML's `tbl` field before executing the validation workflow. This can be any supported table type including DataFrame objects, Ibis table objects, CSV file paths, Parquet file paths, GitHub URLs, or database connection strings. namespaces Optional module namespaces to make available for Python code execution in YAML configurations. Can be a dictionary mapping aliases to module names or a list of module names. See the "Using Namespaces" section below for detailed examples. Returns ------- Validate An instance of the `Validate` class that has been configured based on the YAML input. This object contains the results of the validation steps defined in the YAML configuration. It includes metadata like table name, label, language, and thresholds if specified. Raises ------ YAMLValidationError If the YAML is invalid, malformed, or execution fails. This includes syntax errors, missing required fields, unknown validation methods, or data loading failures. Using Namespaces ---------------- The `namespaces=` parameter enables custom Python modules and functions in YAML configurations. This is particularly useful for custom action functions and advanced Python expressions. **Namespace formats:** - Dictionary format: `{"alias": "module.name"}` maps aliases to module names - List format: `["module.name", "another.module"]` imports modules directly **Option 1: Inline expressions (no namespaces needed)** ```{python} import pointblank as pb # Simple inline custom action yaml_config = ''' tbl: small_table thresholds: warning: 0.01 actions: warning: python: "lambda: print('Custom warning triggered')" steps: - col_vals_gt: columns: [a] value: 1000 ''' result = pb.yaml_interrogate(yaml_config) result ``` **Option 2: External functions with namespaces** ```{python} # Define a custom action function def my_custom_action(): print("Data validation failed: please check your data.") # Add to current module for demo import sys sys.modules[__name__].my_custom_action = my_custom_action # YAML that references the external function yaml_config = ''' tbl: small_table thresholds: warning: 0.01 actions: warning: python: actions.my_custom_action steps: - col_vals_gt: columns: [a] value: 1000 # This will fail ''' # Use namespaces to make the function available result = pb.yaml_interrogate(yaml_config, namespaces={'actions': '__main__'}) result ``` This approach enables modular, reusable validation workflows with custom business logic. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll use YAML configurations to define validation workflows. Let's start with a basic YAML workflow that validates the built-in `small_table` dataset. ```{python} import pointblank as pb # Define a basic YAML validation workflow yaml_config = ''' tbl: small_table steps: - rows_distinct - col_exists: columns: [date, a, b] ''' # Execute the validation workflow result = pb.yaml_interrogate(yaml_config) result ``` The validation table shows the results of our YAML-defined workflow. We can see that the `rows_distinct()` validation failed (because there are duplicate rows in the table), while the column existence checks passed. Now let's create a more comprehensive validation workflow with thresholds and metadata: ```{python} # Advanced YAML configuration with thresholds and metadata yaml_config = ''' tbl: small_table tbl_name: small_table_demo label: Comprehensive data validation thresholds: warning: 0.1 error: 0.25 critical: 0.35 steps: - col_vals_gt: columns: [d] value: 100 - col_vals_regex: columns: [b] pattern: '[0-9]-[a-z]{3}-[0-9]{3}' - col_vals_not_null: columns: [date, a] ''' # Execute the validation workflow result = pb.yaml_interrogate(yaml_config) print(f"Table name: {result.tbl_name}") print(f"Label: {result.label}") print(f"Total validation steps: {len(result.validation_info)}") ``` The validation results now include our custom table name and label. The thresholds we defined will determine when validation steps are marked as warnings, errors, or critical failures. You can also load YAML configurations from files. Here's how you would work with a YAML file: ```{python} from pathlib import Path import tempfile # Create a temporary YAML file for demonstration yaml_content = ''' tbl: small_table tbl_name: File-based Validation steps: - col_vals_between: columns: [c] left: 1 right: 10 - col_vals_in_set: columns: [f] set: [low, mid, high] ''' with tempfile.NamedTemporaryFile(mode='w', suffix='.yaml', delete=False) as f: f.write(yaml_content) yaml_file_path = Path(f.name) # Load and execute validation from file result = pb.yaml_interrogate(yaml_file_path) result ``` This approach is particularly useful for storing validation configurations as part of your data pipeline or version control system, allowing you to maintain validation rules alongside your code. ### Governance Metadata YAML workflows support governance metadata via `owner`, `consumers`, and `version` top-level keys. These are forwarded to the `Validate` constructor and embedded in the validation report: ```{python} yaml_config = ''' tbl: small_table tbl_name: sales_pipeline owner: Data Engineering consumers: [Analytics, Finance, Compliance] version: "2.1.0" steps: - col_vals_not_null: columns: [a, b] ''' result = pb.yaml_interrogate(yaml_config) print(f"Owner: {result.owner}") print(f"Consumers: {result.consumers}") print(f"Version: {result.version}") ``` ### Aggregate Validations YAML supports aggregate validation methods for checking column-level statistics. These methods validate that a column's sum, average, or standard deviation meets a threshold: ```{python} yaml_config = ''' tbl: small_table steps: - col_sum_gt: columns: [d] value: 0 - col_avg_le: columns: [a] value: 10 ''' result = pb.yaml_interrogate(yaml_config) result ``` The 15 available aggregate methods follow the pattern `col_{stat}_{comparator}` where `{stat}` is `sum`, `avg`, or `sd` and `{comparator}` is `gt`, `lt`, `ge`, `le`, or `eq`. ### Data Freshness Check that a date/datetime column has recent data using `data_freshness`: ```yaml tbl: events.csv steps: - data_freshness: columns: event_date freshness: "24h" ``` ### Active Parameter Shortcut The `active=` parameter controls whether a validation step runs. It supports boolean values and Python expression shortcuts: ```yaml steps: - col_vals_gt: columns: [d] value: 100 active: false # Skip this step - col_vals_not_null: columns: [a] active: true # Always run (default) ``` ### Null Percentage Check Use `col_pct_null` to validate that the percentage of null values in a column is within bounds: ```yaml steps: - col_pct_null: columns: [a, b] value: 0.05 ``` ### Using `set_tbl=` to Override the Table The `set_tbl=` parameter allows you to override the table specified in the YAML configuration. This is useful when you have a template validation workflow but want to apply it to different tables: ```{python} import polars as pl # Create a test table with similar structure to small_table test_table = pl.DataFrame({ "date": ["2023-01-01", "2023-01-02", "2023-01-03"], "a": [1, 2, 3], "b": ["1-abc-123", "2-def-456", "3-ghi-789"], "d": [150, 200, 250] }) # Use the same YAML config but apply it to our test table yaml_config = ''' tbl: small_table # This will be overridden tbl_name: Test Table # This name will be used steps: - col_exists: columns: [date, a, b, d] - col_vals_gt: columns: [d] value: 100 ''' # Execute with table override result = pb.yaml_interrogate(yaml_config, set_tbl=test_table) print(f"Validation applied to: {result.tbl_name}") result ``` This feature makes YAML configurations more reusable and flexible, allowing you to define validation logic once and apply it to multiple similar tables. validate_yaml(yaml: 'Union[str, Path]') -> 'None' Validate YAML configuration against the expected structure. This function validates that a YAML configuration conforms to the expected structure for validation workflows. It checks for required fields, proper data types, and valid validation method names. This is useful for validating configurations before execution or for building configuration editors and validators. The function performs comprehensive validation including: - required fields ('tbl' and 'steps') - proper data types for all fields - valid threshold configurations - known validation method names - proper step configuration structure Parameters ---------- yaml YAML configuration as string or file path. Can be: (1) a YAML string containing the validation configuration, or (2) a Path object or string path to a YAML file. Raises ------ YAMLValidationError If the YAML is invalid, malformed, or execution fails. This includes syntax errors, missing required fields, unknown validation methods, or data loading failures. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` For the examples here, we'll demonstrate how to validate YAML configurations before using them with validation workflows. This is particularly useful for building robust data validation systems where you want to catch configuration errors early. Let's start with validating a basic configuration: ```{python} import pointblank as pb # Define a basic YAML validation configuration yaml_config = ''' tbl: small_table steps: - rows_distinct - col_exists: columns: [a, b] ''' # Validate the configuration: no exception means it's valid pb.validate_yaml(yaml_config) print("Basic YAML configuration is valid") ``` The function completed without raising an exception, which means our configuration is valid and follows the expected structure. Now let's validate a more complex configuration with thresholds and metadata: ```{python} # Complex YAML configuration with all optional fields yaml_config = ''' tbl: small_table tbl_name: My Dataset label: Quality check lang: en locale: en thresholds: warning: 0.1 error: 0.25 critical: 0.35 steps: - rows_distinct - col_vals_gt: columns: [d] value: 100 - col_vals_regex: columns: [b] pattern: '[0-9]-[a-z]{3}-[0-9]{3}' ''' # Validate the configuration pb.validate_yaml(yaml_config) print("Complex YAML configuration is valid") # Count the validation steps import pointblank.yaml as pby config = pby.load_yaml_config(yaml_config) print(f"Configuration has {len(config['steps'])} validation steps") ``` This configuration includes all the optional metadata fields and complex validation steps, demonstrating that the validation handles the full range of supported options. Let's see what happens when we try to validate an invalid configuration: ```{python} # Invalid YAML configuration: missing required 'tbl' field invalid_yaml = ''' steps: - rows_distinct ''' try: pb.validate_yaml(invalid_yaml) except pb.yaml.YAMLValidationError as e: print(f"Validation failed: {e}") ``` The validation correctly identifies that our configuration is missing the required `'tbl'` field. Here's a practical example of using validation in a workflow builder: ```{python} def safe_yaml_interrogate(yaml_config): """Safely execute a YAML configuration after validation.""" try: # Validate the YAML configuration first pb.validate_yaml(yaml_config) print("✓ YAML configuration is valid") # Then execute the workflow result = pb.yaml_interrogate(yaml_config) print(f"Validation completed with {len(result.validation_info)} steps") return result except pb.yaml.YAMLValidationError as e: print(f"Configuration error: {e}") return None # Test with a valid YAML configuration test_yaml = ''' tbl: small_table steps: - col_vals_between: columns: [c] left: 1 right: 10 ''' result = safe_yaml_interrogate(test_yaml) ``` This pattern of validating before executing helps build more reliable data validation pipelines by catching configuration errors early in the process. Note that this function only validates the structure and does not check if the specified data source ('tbl') exists or is accessible. Data source validation occurs during execution with `yaml_interrogate()`. Supported Top-level Keys ------------------------ The following top-level keys are recognized in the YAML configuration: - `tbl`: data source specification (required) - `steps`: list of validation steps (required) - `tbl_name`: human-readable table name - `label`: validation description - `df_library`: DataFrame library (`"polars"`, `"pandas"`, `"duckdb"`) - `lang`: language code - `locale`: locale setting - `brief`: global brief template - `thresholds`: global failure thresholds - `actions`: global failure actions - `final_actions`: actions triggered after all steps complete - `owner`: data owner (governance metadata) - `consumers`: data consumers (governance metadata) - `version`: validation version string (governance metadata) - `reference`: reference table for comparison-based validations Unknown top-level keys are rejected, which catches typos like `tbl_nmae` or `step`. Supported Validation Methods ---------------------------- In addition to all standard validation methods (e.g., `col_vals_gt`, `rows_distinct`, `col_schema_match`), the following methods are also supported: - `col_pct_null`: check the percentage of null values in a column - `data_freshness`: check that data is recent - aggregate methods: `col_sum_gt`, `col_sum_lt`, `col_sum_ge`, `col_sum_le`, `col_sum_eq`, `col_avg_gt`, `col_avg_lt`, `col_avg_ge`, `col_avg_le`, `col_avg_eq`, `col_sd_gt`, `col_sd_lt`, `col_sd_ge`, `col_sd_le`, `col_sd_eq` See Also -------- yaml_interrogate : execute YAML-based validation workflows yaml_to_python(yaml: 'Union[str, Path]') -> 'str' Convert YAML validation configuration to equivalent Python code. This function takes a YAML validation configuration and generates the equivalent Python code that would produce the same validation workflow. This is useful for documentation, code generation, or learning how to translate YAML workflows into programmatic workflows. The generated Python code includes all necessary imports, data loading, validation steps, and interrogation execution, formatted as executable Python code. Parameters ---------- yaml YAML configuration as string or file path. Can be: (1) a YAML string containing the validation configuration, or (2) a Path object or string path to a YAML file. Returns ------- str A formatted Python code string enclosed in markdown code blocks that replicates the YAML workflow. The code includes import statements, data loading, validation method calls, and interrogation execution. Raises ------ YAMLValidationError If the YAML is invalid, malformed, or contains unknown validation methods. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` Convert a basic YAML configuration to Python code: ```{python} import pointblank as pb # Define a YAML validation workflow yaml_config = ''' tbl: small_table tbl_name: Data Quality Check steps: - col_vals_not_null: columns: [a, b] - col_vals_gt: columns: [c] value: 0 ''' # Generate equivalent Python code python_code = pb.yaml_to_python(yaml_config) print(python_code) ``` The generated Python code shows exactly how to replicate the YAML workflow programmatically. This is particularly useful when transitioning from YAML-based workflows to code-based workflows, or when generating documentation that shows both YAML and Python approaches. For more complex workflows with thresholds and metadata: ```{python} # Advanced YAML configuration yaml_config = ''' tbl: small_table tbl_name: Advanced Validation label: Production data check thresholds: warning: 0.1 error: 0.2 steps: - col_vals_between: columns: [c] left: 1 right: 10 - col_vals_regex: columns: [b] pattern: '[0-9]-[a-z]{3}-[0-9]{3}' ''' # Generate the equivalent Python code python_code = pb.yaml_to_python(yaml_config) print(python_code) ``` The generated code includes all configuration parameters, thresholds, and maintains the exact same validation logic as the original YAML workflow. Governance metadata (`owner`, `consumers`, `version`) and `reference` are also rendered in the generated Python code: ```{python} yaml_config = ''' tbl: small_table tbl_name: Sales Pipeline owner: Data Engineering consumers: [Analytics, Finance] version: "2.1.0" steps: - col_vals_not_null: columns: [a] - col_sum_gt: columns: [d] value: 0 ''' python_code = pb.yaml_to_python(yaml_config) print(python_code) ``` This function is also useful for educational purposes, helping users understand how YAML configurations map to the underlying Python API calls. ## Utility Functions Functions for accessing metadata about the target data and managing configuration. get_column_count(data: 'Any') -> 'int' Get the number of columns in a table. The `get_column_count()` function returns the number of columns in a table. The function works with any table that is supported by the `pointblank` library, including Pandas, Polars, and Ibis backend tables (e.g., DuckDB, MySQL, PostgreSQL, SQLite, Parquet, etc.). It also supports direct input of CSV files, Parquet files, and database connection strings. Parameters ---------- data The table for which to get the column count, which could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, or a database connection string. Read the *Supported Input Table Types* section for details on the supported table types. Returns ------- int The number of columns in the table. Supported Input Table Types --------------------------- The `data=` parameter can be given any of the following table types: - Polars DataFrame (`"polars"`) - Pandas DataFrame (`"pandas"`) - PySpark table (`"pyspark"`) - DuckDB table (`"duckdb"`)* - MySQL table (`"mysql"`)* - PostgreSQL table (`"postgresql"`)* - SQLite table (`"sqlite"`)* - Microsoft SQL Server table (`"mssql"`)* - Snowflake table (`"snowflake"`)* - Databricks table (`"databricks"`)* - BigQuery table (`"bigquery"`)* - Parquet table (`"parquet"`)* - CSV files (string path or `pathlib.Path` object with `.csv` extension) - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet` extension, or partitioned dataset) - Database connection strings (URI format with optional table specification) The table types marked with an asterisk need to be prepared as Ibis tables (with type of `ibis.expr.types.relations.Table`). Furthermore, using `get_column_count()` with these types of tables requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a Polars or Pandas DataFrame, the availability of Ibis is not needed. To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is provided. The file will be automatically detected and loaded using the best available DataFrame library. The loading preference is Polars first, then Pandas as a fallback. GitHub URLs pointing to CSV or Parquet files are automatically detected and converted to raw content URLs for downloading. The URL format should be: `https://github.com/user/repo/blob/branch/path/file.csv` or `https://github.com/user/repo/blob/branch/path/file.parquet` Connection strings follow database URL formats and must also specify a table using the `::table_name` suffix. Examples include: ``` "duckdb:///path/to/database.ddb::table_name" "sqlite:///path/to/database.db::table_name" "postgresql://user:password@localhost:5432/database::table_name" "mysql://user:password@localhost:3306/database::table_name" "bigquery://project/dataset::table_name" "snowflake://user:password@account/database/schema::table_name" ``` When using connection strings, the Ibis library with the appropriate backend driver is required. Examples -------- To get the number of columns in a table, we can use the `get_column_count()` function. Here's an example using the `small_table` dataset (itself loaded using the [`load_dataset()`](`pointblank.load_dataset`) function): ```{python} import pointblank as pb small_table_polars = pb.load_dataset("small_table") pb.get_column_count(small_table_polars) ``` This table is a Polars DataFrame, but the `get_column_count()` function works with any table supported by `pointblank`, including Pandas DataFrames and Ibis backend tables. Here's an example using a DuckDB table handled by Ibis: ```{python} small_table_duckdb = pb.load_dataset("small_table", tbl_type="duckdb") pb.get_column_count(small_table_duckdb) ``` #### Working with CSV Files The `get_column_count()` function can directly accept CSV file paths: ```{python} # Get a path to a CSV file from the package data csv_path = pb.get_data_path("global_sales", "csv") pb.get_column_count(csv_path) ``` #### Working with Parquet Files The function supports various Parquet input formats: ```{python} # Single Parquet file from package data parquet_path = pb.get_data_path("nycflights", "parquet") pb.get_column_count(parquet_path) ``` You can also use glob patterns and directories: ```python # Multiple Parquet files with glob patterns pb.get_column_count("data/sales_*.parquet") # Directory containing Parquet files pb.get_column_count("parquet_data/") # Partitioned Parquet dataset pb.get_column_count("sales_data/") # Auto-discovers partition columns ``` #### Working with Database Connection Strings The function supports database connection strings for direct access to database tables: ```{python} # Get path to a DuckDB database file from package data duckdb_path = pb.get_data_path("game_revenue", "duckdb") pb.get_column_count(f"duckdb:///{duckdb_path}::game_revenue") ``` The function always returns the number of columns in the table as an integer value, which is `8` for the `small_table` dataset. get_row_count(data: 'Any') -> 'int' Get the number of rows in a table. The `get_row_count()` function returns the number of rows in a table. The function works with any table that is supported by the `pointblank` library, including Pandas, Polars, and Ibis backend tables (e.g., DuckDB, MySQL, PostgreSQL, SQLite, Parquet, etc.). It also supports direct input of CSV files, Parquet files, and database connection strings. Parameters ---------- data The table for which to get the row count, which could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, or a database connection string. Read the *Supported Input Table Types* section for details on the supported table types. Returns ------- int The number of rows in the table. Supported Input Table Types --------------------------- The `data=` parameter can be given any of the following table types: - Polars DataFrame (`"polars"`) - Pandas DataFrame (`"pandas"`) - PySpark table (`"pyspark"`) - DuckDB table (`"duckdb"`)* - MySQL table (`"mysql"`)* - PostgreSQL table (`"postgresql"`)* - SQLite table (`"sqlite"`)* - Microsoft SQL Server table (`"mssql"`)* - Snowflake table (`"snowflake"`)* - Databricks table (`"databricks"`)* - BigQuery table (`"bigquery"`)* - Parquet table (`"parquet"`)* - CSV files (string path or `pathlib.Path` object with `.csv` extension) - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet` extension, or partitioned dataset) - GitHub URLs (direct links to CSV or Parquet files on GitHub) - Database connection strings (URI format with optional table specification) The table types marked with an asterisk need to be prepared as Ibis tables (with type of `ibis.expr.types.relations.Table`). Furthermore, using `get_row_count()` with these types of tables requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a Polars or Pandas DataFrame, the availability of Ibis is not needed. To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is provided. The file will be automatically detected and loaded using the best available DataFrame library. The loading preference is Polars first, then Pandas as a fallback. GitHub URLs pointing to CSV or Parquet files are automatically detected and converted to raw content URLs for downloading. The URL format should be: `https://github.com/user/repo/blob/branch/path/file.csv` or `https://github.com/user/repo/blob/branch/path/file.parquet` Connection strings follow database URL formats and must also specify a table using the `::table_name` suffix. Examples include: ``` "duckdb:///path/to/database.ddb::table_name" "sqlite:///path/to/database.db::table_name" "postgresql://user:password@localhost:5432/database::table_name" "mysql://user:password@localhost:3306/database::table_name" "bigquery://project/dataset::table_name" "snowflake://user:password@account/database/schema::table_name" ``` When using connection strings, the Ibis library with the appropriate backend driver is required. Examples -------- Getting the number of rows in a table is easily done by using the `get_row_count()` function. Here's an example using the `game_revenue` dataset (itself loaded using the [`load_dataset()`](`pointblank.load_dataset`) function): ```{python} import pointblank as pb game_revenue_polars = pb.load_dataset("game_revenue") pb.get_row_count(game_revenue_polars) ``` This table is a Polars DataFrame, but the `get_row_count()` function works with any table supported by `pointblank`, including Pandas DataFrames and Ibis backend tables. Here's an example using a DuckDB table handled by Ibis: ```{python} game_revenue_duckdb = pb.load_dataset("game_revenue", tbl_type="duckdb") pb.get_row_count(game_revenue_duckdb) ``` #### Working with CSV Files The `get_row_count()` function can directly accept CSV file paths: ```{python} # Get a path to a CSV file from the package data csv_path = pb.get_data_path("global_sales", "csv") pb.get_row_count(csv_path) ``` #### Working with Parquet Files The function supports various Parquet input formats: ```{python} # Single Parquet file from package data parquet_path = pb.get_data_path("nycflights", "parquet") pb.get_row_count(parquet_path) ``` You can also use glob patterns and directories: ```python # Multiple Parquet files with glob patterns pb.get_row_count("data/sales_*.parquet") # Directory containing Parquet files pb.get_row_count("parquet_data/") # Partitioned Parquet dataset pb.get_row_count("sales_data/") # Auto-discovers partition columns ``` #### Working with Database Connection Strings The function supports database connection strings for direct access to database tables: ```{python} # Get path to a DuckDB database file from package data duckdb_path = pb.get_data_path("game_revenue", "duckdb") pb.get_row_count(f"duckdb:///{duckdb_path}::game_revenue") ``` The function always returns the number of rows in the table as an integer value, which is `2000` for the `game_revenue` dataset. get_action_metadata() -> 'dict | None' Access step-level metadata when authoring custom actions. Get the metadata for the validation step where an action was triggered. This can be called by user functions to get the metadata for the current action. This function can only be used within callables crafted for the [`Actions`](`pointblank.Actions`) class. Returns ------- dict | None A dictionary containing the metadata for the current step. If called outside of an action (i.e., when no action is being executed), this function will return `None`. Description of the Metadata Fields ---------------------------------- The metadata dictionary contains the following fields for a given validation step: - `step`: The step number. - `column`: The column name. - `value`: The value being compared (only available in certain validation steps). - `type`: The assertion type (e.g., `"col_vals_gt"`, etc.). - `time`: The time the validation step was executed (in ISO format). - `level`: The severity level (`"warning"`, `"error"`, or `"critical"`). - `level_num`: The severity level as a numeric value (`30`, `40`, or `50`). - `autobrief`: A localized and brief statement of the expectation for the step. - `failure_text`: Localized text that explains how the validation step failed. Examples -------- When creating a custom action, you can access the metadata for the current step using the `get_action_metadata()` function. Here's an example of a custom action that logs the metadata for the current step: ```{python} import pointblank as pb def log_issue(): metadata = pb.get_action_metadata() print(f"Type: {metadata['type']}, Step: {metadata['step']}") validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), actions=pb.Actions(warning=log_issue), ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}[0-9]{3}") .col_vals_gt(columns="item_revenue", value=0.05) .col_vals_gt( columns="session_duration", value=15, ) .interrogate() ) validation ``` Key pieces to note in the above example: - `log_issue()` (the custom action) collects `metadata` by calling `get_action_metadata()` - the `metadata` is a dictionary that is used to craft the log message - the action is passed as a bare function to the `Actions` object within the `Validate` object (placing it within `Validate(actions=)` ensures it's set as an action for every validation step) See Also -------- Have a look at [`Actions`](`pointblank.Actions`) for more information on how to create custom actions for validation steps that exceed a set threshold value. get_validation_summary() -> 'dict | None' Access validation summary information when authoring final actions. This function provides a convenient way to access summary information about the validation process within a final action. It returns a dictionary with key metrics from the validation process. This function can only be used within callables crafted for the [`FinalActions`](`pointblank.FinalActions`) class. Returns ------- dict | None A dictionary containing validation metrics. If called outside of an final action context, this function will return `None`. Description of the Summary Fields -------------------------------- The summary dictionary contains the following fields: - `n_steps` (`int`): The total number of validation steps. - `n_passing_steps` (`int`): The number of validation steps where all test units passed. - `n_failing_steps` (`int`): The number of validation steps that had some failing test units. - `n_warning_steps` (`int`): The number of steps that exceeded a 'warning' threshold. - `n_error_steps` (`int`): The number of steps that exceeded an 'error' threshold. - `n_critical_steps` (`int`): The number of steps that exceeded a 'critical' threshold. - `list_passing_steps` (`list[int]`): List of step numbers where all test units passed. - `list_failing_steps` (`list[int]`): List of step numbers for steps having failing test units. - `dict_n` (`dict`): The number of test units for each validation step. - `dict_n_passed` (`dict`): The number of test units that passed for each validation step. - `dict_n_failed` (`dict`): The number of test units that failed for each validation step. - `dict_f_passed` (`dict`): The fraction of test units that passed for each validation step. - `dict_f_failed` (`dict`): The fraction of test units that failed for each validation step. - `dict_warning` (`dict`): The 'warning' level status for each validation step. - `dict_error` (`dict`): The 'error' level status for each validation step. - `dict_critical` (`dict`): The 'critical' level status for each validation step. - `all_passed` (`bool`): Whether or not every validation step had no failing test units. - `highest_severity` (`str`): The highest severity level encountered during validation. This can be one of the following: `"warning"`, `"error"`, or `"critical"`, `"some failing"`, or `"all passed"`. - `tbl_row_count` (`int`): The number of rows in the target table. - `tbl_column_count` (`int`): The number of columns in the target table. - `tbl_name` (`str`): The name of the target table. - `validation_duration` (`float`): The duration of the validation in seconds. Note that the summary dictionary is only available within the context of a final action. If called outside of a final action (i.e., when no final action is being executed), this function will return `None`. Examples -------- Final actions are executed after the completion of all validation steps. They provide an opportunity to take appropriate actions based on the overall validation results. Here's an example of a final action function (`send_report()`) that sends an alert when critical validation failures are detected: ```python import pointblank as pb def send_report(): summary = pb.get_validation_summary() if summary["highest_severity"] == "critical": # Send an alert email send_alert_email( subject=f"CRITICAL validation failures in {summary['tbl_name']}", body=f"{summary['n_critical_steps']} steps failed with critical severity." ) validation = ( pb.Validate( data=my_data, final_actions=pb.FinalActions(send_report) ) .col_vals_gt(columns="revenue", value=0) .interrogate() ) ``` Note that `send_alert_email()` in the example above is a placeholder function that would be implemented by the user to send email alerts. This function is not provided by the Pointblank package. The `get_validation_summary()` function can also be used to create custom reporting for validation results: ```python def log_validation_results(): summary = pb.get_validation_summary() print(f"Validation completed with status: {summary['highest_severity'].upper()}") print(f"Steps: {summary['n_steps']} total") print(f" - {summary['n_passing_steps']} passing, {summary['n_failing_steps']} failing") print( f" - Severity: {summary['n_warning_steps']} warnings, " f"{summary['n_error_steps']} errors, " f"{summary['n_critical_steps']} critical" ) if summary['highest_severity'] in ["error", "critical"]: print("⚠️ Action required: Please review failing validation steps!") ``` Final actions work well with both simple logging and more complex notification systems, allowing you to integrate validation results into your broader data quality workflows. See Also -------- Have a look at [`FinalActions`](`pointblank.FinalActions`) for more information on how to create custom actions that are executed after all validation steps have been completed. write_file(validation: 'Validate', filename: 'str', path: 'str | None' = None, keep_tbl: 'bool' = False, keep_extracts: 'bool' = False, quiet: 'bool' = False) -> 'None' Write a Validate object to disk as a serialized file. Writing a validation object to disk with `write_file()` can be useful for keeping data validation results close at hand for later retrieval (with `read_file()`). By default, any data table that the validation object holds will be removed before writing to disk (not applicable if no data table is present). This behavior can be changed by setting `keep_tbl=True`, but this only works when the table is not of a database type (e.g., DuckDB, PostgreSQL, etc.), as database connections cannot be serialized. Extract data from failing validation steps can also be preserved by setting `keep_extracts=True`, which is useful for later analysis of data quality issues. The serialized file uses Python's pickle format for storage of the validation object state, including all validation results, metadata, and optionally the source data. **Important note.** If your validation uses custom preprocessing functions (via the `pre=` parameter), these functions must be defined at the module level (not interactively or as lambda functions) to ensure they can be properly restored when loading the validation in a different Python session. Read the *Creating Serializable Validations* section below for more information. :::{.callout-warning} The `write_file()` function is currently experimental. Please report any issues you encounter in the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues). ::: Parameters ---------- validation The `Validate` object to write to disk. filename The filename to create on disk for the validation object. Should not include the file extension as `.pkl` will be added automatically. path An optional directory path where the file should be saved. If not provided, the file will be saved in the current working directory. The directory will be created if it doesn't exist. keep_tbl An option to keep the data table that is associated with the validation object. The default is `False` where the data table is removed before writing to disk. For database tables (e.g., Ibis tables with database backends), the table is always removed even if `keep_tbl=True`, as database connections cannot be serialized. keep_extracts An option to keep any collected extract data for failing rows from validation steps. By default, this is `False` (i.e., extract data is removed to save space). quiet Should the function not inform when the file is written? By default, this is `False`, so a message will be printed when the file is successfully written. Returns ------- None This function doesn't return anything but saves the validation object to disk. Creating Serializable Validations --------------------------------- To ensure your validations work reliably across different Python sessions, the recommended approach is to use module-Level functions. So, create a separate Python file for your preprocessing functions: ```python # preprocessing_functions.py import polars as pl def multiply_by_100(df): return df.with_columns(pl.col("value") * 100) def add_computed_column(df): return df.with_columns(computed=pl.col("value") * 2 + 10) ``` Then import and use them in your validation: ```python # your_main_script.py import pointblank as pb from preprocessing_functions import multiply_by_100, add_computed_column validation = ( pb.Validate(data=my_data) .col_vals_gt(columns="value", value=500, pre=multiply_by_100) .col_vals_between(columns="computed", left=50, right=1000, pre=add_computed_column) .interrogate() ) # Save validation and it will work reliably across sessions pb.write_file(validation, "my_validation", keep_tbl=True) ``` ### Problematic Patterns to Avoid Don't use lambda functions as they will cause immediate errors. ```python validation = pb.Validate(data).col_vals_gt( columns="value", value=100, pre=lambda df: df.with_columns(pl.col("value") * 2) ) ``` Don't use interactive function definitions (as they may fail when loading). ```python def my_function(df): # Defined in notebook/REPL return df.with_columns(pl.col("value") * 2) validation = pb.Validate(data).col_vals_gt( columns="value", value=100, pre=my_function ) ``` ### Automatic Analysis and Guidance When you call `write_file()`, it automatically analyzes your validation and provides: - confirmation when all functions will work reliably - warnings for functions that may cause cross-session issues - clear errors for unsupported patterns (lambda functions) - specific recommendations and code examples - loading instructions tailored to your validation ### Loading Your Validation To load a saved validation in a new Python session: ```python # In a new Python session import pointblank as pb # Import the same preprocessing functions used when creating the validation from preprocessing_functions import multiply_by_100, add_computed_column # Upon loading the validation, functions will be automatically restored validation = pb.read_file("my_validation.pkl") ``` ** Testing Your Validation:** To verify your validation works across sessions: 1. save your validation in one Python session 2. start a fresh Python session (restart kernel/interpreter) 3. import required preprocessing functions 4. load the validation using `read_file()` 5. test that preprocessing functions work as expected ### Performance and Storage - use `keep_tbl=False` (default) to reduce file size when you don't need the original data - use `keep_extracts=False` (default) to save space by excluding extract data - set `quiet=True` to suppress guidance messages in automated scripts - files are saved using pickle's highest protocol for optimal performance Examples -------- Let's create a simple validation and save it to disk: ```{python} import pointblank as pb # Create a validation validation = ( pb.Validate(data=pb.load_dataset("small_table"), label="My validation") .col_vals_gt(columns="d", value=100) .col_vals_regex(columns="b", pattern=r"[0-9]-[a-z]{3}-[0-9]{3}") .interrogate() ) # Save to disk (without the original table data) pb.write_file(validation, "my_validation") ``` To keep the original table data for later analysis: ```{python} # Save with the original table data included pb.write_file(validation, "my_validation_with_data", keep_tbl=True) ``` You can also specify a custom directory and keep extract data: ```python pb.write_file( validation, filename="detailed_validation", path="/path/to/validations", keep_tbl=True, keep_extracts=True ) ``` ### Working with Preprocessing Functions For validations that use preprocessing functions to be portable across sessions, define your functions in a separate `.py` file: ```python # In `preprocessing_functions.py` import polars as pl def multiply_by_100(df): return df.with_columns(pl.col("value") * 100) def add_computed_column(df): return df.with_columns(computed=pl.col("value") * 2 + 10) ``` Then import and use them in your validation: ```python # In your main script import pointblank as pb from preprocessing_functions import multiply_by_100, add_computed_column validation = ( pb.Validate(data=my_data) .col_vals_gt(columns="value", value=500, pre=multiply_by_100) .col_vals_between(columns="computed", left=50, right=1000, pre=add_computed_column) .interrogate() ) # This validation can now be saved and loaded reliably pb.write_file(validation, "my_validation", keep_tbl=True) ``` When you load this validation in a new session, simply import the preprocessing functions again and they will be automatically restored. See Also -------- Use the [`read_file()`](`pointblank.read_file`) function to load a validation object that was previously saved with `write_file()`. read_file(filepath: 'str | Path') -> 'Validate' Read a Validate object from disk that was previously saved with `write_file()`. This function loads a validation object that was previously serialized to disk using the `write_file()` function. The validation object will be restored with all its validation results, metadata, and optionally the source data (if it was saved with `keep_tbl=True`). :::{.callout-warning} The `read_file()` function is currently experimental. Please report any issues you encounter in the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues). ::: Parameters ---------- filepath The path to the saved validation file. Can be a string or Path object. Returns ------- Validate The restored validation object with all its original state, validation results, and metadata. Examples -------- Load a validation object that was previously saved: ```python import pointblank as pb # Load a validation object from disk validation = pb.read_file("my_validation.pkl") # View the validation results validation ``` You can also load using just the filename (without extension): ```python # This will automatically look for "my_validation.pkl" validation = pb.read_file("my_validation") ``` The loaded validation object retains all its functionality: ```python # Get validation summary summary = validation.get_json_report() # Get sundered data (if original table was saved) if validation.data is not None: failing_rows = validation.get_sundered_data(type="fail") ``` See Also -------- Use the [`write_file()`](`pointblank.Validate.write_file`) method to save a validation object to disk for later retrieval with this function. ref(column_name: 'str') -> 'ReferenceColumn' Reference a column from the reference data for aggregate comparisons. This function is used with aggregate validation methods (like `col_sum_eq`, `col_avg_gt`, etc.) to compare the aggregate value of a column in the main data against the aggregate value of a column in the reference data. To use this function, you must first set the reference data on the `Validate` object using the `reference=` parameter in the constructor. Parameters ---------- column_name The name of the column in the reference data to compute the aggregate from. Returns ------- ReferenceColumn A reference column marker that indicates the value should be computed from the reference data. Examples -------- ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False, preview_incl_header=False) ``` Suppose we have two DataFrames: a current data table and a reference (historical) table. We want to validate that the sum of a column in the current data matches the sum of the same column in the reference data. ```{python} import pointblank as pb import polars as pl # Current data current_data = pl.DataFrame({"sales": [100, 200, 300]}) # Reference (historical) data reference_data = pl.DataFrame({"sales": [100, 200, 300]}) validation = ( pb.Validate(data=current_data, reference=reference_data) .col_sum_eq("sales", pb.ref("sales")) .interrogate() ) validation ``` You can also compare different columns or use tolerance: ```{python} current_data = pl.DataFrame({"revenue": [105, 205, 305]}) reference_data = pl.DataFrame({"sales": [100, 200, 300]}) # Check if revenue sum is within 10% of sales sum validation = ( pb.Validate(data=current_data, reference=reference_data) .col_sum_eq("revenue", pb.ref("sales"), tol=0.1) .interrogate() ) validation ``` See Also -------- The [`col()`](`pointblank.col`) function for referencing columns within the same table. ## Test Data Generation Generate synthetic test data based on schema definitions. Use `generate_dataset()` to create data from a Schema object. generate_dataset(schema: 'Schema', n: 'int' = 100, seed: 'int | None' = None, output: "Literal['polars', 'pandas', 'dict']" = 'polars', country: 'str | list[str] | dict[str, float]' = 'US', shuffle: 'bool' = True, weighted: 'bool' = True) -> 'Any' Generate synthetic test data from a schema. This function generates random data that conforms to a schema's column definitions. When the schema is defined using `Field` objects with constraints (e.g., `min_val=`, `max_val=`, `pattern=`, `preset=`), the generated data will respect those constraints. Parameters ---------- schema The schema object defining the structure and constraints of the data to generate. Each column can be specified using a field helper function (e.g., `int_field()`, `string_field()`) for fine-grained control, or as a simple dtype string (e.g., `"Int64"`, `"String"`) for unconstrained generation. n Number of rows to generate. The default is `100`. seed Random seed for reproducibility. If provided, the same seed will produce the same data. Default is `None` (non-deterministic). output Output format for the generated data. Options are: (1) `"polars"` (the default) returns a Polars DataFrame, (2) `"pandas"` returns a Pandas DataFrame, and (3) `"dict"` returns a dictionary of lists. country Country code(s) for locale-aware generation when using presets. Accepts a single ISO 3166-1 alpha-2 or alpha-3 code (e.g., `"US"`, `"DEU"`), a list of codes for uniform mixing (e.g., `["US", "DE", "JP"]`), or a dict mapping codes to positive weights (e.g., `{"US": 60, "DE": 25, "JP": 15}`). See the *Locale Mixing* section below for details. The default is `"US"`. shuffle When `country=` is a list or dict (multi-country mixing), controls whether rows from different countries are interleaved randomly (`True`, the default) or grouped by country in the order the countries are specified (`False`). Ignored when `country=` is a single string. weighted When `True`, names and locations are sampled according to real-world frequency tiers. Common names like "James" and "Smith" appear far more often than rare names. Large cities like New York and Los Angeles dominate over small towns. Only affects data files that have been migrated to the tiered format; flat-list data always uses uniform sampling. Default is `True`. Returns ------- DataFrame or dict Generated data in the requested format. Raises ------ ValueError If the schema has no columns or if constraints cannot be satisfied. ImportError If required optional dependencies are not installed. Presets and the `country=` Parameter ------------------------------------ Several `string_field()` presets produce locale-aware data that varies depending on the `country=` parameter. The following presets are particularly affected: - **Address-related presets** (`"address"`, `"city"`, `"state"`, `"postcode"`, `"phone_number"`, `"latitude"`, `"longitude"`, `"license_plate"`): produce addresses, cities, postal codes, phone numbers, and license plates formatted for the specified country. For example, `country="DE"` yields German street names and PLZ postal codes, while `country="JP"` yields Japanese addresses. License plates for CA, US, DE, AU, and GB use province/state-specific formats when location fields are present. - **Person-related presets** (`"name"`, `"name_full"`, `"first_name"`, `"last_name"`, `"email"`, `"user_name"`) produce culturally appropriate names for the specified country. For example, `country="FR"` produces French names, while `country="KR"` produces Korean names. - **Business-related presets** (`"job"`, `"company"`): when both are present, the job and company are drawn from the same industry for realism. The `"name_full"` preset will also add profession-matched titles (e.g., "Dr." for doctors, "Prof." for professors), and integer columns named `age` are automatically constrained to working-age range (22--65). - **Financial presets** (`"iban"`, `"ssn"`, `"license_plate"`): produce identifiers in the format used by the specified country. - **Locale preset** (`"locale_code"`): returns a locale identifier (e.g., `"en_US"`, `"de_DE"`) derived from the country. Multilingual countries randomly select among their official locale codes (e.g., `"CH"` yields `"de_CH"`, `"fr_CH"`, or `"it_CH"`). When multiple columns in the same schema use related presets, the generated data is automatically coherent across those columns within each row. Person-related presets will share the same identity (e.g., the email is derived from the name), address-related presets will share the same location (e.g., the city matches the address), and business-related presets will share the same industry context. Locale Mixing ------------- The `country=` parameter accepts three input forms for flexible locale control: (1) a **single string** (the default), such as `"US"` or `"DEU"`, which generates all rows from one locale; (2) a **list of strings**, such as `["US", "DE", "JP"]`, which splits rows equally across the listed countries; and (3) a **dict of weights**, such as `{"US": 0.6, "DE": 0.3, "FR": 0.1}`, which allocates rows proportionally (weights are auto-normalized, so `{"US": 6, "DE": 3, "FR": 1}` is equivalent). Row counts are distributed using largest-remainder apportionment so they always sum to exactly `n=`. Each country's rows are generated as an independent batch (preserving all cross-column coherence within each batch), then either interleaved randomly (`shuffle=True`, the default) or left in contiguous country blocks (`shuffle=False`). Supported Countries ------------------- The `country=` parameter currently supports 100 countries with full locale data: **Europe (38 countries):** Armenia (`"AM"`), Austria (`"AT"`), Azerbaijan (`"AZ"`), Belgium (`"BE"`), Bulgaria (`"BG"`), Croatia (`"HR"`), Cyprus (`"CY"`), Czech Republic (`"CZ"`), Denmark (`"DK"`), Estonia (`"EE"`), Finland (`"FI"`), France (`"FR"`), Georgia (`"GE"`), Germany (`"DE"`), Greece (`"GR"`), Hungary (`"HU"`), Iceland (`"IS"`), Ireland (`"IE"`), Italy (`"IT"`), Latvia (`"LV"`), Lithuania (`"LT"`), Luxembourg (`"LU"`), Malta (`"MT"`), Moldova (`"MD"`), Netherlands (`"NL"`), Norway (`"NO"`), Poland (`"PL"`), Portugal (`"PT"`), Romania (`"RO"`), Russia (`"RU"`), Serbia (`"RS"`), Slovakia (`"SK"`), Slovenia (`"SI"`), Spain (`"ES"`), Sweden (`"SE"`), Switzerland (`"CH"`), Ukraine (`"UA"`), United Kingdom (`"GB"`) **Americas (19 countries):** Argentina (`"AR"`), Bolivia (`"BO"`), Brazil (`"BR"`), Canada (`"CA"`), Chile (`"CL"`), Colombia (`"CO"`), Costa Rica (`"CR"`), Dominican Republic (`"DO"`), Ecuador (`"EC"`), El Salvador (`"SV"`), Guatemala (`"GT"`), Honduras (`"HN"`), Jamaica (`"JM"`), Mexico (`"MX"`), Panama (`"PA"`), Paraguay (`"PY"`), Peru (`"PE"`), United States (`"US"`), Uruguay (`"UY"`) **Asia-Pacific (22 countries):** Australia (`"AU"`), Bangladesh (`"BD"`), Cambodia (`"KH"`), China (`"CN"`), Hong Kong (`"HK"`), India (`"IN"`), Indonesia (`"ID"`), Japan (`"JP"`), Kazakhstan (`"KZ"`), Malaysia (`"MY"`), Myanmar (`"MM"`), Nepal (`"NP"`), New Zealand (`"NZ"`), Pakistan (`"PK"`), Philippines (`"PH"`), Singapore (`"SG"`), South Korea (`"KR"`), Sri Lanka (`"LK"`), Taiwan (`"TW"`), Thailand (`"TH"`), Uzbekistan (`"UZ"`), Vietnam (`"VN"`) **Middle East & Africa (21 countries):** Algeria (`"DZ"`), Cameroon (`"CM"`), Egypt (`"EG"`), Ethiopia (`"ET"`), Ghana (`"GH"`), Israel (`"IL"`), Jordan (`"JO"`), Kenya (`"KE"`), Lebanon (`"LB"`), Morocco (`"MA"`), Mozambique (`"MZ"`), Nigeria (`"NG"`), Rwanda (`"RW"`), Saudi Arabia (`"SA"`), Senegal (`"SN"`), South Africa (`"ZA"`), Tanzania (`"TZ"`), Tunisia (`"TN"`), Turkey (`"TR"`), Uganda (`"UG"`), United Arab Emirates (`"AE"`) Pytest Fixture -------------- When Pointblank is installed, a `generate_dataset` pytest fixture is automatically available in all test files: no imports or `conftest.py` setup required. The fixture behaves identically to this function, but derives a deterministic seed from the test's fully-qualified name when `seed=` is not provided. This means: - the **same test** always produces the **same data**, with no manual seed management. - **different tests** get different seeds, so they exercise different data. - **you** can still pass an explicit `seed=` to override the automatic seed. - **calling** the fixture **multiple times** within one test produces different (but still deterministic) data on each call. - the fixture exposes `.default_seed` and `.last_seed` attributes for debugging. ```python def test_my_pipeline(generate_dataset): import pointblank as pb schema = pb.Schema( user_id=pb.int_field(unique=True), email=pb.string_field(preset="email"), age=pb.int_field(min_val=18, max_val=100), ) df = generate_dataset(schema, n=500, country="DE") # seed is derived from "test_my_pipeline" — same data every run result = my_pipeline(df) assert result.shape[0] == 500 ``` Multiple datasets can be generated within the same test, each with its own deterministic seed: ```python def test_merge(generate_dataset): customers = generate_dataset(customer_schema, n=1000, country="US") orders = generate_dataset(order_schema, n=5000) # Both DataFrames are deterministic; each call gets a unique seed ``` When a test fails, include the seed in the assertion message so the failure is easy to reproduce: ```python def test_age_range(generate_dataset): df = generate_dataset(schema, n=100) assert df["age"].min() >= 18, f"Failed with seed {generate_dataset.last_seed}" ``` Seed Stability -------------- A given seed (whether explicit or auto-derived) is guaranteed to produce identical output **within the same Pointblank version**. Across versions, changes to country data files or generator logic may alter the output for a given seed. For CI pipelines that require bit-exact data across library upgrades, save generated DataFrames as Parquet or CSV snapshot files rather than relying on cross-version seed stability. This is the same approach used by snapshot-testing tools like `pytest-snapshot` and `syrupy`. Examples -------- Here we define a schema with field constraints and generate test data from it: ```{python} import pointblank as pb schema = pb.Schema( user_id=pb.int_field(min_val=1, unique=True), email=pb.string_field(preset="email"), age=pb.int_field(min_val=18, max_val=100), status=pb.string_field(allowed=["active", "pending", "inactive"]), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` It's also possible to generate data from a simple, dtype-only schema. Setting `output="pandas"` returns a Pandas DataFrame: ```{python} schema = pb.Schema(name="String", age="Int64", active="Boolean") pb.preview(pb.generate_dataset(schema, n=50, seed=23, output="pandas")) ``` When using presets, the `country=` parameter controls the locale. Here, `country="DE"` produces German names and addresses: ```{python} schema = pb.Schema( name=pb.string_field(preset="name"), address=pb.string_field(preset="address"), city=pb.string_field(preset="city"), ) pb.preview(pb.generate_dataset(schema, n=20, seed=23, country="DE")) ``` We can combine several field types with nullable columns in a mixed-type dataset: ```{python} from datetime import date, timedelta schema = pb.Schema( id=pb.int_field(min_val=1, unique=True), name=pb.string_field(preset="name"), score=pb.float_field(min_val=0.0, max_val=100.0), is_active=pb.bool_field(p_true=0.75), joined=pb.date_field(min_date=date(2020, 1, 1), max_date=date(2024, 12, 31)), session_time=pb.duration_field( min_duration=timedelta(minutes=1), max_duration=timedelta(hours=3), nullable=True, null_probability=0.2, ), ) pb.preview(pb.generate_dataset(schema, n=50, seed=23)) ``` int_field(min_val: 'int | None' = None, max_val: 'int | None' = None, allowed: 'list[int] | None' = None, nullable: 'bool' = False, null_probability: 'float' = 0.0, unique: 'bool' = False, generator: 'Callable[[], Any] | None' = None, dtype: 'str' = 'Int64') -> 'IntField' Create an integer column specification for use in a schema. The `int_field()` function defines the constraints and behavior for an integer column when generating synthetic data with `generate_dataset()`. You can control the range of values with `min_val=` and `max_val=`, restrict values to a specific set with `allowed=`, enforce uniqueness with `unique=True`, and introduce null values with `nullable=True` and `null_probability=`. The `dtype=` parameter lets you choose the specific integer type (e.g., `"Int8"`, `"UInt16"`, `"Int64"`), which also determines the valid range of values. When no constraints are specified, values are drawn uniformly from the full range of the chosen integer dtype. If both `min_val=` and `max_val=` are provided, values are drawn uniformly from that range. If `allowed=` is provided, values are sampled from that specific list. Parameters ---------- min_val Minimum value (inclusive). Default is `None` (no minimum, uses dtype lower bound). max_val Maximum value (inclusive). Default is `None` (no maximum, uses dtype upper bound). allowed List of allowed values (categorical constraint). When provided, values are sampled from this list. Cannot be combined with `min_val=`/`max_val=`. nullable Whether the column can contain null values. Default is `False`. null_probability Probability of generating a null value for each row when `nullable=True`. Must be between `0.0` and `1.0`. Default is `0.0`. unique Whether all values must be unique. Default is `False`. When `True`, the generator will retry until it produces `n` distinct values (subject to retry limits). generator Custom callable that generates values. When provided, this overrides all other constraints (`min_val=`, `max_val=`, `allowed=`, etc.). The callable should take no arguments and return a single integer value. dtype Integer dtype. Default is `"Int64"`. Options: `"Int8"`, `"Int16"`, `"Int32"`, `"Int64"`, `"UInt8"`, `"UInt16"`, `"UInt32"`, `"UInt64"`. Returns ------- IntField An integer field specification that can be passed to `Schema()`. Raises ------ ValueError If `min_val` is greater than `max_val`, if `allowed` is an empty list, if `null_probability` is not between `0.0` and `1.0`, or if `dtype` is not a valid integer type. Examples -------- The `min_val=` and `max_val=` parameters constrain generated ranges, while `allowed=` restricts values to a specific set: ```{python} import pointblank as pb schema = pb.Schema( user_id=pb.int_field(min_val=1, unique=True), age=pb.int_field(min_val=0, max_val=120), rating=pb.int_field(allowed=[1, 2, 3, 4, 5]), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` It's possible to introduce missing values with `nullable=True` and `null_probability=`, and to select a smaller dtype with `dtype=`: ```{python} schema = pb.Schema( score=pb.int_field(min_val=0, max_val=255, dtype="UInt8"), optional_val=pb.int_field( min_val=1, max_val=50, nullable=True, null_probability=0.3, ), ) pb.preview(pb.generate_dataset(schema, n=50, seed=23)) ``` We can also enforce uniqueness with `unique=True` to produce distinct identifiers within a range: ```{python} schema = pb.Schema( record_id=pb.int_field(min_val=1000, max_val=9999, unique=True), priority=pb.int_field(allowed=[1, 2, 3]), ) pb.preview(pb.generate_dataset(schema, n=30, seed=10)) ``` For complete control, a custom `generator=` callable can be provided: ```{python} import random rng = random.Random(0) schema = pb.Schema( even_numbers=pb.int_field(generator=lambda: rng.choice(range(0, 100, 2))), ) pb.preview(pb.generate_dataset(schema, n=20, seed=5)) ``` float_field(min_val: 'float | None' = None, max_val: 'float | None' = None, allowed: 'list[float] | None' = None, nullable: 'bool' = False, null_probability: 'float' = 0.0, unique: 'bool' = False, generator: 'Callable[[], Any] | None' = None, dtype: 'str' = 'Float64') -> 'FloatField' Create a floating-point column specification for use in a schema. The `float_field()` function defines the constraints and behavior for a floating-point column when generating synthetic data with `generate_dataset()`. You can control the range of values with `min_val=` and `max_val=`, restrict values to a specific set with `allowed=`, enforce uniqueness with `unique=True`, and introduce null values with `nullable=True` and `null_probability=`. The `dtype=` parameter lets you choose between `"Float32"` and `"Float64"` precision. When both `min_val=` and `max_val=` are provided, values are drawn from a uniform distribution across that range. If neither is specified, values are drawn uniformly from a large default range. If `allowed=` is provided, values are sampled from that specific list. Parameters ---------- min_val Minimum value (inclusive). Default is `None` (no minimum). max_val Maximum value (inclusive). Default is `None` (no maximum). allowed List of allowed values (categorical constraint). When provided, values are sampled from this list. Cannot be combined with `min_val=`/`max_val=`. nullable Whether the column can contain null values. Default is `False`. null_probability Probability of generating a null value for each row when `nullable=True`. Must be between `0.0` and `1.0`. Default is `0.0`. unique Whether all values must be unique. Default is `False`. When `True`, the generator will retry until it produces `n` distinct values. generator Custom callable that generates values. When provided, this overrides all other constraints. The callable should take no arguments and return a single float value. dtype Float dtype. Default is `"Float64"`. Options: `"Float32"`, `"Float64"`. Returns ------- FloatField A float field specification that can be passed to `Schema()`. Raises ------ ValueError If `min_val` is greater than `max_val`, if `allowed` is an empty list, if `null_probability` is not between `0.0` and `1.0`, or if `dtype` is not a valid float type. Examples -------- The `min_val=` and `max_val=` parameters define the generated value ranges: ```{python} import pointblank as pb schema = pb.Schema( price=pb.float_field(min_val=0.01, max_val=9999.99), probability=pb.float_field(min_val=0.0, max_val=1.0), temperature=pb.float_field(min_val=-40.0, max_val=50.0), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` It's also possible to restrict values to a discrete set with `allowed=`, which is useful for fixed pricing tiers or measurement levels: ```{python} schema = pb.Schema( discount=pb.float_field(allowed=[0.05, 0.10, 0.15, 0.20, 0.25]), weight_kg=pb.float_field(min_val=0.5, max_val=100.0), ) pb.preview(pb.generate_dataset(schema, n=50, seed=23)) ``` We can simulate missing measurements by introducing null values: ```{python} schema = pb.Schema( reading=pb.float_field( min_val=0.0, max_val=500.0, nullable=True, null_probability=0.2, ), calibration=pb.float_field(min_val=0.9, max_val=1.1), ) pb.preview(pb.generate_dataset(schema, n=30, seed=7)) ``` Setting `dtype="Float32"` gives reduced precision, and a custom `generator=` provides full control over value generation: ```{python} import random, math rng = random.Random(0) schema = pb.Schema( sensor_value=pb.float_field(min_val=-10.0, max_val=10.0, dtype="Float32"), log_value=pb.float_field(generator=lambda: math.log(rng.uniform(1, 1000))), ) pb.preview(pb.generate_dataset(schema, n=20, seed=99)) ``` string_field(min_length: 'int | None' = None, max_length: 'int | None' = None, pattern: 'str | None' = None, preset: 'str | None' = None, allowed: 'list[str] | None' = None, nullable: 'bool' = False, null_probability: 'float' = 0.0, unique: 'bool' = False, generator: 'Callable[[], Any] | None' = None) -> 'StringField' Create a string column specification for use in a schema. The `string_field()` function defines the constraints and behavior for a string column when generating synthetic data with `generate_dataset()`. It provides three main modes of string generation: (1) controlled random strings with `min_length=`/`max_length=`, (2) strings matching a regular expression via `pattern=`, or (3) realistic data using `preset=` (e.g., `"email"`, `"name"`, `"address"`). You can also restrict values to a fixed set with `allowed=`. Only one of `preset=`, `pattern=`, or `allowed=` can be specified at a time. When no special mode is selected, random alphanumeric strings are generated with lengths between `min_length=` and `max_length=` (defaulting to 1--20 characters). Parameters ---------- min_length Minimum string length (for random string generation). Default is `None` (defaults to `1`). Only applies when `preset=`, `pattern=`, and `allowed=` are all `None`. max_length Maximum string length (for random string generation). Default is `None` (defaults to `20`). Only applies when `preset=`, `pattern=`, and `allowed=` are all `None`. pattern Regular expression pattern that generated strings must match. Supports character classes (e.g., `[A-Z]`, `[0-9]`), quantifiers (e.g., `{3}`, `{2,5}`), alternation, and groups. Cannot be combined with `preset=` or `allowed=`. preset Preset name for generating realistic data. When specified, values are produced using locale-aware data generation, and the `country=` parameter of `generate_dataset()` controls the locale. Cannot be combined with `pattern=` or `allowed=`. See the **Available Presets** section below for the full list. allowed List of allowed string values (categorical constraint). Values are sampled uniformly from this list. Cannot be combined with `preset=` or `pattern=`. nullable Whether the column can contain null values. Default is `False`. null_probability Probability of generating a null value for each row when `nullable=True`. Must be between `0.0` and `1.0`. Default is `0.0`. unique Whether all values must be unique. Default is `False`. When `True`, the generator will retry until it produces `n` distinct values. generator Custom callable that generates values. When provided, this overrides all other constraints. The callable should take no arguments and return a single string value. Returns ------- StringField A string field specification that can be passed to `Schema()`. Raises ------ ValueError If more than one of `preset=`, `pattern=`, or `allowed=` is specified; if `allowed=` is an empty list; if `min_length` or `max_length` is negative; if `min_length` exceeds `max_length`; or if `preset` is not a recognized preset name. Available Presets ----------------- The `preset=` parameter accepts one of the following preset names, organized by category. When a preset is used, the `country=` parameter of `generate_dataset()` controls the locale for region-specific formatting (e.g., address formats, phone number patterns). **Personal:** `"name"` (first + last name), `"name_full"` (full name with possible prefix or suffix), `"first_name"`, `"last_name"`, `"gender"` (person's gender, coherent with name), `"email"` (realistic email address), `"phone_number"`, `"address"` (full street address), `"city"`, `"state"`, `"country"`, `"country_code_2"` (ISO 3166-1 alpha-2 code, e.g., `"US"`), `"country_code_3"` (ISO 3166-1 alpha-3 code, e.g., `"USA"`), `"postcode"`, `"latitude"`, `"longitude"` **Business:** `"company"` (company name), `"job"` (job title), `"catch_phrase"` **Internet:** `"url"`, `"domain_name"`, `"ipv4"`, `"ipv6"`, `"user_name"`, `"password"` **Text:** `"text"` (paragraph of text), `"sentence"`, `"paragraph"`, `"word"` **Financial:** `"credit_card_number"`, `"credit_card_provider"` (Visa, Mastercard, American Express, or Discover), `"iban"`, `"currency_code"` **Identifiers:** `"uuid4"`, `"md5"` (MD5 hash, 32 hex chars), `"sha1"` (SHA-1 hash, 40 hex chars), `"sha256"` (SHA-256 hash, 64 hex chars), `"ssn"` (social security number), `"license_plate"` **Barcodes:** `"ean8"` (EAN-8 barcode with valid check digit), `"ean13"` (EAN-13 barcode with valid check digit) **Date/Time (as strings):** `"date_this_year"`, `"date_this_decade"`, `"date_between"` (random date between 2000–2025), `"date_range"` (two dates joined with an en-dash, e.g., `"2012-05-12 – 2015-11-22"`), `"future_date"` (up to 1 year ahead), `"past_date"` (up to 10 years back), `"time"` **Miscellaneous:** `"color_name"`, `"file_name"`, `"file_extension"`, `"mime_type"`, `"user_agent"` (browser user agent string with country-specific browser weighting), `"locale_code"` (locale identifier like `"en_US"`, `"de_DE"`; multilingual countries return a random official locale) Coherent Data Generation ------------------------ When multiple columns in the same schema use related presets, the generated data will be coherent across those columns within each row. Specifically: - **Person-related presets** (`"name"`, `"name_full"`, `"first_name"`, `"last_name"`, `"gender"`, `"email"`, `"user_name"`): the email and username will be derived from the person's name, and `"gender"` will match the person's first name. - **Address-related presets** (`"address"`, `"city"`, `"state"`, `"postcode"`, `"phone_number"`, `"latitude"`, `"longitude"`): the city, state, and postcode will correspond to the same location within the address. - **Credit card presets** (`"credit_card_number"`, `"credit_card_provider"`): the card number prefix and provider name will be consistent (e.g., "Visa" with a "4"-prefixed number). This coherence is automatic and requires no additional configuration. Examples -------- The `preset=` parameter generates realistic personal data, while `allowed=` restricts values to a categorical set: ```{python} import pointblank as pb schema = pb.Schema( name=pb.string_field(preset="name"), email=pb.string_field(preset="email", unique=True), status=pb.string_field(allowed=["active", "pending", "inactive"]), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` We can also generate strings that match a regular expression with `pattern=` (e.g., product codes, identifiers): ```{python} schema = pb.Schema( product_code=pb.string_field(pattern=r"[A-Z]{3}-[0-9]{4}"), batch_id=pb.string_field(pattern=r"BATCH-[A-Z][0-9]{3}"), sku=pb.string_field(pattern=r"[A-Z]{2}[0-9]{6}"), ) pb.preview(pb.generate_dataset(schema, n=30, seed=23)) ``` For random alphanumeric strings, `min_length=` and `max_length=` control the length. Adding `nullable=True` introduces missing values: ```{python} schema = pb.Schema( short_code=pb.string_field(min_length=3, max_length=5), notes=pb.string_field( min_length=10, max_length=50, nullable=True, null_probability=0.4, ), ) pb.preview(pb.generate_dataset(schema, n=30, seed=7)) ``` It's possible to combine business and internet presets to build a company directory: ```{python} schema = pb.Schema( company=pb.string_field(preset="company"), domain=pb.string_field(preset="domain_name"), industry_tag=pb.string_field(allowed=["tech", "finance", "health", "retail"]), ) pb.preview(pb.generate_dataset(schema, n=20, seed=55)) ``` bool_field(p_true: 'float' = 0.5, nullable: 'bool' = False, null_probability: 'float' = 0.0, unique: 'bool' = False, generator: 'Callable[[], Any] | None' = None) -> 'BoolField' Create a boolean column specification for use in a schema. The `bool_field()` function defines the constraints and behavior for a boolean column when generating synthetic data with `generate_dataset()`. The `p_true=` parameter controls the probability of generating `True` values, which is useful for simulating real-world distributions where events may be rare or common (e.g., 5% fraud rate, 80% active users). By default, `True` and `False` are equally likely (`p_true=0.5`). Setting `p_true=0.0` produces all `False` values, and `p_true=1.0` produces all `True` values. Parameters ---------- p_true Probability of generating `True`. Default is `0.5` (equal probability). Must be between `0.0` and `1.0`. nullable Whether the column can contain null values. Default is `False`. null_probability Probability of generating a null value for each row when `nullable=True`. Must be between `0.0` and `1.0`. Default is `0.0`. unique Whether all values must be unique. Default is `False`. Note that boolean columns can only have 2 unique non-null values, so `n` must be `<= 2` when `unique=True` (or `<= 3` with `nullable=True`). generator Custom callable that generates values. When provided, this overrides all other constraints. The callable should take no arguments and return a single boolean value. Returns ------- BoolField A boolean field specification that can be passed to `Schema()`. Raises ------ ValueError If `p_true` is not between `0.0` and `1.0`, or if `null_probability` is not between `0.0` and `1.0`. Examples -------- The `p_true=` parameter controls the distribution of `True`/`False` values, allowing you to simulate different probabilities: ```{python} import pointblank as pb schema = pb.Schema( is_active=pb.bool_field(p_true=0.8), is_premium=pb.bool_field(p_true=0.2), is_verified=pb.bool_field(), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` Optional boolean flags can be simulated by combining `nullable=True` with `null_probability=`: ```{python} schema = pb.Schema( opted_in=pb.bool_field(p_true=0.6), has_referral=pb.bool_field( p_true=0.3, nullable=True, null_probability=0.25, ), ) pb.preview(pb.generate_dataset(schema, n=50, seed=23)) ``` Boolean fields can be combined with other field types in a realistic schema: ```{python} schema = pb.Schema( user_id=pb.int_field(min_val=1, unique=True), name=pb.string_field(preset="name"), email_verified=pb.bool_field(p_true=0.9), is_admin=pb.bool_field(p_true=0.05), ) pb.preview(pb.generate_dataset(schema, n=30, seed=10)) ``` date_field(min_date: 'str | date | None' = None, max_date: 'str | date | None' = None, nullable: 'bool' = False, null_probability: 'float' = 0.0, unique: 'bool' = False, generator: 'Callable[[], Any] | None' = None) -> 'DateField' Create a date column specification for use in a schema. The `date_field()` function defines the constraints and behavior for a date column when generating synthetic data with `generate_dataset()`. You can control the date range with `min_date=` and `max_date=`, enforce uniqueness with `unique=True`, and introduce null values with `nullable=True` and `null_probability=`. Dates are generated uniformly within the specified range. If no range is provided, the default range is 2000-01-01 to 2030-12-31. Both `min_date=` and `max_date=` accept either `datetime.date` objects or ISO 8601 date strings (e.g., `"2024-06-15"`). Parameters ---------- min_date Minimum date (inclusive). Can be an ISO format string (e.g., `"2020-01-01"`) or a `datetime.date` object. Default is `None` (defaults to `2000-01-01`). max_date Maximum date (inclusive). Can be an ISO format string (e.g., `"2024-12-31"`) or a `datetime.date` object. Default is `None` (defaults to `2030-12-31`). nullable Whether the column can contain null values. Default is `False`. null_probability Probability of generating a null value for each row when `nullable=True`. Must be between `0.0` and `1.0`. Default is `0.0`. unique Whether all values must be unique. Default is `False`. When `True`, the generator will retry until it produces `n` distinct dates. Ensure the date range is large enough to accommodate the requested number of unique dates. generator Custom callable that generates values. When provided, this overrides all other constraints. The callable should take no arguments and return a single `datetime.date` value. Returns ------- DateField A date field specification that can be passed to `Schema()`. Raises ------ ValueError If `min_date` is later than `max_date`, or if a date string cannot be parsed. Examples -------- The `min_date=` and `max_date=` parameters accept `datetime.date` objects to define date ranges: ```{python} import pointblank as pb from datetime import date schema = pb.Schema( birth_date=pb.date_field( min_date=date(1960, 1, 1), max_date=date(2005, 12, 31), ), hire_date=pb.date_field( min_date=date(2020, 1, 1), max_date=date(2024, 12, 31), ), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` For convenience, ISO format strings can be used instead of `date` objects: ```{python} schema = pb.Schema( event_date=pb.date_field(min_date="2024-01-01", max_date="2024-12-31"), signup_date=pb.date_field(min_date="2023-06-01", max_date="2024-06-01"), ) pb.preview(pb.generate_dataset(schema, n=50, seed=23)) ``` We can introduce missing dates with `nullable=True` and enforce distinct values using `unique=True`: ```{python} schema = pb.Schema( order_date=pb.date_field( min_date="2024-01-01", max_date="2024-03-31", unique=True, ), cancel_date=pb.date_field( min_date="2024-01-01", max_date="2024-12-31", nullable=True, null_probability=0.5, ), ) pb.preview(pb.generate_dataset(schema, n=30, seed=7)) ``` datetime_field(min_date: 'str | datetime | None' = None, max_date: 'str | datetime | None' = None, nullable: 'bool' = False, null_probability: 'float' = 0.0, unique: 'bool' = False, generator: 'Callable[[], Any] | None' = None) -> 'DatetimeField' Create a datetime column specification for use in a schema. The `datetime_field()` function defines the constraints and behavior for a datetime column when generating synthetic data with `generate_dataset()`. You can control the datetime range with `min_date=` and `max_date=`, enforce uniqueness with `unique=True`, and introduce null values with `nullable=True` and `null_probability=`. Datetime values are generated uniformly (at second-level resolution) within the specified range. If no range is provided, the default range is 2000-01-01T00:00:00 to 2030-12-31T23:59:59. Both `min_date=` and `max_date=` accept `datetime` objects, `date` objects (which are converted to datetimes at midnight), or ISO 8601 datetime strings. Parameters ---------- min_date Minimum datetime (inclusive). Can be an ISO format string (e.g., `"2024-01-01T00:00:00"`), a `datetime.datetime` object, or a `datetime.date` object. Default is `None` (defaults to `2000-01-01 00:00:00`). max_date Maximum datetime (inclusive). Can be an ISO format string, a `datetime.datetime` object, or a `datetime.date` object. Default is `None` (defaults to `2030-12-31 23:59:59`). nullable Whether the column can contain null values. Default is `False`. null_probability Probability of generating a null value for each row when `nullable=True`. Must be between `0.0` and `1.0`. Default is `0.0`. unique Whether all values must be unique. Default is `False`. With second-level resolution over a wide range, collisions are unlikely for moderate dataset sizes. generator Custom callable that generates values. When provided, this overrides all other constraints. The callable should take no arguments and return a single `datetime.datetime` value. Returns ------- DatetimeField A datetime field specification that can be passed to `Schema()`. Raises ------ ValueError If `min_date` is later than `max_date`, or if a datetime string cannot be parsed. Examples -------- The `min_date=` and `max_date=` parameters accept `datetime` objects for precise range definitions: ```{python} import pointblank as pb from datetime import datetime schema = pb.Schema( created_at=pb.datetime_field( min_date=datetime(2024, 1, 1), max_date=datetime(2024, 12, 31), ), updated_at=pb.datetime_field( min_date=datetime(2024, 6, 1), max_date=datetime(2024, 12, 31), ), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` For a quick setup, ISO format strings work just as well: ```{python} schema = pb.Schema( event_time=pb.datetime_field( min_date="2024-03-01T08:00:00", max_date="2024-03-01T18:00:00", ), ) pb.preview(pb.generate_dataset(schema, n=30, seed=23)) ``` Optional timestamps can be simulated with `nullable=True`, and datetime fields work nicely alongside other field types: ```{python} schema = pb.Schema( order_id=pb.int_field(min_val=1000, max_val=9999, unique=True), placed_at=pb.datetime_field( min_date=datetime(2024, 1, 1), max_date=datetime(2024, 12, 31), ), shipped_at=pb.datetime_field( min_date=datetime(2024, 1, 2), max_date=datetime(2025, 1, 15), nullable=True, null_probability=0.3, ), ) pb.preview(pb.generate_dataset(schema, n=30, seed=7)) ``` time_field(min_time: 'str | time | None' = None, max_time: 'str | time | None' = None, nullable: 'bool' = False, null_probability: 'float' = 0.0, unique: 'bool' = False, generator: 'Callable[[], Any] | None' = None) -> 'TimeField' Create a time column specification for use in a schema. The `time_field()` function defines the constraints and behavior for a time-of-day column when generating synthetic data with `generate_dataset()`. You can control the time range with `min_time=` and `max_time=`, enforce uniqueness with `unique=True`, and introduce null values with `nullable=True` and `null_probability=`. Time values are generated uniformly (at second-level resolution) within the specified range. If no range is provided, the default range is 00:00:00 to 23:59:59. Both `min_time=` and `max_time=` accept `datetime.time` objects or ISO format time strings (e.g., `"09:30:00"`). Parameters ---------- min_time Minimum time (inclusive). Can be an ISO format string (e.g., `"08:00:00"`) or a `datetime.time` object. Default is `None` (defaults to `00:00:00`). max_time Maximum time (inclusive). Can be an ISO format string (e.g., `"17:30:00"`) or a `datetime.time` object. Default is `None` (defaults to `23:59:59`). nullable Whether the column can contain null values. Default is `False`. null_probability Probability of generating a null value for each row when `nullable=True`. Must be between `0.0` and `1.0`. Default is `0.0`. unique Whether all values must be unique. Default is `False`. With second-level resolution within a time range, uniqueness is feasible for moderate dataset sizes. generator Custom callable that generates values. When provided, this overrides all other constraints. The callable should take no arguments and return a single value. Returns ------- TimeField A time field specification that can be passed to `Schema()`. Raises ------ ValueError If `min_time` is later than `max_time`, or if a time string cannot be parsed. Examples -------- The `min_time=` and `max_time=` parameters accept `datetime.time` objects, making it easy to define business-hours ranges: ```{python} import pointblank as pb from datetime import time schema = pb.Schema( start_time=pb.time_field( min_time=time(9, 0, 0), max_time=time(12, 0, 0), ), end_time=pb.time_field( min_time=time(13, 0, 0), max_time=time(17, 0, 0), ), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` ISO format strings can also be used for convenience: ```{python} schema = pb.Schema( login_time=pb.time_field(min_time="06:00:00", max_time="23:59:59"), alarm_time=pb.time_field(min_time="05:00:00", max_time="09:00:00"), ) pb.preview(pb.generate_dataset(schema, n=30, seed=23)) ``` It's possible to introduce optional time values with `nullable=True` and combine them with other field types: ```{python} schema = pb.Schema( employee_id=pb.int_field(min_val=100, max_val=999, unique=True), check_in=pb.time_field(min_time="07:00:00", max_time="10:00:00"), check_out=pb.time_field( min_time="16:00:00", max_time="20:00:00", nullable=True, null_probability=0.15, ), ) pb.preview(pb.generate_dataset(schema, n=30, seed=7)) ``` duration_field(min_duration: 'str | timedelta | None' = None, max_duration: 'str | timedelta | None' = None, nullable: 'bool' = False, null_probability: 'float' = 0.0, unique: 'bool' = False, generator: 'Callable[[], Any] | None' = None) -> 'DurationField' Create a duration column specification for use in a schema. The `duration_field()` function defines the constraints and behavior for a duration (timedelta) column when generating synthetic data with `generate_dataset()`. You can control the duration range with `min_duration=` and `max_duration=`, enforce uniqueness with `unique=True`, and introduce null values with `nullable=True` and `null_probability=`. Duration values are generated uniformly (at second-level resolution) within the specified range. If no range is provided, the default range is 0 seconds to 30 days. Both `min_duration=` and `max_duration=` accept `datetime.timedelta` objects or colon-separated strings in `"HH:MM:SS"` or `"MM:SS"` format. Parameters ---------- min_duration Minimum duration (inclusive). Can be a `"HH:MM:SS"` or `"MM:SS"` string, or a `datetime.timedelta` object. Default is `None` (defaults to 0 seconds). max_duration Maximum duration (inclusive). Can be a `"HH:MM:SS"` or `"MM:SS"` string, or a `datetime.timedelta` object. Default is `None` (defaults to 30 days). nullable Whether the column can contain null values. Default is `False`. null_probability Probability of generating a null value for each row when `nullable=True`. Must be between `0.0` and `1.0`. Default is `0.0`. unique Whether all values must be unique. Default is `False`. With second-level resolution within a duration range, uniqueness is feasible for moderate dataset sizes. generator Custom callable that generates values. When provided, this overrides all other constraints. The callable should take no arguments and return a single `datetime.timedelta` value. Returns ------- DurationField A duration field specification that can be passed to `Schema()`. Raises ------ ValueError If `min_duration` is greater than `max_duration`, or if a duration string cannot be parsed. Examples -------- The `min_duration=` and `max_duration=` parameters accept `timedelta` objects for defining duration ranges: ```{python} import pointblank as pb from datetime import timedelta schema = pb.Schema( session_length=pb.duration_field( min_duration=timedelta(minutes=5), max_duration=timedelta(hours=2), ), wait_time=pb.duration_field( min_duration=timedelta(seconds=30), max_duration=timedelta(minutes=15), ), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` Colon-separated strings can also be used for quick duration definitions: ```{python} schema = pb.Schema( call_duration=pb.duration_field(min_duration="0:01:00", max_duration="1:30:00"), break_time=pb.duration_field(min_duration="0:05:00", max_duration="0:30:00"), ) pb.preview(pb.generate_dataset(schema, n=30, seed=23)) ``` Optional durations can be created with `nullable=True`, and duration fields work well alongside other field types: ```{python} schema = pb.Schema( task_id=pb.int_field(min_val=1, max_val=500, unique=True), time_spent=pb.duration_field( min_duration=timedelta(minutes=1), max_duration=timedelta(hours=8), ), overtime=pb.duration_field( min_duration=timedelta(0), max_duration=timedelta(hours=4), nullable=True, null_probability=0.6, ), ) pb.preview(pb.generate_dataset(schema, n=30, seed=7)) ``` profile_fields(*, set: "Literal['minimal', 'standard', 'full']" = 'standard', split_name: 'bool' = True, include: 'list[str] | None' = None, exclude: 'list[str] | None' = None, prefix: 'str | None' = None) -> 'dict[str, StringField]' Create a dict of string field specifications representing a person profile. Returns a dictionary of `StringField` objects suitable for `**`-unpacking into a `Schema()`. Each field uses a preset that participates in the existing coherence system, so generated data will have coherent names, emails, addresses, and phone numbers within each row. Parameters ---------- set The base set of profile fields to include. Options are `"minimal"` (name, email, phone; 3-4 columns depending on `split_name=`), `"standard"` (name, email, city, state, postcode, phone; 6-7 columns), and `"full"` (name, email, address, city, state, postcode, phone, company, job; 9-10 columns). Default is `"standard"`. split_name Whether to split the name into separate `first_name` and `last_name` columns (`True`, the default) or use a single combined `name` column (`False`). include List of additional preset names to add to the base set. For example, `include=["company"]` adds a company column to the `"standard"` set. Presets already in the base set are silently ignored. exclude List of preset names to remove from the (possibly augmented) set. For example, `exclude=["postcode"]` removes the postcode column. Presets not in the set are silently ignored. prefix Optional string to prepend to every column name. For example, `prefix="customer_"` produces keys like `"customer_first_name"`, `"customer_email"`, etc. Returns ------- dict[str, StringField] A dictionary mapping column names to `StringField` objects, ordered logically (name fields first, then contact, address, phone, business). Raises ------ ValueError If `set=` is not one of `"minimal"`, `"standard"`, or `"full"`; if `include=` or `exclude=` contain unknown preset names; if a preset appears in both `include=` and `exclude=`; or if `include=` contains name presets incompatible with the `split_name=` setting. Examples -------- The default call returns the `"standard"` set of profile columns. The `**` operator unpacks the returned dictionary directly into `Schema()`, as if each `string_field()` call had been written by hand. All coherence rules apply automatically: emails are derived from names, and city/state/postcode/phone are internally consistent. ```{python} import pointblank as pb schema = pb.Schema( user_id=pb.int_field(unique=True), **pb.profile_fields(), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` Use `set=` to control how many columns are generated. The `"minimal"` set includes only `name`, `email`, and `phone`, while `"full"` adds `address`, `company`, and `job`. Setting `split_name=False` collapses `first_name` and `last_name` into a single combined `name` column: ```{python} schema = pb.Schema( **pb.profile_fields(set="minimal", split_name=False), balance=pb.float_field(min_val=0, max_val=10000), ) pb.preview(pb.generate_dataset(schema, n=50, seed=23)) ``` The `include=` and `exclude=` parameters let you customize the column set without switching to a different base set. Here we start from the `"full"` set but drop the business columns: ```{python} schema = pb.Schema( **pb.profile_fields(set="full", exclude=["company", "job"]), ) pb.preview(pb.generate_dataset(schema, n=50, seed=23, country="DE")) ``` The `prefix=` parameter prepends a string to every column name, which is especially useful when a schema needs two independent profiles (e.g., a sender and a recipient). Each prefixed group maintains its own coherence: ```{python} schema = pb.Schema( **pb.profile_fields(set="minimal", prefix="sender_"), **pb.profile_fields(set="minimal", prefix="recipient_"), ) pb.preview(pb.generate_dataset(schema, n=50, seed=23)) ``` ## Prebuilt Actions Prebuilt action functions for common notification patterns. send_slack_notification(webhook_url: 'str | None' = None, step_msg: 'str | None' = None, summary_msg: 'str | None' = None, debug: 'bool' = False) -> 'Callable | None' Create a Slack notification function using a webhook URL. This function can be used in two ways: 1. With [`Actions`](`pointblank.Actions`) to notify about individual validation step failures 2. With [`FinalActions`](`pointblank.FinalActions`) to provide a summary notification after all validation steps have undergone interrogation The function creates a callable that sends notifications through a Slack webhook. Message formatting can be customized using templates for both individual steps and summary reports. Parameters ---------- webhook_url The Slack webhook URL. If `None` (and `debug=True`), a dry run is performed (see the *Offline Testing* section below for information on this). step_msg Template string for step notifications. Some of the available variables include: `"{step}"`, `"{column}"`, `"{value}"`, `"{type}"`, `"{time}"`, `"{level}"`, etc. See the *Available Template Variables for Step Notifications* section below for more details. If not provided, a default step message template will be used. summary_msg Template string for summary notifications. Some of the available variables are: `"{n_steps}"`, `"{n_passing_steps}"`, `"{n_failing_steps}"`, `"{all_passed}"`, `"{highest_severity}"`, etc. See the *Available Template Variables for Summary Notifications* section below for more details. If not provided, a default summary message template will be used. debug Print debug information if `True`. This includes the message content and the response from Slack. This is useful for testing and debugging the notification function. If `webhook_url` is `None`, the function will print the message to the console instead of sending it to Slack. This is useful for debugging and ensuring that your templates are formatted correctly. Returns ------- Callable A function that sends notifications to Slack. Available Template Variables for Step Notifications --------------------------------------------------- When creating a custom template for validation step alerts (`step_msg=`), the following templating strings can be used: - `"{step}"`: The step number. - `"{column}"`: The column name. - `"{value}"`: The value being compared (only available in certain validation steps). - `"{type}"`: The assertion type (e.g., `"col_vals_gt"`, etc.). - `"{level}"`: The severity level (`"warning"`, `"error"`, or `"critical"`). - `"{level_num}"`: The severity level as a numeric value (`30`, `40`, or `50`). - `"{autobrief}"`: A localized and brief statement of the expectation for the step. - `"{failure_text}"`: Localized text that explains how the validation step failed. - `"{time}"`: The time of the notification. Here's an example of how to construct a `step_msg=` template: ```python step_msg = '''🚨 *Validation Step Alert* • Step Number: {step} • Column: {column} • Test Type: {type} • Value Tested: {value} • Severity: {level} (level {level_num}) • Brief: {autobrief} • Details: {failure_text} • Time: {time}''' ``` This template will be filled with the relevant information when a validation step fails. The placeholders will be replaced with actual values when the Slack notification is sent. Available Template Variables for Summary Notifications ------------------------------------------------------ When creating a custom template for a validation summary (`summary_msg=`), the following templating strings can be used: - `"{n_steps}"`: The total number of validation steps. - `"{n_passing_steps}"`: The number of validation steps where all test units passed. - `"{n_failing_steps}"`: The number of validation steps that had some failing test units. - `"{n_warning_steps}"`: The number of steps that exceeded a 'warning' threshold. - `"{n_error_steps}"`: The number of steps that exceeded an 'error' threshold. - `"{n_critical_steps}"`: The number of steps that exceeded a 'critical' threshold. - `"{all_passed}"`: Whether or not every validation step had no failing test units. - `"{highest_severity}"`: The highest severity level encountered during validation. This can be one of the following: `"warning"`, `"error"`, or `"critical"`, `"some failing"`, or `"all passed"`. - `"{tbl_row_count}"`: The number of rows in the target table. - `"{tbl_column_count}"`: The number of columns in the target table. - `"{tbl_name}"`: The name of the target table. - `"{validation_duration}"`: The duration of the validation in seconds. - `"{time}"`: The time of the notification. Here's an example of how to put together a `summary_msg=` template: ```python summary_msg = '''📊 *Validation Summary Report* *Overview* • Status: {highest_severity} • All Passed: {all_passed} • Total Steps: {n_steps} *Step Results* • Passing Steps: {n_passing_steps} • Failing Steps: {n_failing_steps} • Warning Level: {n_warning_steps} • Error Level: {n_error_steps} • Critical Level: {n_critical_steps} *Table Info* • Table Name: {tbl_name} • Row Count: {tbl_row_count} • Column Count: {tbl_column_count} *Timing* • Duration: {validation_duration}s • Completed: {time}''' ``` This template will be filled with the relevant information when the validation summary is generated. The placeholders will be replaced with actual values when the Slack notification is sent. Offline Testing --------------- If you want to test the function without sending actual notifications, you can leave the `webhook_url=` as `None` and set `debug=True`. This will print the message to the console instead of sending it to Slack. This is useful for debugging and ensuring that your templates are formatted correctly. Furthermore, the function could be run globally (i.e., outside of the context of a validation plan) to show the message templates with all possible variables. Here's an example of how to do this: ```python import pointblank as pb # Create a Slack notification function notify_slack = pb.send_slack_notification( webhook_url=None, # Leave as None for dry run debug=True, # Enable debug mode to print message previews ) # Call the function to see the message previews notify_slack() ``` This will print the step and summary message previews to the console, allowing you to see how the templates will look when filled with actual data. You can then adjust your templates as needed before using them in a real validation plan. When `step_msg=` and `summary_msg=` are not provided, the function will use default templates. However, you can customize the templates to include additional information or change the format to better suit your needs. Iterating on the templates can help you create more informative and visually appealing messages. Here's an example of that: ```python import pointblank as pb # Create a Slack notification function with custom templates notify_slack = pb.send_slack_notification( webhook_url=None, # Leave as None for dry run step_msg='''*Data Validation Alert* • Type: {type} • Level: {level} • Step: {step} • Column: {column} • Time: {time}''', summary_msg='''*Data Validation Summary* • Highest Severity: {highest_severity} • Total Steps: {n_steps} • Failed Steps: {n_failing_steps} • Time: {time}''', debug=True, # Enable debug mode to print message previews ) ``` These templates will be used with sample data when the function is called. The combination of `webhook_url=None` and `debug=True` allows you to test your custom templates without having to send actual notifications to Slack. Examples -------- When using an action with one or more validation steps, you typically provide callables that fire when a matched threshold of failed test units is exceeded. The callable can be a function or a lambda. The `send_slack_notification()` function creates a callable that sends a Slack notification when the validation step fails. Here is how it can be set up to work for multiple validation steps by using of [Actions](`pointblank.Actions`): ```python import pointblank as pb # Create a Slack notification function notify_slack = pb.send_slack_notification( webhook_url="https://hooks.slack.com/services/your/webhook/url" ) # Create a validation plan validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), actions=pb.Actions(critical=notify_slack), ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}[0-9]{3}") .col_vals_gt(columns="item_revenue", value=0.05) .col_vals_gt(columns="session_duration", value=15) .interrogate() ) validation ``` By placing the `notify_slack()` function in the `Validate(actions=Actions(critical=))` argument, you can ensure that the notification is sent whenever the 'critical' threshold is reached (as set here, when 15% or more of the test units fail). The notification will include information about the validation step that triggered the alert. When using a [`FinalActions`](`pointblank.FinalActions`) object, the notification will be sent after all validation steps have been completed. This is useful for providing a summary of the validation process. Here is an example of how to set up a summary notification: ```python import pointblank as pb # Create a Slack notification function notify_slack = pb.send_slack_notification( webhook_url="https://hooks.slack.com/services/your/webhook/url" ) # Create a validation plan validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), final_actions=pb.FinalActions(notify_slack), ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}[0-9]{3}") .col_vals_gt(columns="item_revenue", value=0.05) .col_vals_gt(columns="session_duration", value=15) .interrogate() ) ``` In this case, the same `notify_slack()` function is used, but it is placed in `Validate(final_actions=FinalActions())`. This results in the summary notification being sent after all validation steps are completed, regardless of whether any steps failed or not. This simplicity is possible because the `send_slack_notification()` function creates a callable that can be used in both contexts. The function will automatically determine whether to send a step notification or a summary notification based on the context in which it is called. We can customize the message templates for both step and summary notifications. In that way, it's possible to create a more informative and visually appealing message. For example, we can use Markdown formatting to make the message more readable and visually appealing. Here is an example of how to customize the templates: ```python import pointblank as pb # Create a Slack notification function notify_slack = pb.send_slack_notification( webhook_url="https://hooks.slack.com/services/your/webhook/url", step_msg=''' 🚨 *Validation Step Alert* • Step Number: {step} • Column: {column} • Test Type: {type} • Value Tested: {value} • Severity: {level} (level {level_num}) • Brief: {autobrief} • Details: {failure_text} • Time: {time}''', summary_msg=''' 📊 *Validation Summary Report* *Overview* • Status: {highest_severity} • All Passed: {all_passed} • Total Steps: {n_steps} *Step Results* • Passing Steps: {n_passing_steps} • Failing Steps: {n_failing_steps} • Warning Level: {n_warning_steps} • Error Level: {n_error_steps} • Critical Level: {n_critical_steps} *Table Info* • Table Name: {tbl_name} • Row Count: {tbl_row_count} • Column Count: {tbl_column_count} *Timing* • Duration: {validation_duration}s • Completed: {time}''', ) # Create a validation plan validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), actions=pb.Actions(default=notify_slack), final_actions=pb.FinalActions(notify_slack), ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}[0-9]{3}") .col_vals_gt(columns="item_revenue", value=0.05) .col_vals_gt(columns="session_duration", value=15) .interrogate() ) ``` In this example, we have customized the templates for both step and summary notifications. The step notification includes details about the validation step, including the step number, column name, test type, value tested, severity level, brief description, and time of the notification. The summary notification includes an overview of the validation process, including the status, number of steps, passing and failing steps, table information, and timing details. ---------------------------------------------------------------------- This is the CLI documentation for the package. ---------------------------------------------------------------------- ## CLI: pb ``` Usage: pb [OPTIONS] COMMAND [ARGS]... Pointblank CLI: Data validation and quality tools for data engineers. Use this CLI to validate data quality, explore datasets, and generate comprehensive reports for CSV, Parquet, and database sources. Suitable for data pipelines, ETL validation, and exploratory data analysis from the command line. Quick Examples: pb preview data.csv Preview your data pb scan data.csv Generate data profile pb validate data.csv Run basic validation Use pb COMMAND --help for detailed help on any command. Options: -v, --version Show the version and exit. -h, --help Show this message and exit. Commands: info Display information about a data source. preview Preview a data table showing head and tail rows. scan Generate a data scan profile report. missing Generate a missing values report for a data table. validate Perform single or multiple data validations. run Run a Pointblank validation script or YAML configuration. make-template Create a validation script or YAML configuration template. pl Execute Polars expressions and display results. datasets List available built-in datasets. requirements Check installed dependencies and their availability. ``` ### pb info ``` Usage: pb info [OPTIONS] [DATA_SOURCE] Display information about a data source. Shows table type, dimensions, column names, and data types. DATA_SOURCE can be: - CSV file path (e.g., data.csv) - Parquet file path or pattern (e.g., data.parquet, data/*.parquet) - GitHub URL to CSV/Parquet (e.g., https://github.com/user/repo/blob/main/data.csv) - Database connection string (e.g., duckdb:///path/to/db.ddb::table_name) - Dataset name from pointblank (small_table, game_revenue, nycflights, global_sales) Options: --help Show this message and exit. ``` ### pb preview ``` Usage: pb preview [OPTIONS] [DATA_SOURCE] Preview a data table showing head and tail rows. DATA_SOURCE can be: - CSV file path (e.g., data.csv) - Parquet file path or pattern (e.g., data.parquet, data/*.parquet) - GitHub URL to CSV/Parquet (e.g., https://github.com/user/repo/blob/main/data.csv) - Database connection string (e.g., duckdb:///path/to/db.ddb::table_name) - Dataset name from pointblank (small_table, game_revenue, nycflights, global_sales) - Piped data from pb pl command COLUMN SELECTION OPTIONS: For tables with many columns, use these options to control which columns are displayed: - --columns: Specify exact columns (--columns "name,age,email") - --col-range: Select column range (--col-range "1:10", --col-range "5:", --col-range ":15") - --col-first: Show first N columns (--col-first 5) - --col-last: Show last N columns (--col-last 3) Tables with >15 columns automatically show first 7 and last 7 columns with indicators. Options: --columns TEXT Comma-separated list of columns to display --col-range TEXT Column range like '1:10' or '5:' or ':15' (1-based indexing) --col-first INTEGER Show first N columns --col-last INTEGER Show last N columns --head INTEGER Number of rows from the top (default: 5) --tail INTEGER Number of rows from the bottom (default: 5) --limit INTEGER Maximum total rows to display (default: 50) --no-row-numbers Hide row numbers --max-col-width INTEGER Maximum column width in pixels (default: 250) --min-table-width INTEGER Minimum table width in pixels (default: 500) --no-header Hide table header --output-html PATH Save HTML output to file --help Show this message and exit. ``` ### pb scan ``` Usage: pb scan [OPTIONS] [DATA_SOURCE] Generate a data scan profile report. Produces a comprehensive data profile including: - Column types and distributions - Missing value patterns - Basic statistics - Data quality indicators DATA_SOURCE can be: - CSV file path (e.g., data.csv) - Parquet file path or pattern (e.g., data.parquet, data/*.parquet) - GitHub URL to CSV/Parquet (e.g., https://github.com/user/repo/blob/main/data.csv) - Database connection string (e.g., duckdb:///path/to/db.ddb::table_name) - Dataset name from pointblank (small_table, game_revenue, nycflights, global_sales) - Piped data from pb pl command Options: --output-html PATH Save HTML scan report to file -c, --columns TEXT Comma-separated list of columns to scan --help Show this message and exit. ``` ### pb missing ``` Usage: pb missing [OPTIONS] [DATA_SOURCE] Generate a missing values report for a data table. DATA_SOURCE can be: - CSV file path (e.g., data.csv) - Parquet file path or pattern (e.g., data.parquet, data/*.parquet) - GitHub URL to CSV/Parquet (e.g., https://github.com/user/repo/blob/main/data.csv) - Database connection string (e.g., duckdb:///path/to/db.ddb::table_name) - Dataset name from pointblank (small_table, game_revenue, nycflights, global_sales) - Piped data from pb pl command Options: --output-html PATH Save HTML output to file --help Show this message and exit. ``` ### pb validate ``` Usage: pb validate [OPTIONS] [DATA_SOURCE] Perform single or multiple data validations. Run one or more validation checks on your data in a single command. Use multiple --check options to perform multiple validations. DATA_SOURCE can be: - CSV file path (e.g., data.csv) - Parquet file path or pattern (e.g., data.parquet, data/*.parquet) - GitHub URL to CSV/Parquet (e.g., https://github.com/user/repo/blob/main/data.csv) - Database connection string (e.g., duckdb:///path/to/db.ddb::table_name) - Dataset name from pointblank (small_table, game_revenue, nycflights, global_sales) AVAILABLE CHECK_TYPES: Require no additional options: - rows-distinct: Check if all rows in the dataset are unique (no duplicates) - rows-complete: Check if all rows are complete (no missing values in any column) Require --column: - col-exists: Check if a specific column exists in the dataset - col-vals-not-null: Check if all values in a column are not null/missing Require --column and --value: - col-vals-gt: Check if column values are greater than a fixed value - col-vals-ge: Check if column values are greater than or equal to a fixed value - col-vals-lt: Check if column values are less than a fixed value - col-vals-le: Check if column values are less than or equal to a fixed value Require --column and --set: - col-vals-in-set: Check if column values are in an allowed set Use --list-checks to see all available validation methods with examples. The default CHECK_TYPE is 'rows-distinct' which checks for duplicate rows. Examples: pb validate data.csv # Uses default validation (rows-distinct) pb validate data.csv --list-checks # Show all available checks pb validate data.csv --check rows-distinct pb validate data.csv --check rows-distinct --show-extract pb validate data.csv --check rows-distinct --write-extract failing_rows_folder pb validate data.csv --check rows-distinct --exit-code pb validate data.csv --check col-exists --column price pb validate data.csv --check col-vals-not-null --column email pb validate data.csv --check col-vals-gt --column score --value 50 pb validate data.csv --check col-vals-in-set --column status --set "active,inactive,pending" Multiple validations in one command: pb validate data.csv --check rows- distinct --check rows-complete Options: --list-checks List available validation checks and exit --check CHECK_TYPE Type of validation check to perform. Can be used multiple times for multiple checks. --column TEXT Column name or integer position as #N (1-based index) for validation. --set TEXT Comma-separated allowed values for col-vals-in-set checks. --value FLOAT Numeric value for comparison checks. --show-extract Show extract of failing rows if validation fails --write-extract TEXT Save failing rows to folder. Provide base name for folder. --limit INTEGER Maximum number of failing rows to save to CSV (default: 500) --exit-code Exit with non-zero code if validation fails --help Show this message and exit. ``` ### pb datasets ``` Usage: pb datasets [OPTIONS] List available built-in datasets. Options: --help Show this message and exit. ``` ### pb requirements ``` Usage: pb requirements [OPTIONS] Check installed dependencies and their availability. Options: --help Show this message and exit. ``` ### pb make-template ``` Usage: pb make-template [OPTIONS] [OUTPUT_FILE] Create a validation script or YAML configuration template. Creates a sample Python script or YAML configuration with examples showing how to use Pointblank for data validation. The template type is determined by the file extension: - .py files create Python script templates - .yaml/.yml files create YAML configuration templates Edit the template to add your own data loading and validation rules, then run it with 'pb run'. OUTPUT_FILE is the path where the template will be created. Examples: pb make-template my_validation.py # Creates Python script template pb make-template my_validation.yaml # Creates YAML config template pb make-template validation_template.yml # Creates YAML config template Options: --help Show this message and exit. ``` ### pb run ``` Usage: pb run [OPTIONS] [VALIDATION_FILE] Run a Pointblank validation script or YAML configuration. VALIDATION_FILE can be: - A Python file (.py) that defines validation logic - A YAML configuration file (.yaml, .yml) that defines validation steps Python scripts should load their own data and create validation objects. YAML configurations define data sources and validation steps declaratively. If --data is provided, it will automatically replace the data source in your validation objects (Python scripts) or override the 'tbl' field (YAML configs). To get started quickly, use 'pb make-template' to create templates. DATA can be: - CSV file path (e.g., data.csv) - Parquet file path or pattern (e.g., data.parquet, data/*.parquet) - GitHub URL to CSV/Parquet (e.g., https://github.com/user/repo/blob/main/data.csv) - Database connection string (e.g., duckdb:///path/to/db.ddb::table_name) - Dataset name from pointblank (small_table, game_revenue, nycflights, global_sales) Examples: pb make-template my_validation.py # Create a Python template pb run validation_script.py pb run validation_config.yaml pb run validation_script.py --data data.csv pb run validation_config.yaml --data small_table --output-html report.html pb run validation_script.py --show-extract --fail-on error pb run validation_config.yaml --write-extract extracts_folder --fail-on critical Options: --data TEXT Data source to replace in validation objects (Python scripts and YAML configs) --output-html PATH Save HTML validation report to file --output-json PATH Save JSON validation summary to file --show-extract Show extract of failing rows if validation fails --write-extract TEXT Save failing rows to folders (one CSV per step). Provide base name for folder. --limit INTEGER Maximum number of failing rows to save to CSV (default: 500) --fail-on [critical|error|warning|any] Exit with non-zero code when validation reaches this threshold level --help Show this message and exit. ``` ### pb pl ``` Usage: pb pl [OPTIONS] [POLARS_EXPRESSION] Execute Polars expressions and display results. Execute Polars DataFrame operations from the command line and display the results using Pointblank's visualization tools. POLARS_EXPRESSION should be a valid Polars expression that returns a DataFrame. The 'pl' module is automatically imported and available. Examples: # Direct expression pb pl "pl.read_csv('data.csv')" pb pl "pl.read_csv('data.csv').select(['name', 'age'])" pb pl "pl.read_csv('data.csv').filter(pl.col('age') > 25)" # Multi-line with editor (supports multiple statements) pb pl --edit # Multi-statement code example in editor: # csv = pl.read_csv('data.csv') # result = csv.select(['name', 'age']).filter(pl.col('age') > 25) # Multi-line with a specific editor pb pl --edit --editor nano pb pl --edit --editor code pb pl --edit --editor micro # From file pb pl --file query.py Piping to other pb commands pb pl "pl.read_csv('data.csv').head(20)" --pipe | pb validate --check rows-distinct pb pl --edit --pipe | pb preview --head 10 pb pl --edit --pipe | pb scan --output-html report.html pb pl --edit --pipe | pb missing --output-html missing_report.html Use --output-format to change how results are displayed: pb pl "pl.read_csv('data.csv')" --output-format scan pb pl "pl.read_csv('data.csv')" --output-format missing pb pl "pl.read_csv('data.csv')" --output-format info Note: For multi-statement code, assign your final result to a variable like 'result', 'df', 'data', or ensure it's the last expression. Options: -e, --edit Open editor for multi-line input -f, --file PATH Read query from file --editor TEXT Editor to use for --edit mode (overrides $EDITOR and auto-detection) -o, --output-format [preview|scan|missing|info] Output format for the result --preview-head INTEGER Number of head rows for preview --preview-tail INTEGER Number of tail rows for preview --output-html PATH Save HTML output to file --pipe Output data in a format suitable for piping to other pb commands --pipe-format [parquet|csv] Format for piped output (default: parquet) --help Show this message and exit. ``` ---------------------------------------------------------------------- This is the User Guide documentation for the package. ---------------------------------------------------------------------- ### Index
![](../assets/pointblank_logo.svg){width=85%} **Data validation toolkit for assessing and monitoring data quality.**
Pointblank is a data validation framework for Python that makes data quality checks beautiful, powerful, and stakeholder-friendly. Instead of cryptic error messages, get stunning interactive reports that turn data issues into conversations. Here's what a validation looks like (click "Show the code" to see how it's done): ```{python} #| code-fold: true #| code-summary: "Show the code" import pointblank as pb import polars as pl validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="polars"), tbl_name="game_revenue", label="Comprehensive validation of game revenue data", thresholds=pb.Thresholds(warning=0.10, error=0.25, critical=0.35), brief=True ) .col_vals_regex(columns="player_id", pattern=r"^[A-Z]{12}[0-9]{3}$") # STEP 1 .col_vals_gt(columns="session_duration", value=20) # STEP 2 .col_vals_ge(columns="item_revenue", value=0.20) # STEP 3 .col_vals_in_set(columns="item_type", set=["iap", "ad"]) # STEP 4 .col_vals_in_set( # STEP 5 columns="acquisition", set=["google", "facebook", "organic", "crosspromo", "other_campaign"] ) .col_vals_not_in_set(columns="country", set=["Mongolia", "Germany"]) # STEP 6 .col_vals_between( # STEP 7 columns="session_duration", left=10, right=50, pre = lambda df: df.select(pl.median("session_duration")), brief="Expect that the median of `session_duration` should be between `10` and `50`." ) .rows_distinct(columns_subset=["player_id", "session_id", "time"]) # STEP 8 .row_count_match(count=2000) # STEP 9 .col_count_match(count=11) # STEP 10 .col_vals_not_null(columns="item_type") # STEP 11 .col_exists(columns="start_day") # STEP 12 .interrogate() ) validation.get_tabular_report(title="Game Revenue Validation Report") ``` That's the kind of report you get from Pointblank: clear, interactive, and designed for everyone on your team. And if you need help getting started or want to work faster, Pointblank has built-in AI support through the [`assistant()`](`assistant`) function to guide you along the way. You can also use [`DraftValidation`](user-guide/draft-validation.qmd) to quickly generate a validation plan from your existing data (great for getting started fast). Ready to validate? Start with our [Installation](user-guide/installation.qmd) guide or jump straight to the [User Guide](user-guide/index.qmd). By the way, Pointblank is made with 💙 by [Posit](https://posit.co/). ## What is Data Validation? Data validation ensures your data meets quality standards before it's used in analysis, reports, or downstream systems. Pointblank provides a structured way to define validation rules, execute them, and communicate results to both technical and non-technical stakeholders. With Pointblank you can: - **Validate data** through a fluent, chainable API with [25+ validation methods](reference/index.qmd#validation-steps) - **Set thresholds** to define acceptable levels of data quality (warning, error, critical) - **Take actions** when thresholds are exceeded (notifications, logging, custom functions) - **Generate reports** that make data quality issues immediately understandable - **Inspect data** with built-in tools for previewing, summarizing, and finding missing values ## Why Pointblank? Pointblank is designed for the entire data team, not just engineers: - 🎨 **Beautiful Reports**: Interactive validation reports that stakeholders actually want to read - 📊 **Threshold Management**: Define quality standards with warning, error, and critical levels - 🔍 **Error Drill-Down**: Inspect failing data to get to root causes quickly - 🔗 **Universal Compatibility**: Works with Polars, Pandas, DuckDB, MySQL, PostgreSQL, SQLite, and more - 🌍 **Multilingual Support**: Reports available in 40 languages for global teams - 📝 **YAML Support**: Write validations in YAML for version control and team collaboration - ⚡ **CLI Tools**: Run validations from the command line for CI/CD pipelines or as quick checks - 📋 **Rich Inspection**: Preview data, analyze columns, and visualize missing values ## Quick Examples ### Threshold-Based Quality Set expectations and react when data quality degrades (with alerts, logging, or custom functions): ```python validation = ( pb.Validate(data=sales_data, thresholds=(0.01, 0.02, 0.05)) # Three threhold levels set .col_vals_not_null(columns="customer_id") .col_vals_in_set(columns="status", set=["pending", "shipped", "delivered"]) .interrogate() ) ``` ### YAML Workflows Works wonderfully for CI/CD pipelines and team collaboration: ```yaml validate: data: sales_data tbl_name: "sales_data" thresholds: [0.01, 0.02, 0.05] steps: - col_vals_not_null: columns: "customer_id" - col_vals_in_set: columns: "status" set: ["pending", "shipped", "delivered"] ``` ```python validation = pb.yaml_interrogate("validation.yaml") ``` ### Command Line Power Run validations without writing code: ```bash # Quick validation pb validate sales_data.csv --check col-vals-not-null --column customer_id # Run YAML workflows pb run validation.yaml --exit-code # <- Great for CI/CD! # Explore your data pb scan sales_data.csv pb missing sales_data.csv ``` ## Installation Install Pointblank using pip or conda: ```bash pip install pointblank # or conda install conda-forge::pointblank ``` For specific backends: ```bash pip install "pointblank[pl]" # Polars support pip install "pointblank[pd]" # Pandas support pip install "pointblank[duckdb]" # DuckDB support pip install "pointblank[postgres]" # PostgreSQL support ``` See the [Installation guide](user-guide/installation.qmd) for more details. ## Text Formats The docs are also available in `llms.txt` format: - [`llms.txt`](llms.txt): a sitemap listing all documentation pages - [`llms-full.txt`](llms-full.txt): all the documentation in one file ## Join the Community We'd love to hear from you! Connect with us: - [GitHub Issues](https://github.com/posit-dev/pointblank/issues) for bug reports and feature requests - [Discord server](https://discord.com/invite/YH7CybCNCQ) for discussions and help - [Contributing guidelines](https://github.com/posit-dev/pointblank/blob/main/CONTRIBUTING.md) if you'd like to contribute --- **License**: MIT | **© 2024-2026 Posit Software, PBC** ### Getting Started ### Validation Plan ### Advanced Validation ### YAML ### Post Interrogation ### Data Inspection ### Test Data Generation ### The Pointblank CLI ### MCP Server ### Quickstart ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_footer_timings=False) ``` The Pointblank library is all about assessing the state of data quality for a table. You provide the validation rules and the library will dutifully interrogate the data and provide useful reporting. We can use different types of tables like Polars and Pandas DataFrames, Parquet files, or various database tables. Let's walk through what data validation looks like in Pointblank. ## A Simple Validation Table This is a validation report table that is produced from a validation of a Polars DataFrame: ```{python} #| code-fold: true #| code-summary: "Show the code" import pointblank as pb ( pb.Validate(data=pb.load_dataset(dataset="small_table"), label="Example Validation") .col_vals_lt(columns="a", value=10) .col_vals_between(columns="d", left=0, right=5000) .col_vals_in_set(columns="f", set=["low", "mid", "high"]) .col_vals_regex(columns="b", pattern=r"^[0-9]-[a-z]{3}-[0-9]{3}$") .interrogate() ) ``` Each row in this reporting table constitutes a single validation step. Roughly, the left-hand side outlines the validation rules and the right-hand side provides the results of each validation step. While simple in principle, there's a lot of useful information packed into this validation table. Here's a diagram that describes a few of the important parts of the validation table: ![](/assets/validation-table-diagram.png){width=100%} There are three things that should be noted here: - validation steps: each step is a separate test on the table, focused on a certain aspect of the table - validation rules: the validation type is provided here along with key constraints - validation results: interrogation results are provided here, with a breakdown of test units (*total*, *passing*, and *failing*), threshold flags, and more The intent is to provide the key information in one place, and have it be interpretable by data stakeholders. For example, a failure can be seen in the second row (notice there's a CSV button). A data quality stakeholder could click this to download a CSV of the failing rows for that step. ## Example Code, Step-by-Step This section will walk you through the example code used above. ```python import pointblank as pb ( pb.Validate(data=pb.load_dataset(dataset="small_table")) .col_vals_lt(columns="a", value=10) .col_vals_between(columns="d", left=0, right=5000) .col_vals_in_set(columns="f", set=["low", "mid", "high"]) .col_vals_regex(columns="b", pattern=r"^[0-9]-[a-z]{3}-[0-9]{3}$") .interrogate() ) ``` Note these three key pieces in the code: - **data**: the `Validate(data=)` argument takes a DataFrame or database table that you want to validate - **steps**: the methods starting with `col_vals_` specify validation steps that run on specific columns - **execution**: the [`Validate.interrogate()`](`Validate.interrogate`) method executes the validation plan on the table This common pattern is used in a validation workflow, where `Validate` and [`Validate.interrogate()`](`Validate.interrogate`) bookend a validation plan generated through calling validation methods. In the next few sections we'll go a bit further by understanding how we can measure data quality and respond to failures. ## Understanding Test Units Each validation step will execute a type of validation test on the target table. For example, a [`Validate.col_vals_lt()`](`Validate.col_vals_lt`) validation step can test that each value in a column is less than a specified number. And the key finding that's reported in each step is the number of *test units* that pass or fail. In the validation report table, test unit metrics are displayed under the `UNITS`, `PASS`, and `FAIL` columns. This diagram explains what the tabulated values signify: ![](/assets/validation-test-units.png){width=100%} Test units are dependent on the test being run. Some validation methods might test every value in a particular column, so each value will be a test unit. Others will only have a single test unit since they aren't testing individual values but rather if the overall test passes or fails. ## Setting Thresholds for Data Quality Signals Understanding test units is essential because they form the foundation of Pointblank's threshold system. Thresholds let you define acceptable levels of data quality, triggering different severity signals ('warning', 'error', or 'critical') when certain failure conditions are met. Here's a simple example that uses a single validation step along with thresholds set using the `Thresholds` class: ```{python} ( pb.Validate(data=pb.load_dataset(dataset="small_table")) .col_vals_lt( columns="a", value=7, # Set the 'warning' and 'error' thresholds --- thresholds=pb.Thresholds(warning=2, error=4) ) .interrogate() ) ``` If you look at the validation report table, we can see: - the `FAIL` column shows that 2 tests units have failed - the `W` column (short for 'warning') shows a filled gray circle indicating those failing test units reached that threshold value - the `E` column (short for 'error') shows an open yellow circle indicating that the number of failing test units is below that threshold The one final threshold level, `C` (for 'critical'), wasn't set so it appears on the validation table as a long dash. ## Taking Action on Threshold Exceedances Pointblank becomes even more powerful when you combine thresholds with actions. The `Actions` class lets you trigger responses when validation failures exceed threshold levels, turning passive reporting into active notifications. Here's a simple example that adds an action to the previous validation: ```{python} ( pb.Validate(data=pb.load_dataset(dataset="small_table")) .col_vals_lt( columns="a", value=7, thresholds=pb.Thresholds(warning=2, error=4), # Set an action for the 'warning' threshold --- actions=pb.Actions( warning="WARNING: Column 'a' has values that aren't less than 7." ) ) .interrogate() ) ``` Notice the printed warning message: `"WARNING: Column 'a' has values that aren't less than 7."`. The warning indicator (filled gray circle) visually confirms this threshold was reached and the action should trigger. Actions make your validation workflows more responsive and integrated with your data pipelines. For example, you can generate console messages, Slack notifications, and more. ## Navigating the User Guide As you continue exploring Pointblank's capabilities, you'll find the **User Guide** organized into sections that will help you navigate the various features. ### Getting Started The *Getting Started* section introduces you to Pointblank: - [Introduction](index.qmd): Overview of Pointblank and core concepts (**this article**) - [Installation](installation.qmd): How to install and set up Pointblank ### Validation Plan The *Validation Plan* section covers everything you need to know about creating robust validation plans: - [Overview](validation-overview.qmd): Survey of validation methods and their shared parameters - [Validation Methods](validation-methods.qmd): A closer look at the more common validation methods - [Column Selection Patterns](column-selection-patterns.qmd): Techniques for targeting specific columns - [Preprocessing](preprocessing.qmd): Transform data before validation - [Segmentation](segmentation.qmd): Apply validations to specific segments of your data - [Thresholds](thresholds.qmd): Set quality standards and trigger severity levels - [Actions](actions.qmd): Respond to threshold exceedances with notifications or custom functions - [Briefs](briefs.qmd): Add context to validation steps ### Advanced Validation The *Advanced Validation* section explores more specialized validation techniques: - [Expression-Based Validation](expressions.qmd): Use column expressions for advanced validation - [Schema Validation](schema-validation.qmd): Enforce table structure and column types - [Assertions](assertions.qmd): Raise exceptions to enforce data quality requirements - [Draft Validation](draft-validation.qmd): Create validation plans from existing data ### Post Interrogation After validating your data, the *Post Interrogation* section helps you analyze and respond to results: - [Validation Reports](validation-reports.qmd): Understand and customize the validation report table - [Step Reports](step-reports.qmd): View detailed results for individual validation steps - [Data Extracts](extracts.qmd): Extract and analyze failing data - [Sundering Validated Data](sundering.qmd): Split data based on validation results ### Data Inspection The *Data Inspection* section provides tools to explore and understand your data: - [Previewing Data](preview.qmd): View samples of your data - [Column Summaries](col-summary-tbl.qmd): Get statistical summaries of your data - [Missing Values Reporting](missing-vals-tbl.qmd): Identify and visualize missing data By following this guide, you'll gain a comprehensive understanding of how to validate, monitor, and maintain high-quality data with Pointblank. ::: {.callout-note} A [PDF version of the User Guide](../user-guide.pdf) is also available for offline reading. ::: ### Installation Pointblank can be installed using various package managers. The base installation gives you the core validation functionality, with optional dependencies for working with different data sources. ## Basic Installation You can install Pointblank using your preferred package manager: ::: {.panel-tabset} ## pip ```bash pip install pointblank ``` ## uv ```bash uv pip install pointblank ``` ## conda ```bash conda install -c conda-forge pointblank ``` ## pixi ```bash # add pointblank to project pixi init name-of-project cd name-of-project pixi add pointblank ``` ::: ## DataFrame Libraries Pointblank requires a DataFrame library but doesn't include one by default, giving you the flexibility to choose either [Pandas](https://pandas.pydata.org) or [Polars](https://pola.rs): ::: {.panel-tabset} ## Polars ```bash # Using pip pip install pointblank[pl] # Or manually pip install polars>=1.24.0 ``` ## Pandas ```bash # Using pip pip install pointblank[pd] # Or manually pip install pandas>=2.2.3 ``` ::: Pointblank works seamlessly with both libraries, and you can choose the one that best fits your workflow and performance requirements. ## Optional Dependencies ### Ibis Backends To work with various database systems through [Ibis](https://ibis-project.org), you can install additional backends: ::: {.panel-tabset} ## pip ```bash pip install pointblank[sqlite] # SQLite pip install pointblank[duckdb] # DuckDB pip install pointblank[postgres] # PostgreSQL pip install pointblank[mysql] # MySQL pip install pointblank[mssql] # Microsoft SQL Server pip install pointblank[bigquery] # BigQuery pip install pointblank[pyspark] # Apache Spark pip install pointblank[databricks] # Databricks pip install pointblank[snowflake] # Snowflake # Example of installing multiple backends pip install pointblank[duckdb,postgres,sqlite] ``` ## uv ```bash uv pip install pointblank[sqlite] # SQLite uv pip install pointblank[duckdb] # DuckDB uv pip install pointblank[postgres] # PostgreSQL uv pip install pointblank[mysql] # MySQL uv pip install pointblank[mssql] # Microsoft SQL Server uv pip install pointblank[bigquery] # BigQuery uv pip install pointblank[pyspark] # Apache Spark uv pip install pointblank[databricks] # Databricks uv pip install pointblank[snowflake] # Snowflake # Example of installing multiple backends uv pip install pointblank[duckdb,postgres,sqlite] ``` ## conda ```bash conda install -c conda-forge pointblank-sqlite # SQLite conda install -c conda-forge pointblank-duckdb # DuckDB conda install -c conda-forge pointblank-postgres # PostgreSQL conda install -c conda-forge pointblank-mysql # MySQL conda install -c conda-forge pointblank-mssql # Microsoft SQL Server conda install -c conda-forge pointblank-bigquery # BigQuery conda install -c conda-forge pointblank-pyspark # Apache Spark conda install -c conda-forge pointblank-databricks # Databricks conda install -c conda-forge pointblank-snowflake # Snowflake # Example of installing multiple backends conda install -c conda-forge pointblank-duckdb pointblank-postgres pointblank-sqlite ``` ## pixi ```bash pixi add pointblank-sqlite # SQLite pixi add pointblank-duckdb # DuckDB pixi add pointblank-postgres # PostgreSQL pixi add pointblank-mysql # MySQL pixi add pointblank-mssql # Microsoft SQL Server pixi add pointblank-bigquery # BigQuery pixi add pointblank-pyspark # Apache Spark pixi add pointblank-databricks # Databricks pixi add pointblank-snowflake # Snowflake # Example of installing multiple backends pixi add pointblank-duckdb pointblank-postgres pointblank-sqlite ``` ::: ::: {.callout-note} Even when using exclusively Ibis backends, you still need either Pandas or Polars installed since Pointblank's reporting functionality (powered by [Great Tables](https://posit-dev.github.io/great-tables)) requires a DataFrame library for rendering tabular reporting results. ::: ### AI-Assisted Validation (Experimental) Pointblank includes experimental support for AI-assisted validation plan generation: ```bash pip install pointblank[generate] ``` This installs the necessary dependencies for working with LLM providers to help generate validation plans. See the [Draft Validation](draft-validation.qmd) article for how to create validation plans from existing data. ### Development Version If you want the latest development version with the newest features, you can install directly from GitHub: ```bash pip install git+https://github.com/posit-dev/pointblank.git ``` ## Verifying Your Installation You can verify your installation by importing Pointblank and checking the version: ```python import pointblank as pb print(pb.__version__) ``` ## System Requirements - Python 3.10 or higher - a supported DataFrame library (Pandas or Polars) - optional: Ibis (for database connectivity) ## Next Steps Now that you've installed Pointblank, you're ready to start validating your data. If you haven't read the [Introduction](index.qmd) yet, consider starting there to learn the basic concepts. If you encounter any installation issues, please [open an issue on GitHub](https://github.com/posit-dev/pointblank/issues/new) with details about your system and the specific error messages you're seeing. The maintainers actively monitor these issues and can help troubleshoot problems. For a quick test of your installation, try running a simple validation: ```python import pointblank as pb # Load a small dataset data = pb.load_dataset("small_table") # Create a simple validation validation = ( pb.Validate(data=data) .col_exists(columns=["a", "b", "c"]) .interrogate() ) # Display the validation results validation ``` ## Command Line Interface Once installed, Pointblank also provides a powerful command-line interface for quick data validation tasks: ```bash # Test the CLI with a built-in dataset pb validate small_table --check rows-distinct # Check if a column exists pb validate small_table --check col-exists --column a # Validate data ranges pb validate small_table --check col-vals-lt --column a --value 10 ``` The CLI is perfect for: - quick data quality checks in CI/CD pipelines - exploratory data analysis from the terminal - integration with shell scripts and automation workflows ::: {.callout-tip} ## See the CLI in Action Watch our [interactive CLI demonstrations](../demos/cli-interactive/index.qmd) to see these commands executing in real-time with actual output formatting. ::: Learn more about the CLI capabilities in the [Command Line Interface](cli.qmd) guide. ### Overview ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_footer_timings=False) ``` This article provides a quick overview of the data validation features in Pointblank. It introduces the key concepts and shows examples of the main functionality, giving you a foundation for using the library effectively. Later articles in the **User Guide** will expand on each section covered here, providing more explanations and examples. ## Validation Methods Pointblank's core functionality revolves around validation steps, which are individual checks that verify different aspects of your data. These steps are created by calling validation methods from the `Validate` class. When combined they create a comprehensive validation plan for your data. Here's an example of a validation that incorporates three different validation methods: ```{python} import pointblank as pb import polars as pl ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars"), label="Three different validation methods." ) .col_vals_gt(columns="a", value=0) .rows_distinct() .col_exists(columns="date") .interrogate() ) ``` This example showcases how you can combine different types of validations in a single validation plan: - a column value validation with `Validate.col_vals_gt()` - a row-based validation with `Validate.rows_distinct()` - a table structure validation with `Validate.col_exists()` Most validation methods share common parameters that enhance their flexibility and power. These shared parameters (overviewed in the next few sections) create a consistent interface across all validation steps while allowing you to customize validation behavior for specific needs. ## Column Selection Patterns You can apply the same validation logic to multiple columns at once through use of column selection patterns (used in the `columns=` parameter). This reduces repetitive code and makes your validation plans more maintainable: ```{python} import narwhals.selectors as nws # Map validations across multiple columns ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars"), label="Applying column mapping in `columns`." ) # Apply validation rules to multiple columns --- .col_vals_not_null( columns=["a", "b", "c"] ) # Apply to numeric columns only with a Narwhals selector --- .col_vals_gt( columns=nws.numeric(), value=0 ) .interrogate() ) ``` This technique is particularly valuable when working with wide datasets containing many similarly-structured columns or when applying standard quality checks across an entire table. It also ensures consistency in how validation rules are applied across related data columns. ## Preprocessing Preprocessing (with the `pre=` parameter) allows you to transform or modify your data before applying validation checks, enabling you to validate derived or modified data without altering the original dataset: ```{python} import polars as pl # Define preprocessing functions for `pre=` parameters def double_column_a(df): return df.with_columns(pl.col("a") * 2) def square_column_c(df): return df.with_columns(pl.col("c").pow(2)) ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars"), label="Preprocessing validation steps via `pre=`." ) .col_vals_gt( columns="a", value=5, # Apply transformation before validation --- pre=double_column_a # Double values before checking ) .col_vals_lt( columns="c", value=100, # Apply more complex transformation --- pre=square_column_c # Square values before checking ) .interrogate() ) ``` Preprocessing enables validation of transformed data without modifying your original dataset, making it ideal for checking derived metrics, or validating normalized values. This approach keeps your validation code clean while allowing for sophisticated data quality checks on calculated results. ## Segmentation Segmentation (through the `segments=` parameter) allows you to validate data across different groups, enabling you to identify segment-specific quality issues that might be hidden in aggregate analyses: ```{python} ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars"), label="Segmenting validation steps via `segments=`." ) .col_vals_gt( columns="c", value=3, # Split into steps by categorical values in column 'f' --- segments="f" ) .interrogate() ) ``` Segmentation is powerful for detecting patterns of quality issues that may exist only in specific data subsets, such as certain time periods, categories, or geographical regions. It helps ensure that all significant segments of your data meet quality standards, not just the data as a whole. ## Thresholds Thresholds (set through the `thresholds=` parameter) let you set acceptable levels of failure before triggering warnings, errors, or critical notifications for individual validation steps: ```{python} ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars"), label="Using thresholds." ) # Add validation steps with different thresholds --- .col_vals_gt( columns="a", value=1, thresholds=pb.Thresholds(warning=0.1, error=0.2, critical=0.3) ) # Add another step with stricter thresholds --- .col_vals_lt( columns="c", value=10, thresholds=pb.Thresholds(warning=0.05, error=0.1) ) .interrogate() ) ``` Thresholds provide a nuanced way to monitor data quality, allowing you to set different severity levels based on the importance of each validation and your organization's tolerance for specific types of data issues. ## Actions Actions (which can be configured in the `actions=` parameter) allow you to define specific responses when validation thresholds are crossed. You can use simple string messages or custom functions for more complex behavior: ```{python} # Example 1: Action with a string message --- ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars"), label="Using actions with a string message." ) .col_vals_gt( columns="c", value=2, thresholds=pb.Thresholds(warning=0.1, error=0.2), # Add a print-to-console action for the 'warning' threshold --- actions=pb.Actions( warning="WARNING: Values below `{value}` detected in column 'c'." ) ) .interrogate() ) ``` ```{python} # Example 2: Action with a callable function --- def custom_action(): from datetime import datetime print(f"Data quality issue found ({datetime.now()}).") ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars"), label="Using actions with a callable function." ) .col_vals_gt( columns="a", value=5, thresholds=pb.Thresholds(warning=0.1, error=0.2), # Apply the function to the 'error' threshold --- actions=pb.Actions(error=custom_action) ) .interrogate() ) ``` With custom action functions, you can implement sophisticated responses like sending notifications or logging to external systems. ## Briefs Briefs (which can be set through the `brief=` parameter) allow you to customize descriptions associated with validation steps, making validation results more understandable to stakeholders. Briefs can be either automatically generated by setting `brief=True` or defined as custom messages for more specific explanations: ```{python} ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars"), label="Using `brief=` for displaying brief messages." ) .col_vals_gt( columns="a", value=0, # Use `True` for automatic generation of briefs --- brief=True ) .col_exists( columns=["date", "date_time"], # Add a custom brief for this validation step --- brief="Verify required date columns exist for time-series analysis" ) .interrogate() ) ``` Briefs make validation results more meaningful by providing context about why each check matters. They're particularly valuable in shared reports where stakeholders from various disciplines need to understand validation results in domain-specific terms. ## Getting More Information Each validation step can be further customized and has additional options. See these pages for more information: - [Validation Methods](validation-methods.qmd): A closer look at the more common validation methods - [Column Selection Patterns](column-selection-patterns.qmd): Techniques for targeting specific columns - [Preprocessing](preprocessing.qmd): Transform data before validation - [Segmentation](segmentation.qmd): Apply validations to specific segments of your data - [Thresholds](thresholds.qmd): Set quality standards and trigger severity levels - [Actions](actions.qmd): Respond to threshold exceedances with notifications or custom functions - [Briefs](briefs.qmd): Add context to validation steps ## Conclusion Validation steps are the building blocks of data validation in Pointblank. By combining steps from different categories and leveraging common features like thresholds, actions, and preprocessing, you can create comprehensive data quality checks tailored to your specific needs. The next sections of this guide will dive deeper into each of these topics, providing detailed explanations and examples. ### Validation Methods ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False) ``` Pointblank provides a comprehensive suite of validation methods to verify different aspects of your data. Each method creates a validation step that becomes part of your validation plan. These validation methods cover everything from checking column values against thresholds to validating the table structure and detecting duplicates. Combined into validation steps, they form the foundation of your data quality workflow. Pointblank provides [over 40 validation methods](https://posit-dev.github.io/pointblank/reference/#validation-steps) to handle diverse data quality requirements. These are grouped into five main categories: 1. Column Value Validations 2. Row-based Validations 3. Table Structure Validations 4. AI-Powered Validations 5. Aggregate Validations Within each of these categories, we'll walk through several examples showing how each validation method creates steps in your validation plan. And we'll use the `small_table` dataset for all of our examples. Here's a preview of it: ```{python} # | echo: false pb.preview(pb.load_dataset(dataset="small_table"), n_head=20, n_tail=20) ``` ## Validation Methods to Validation Steps In Pointblank, validation *methods* become validation *steps* when you add them to a validation plan. Each method creates a distinct step that performs a specific check on your data. Here's a simple example showing how three validation methods create three validation steps: ```{python} import pointblank as pb ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) # Step 1: Check that values in column `a` are greater than 2 --- .col_vals_gt(columns="a", value=2, brief="Values in 'a' must exceed 2.") # Step 2: Check that column 'date' exists in the table --- .col_exists(columns="date", brief="Column 'date' must exist.") # Step 3: Check that the table has exactly 13 rows --- .row_count_match(count=13, brief="Table should have exactly 13 rows.") .interrogate() ) ``` Each validation method produces one step in the validation report above. When combined, these steps form a complete validation plan that systematically checks different aspects of your data quality. ## Common Arguments Most validation methods in Pointblank share a set of common arguments that provide consistency and flexibility across different validation types: - `columns=`: specifies which column(s) to validate (used in column-based validations) - `pre=`: allows data transformation before validation - `segments=`: enables validation across different data subsets - `thresholds=`: sets acceptable failure thresholds - `actions=`: defines actions to take when validations fail - `brief=`: provides a description of what the validation is checking - `active=`: determines if the validation step should be executed (default is `True`) - `na_pass=`: controls how missing values are handled (only for column value validation methods) For column validation methods, the `na_pass=` parameter determines whether missing values (Null/None/NA) should pass validation (this parameter is covered in a later section). These arguments follow a consistent pattern across validation methods, so you don't need to memorize different parameter sets for each function. This systematic approach makes Pointblank more intuitive to work with as you build increasingly complex validation plans. We'll cover most of these common arguments in their own dedicated sections later in the **User Guide**, as some of them represent a deeper topic worthy of focused attention. ## 1. Column Value Validations These methods check individual values within columns against specific criteria: - **Comparison checks** ([`Validate.col_vals_gt()`](`Validate.col_vals_gt`), [`Validate.col_vals_lt()`](`Validate.col_vals_lt`), etc.) for comparing values to thresholds or other columns - **Range checks** ([`Validate.col_vals_between()`](`Validate.col_vals_between`), [`Validate.col_vals_outside()`](`Validate.col_vals_outside`)) for verifying that values fall within or outside specific ranges - **Set membership checks** ([`Validate.col_vals_in_set()`](`Validate.col_vals_in_set`), [`Validate.col_vals_not_in_set()`](`Validate.col_vals_not_in_set`)) for validating values against predefined sets - **Null value checks** ([`Validate.col_vals_null()`](`Validate.col_vals_null`), [`Validate.col_vals_not_null()`](`Validate.col_vals_not_null`)) for testing presence or absence of null values - **Pattern matching checks** ([`Validate.col_vals_regex()`](`Validate.col_vals_regex`), [`Validate.col_vals_within_spec()`](`Validate.col_vals_within_spec`)) for validating text patterns with regular expressions or against standard specifications - **Trending value checks** ([`Validate.col_vals_increasing()`](`Validate.col_vals_increasing`), [`Validate.col_vals_decreasing()`](`Validate.col_vals_decreasing`)) for verifying that values increase or decrease as you move down the rows - **Custom expression checks** ([`Validate.col_vals_expr()`](`Validate.col_vals_expr`)) for complex validations using custom expressions Now let's look at some key examples from select categories of column value validations. ### Comparison Checks Let's start with a simple example of how [`Validate.col_vals_gt()`](`Validate.col_vals_gt`) might be used to check if the values in a column are greater than a specified value. ```{python} ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .col_vals_gt(columns="a", value=5) .interrogate() ) ``` If you're checking data in a column that contains Null/`None`/`NA` values and you'd like to disregard those values (i.e., let them pass validation), you can use `na_pass=True`. The following example checks values in column `c` of `small_table`, which contains two `None` values: ```{python} ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .col_vals_le(columns="c", value=10, na_pass=True) .interrogate() ) ``` In the above validation table, we see that all test units passed. If we didn't use `na_pass=True` there would be 2 failing test units, one for each `None` value in the `c` column. It's possible to check against column values against values in an adjacent column. To do this, supply the `value=` argument with the column name within the `col()` helper function. Here's an example of that: ```{python} ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .col_vals_lt(columns="a", value=pb.col("c")) .interrogate() ) ``` This validation checks that values in column `a` are less than values in column `c`. ### Checking of Missing Values A very common thing to validate is that there are no Null/NA/missing values in a column. The [`Validate.col_vals_not_null()`](`Validate.col_vals_not_null`) method checks for the presence of missing values: ```{python} ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .col_vals_not_null(columns="a") .interrogate() ) ``` Column `a` has no missing values and the above validation proves this. ### Checking Percentage of Missing Values While [`Validate.col_vals_not_null()`](`Validate.col_vals_not_null`) ensures there are no missing values at all, sometimes you need to validate that missing values match a specific percentage. The [`Validate.col_pct_null()`](`Validate.col_pct_null`) method checks whether the percentage of missing values in a column matches an expected value: ```{python} ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .col_pct_null(columns="c", p=0.15, tol=0.05) # Expect ~15% missing values (±5%) .interrogate() ) ``` This validation checks that approximately 15% of values in column `c` are missing, allowing a tolerance of ±5% (so the acceptable range is 10-20%). The `tol=` parameter can accept various formats including absolute counts or percentage ranges: ```{python} ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .col_pct_null(columns="c", p=0.15, tol=(0.05, 0.10)) # Asymmetric tolerance: -5%/+10% .interrogate() ) ``` ### Checking Strings with Regexes A regular expression (regex) validation via the [`Validate.col_vals_regex()`](`Validate.col_vals_regex`) validation method checks if values in a column match a specified pattern. Here's an example with two validation steps, each checking text values in a column: ```{python} ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .col_vals_regex(columns="b", pattern=r"^\d-[a-z]{3}-\d{3}$") .col_vals_regex(columns="f", pattern=r"high|low|mid") .interrogate() ) ``` ### Checking Strings Against Specifications The [`Validate.col_vals_within_spec()`](`Validate.col_vals_within_spec`) method validates column values against common data specifications like email addresses, URLs, postal codes, credit card numbers, ISBNs, VINs, and IBANs. This is particularly useful when you need to validate that text data conforms to standard formats: ```{python} import polars as pl # Create a sample table with various data types sample_data = pl.DataFrame({ "isbn": ["978-0-306-40615-7", "0-306-40615-2", "invalid"], "email": ["test@example.com", "user@domain.co.uk", "not-an-email"], "zip": ["12345", "90210", "invalid"] }) ( pb.Validate(data=sample_data) .col_vals_within_spec(columns="isbn", spec="isbn") .col_vals_within_spec(columns="email", spec="email") .col_vals_within_spec(columns="zip", spec="postal_code[US]") .interrogate() ) ``` ### Checking for Trending Values The [`Validate.col_vals_increasing()`](`Validate.col_vals_increasing`) and [`Validate.col_vals_decreasing()`](`Validate.col_vals_decreasing`) validation methods check whether column values are increasing or decreasing as you move down the rows. These are useful for validating time series data, sequential identifiers, or any data where you expect monotonic trends: ```{python} import polars as pl # Create a sample table with increasing and decreasing values trend_data = pl.DataFrame({ "id": [1, 2, 3, 4, 5], "temperature": [20, 22, 25, 28, 30], "countdown": [100, 80, 60, 40, 20] }) ( pb.Validate(data=trend_data) .col_vals_increasing(columns="id") .col_vals_increasing(columns="temperature") .col_vals_decreasing(columns="countdown") .interrogate() ) ``` The `allow_stationary=` parameter lets you control whether consecutive identical values should pass validation. By default, stationary values (e.g., `[1, 2, 2, 3]`) will fail the increasing check, but setting `allow_stationary=True` will allow them to pass. ### Handling Missing Values with `na_pass=` When validating columns containing Null/None/NA values, you can control how these missing values are treated with the `na_pass=` parameter: ```{python} ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .col_vals_le(columns="c", value=10, na_pass=True) .interrogate() ) ``` In the above example, column `c` contains two `None` values, but all test units pass because we set `na_pass=True`. Without this setting, those two values would fail the validation. In summary, `na_pass=` works like this: - `na_pass=True`: missing values pass validation regardless of the condition being tested - `na_pass=False` (the default): missing values fail validation ## 2. Row-based Validations Row-based validations focus on examining properties that span across entire rows rather than individual columns. These are essential for detecting issues that can't be found by looking at columns in isolation: - [`Validate.rows_distinct()`](`Validate.rows_distinct`): ensures no duplicate rows exist in the table - [`Validate.rows_complete()`](`Validate.rows_complete`): verifies that no rows contain any missing values These row-level validations are particularly valuable for ensuring data integrity and completeness at the record level, which is crucial for many analytical and operational data applications. ### Checking Row Distinctness Here's an example where we check for duplicate rows with [`Validate.rows_distinct()`](`Validate.rows_distinct`): ```{python} ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .rows_distinct() .interrogate() ) ``` We can also adapt the [`Validate.rows_distinct()`](`Validate.rows_distinct`) check to use a single column or a subset of columns. To do that, we need to use the `columns_subset=` parameter. Here's an example of that: ```{python} ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .rows_distinct(columns_subset="b") .interrogate() ) ``` ### Checking Row Completeness Another important validation is checking for complete rows: rows that have no missing values across all columns or a specified subset of columns. The [`Validate.rows_complete()`](`Validate.rows_complete`) validation method performs this check. Here's an example checking if all rows in the table are complete (have no missing values in any column): ```{python} ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .rows_complete() .interrogate() ) ``` As the report indicates, there are some incomplete rows in the table. ## 3. Table Structure Validations Table structure validations ensure that the overall architecture of your data meets expectations. These structural checks form a foundation for more detailed data quality assessments: - [`Validate.col_exists()`](`Validate.col_exists`): verifies a column exists in the table - [`Validate.col_schema_match()`](`Validate.col_schema_match`): ensures table matches a defined schema - [`Validate.col_count_match()`](`Validate.col_count_match`): confirms the table has the expected number of columns - [`Validate.row_count_match()`](`Validate.row_count_match`): verifies the table has the expected number of rows - [`Validate.tbl_match()`](`Validate.tbl_match`): validates that the target table matches a comparison table - [`Validate.data_freshness()`](`Validate.data_freshness`): checks that data is recent and not stale These structural validations provide essential checks on the fundamental organization of your data tables, ensuring they have the expected dimensions and components needed for reliable data analysis. ### Checking Column Presence If you need to check for the presence of individual columns, the `Validate.col_exists()` validation method is useful. In this example, we check whether the `date` column is present in the table: ```{python} ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .col_exists(columns="date") .interrogate() ) ``` That column is present, so the single test unit of this validation step is a passing one. ### Checking the Table Schema For deeper checks of table structure, a schema validation can be performed with the [`Validate.col_schema_match()`](`Validate.col_schema_match`) validation method, where the goal is to check whether the structure of a table matches an expected schema. To define an expected table schema, we need to use the `Schema` class. Here is a simple example that (1) prepares a schema consisting of column names, (2) uses that `schema` object in a [`Validate.col_schema_match()`](`Validate.col_schema_match`) validation step: ```{python} schema = pb.Schema(columns=["date_time", "date", "a", "b", "c", "d", "e", "f"]) ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .col_schema_match(schema=schema) .interrogate() ) ``` The [`Validate.col_schema_match()`](`Validate.col_schema_match`) validation step will only have a single test unit (signifying pass or fail). We can see in the above validation report that the column schema validation passed. More often, a schema will be defined using column names and column types. We can do that by using a list of tuples in the `columns=` parameter of `Schema`. Here's an example of that approach in action: ```{python} schema = pb.Schema( columns=[ ("date_time", "Datetime(time_unit='us', time_zone=None)"), ("date", "Date"), ("a", "Int64"), ("b", "String"), ("c", "Int64"), ("d", "Float64"), ("e", "Boolean"), ("f", "String"), ] ) ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .col_schema_match(schema=schema) .interrogate() ) ``` The [`Validate.col_schema_match()`](`Validate.col_schema_match`) validation method has several boolean parameters for making the checks less stringent: - `complete=`: requires exact column matching (all expected columns must exist, no extra columns allowed) - `in_order=`: enforces that columns appear in the same order as defined in the schema - `case_sensitive_colnames=`: column names must match with exact letter case - `case_sensitive_dtypes=`: data type strings must match with exact letter case These parameters all default to `True`, providing strict schema validation. Setting any to `False` relaxes the validation requirements, making the checks more flexible when exact matching isn't necessary or practical for your use case. ### Comparing Tables with `tbl_match()` The [`Validate.tbl_match()`](`Validate.tbl_match`) validation method provides a comprehensive way to verify that two tables are identical. It performs a progressive series of checks, from least to most stringent: 1. Column count match 2. Row count match 3. Schema match (loose - case-insensitive, any order) 4. Schema match (order - columns in correct order) 5. Schema match (exact - case-sensitive, correct order) 6. Data match (cell-by-cell comparison) This progressive approach helps identify exactly where tables differ. Here's an example comparing the `small_table` dataset with itself: ```{python} ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .tbl_match(tbl_compare=pb.load_dataset(dataset="small_table", tbl_type="polars")) .interrogate() ) ``` This validation method is especially useful for: - Verifying that data transformations preserve expected properties - Comparing production data against a golden dataset - Ensuring data consistency across different environments - Validating that imported data matches source data ### Checking Counts of Row and Columns Row and column count validations check the number of rows and columns in a table. Using [`Validate.row_count_match()`](`Validate.row_count_match`) checks whether the number of rows in a table matches a specified count. ```{python} ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .row_count_match(count=13) .interrogate() ) ``` The [`Validate.col_count_match()`](`Validate.col_count_match`) validation method checks if the number of columns in a table matches a specified count. ```{python} ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .col_count_match(count=8) .interrogate() ) ``` Expectations on column and row counts can be useful in certain situations and they align nicely with schema checks. ### Validating Data Freshness Late or missing data is one of the most common (and costly) data quality issues in production systems. When data pipelines fail silently or experience delays, downstream analytics and ML models can produce stale or misleading results. The [`Validate.data_freshness()`](`Validate.data_freshness`) validation method helps catch these issues early by verifying that your data contains recent records. Data freshness validation works by checking a datetime column against a maximum allowed age. If the most recent timestamp in that column is older than the specified threshold, the validation fails. This simple check can prevent major downstream problems caused by stale data. Here's an example that validates data is no older than 2 days: ```{python} import polars as pl from datetime import datetime, timedelta # Simulate a data feed that should be updated daily recent_data = pl.DataFrame({ "event": ["login", "purchase", "logout", "signup"], "event_time": [ datetime.now() - timedelta(hours=1), datetime.now() - timedelta(hours=6), datetime.now() - timedelta(hours=12), datetime.now() - timedelta(hours=18), ], "user_id": [101, 102, 103, 104] }) ( pb.Validate(data=recent_data) .data_freshness(column="event_time", max_age="2d") .interrogate() ) ``` The `max_age=` parameter accepts a flexible string format: `"30m"` for 30 minutes, `"6h"` for 6 hours, `"2d"` for 2 days, or `"1w"` for 1 week. You can also combine units: `"1d 12h"` for 1.5 days. When validation succeeds, the report includes details about the data's age in the footer. When it fails, you'll see exactly how old the most recent data is and what threshold was exceeded. This context helps quickly diagnose whether you're dealing with a minor delay or a major pipeline failure. Data freshness validation is particularly valuable for: - monitoring ETL pipelines to catch failures before they cascade to reports and dashboards - validating data feeds to ensure third-party data sources are delivering as expected - including freshness checks in automated data quality tests as part of continuous integration - building alerting systems that trigger notifications when critical data becomes stale You might wonder why not just use [`Validate.col_vals_gt()`](`Validate.col_vals_gt`) with a datetime threshold. While that approach works, [`Validate.data_freshness()`](`Validate.data_freshness`) offers several advantages: the method name clearly communicates your intent, the `max_age=` string format (e.g., `"2d"`) is more readable than datetime arithmetic, it auto-generates meaningful validation briefs, the report footer shows helpful context about actual data age and thresholds, and timezone mismatches between your data and comparison time are handled gracefully with informative warnings. ::: {.callout-note} When comparing timezone-aware and timezone-naive datetimes, Pointblank will include a warning in the validation report. For consistent results, ensure your data and comparison times use compatible timezone settings. ::: ## 4. AI-Powered Validations AI-powered validations use Large Language Models (LLMs) to validate data based on natural language criteria. This opens up new possibilities for complex validation rules that are difficult to express with traditional programmatic methods. ### Validating with Natural Language Prompts The [`Validate.prompt()`](`Validate.prompt`) validation method allows you to describe validation criteria in plain language. The LLM interprets your prompt and evaluates each row, producing pass/fail results just like other Pointblank validation methods. This is particularly useful for: - Semantic checks (e.g., "descriptions should mention a product name") - Context-dependent validation (e.g., "prices should be reasonable for the product category") - Subjective quality assessments (e.g., "comments should be professional and constructive") - Complex rules that would require extensive regex patterns or custom functions Here's a simple example that validates whether text descriptions contain specific information: ```{python} #| eval: false import polars as pl # Create sample data with product descriptions products = pl.DataFrame({ "product": ["Widget A", "Gadget B", "Tool C"], "description": [ "High-quality widget made in USA", "Innovative gadget with warranty", "Professional tool" ], "price": [29.99, 49.99, 19.99] }) # Validate that descriptions mention quality or features ( pb.Validate(data=products) .prompt( prompt="Each description should mention either quality, features, or warranty", columns_subset=["description"], model="anthropic:claude-sonnet-4-5" ) .interrogate() ) ``` The `columns_subset=` parameter lets you specify which columns to include in the validation, improving performance and reducing API costs by only sending relevant data to the LLM. **Note:** To use [`Validate.prompt()`](`Validate.prompt`), you need to have the appropriate API credentials configured for your chosen LLM provider (Anthropic, OpenAI, Ollama, or AWS Bedrock). ## 5. Aggregate Validations Aggregate validations operate on column-level statistics rather than individual row values. These methods compute an aggregate value (such as sum, average, or standard deviation) from a column and compare it against an expected value. Unlike row-level validations where each row is a test unit, aggregate validations treat the entire column as a single test unit that either passes or fails. Pointblank provides three families of aggregate validation methods: - **Sum validations** ([`Validate.col_sum_eq()`](`Validate.col_sum_eq`), [`Validate.col_sum_gt()`](`Validate.col_sum_gt`), [`Validate.col_sum_lt()`](`Validate.col_sum_lt`), [`Validate.col_sum_ge()`](`Validate.col_sum_ge`), [`Validate.col_sum_le()`](`Validate.col_sum_le`)) for validating the sum of column values - **Average validations** ([`Validate.col_avg_eq()`](`Validate.col_avg_eq`), [`Validate.col_avg_gt()`](`Validate.col_avg_gt`), [`Validate.col_avg_lt()`](`Validate.col_avg_lt`), [`Validate.col_avg_ge()`](`Validate.col_avg_ge`), [`Validate.col_avg_le()`](`Validate.col_avg_le`)) for validating the mean of column values - **Standard deviation validations** ([`Validate.col_sd_eq()`](`Validate.col_sd_eq`), [`Validate.col_sd_gt()`](`Validate.col_sd_gt`), [`Validate.col_sd_lt()`](`Validate.col_sd_lt`), [`Validate.col_sd_ge()`](`Validate.col_sd_ge`), [`Validate.col_sd_le()`](`Validate.col_sd_le`)) for validating the standard deviation of column values Each family supports the five comparison operators: equal to (`_eq`), greater than (`_gt`), less than (`_lt`), greater than or equal to (`_ge`), and less than or equal to (`_le`). ### Validating Column Sums Here's an example validating that the sum of column `a` equals 55: ```{python} import polars as pl agg_data = pl.DataFrame({ "a": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "b": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100], }) ( pb.Validate(data=agg_data) .col_sum_eq(columns="a", value=55) .col_sum_gt(columns="b", value=500) .interrogate() ) ``` ### Validating Column Averages Average validations are useful for ensuring that typical values remain within expected bounds: ```{python} ( pb.Validate(data=agg_data) .col_avg_eq(columns="a", value=5.5) .col_avg_ge(columns="b", value=50) .interrogate() ) ``` ### Validating Standard Deviations Standard deviation validations help ensure data variability is within expected ranges: ```{python} ( pb.Validate(data=agg_data) .col_sd_gt(columns="a", value=2) .col_sd_lt(columns="b", value=35) .interrogate() ) ``` ### Using Tolerance for Fuzzy Comparisons Floating-point arithmetic can introduce small precision errors, making exact equality comparisons unreliable. The `tol=` parameter allows for fuzzy comparisons by specifying an acceptable tolerance: ```{python} ( pb.Validate(data=agg_data) .col_avg_eq(columns="a", value=5.5, tol=0.01) # Pass if average is within ±0.01 of 5.5 .col_sum_eq(columns="b", value=550, tol=1) # Pass if sum is within ±1 of 550 .interrogate() ) ``` For equality comparisons, the tolerance creates a range `[value - tol, value + tol]` within which the aggregate is considered valid. ### Comparing Against Reference Data Aggregate validations shine when comparing current data against a baseline or reference dataset. This is invaluable for detecting drift in data properties over time: ```{python} # Current data current_data = pl.DataFrame({"revenue": [100, 200, 150, 175, 125]}) # Historical baseline baseline_data = pl.DataFrame({"revenue": [95, 205, 145, 180, 130]}) ( pb.Validate(data=current_data, reference=baseline_data) .col_sum_eq(columns="revenue", tol=50) # Compare sums with tolerance .col_avg_eq(columns="revenue", tol=5) # Compare averages with tolerance .interrogate() ) ``` When `value=None` (the default) and reference data is set, aggregate methods automatically compare against the same column in the reference data. ## 6. Custom Validations with `specially()` While Pointblank provides over 40 built-in validation methods, there are times when you need to implement custom validation logic that goes beyond these standard checks. The [`Validate.specially()`](`Validate.specially`) method gives you complete flexibility to create bespoke validations for domain-specific business rules, complex multi-column checks, or cross-dataset referential integrity constraints. ### Basic Custom Validations The `specially()` method accepts a callable function that performs your custom validation logic. The function should return boolean values indicating whether each test unit passes: ```{python} import polars as pl simple_tbl = pl.DataFrame({ "a": [5, 7, 1, 3, 9, 4], "b": [6, 3, 0, 5, 8, 2] }) # Custom validation: sum of two columns must be positive def validate_sum_positive(data): return data.select(pl.col("a") + pl.col("b") > 0) ( pb.Validate(data=simple_tbl) .specially( expr=validate_sum_positive, brief="Sum of columns 'a' and 'b' must be positive" ) .interrogate() ) ``` This validation passes because all rows have a positive sum for columns `a` and `b`. The `specially()` method provides the flexibility to implement any validation logic you can express in Python, making it a powerful tool for custom data quality checks. ### Cross-Dataset Referential Integrity One powerful use case for `specially()` is validating relationships between multiple datasets. This is particularly valuable for checking foreign key constraints, conditional existence rules, and cardinality relationships that span multiple tables. #### Foreign Key Validation Verify that all keys in one dataset exist in another: ```{python} # Create related datasets: Orders and OrderDetails orders = pl.DataFrame({ "order_id": [1, 2, 3, 4, 5], "customer_id": ["A", "B", "A", "C", "B"], "status": ["completed", "pending", "completed", "cancelled", "completed"] }) order_details = pl.DataFrame({ "detail_id": [101, 102, 103, 104, 105, 106, 107, 108, 109], "order_id": [1, 1, 1, 2, 3, 3, 4, 5, 5], "product_id": ["P1", "P2", "P3", "P4", "P5", "P6", "P7", "P8", "P9"], "quantity": [2, 1, 3, 1, 2, 1, 1, 2, 1] }) # Validate foreign key constraint def check_foreign_key(df): """Check if all order_ids in order_details exist in orders table""" valid_order_ids = orders.select("order_id") # Semi join returns only rows with matching keys return df.join(valid_order_ids, on="order_id", how="semi").height == df.height ( pb.Validate(data=order_details, tbl_name="order_details") .specially( expr=check_foreign_key, brief="All order_ids must exist in orders table" ) .interrogate() ) ``` This validation ensures referential integrity by confirming that every `order_id` in the `order_details` table has a corresponding record in the `orders` table. The use of a semi-join makes this check efficient, as it only verifies the existence of matching keys without returning full joined data. #### Conditional Existence Checks Implement "if X then Y must exist" logic across datasets: ```{python} def check_completed_orders_have_details(df): """Completed orders must have at least one detail record""" completed_orders = df.filter(pl.col("status") == "completed") order_ids_with_details = order_details.select("order_id").unique() # Check each completed order has matching details return completed_orders.join( order_ids_with_details, on="order_id", how="left" ).with_columns( pl.col("order_id").is_not_null().alias("has_details") ).select("has_details") ( pb.Validate(data=orders, tbl_name="orders") .specially( expr=check_completed_orders_have_details, brief="Completed orders must have detail records" ) .interrogate() ) ``` This validation implements conditional business logic: only orders with a `completed` status are required to have detail records. This pattern is common in real-world scenarios where certain records trigger mandatory relationships while others don't. The validation returns a boolean for each completed order, allowing you to see exactly which records pass or fail. #### Cardinality Constraints Validate that relationships between datasets follow specific cardinality rules: ```{python} def check_quantity_ratio(df): """Each order should have exactly 3x quantity units in details""" order_counts = orders.group_by("order_id").agg(pl.lit(1).alias("order_count")) detail_quantities = order_details.group_by("order_id").agg( pl.col("quantity").sum().alias("total_quantity") ) joined = order_counts.join(detail_quantities, on="order_id", how="left") return joined.with_columns( (pl.col("total_quantity") == pl.col("order_count") * 3).alias("valid_ratio") ).select("valid_ratio") ( pb.Validate(data=orders, tbl_name="orders") .specially( expr=check_quantity_ratio, brief="Each order should have 3x quantity units in details", thresholds=(0.4, 0.7), # Allow some flexibility ) .interrogate() ) ``` Cardinality constraints like this validate that the relationship between datasets follows expected patterns. In this example, we expect each order to have a specific quantity ratio in the detail records. Note the use of `thresholds=` to allow some flexibility (not every order needs to meet this requirement perfectly, but too many violations would indicate a data quality issue). #### Composite Keys with Business Logic Validate complex relationships involving multiple columns and conditional logic: ```{python} # More complex scenario with composite keys employees = pl.DataFrame({ "dept_id": ["D1", "D1", "D2", "D2", "D3"], "emp_id": ["E001", "E002", "E003", "E004", "E005"], "emp_name": ["Alice", "Bob", "Charlie", "Diana", "Eve"], "is_manager": [True, False, True, False, False] }) projects = pl.DataFrame({ "project_id": ["P1", "P2", "P3", "P4"], "dept_id": ["D1", "D2", "D1", "D3"], "manager_emp_id": ["E001", "E003", "E001", "E005"] }) def check_project_manager_validity(df): """Project managers must be valid managers in their department""" validation_result = df.join( employees, left_on=["dept_id", "manager_emp_id"], right_on=["dept_id", "emp_id"], how="left" ).with_columns( # Manager must exist in dept AND have manager status ((pl.col("emp_name").is_not_null()) & (pl.col("is_manager") == True)).alias("valid_manager") ).select("valid_manager") return validation_result ( pb.Validate(data=projects, tbl_name="projects") .specially( expr=check_project_manager_validity, brief="Project managers must be valid managers in their department" ) .interrogate() ) ``` This example demonstrates validation using composite keys (both `dept_id` and `emp_id`) combined with conditional business logic (checking the `is_manager` flag). Such validations are common in enterprise systems where relationships must satisfy multiple constraints simultaneously. The validation reveals that one project (`P4`) fails because employee `E005` is not a manager, even though they exist in the same department. ### Reusable Validation Factories For validations you'll use repeatedly, create factory functions that generate customized validators: ```{python} def make_foreign_key_validator(reference_table, key_columns): """Factory function to create reusable foreign key validators""" def validate_fk(df): if isinstance(key_columns, str): keys = [key_columns] else: keys = key_columns ref_keys = reference_table.select(keys).unique() matched = df.join(ref_keys, on=keys, how="semi") return matched.height == df.height return validate_fk # Use the factory across multiple validations ( pb.Validate(data=order_details, tbl_name="order_details") .specially( expr=make_foreign_key_validator(orders, "order_id"), brief="FK constraint: order_id → orders" ) .interrogate() ) ``` Factory functions like `make_foreign_key_validator()` make your validation code more maintainable and reusable. Once defined, you can use the same factory to validate different foreign key relationships across your entire data pipeline, ensuring consistency in how these constraints are checked. This pattern is particularly valuable in production environments where you validate multiple related tables. ### When to Use `specially()` The `specially()` method is ideal for: - cross-dataset validations: foreign keys, referential integrity, conditional existence - complex business rules: multi-column checks, conditional logic, domain-specific constraints - custom statistical tests: validations requiring calculations not covered by built-in methods - SQL-style checks: converting complex SQL queries into validation steps - prototype validations: testing new validation patterns before implementing them as dedicated methods By combining `specially()` with Pointblank's built-in validation methods, you can create comprehensive data quality checks that address both standard and highly specific validation requirements. ## Conclusion In this article, we've explored the various types of validation methods that Pointblank offers for ensuring data quality. These methods provide a framework for validating column values, checking row properties, verifying table structures, using AI for complex semantic validations, and validating aggregate statistics across columns. By combining these validation methods into comprehensive plans, you can systematically test your data against business rules and quality expectations. And this all helps to ensure your data remains reliable and trustworthy. ### Column Selection Patterns ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False) ``` Data validation often requires working with columns in flexible ways. Pointblank offers two powerful approaches: 1. Applying validation rules across multiple columns: validate many columns with a single rule 2. Comparing values between columns: create validations that compare values across different columns This guide covers both approaches in detail with practical examples. ## Part 1: Applying Rules Across Multiple Columns Many of Pointblank's validation methods perform column-level checks. These methods provide the `columns=` parameter, which accepts not just a single column name but multiple columns through various selection methods. Why is this useful? Often you'll want to perform the same validation check (e.g., checking that numerical values are all positive) across multiple columns. Rather than defining the same rules multiple times, you can map the validation across those columns in a single step. Let's explore this using the `game_revenue` dataset: ```{python} #| echo: false pb.preview(pb.load_dataset(dataset="game_revenue")) ``` ### Using a List of Column Names The simplest way to validate multiple columns is to provide a list to the `columns=` parameter. In the `game_revenue` dataset, we have two columns with numerical data: `item_revenue` and `session_duration`. If we expect all values in both columns to be greater than `0`, we can write: ```{python} import pointblank as pb ( pb.Validate(data=pb.load_dataset("game_revenue")) .col_vals_gt( columns=["item_revenue", "session_duration"], value=0 ) .interrogate() ) ``` The validation report shows two validation steps were created from a single method call! All validation parameters are shared across all generated steps, including thresholds and briefs: ```{python} ( pb.Validate(data=pb.load_dataset("game_revenue")) .col_vals_gt( columns=["item_revenue", "session_duration"], value=0, thresholds=(0.1, 0.2, 0.3), brief="`{col}` must be greater than zero." ) .interrogate() ) ``` In this example, you can see that the validation report displays customized briefs for each column ("`item_revenue` must be greater than zero." and "`session_duration` must be greater than zero."), automatically substituting the column name using the `{col}` placeholder in the brief template. This feature is particularly helpful when reviewing reports, as it provides clear, human-readable descriptions of what each validation step is checking. When working with multiple columns through a single validation call, these dynamically generated briefs make your validation reports more understandable for both technical and non-technical stakeholders. ### Using Pointblank's Column Selectors For more advanced column selection, Pointblank provides selector functions that resolve columns based on: - text patterns in column names - column position - column data type Two common selectors, `starts_with()` and `ends_with()`, resolve columns based on text patterns in column names. The `game_revenue` dataset has three columns starting with "item": `item_type`, `item_name`, and `item_revenue`. Let's check that these columns contain no missing values: ```{python} ( pb.Validate(data=pb.load_dataset("game_revenue")) .col_vals_not_null(columns=pb.starts_with("item")) .interrogate() ) ``` Three validation steps were automatically created because three columns matched the pattern. The complete list of column selectors includes: - `starts_with()` - `ends_with()` - `contains()` - `matches()` - `everything()` - `first_n()` - `last_n()` ### Combining Column Selectors Column selectors can be combined for more powerful selection. To do this, use the `col()` helper function with logical operators: - `&` (*and*) - `|` (*or*) - `-` (*difference*) - `~` (*not*) For example, to select all columns except the first four: ```{python} col_selection = pb.col(pb.everything() - pb.first_n(4)) ( pb.Validate(data=pb.load_dataset("game_revenue")) .col_vals_not_null( columns=col_selection, thresholds=(1, 0.05, 0.1) ) .interrogate() ) ``` This selects every column except the first four, resulting in seven validation steps. ### Narwhals Selectors Pointblank also supports column selectors from the [Narwhals](https://narwhals-dev.github.io/narwhals/) library, which include: - `matches()` - `by_dtype()` - `boolean()` - `categorical()` - `datetime()` - `numeric()` - `string()` Here's an example selecting all numeric columns: ```{python} import narwhals.selectors as ncs ( pb.Validate(data=pb.load_dataset("game_revenue")) .col_vals_gt( columns=ncs.numeric(), value=0 ) .interrogate() ) ``` And selecting all string columns matching "item_": ```{python} ( pb.Validate(data=pb.load_dataset("game_revenue")) .col_vals_not_null(columns=pb.col(ncs.string() & ncs.matches("item_"))) .interrogate() ) ``` This example demonstrates the power of combining Narwhals selectors with logical operators. By using `ncs.string()` to select string columns and then filtering with `ncs.matches("item_")`, we can precisely target text columns with specific naming patterns. This type of targeted selection is particularly valuable when working with wide datasets that have consistent column naming conventions, allowing you to apply appropriate validation rules to logically grouped columns without explicitly listing each one. ### Caveats for Using Column Selectors While column selectors are powerful, there are some caveats. If a selector doesn't match any columns, the validation won't fail but will show an 'explosion' in the report: ```{python} ( pb.Validate(data=pb.load_dataset("game_revenue")) .col_vals_not_null(columns=pb.starts_with("items")) .col_vals_gt(columns="item_revenue", value=0) .interrogate() ) ``` Notice that although there was a problem with Step 1 (that should be addressed), the interrogation did move on to Step 2 without complication. To mitigate uncertainty, include validation steps that check for the existence of key columns with [`Validate.col_exists()`](`Validate.col_exists`) or verify the schema with [`Validate.col_schema_match()`](`Validate.col_schema_match`). ## Part 2: Comparing Values Between Columns Sometimes you need to compare values across different columns rather than against fixed values. Pointblank enables this through the `col()` helper function. Let's look at examples using the `small_table` dataset: ```{python} # | echo: false pb.preview(pb.load_dataset(dataset="small_table"), n_head=20, n_tail=20) ``` ### Using `col()`{.qd-no-link} to Specify a Comparison Column While we typically use validation methods to compare column values against fixed values: ```python ... .col_vals_gt(columns="a", value=2, ...) ... ``` We can also compare values between columns by using `col()` in the `value=` parameter: ```python ... .col_vals_gt(columns="a", value=pb.col("x"), ...) ... ``` This checks that each value in column `a` is greater than the corresponding value in column `x`. Here's a concrete example: ```{python} ( pb.Validate(data=pb.load_dataset("small_table")) .col_vals_gt( columns="d", value=pb.col("c") ) .interrogate() ) ``` Notice that the validation report shows both column names (`d` and `c`). There are two failing test units because of missing values in column `c`. When comparing across columns, missing values in either column can cause failures. To handle missing values, use `na_pass=True`: ```{python} ( pb.Validate(data=pb.load_dataset("small_table")) .col_vals_gt( columns="d", value=pb.col("c"), na_pass=True ) .interrogate() ) ``` Now all tests pass. The following validation methods accept a `col()` expression in their `value=` parameter: - [`Validate.col_vals_gt()`](`Validate.col_vals_gt`) - [`Validate.col_vals_lt()`](`Validate.col_vals_lt`) - [`Validate.col_vals_ge()`](`Validate.col_vals_ge`) - [`Validate.col_vals_le()`](`Validate.col_vals_le`) - [`Validate.col_vals_eq()`](`Validate.col_vals_eq`) - [`Validate.col_vals_ne()`](`Validate.col_vals_ne`) ### Using `col()` in Range Checks For range validations via [`Validate.col_vals_between()`](`Validate.col_vals_between`) and [`Validate.col_vals_outside()`](`Validate.col_vals_outside`) you can use a mix of column references and fixed values: ```{python} ( pb.Validate(data=pb.load_dataset("small_table")) .col_vals_between( columns="d", left=pb.col("c"), right=10_000, na_pass=True ) .interrogate() ) ``` The validation report shows the range as `[c, 10000]`, indicating that the lower bound comes from column `c` while the upper bound is fixed at `10000`. ## Advanced Examples: Combining Both Approaches The true power comes from combining both approaches: validating multiple columns and using cross-column comparisons: ```{python} validation = ( pb.Validate(data=pb.load_dataset("small_table")) .col_vals_gt( columns=["c", "d"], value=pb.col("a"), na_pass=True ) .interrogate() ) validation ``` This creates validation steps checking that values in both columns `d` and `e` are greater than their corresponding values in column `a`. ## Conclusion Pointblank provides flexible approaches to working with columns: 1. Column selection: validate multiple columns with a single validation rule 2. Cross-column comparison: compare values between columns These capabilities allow you to: - write more concise validation code - apply consistent validation rules across similar columns - create dynamic validations that check relationships between columns - build comprehensive data quality checks with minimal code By getting familiar with these techniques, you can create more elegant and powerful validation plans while also reducing repetition in your code. ### Preprocessing ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_footer_timings=False) ``` While the available validation methods can do a lot for you, there's likewise a lot of things you *can't* easily do with them. What if you wanted to validate that - string lengths in a column are less than 10 characters? - the median of values in a column is less than the median of values in another column? - there are at least three instances of every categorical value in a column? These constitute more sophisticated validation requirements, yet such examinations are quite prevalent in practice. Rather than expanding our library to encompass every conceivable validation scenario (a pursuit that would yield an unwieldy and potentially infinite collection) we instead employ a more elegant approach. By transforming the table under examination through judicious preprocessing and exposing key metrics, we may subsequently employ the existing collection of validation methods. This compositional strategy affords us considerable analytical power while maintaining conceptual clarity and implementation parsimony. Central to this approach is the idea of composability. Pointblank makes it easy to safely transform the target table for a given validation via the `pre=` argument. Any computed columns are available for the (short) lifetime of the validation step during interrogation. This composability means: 1. we can validate on different forms of the initial dataset (e.g., validating on aggregate forms, validating on calculated columns, etc.) 2. there's no need to start an entirely new validation process for each transformed version of the data (i.e., one tabular report could be produced instead of several) This compositional paradigm allows us to use data transformation effectively within our validation workflows, maintaining both flexibility and clarity in our data quality assessments. ## Transforming Data with Lambda Functions Now, through examples, let's look at the process of performing the validations mentioned above. We'll use the `small_table` dataset for all of the examples. Here it is in its entirety: ```{python} #| echo: false pb.preview(pb.load_dataset(dataset="small_table", tbl_type="polars"), n_head=20, n_tail=20) ``` In getting to grips with the basics, we'll try to validate that string lengths in the `b` column are less than 10 characters. We can't directly use the [`Validate.col_vals_lt()`](`Validate.col_vals_lt`) validation method with that column because it is meant to be used with a column of numeric values. Let's just give that method what it needs and create a column with string lengths! The target table is a Polars DataFrame so we'll provide a function that uses the Polars API to add in that numeric column: ```{python} import polars as pl # Define a preprocessing function that gets string lengths from column `b` def add_string_length_column(df): return df.with_columns(string_lengths=pl.col("b").str.len_chars()) ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars"), tbl_name="small_table", label="String lengths" ) .col_vals_lt( # The generated column, via `pre=` (see below) --- columns="string_lengths", # The string length value to be less than --- value=10, # The preprocessing function that modifies the table --- pre=add_string_length_column ) .interrogate() ) ``` The validation was successfully constructed and we can see from the validation report table that all strings in `b` had lengths less than 10 characters. Also note that the icon under the `TBL` column is no longer a rightward-facing arrow, but one that is indicative of a transformation taking place. Let's examine the transformation approach more closely. In the previous example, we're not directly testing the `b` column itself. Instead, we're validating the `string_lengths` column that was generated by the lambda function provided to `pre=`. The Polars API's `with_columns()` method does the heavy lifting, creating numerical values that represent each string's length in the original column. That transformation occurs only during interrogation and only for that validation step. Any prior or subsequent steps would normally use the as-provided `small_table`. Having the possibility for data transformation being isolated at the step level means that you don't have to generate separate validation plans for each form of the data, you're free to fluidly transform the target table as necessary for perform validations on different representations of the data. ## Using Custom Functions for Preprocessing While lambda functions work well for simple transformations, custom named functions can make your validation code more organized and reusable, especially for complex preprocessing logic. Let's implement the same string length validation using a dedicated function: ```{python} def add_string_lengths(df): # This generates string length from a column `b`; the new column with # the values is called `string_lengths` (will be placed as the last column) return df.with_columns(string_lengths=pl.col("b").str.len_chars()) ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars"), tbl_name="small_table", label="String lengths for column `b`." ) .col_vals_lt( # Use of a column selector function to select the last column --- columns=pb.last_n(1), # The string length to be less than --- value=10, # Custom function for generating string lengths in a new column --- pre=add_string_lengths ) .interrogate() ) ``` The column-generating logic was placed in the `add_string_lengths()` function, which is then passed to `pre=`. Notice we're using `pb.last_n(1)` in the `columns` parameter. This is a convenient column selector that targets the last column in the DataFrame, which in our case is the newly created `string_lengths` column. This saves us from having to explicitly write out the column name, making our code more adaptable if column names change. Despite not specifying the name directly, you'll still see the actual column name (`string_lengths`) displayed in the validation report. ## Creating Parameterized Preprocessing Functions So far we've used simple functions and lambdas, but sometimes you may want to create more flexible preprocessing functions that can be configured with parameters. Let's create a reusable function that can calculate string lengths for any column: ```{python} def string_length_calculator(column_name): """Returns a preprocessing function that calculates string lengths for the specified column.""" def preprocessor(df): return df.with_columns(string_lengths=pl.col(column_name).str.len_chars()) return preprocessor # Validate string lengths in column b ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars"), tbl_name="small_table", label="String lengths for column `b`." ) .col_vals_lt( columns=pb.last_n(1), value=10, pre=string_length_calculator(column_name="b") ) .interrogate() ) ``` This pattern is called a *function factory*, which is a function that creates and returns another function. The outer function (`string_length_calculator()`) accepts parameters that customize the behavior of the returned preprocessing function. The inner function (`preprocessor()`) is what actually gets called during validation. This approach offers several benefits as it: - creates reusable, configurable preprocessing functions - keeps your validation code DRY - allows you to separate configuration from implementation - enables easy application of the same transformation to different columns You could extend this pattern to create even more sophisticated preprocessing functions with multiple parameters, default values, and complex logic. ## Using Narwhals to Preprocess Many Types of DataFrames In this previous example we used a Polars table. You might have a situation where you perform data validation variously on Pandas and Polars DataFrames. This is where Narwhals becomes handy: it provides a single, consistent API that works across multiple DataFrame types, eliminating the need to learn and switch between different APIs depending on your data source. Let's obtain `small_table` as a Pandas DataFrame. We'll construct a validation step to verify that the median of column `c` is greater than the median in column `a`. ```{python} import narwhals as nw # Define preprocessing function using Narwhals for cross-backend compatibility def get_median_columns_c_and_a(df): return nw.from_native(df).select(nw.median("c"), nw.median("a")) ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="pandas"), tbl_name="small_table", label="Median comparison.", ) .col_vals_gt( columns="c", value=pb.col("a"), # Using Narwhals to modify the table; generates table with columns `c` and `a` --- pre=get_median_columns_c_and_a ) .interrogate() ) ``` The goal is to check that the median value of `c` is greater than the corresponding median of column `a`, which we set up through the `columns=` and `value=` parameters in the [`Validate.col_vals_gt()`](`Validate.col_vals_gt`) method. There's a bit to unpack here so let's look at at the lambda function first. Narwhals can translate a Pandas DataFrame to a Narwhals DataFrame with its `from_native()` function. After that initiating step, you're free to use the Narwhals API (which is modeled on a subset of the Polars API) to do the necessary data transformation. In this case, we are getting the medians of the `c` and `a` columns and ending up with a one-row, two-column table. We should note that the transformed table is, perhaps surprisingly, a Narwhals DataFrame (we didn't have to go back to a Pandas DataFrame by using `.to_native()`). Pointblank is able to work directly with the Narwhals DataFrame for validation purposes, which makes the workflow more concise. One more thing to note: Pointblank provides a convenient syntactic sugar for working with Narwhals. If you name the lambda parameter `dfn` instead of `df`, the system automatically applies `nw.from_native()` to the input DataFrame first. This lets you write more concise code without having to explicitly convert the DataFrame to a Narwhals format. ## Swapping in a Totally Different DataFrame Sometimes data validation requires looking at completely transformed versions of your data (such as aggregated summaries, pivoted views, or even reference tables). While this approach goes against the typical paradigm of validating a single *target table*, there are legitimate use cases where you might need to validate properties that only emerge after significant transformations. Let's now try to prepare the final validation scenario, checking that there are at least three instances of every categorical value in column `f` (which contains string values in the set of `"low"`, `"mid"`, and `"high"`). This time, we'll prepare the transformed table (transformed by Polars expressions) outside of the Pointblank code. ```{python} data_original = pb.load_dataset(dataset="small_table", tbl_type="polars") data_transformed = data_original.group_by("f").len(name="n") data_transformed ``` Then, we'll plug in the `data_transformed` DataFrame with a preprocessing function: ```{python} # Define preprocessing function to use the transformed data def use_transformed_data(df): return data_transformed ( pb.Validate( data=data_original, tbl_name="small_table", label="Category counts.", ) .col_vals_ge( columns="n", value=3, pre=use_transformed_data ) .interrogate() ) ``` We can see from the validation report table that there are three test units. This corresponds to a row for each of the categorical value counts. From the report, we find that two of the three test units are passing test units (turns out there are only two instances of `"mid"` in column `f`). Note that the swapped-in table can be any table type that Pointblank supports, like a Polars DataFrame (as shown here), a Pandas DataFrame, a Narwhals DataFrame, or any other compatible format. This flexibility allows you to validate properties of your data that might only be apparent after significant reshaping or aggregation. ## Conclusion The preprocessing capabilities in Pointblank provide the power and flexibility for validating complex data properties beyond what's directly possible with the standard validation methods. Through the `pre=` parameter, you can: - transform your data on-the-fly with computed columns - generate aggregated metrics to validate statistical properties - work seamlessly across different DataFrame types using Narwhals - swap in completely different tables when validating properties that emerge only after transformation By combining these preprocessing techniques with Pointblank's validation methods, you can create comprehensive data quality checks that address virtually any validation scenario without needing an endless library of specialized validation functions. This composable approach keeps your validation code concise while allowing you to verify even the most complex data quality requirements. Remember that preprocessing happens just for the specific validation step, keeping your validation plan organized and maintaining the integrity of your original data throughout the rest of the validation process. ### Segmentation ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_footer_timings=False) ``` When validating data, you often need to analyze specific subsets or segments of your data separately. Maybe you want to ensure that data quality meets standards in each geographic region, for each product category, or across different time periods. This is where the `segments=` argument can be useful. Data segmentation lets you split a validation step into multiple segments, with each segment receiving its own validation step. Rather than validating an entire table at once, you could instead validate different partitions separately and get separate results for each. The `segments=` argument is available in many validation methods; typically it's in those methods that check values within rows, and those methods that examine entire rows ([`Validate.rows_distinct()`](`Validate.rows_distinct`), [`Validate.rows_complete()`](`Validate.rows_complete`)). When you use it, Pointblank will: 1. split your data according to your segmentation criteria 2. run the validation separately on each segment 3. report results individually for each segment Let's explore how to use the `segments=` argument through a few practical examples. ## Basic Segmentation by Column Values The simplest way to segment data is by the unique values in a column. For the upcoming example, we'll use the `small_table` dataset, which contains a categorical-value column called `f`. First, let's preview the dataset: ```{python} table = pb.load_dataset() pb.preview(table) ``` Now, let's validate that values in column `d` are greater than `100`, but we'll also segment the validation by the categorical values in column `f`: ```{python} validation_1 = ( pb.Validate( data=pb.load_dataset(), tbl_name="small_table", label="Segmented validation by category" ) .col_vals_gt( columns="d", value=100, # Segment by unique values in column `f` --- segments="f" ) .interrogate() ) validation_1 ``` In the validation report, notice that instead of a single validation step, we have multiple steps: one for each unique value in the `f` column. The segmentation is clearly indicated in the `STEP` column with labels like `SEGMENT f / high`, making it easy to identify which segment each validation result belongs to. This clear labeling helps when reviewing reports, especially with complex validations that use multiple segmentation criteria. ## Segmenting on Specific Values Sometimes you don't want to segment on all unique values in a column, but only on specific ones of interest. You can do this by providing a tuple with the column name and a list of values: ```{python} validation_2 = ( pb.Validate( data=pb.load_dataset(), tbl_name="small_table", label="Segmented validation on specific categories" ) .col_vals_gt( columns="d", value=100, segments=("f", ["low", "high"]) # Only segment on "low" and "high" values in column `f` ) .interrogate() ) validation_2 ``` In this example, we only create validation steps for the `"low"` and `"high"` segments, ignoring any rows with `f` equal to `"mid"`. ## Multiple Segmentation Criteria For more complex segmentation, you can provide a list of columns or column-value tuples. This creates segments based on combinations of criteria: ```{python} validation_3 = ( pb.Validate( data=pb.load_dataset(), tbl_name="small_table", label="Multiple segmentation criteria" ) .col_vals_gt( columns="d", value=100, # Segment by values in `f` AND specific values in `a` --- segments=["f", ("a", [1, 2])] ) .interrogate() ) validation_3 ``` This creates validation steps for each combination of values in column `f` and the specified values in column `a`. ## Segmentation with Preprocessing You can combine segmentation with preprocessing for powerful and flexible validations. All preprocessing is applied before segmentation occurs, which means you can create derived columns to segment on: ```{python} import polars as pl # Define preprocessing function for creating a categorical column def add_d_category_column(df): return df.with_columns( d_category=pl.when(pl.col("d") > 150).then(pl.lit("high")).otherwise(pl.lit("low")) ) validation_4 = ( pb.Validate( data=pb.load_dataset(tbl_type="polars"), tbl_name="small_table", label="Segmentation with preprocessing", ) .col_vals_gt( columns="d", value=100, # Create a column containing categorical values --- pre=add_d_category_column, # Segment by the computed column `d_category` generated via `pre=` --- segments="d_category", ) .interrogate() ) validation_4 ``` In this example, we first create a derived column `d_category` based on whether `d` is greater than `150`. Then, we segment our validation based on this derived column by using `segments="d_category"`. ## When to Use Segmentation Segmentation is particularly useful when: 1. Data quality standards vary by group: different regions, product lines, or customer segments might have different acceptable thresholds 2. Identifying problem areas: segmentation helps pinpoint exactly where data quality issues exist, rather than just knowing that some issue exists somewhere in the data 3. Generating detailed reports: by segmenting, you get more granular reporting that can be shared with different stakeholders responsible for different parts of the data 4. Tracking improvements over time: segmented validations make it easier to see if data quality is improving in specific areas that were previously problematic By using segmentation strategically in these scenarios, you can transform your data validation from a simple pass/fail system into a much more nuanced diagnostic tool that provides actionable insights about data quality across different dimensions. This targeted approach not only helps identify issues more precisely but also enables more effective communication of data quality metrics to relevant stakeholders. ## Segmentation vs. Multiple Validation Steps So why use segmentation instead of just creating separate validation steps for each segment using filtering in the `pre=` argument? Well, segmentation offers several nice advantages: 1. Conciseness: you define your validation logic once, not repeatedly for each segment 2. Consistency: we can be certain that the same validation is applied uniformly across segments 3. Clarity: the validation report will clearly organize results by segment (with extra labeling) 4. Convenience: there's no need to manually extract and filter subsets of your data Segmentation can end of simplifying your validation code while also providing more structured and informative reporting about different portions of your data. ## Practical Example: Validating Sales Data by Region and Product Type Let's see a more realistic example where we validate sales data segmented by both region and product type: ```{python} import pandas as pd import numpy as np # Create a sample sales dataset np.random.seed(123) # Create a simple sales dataset sales_data = pd.DataFrame({ "region": np.random.choice(["North", "South", "East", "West"], 100), "product_type": np.random.choice(["Electronics", "Clothing", "Food"], 100), "units_sold": np.random.randint(5, 100, 100), "revenue": np.random.uniform(100, 10000, 100), "cost": np.random.uniform(50, 5000, 100) }) # Calculate profit sales_data["profit"] = sales_data["revenue"] - sales_data["cost"] sales_data["profit_margin"] = sales_data["profit"] / sales_data["revenue"] # Preview the dataset pb.preview(sales_data) ``` Now, let's validate that profit margins are above 20% across different regions and product types: ```{python} validation_5 = ( pb.Validate( data=sales_data, tbl_name="sales_data", label="Sales data validation by region and product" ) .col_vals_gt( columns="profit_margin", value=0.2, segments=["region", "product_type"], brief="Profit margin > 20% check" ) .interrogate() ) validation_5 ``` This validation gives us a detailed breakdown of profit margin performance across the different regions and product types, making it easy to identify areas that need attention. ## Best Practices for Segmentation Effective data segmentation requires thoughtful planning about how to divide your data in ways that make sense for your validation needs. When implementing segmentation in your data validation workflow, consider these key principles: 1. Choose meaningful segments: select segmentation columns that align with your business logic and organizational structure 2. Use preprocessing when needed: if your raw data doesn't have good segmentation columns, create them through preprocessing (with the `pre=` argument) 3. Combine with actions: for critical segments, define segment-specific actions using the `actions=` parameter to respond to validation failures. By implementing these best practices, you'll create more targeted, maintainable, and actionable data validations. Segmentation becomes most powerful when it aligns with natural divisions in your data and analytical processes, allowing for more precise identification of quality issues while maintaining a unified validation framework. ## Conclusion Data segmentation can make your validations more targeted and informative. By dividing your data into meaningful segments, you can identify quality issues with greater precision, apply appropriate validation standards to different parts of your data, and generate more actionable reports. The `segments=` parameter transforms validation from a monolithic process into a granular assessment of data quality across various dimensions of your dataset. Whether you're dealing with regional differences, product categories, time periods, or any other meaningful divisions in your data, segmentation makes it possible to validate each portion according to its specific requirements while maintaining the simplicity of a unified validation framework. ### Thresholds ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_footer_timings=False) ``` Thresholds are a key concept in Pointblank that allow you to define acceptable limits for failing validation tests. Rather than a simple pass/fail model, thresholds enable you to signal failure at different severity levels ('warning', 'error', and 'critical'), giving you fine-grained control over how data quality issues are reported and handled. When used with actions (covered in the next section), thresholds create a robust system for responding to data quality issues based on their severity. This approach allows you to: - set different tolerance levels for different types of validation checks - escalate responses based on the severity of data quality issues - configure different notification strategies for different threshold levels - create a more nuanced data validation workflow than simple pass/fail tests ## A Simple Example Let's start with a basic example that demonstrates how thresholds work in practice: ```{python} import pointblank as pb ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .col_vals_not_null( columns="c", # Set thresholds for the validation step --- thresholds=pb.Thresholds(warning=1, error=0.2) ) .interrogate() ) ``` In this example, we're validating that column `c` contains no Null values. We've set: - A 'warning' threshold of `1` (triggers when 1 or more values are Null) - An 'error' threshold of `0.2` (triggers when 20% or more values are Null) Looking at the results: - the `FAIL` column shows that 2 test units have failed - the `W` column (for 'warning') shows a filled gray circle, indicating the warning threshold has been exceeded - the `E` column (for 'error') shows an open yellow circle, indicating the error threshold has not been exceeded - the `C` column (for 'critical') shows a dash since we didn't set a critical threshold ## Types of Threshold Values Thresholds in Pointblank can be specified in two different ways: ### Absolute Thresholds Absolute thresholds are specified as integers and represent a fixed number of failing test units: ```python # Warning threshold of exactly 5 failing test units thresholds_absolute = pb.Thresholds(warning=5) ``` With this configuration, the 'warning' threshold would be triggered if 5 or more test units fail. ### Proportional Thresholds Proportional thresholds are specified as decimals between 0 and 1, representing a percentage of the total test units: ```python # Error threshold of 10% of test units failing thresholds_proportional = pb.Thresholds(error=0.1) ``` With this configuration, the 'error' threshold would be triggered if 10% or more of the test units fail. ### Boolean Shorthand For cases where you want to allow exactly 1 failing test unit, you can use `True` as a convenient shorthand: ```python # Critical threshold of exactly 1 failing test unit thresholds_boolean = pb.Thresholds(critical=True) ``` This is equivalent to setting `critical=1` but provides a more intuitive way to express "allow at most one failure". This shorthand is particularly useful for strict validations where any failure beyond a single edge case should trigger immediate attention. ## Understanding Severity Levels The three threshold levels in Pointblank ('warning', 'error', and 'critical') are inspired by traditional logging levels used in software development. These names suggest a progression of severity: - **'warning'** (level `30`): indicates potential issues that don't necessarily prevent normal operation - **'error'** (level `40`): suggests more serious problems that might impact data quality - **'critical'** (level `50`): represents the most severe issues that likely require immediate attention These numerical values (`30`, `40`, `50`) are used internally by Pointblank when determining threshold hierarchy and can be accessed through the `{level_num}` field in action metadata (covered in the next **User Guide** article). While these names imply certain severity levels, they're ultimately just convenient labels for different thresholds. You have complete flexibility in how you use them: - you could use 'warning' for issues that should block a pipeline - you might configure 'critical' for minor issues that just need documentation - the 'error' level could trigger informational emails rather than actual error handling The naming is primarily a suggestion to help organize your validation strategy. What matters most is how you configure actions for each threshold level to suit your specific data quality requirements. ## Threshold Behavior It's important to understand a few key behaviors of thresholds: - thresholds are **inclusive**: a value equal to or exceeding the threshold will trigger the associated level - thresholds can be **mixed**: you can use absolute values for some levels and proportional for others - threshold levels are **hierarchical**: 'critical' is more severe than 'error', which is more severe than 'warning' - when a test fails, **all** applicable threshold levels are marked in the report (though actions may only execute for the highest level by default) ## Setting Global Thresholds You can set thresholds globally for all validation steps in a workflow using the `thresholds=` parameter in `Validate`: ```{python} ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars"), # Setting thresholds for all validation steps --- thresholds=pb.Thresholds(warning=1, error=0.1) ) .col_vals_not_null(columns="a") .col_vals_gt(columns="a", value=2) .interrogate() ) ``` With this approach, the same thresholds are applied to every validation step in the workflow. ## Overriding Thresholds for Specific Steps You can override global thresholds for specific validation steps by providing the `thresholds=` parameter in individual validation methods: ```{python} ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars"), # Setting global thresholds --- thresholds=pb.Thresholds(warning=1, error=0.1) ) .col_vals_not_null(columns="a") .col_vals_gt( columns="a", value=2, # Step-specific threshold that overrides global --- thresholds=pb.Thresholds(warning=3) ) .interrogate() ) ``` In this example, the second validation step uses its own 'warning' threshold of `3`, overriding the global setting of `1`. ## Ways to Define Thresholds Pointblank offers multiple ways to define thresholds to accommodate different coding styles and requirements. ### 1. Using the `Thresholds` Class (Recommended) The most explicit and flexible approach is using the `Thresholds` class: ```{python} # Set individual thresholds for different levels thresholds_all_levels = pb.Thresholds(warning=0.05, error=0.1, critical=0.25) # Set only specific levels thresholds_error_only = pb.Thresholds(error=0.15) ``` This approach allows you to: - set any combination of threshold levels - use descriptive parameter names for clarity - skip levels you don't need to set ### 2. Using a Tuple For concise code, you can use a tuple where positions represent 'warning', 'error', and 'critical' levels in that order: ```{python} # (warning, error, critical) thresholds_tuple = (1, 0.1, 0.25) # Shorter tuples are also allowed thresholds_tuple_warning = (3,) # Only the 'warning' threshold thresholds_tuple_warning_error = (3, 0.2) # Both 'warning' and 'error' thresholds ``` While concise, this approach requires you to start with the 'warning' level and add levels in order. ### 3. Using a Dictionary You can also use a dictionary with keys that match the threshold level names: ```{python} # Can use any combination of threshold levels thresholds_dict = {"warning": 1, "critical": 0.15} ``` The dictionary must use the exact keys `"warning"`, `"error"`, and/or `"critical"`. ### 4. Using a Single Value The simplest approach is using a single numeric value, which sets just the 'warning' threshold: ```{python} # Sets 'warning' threshold to `5` thresholds_single = 5 ``` This is equivalent to `pb.Thresholds(warning=5)`. ## Thresholds and Validation Steps Let's look at a more complete validation workflow that demonstrates different threshold configurations: ```{python} # Create a validation workflow with global and step-specific thresholds ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars"), # Global thresholds applied to all steps unless overridden --- thresholds=pb.Thresholds(warning=0.05, error=0.1, critical=0.2) ) # Step 1: Uses global thresholds --- .col_vals_not_null(columns="b") # Step 2: Overrides with step-specific thresholds --- .col_vals_gt( columns="a", value=2, thresholds=pb.Thresholds(warning=1, critical=0.3) # No 'error' threshold ) # Step 3: Uses a simplified tuple notation --- .col_vals_not_null(columns="c", thresholds=(2, 0.15)) .interrogate() ) ``` ## Thresholds and Actions While thresholds by themselves provide visual indicators of validation severity in reports, their real power emerges when combined with Actions. The Actions system (covered in the next article) allows you to specify what happens when a threshold is exceeded. For example, you might configure: - A 'warning' threshold that logs a message - An 'error' threshold that sends an email notification - A 'critical' threshold that blocks a data pipeline Here's a simple preview of how thresholds and actions work together: ```{python} ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars"), # Define thresholds for all three severity levels --- thresholds=pb.Thresholds(warning=1, error=2, critical=3), # Define actions for different threshold levels --- actions=pb.Actions( warning="Warning: {step} has {FAIL} failing values", error="ERROR: Step {step} exceeded the 'error' threshold", critical="CRITICAL: Data quality issue in column {col}" ) ) .col_vals_not_null(columns="c") .interrogate() ) ``` ## Conclusion Thresholds are a powerful feature that transform Pointblank from a simple validation tool into a sophisticated data quality monitoring system. By setting appropriate thresholds, you can: 1. Define different severity levels for data quality issues 2. Customize tolerance levels for different types of validation checks 3. Create a more nuanced approach to data validation than binary pass/fail 4. Enable targeted actions based on the severity of issues detected In the next article, we'll explore the Actions system in depth, showing you how to define automatic responses when thresholds are exceeded. ### Actions ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_footer_timings=False) ``` Actions transform data validation from passive reporting to active response by automatically executing code when quality issues arise. They bridge the gap between detection and intervention, enabling immediate notifications and comprehensive logging when thresholds are exceeded. Whether you need simple console messages for interactive analysis or complex alerting for production pipelines, Actions provide the framework to make your validation workflows responsive. For example, when validating revenue values, you can configure immediate alerts if failures exceed acceptable thresholds, ensuring data issues are addressed promptly rather than discovered later. In this article, we'll explore how to use Actions to respond to threshold violations during data validation, and Final Actions to execute code after all validation steps are complete, giving you powerful tools to monitor, alert, and report on your data's quality. ## How Actions Work Let's look at an example on how this works in practice. The following validation plan contains a single step (using [`Validate.col_vals_gt()`](`Validate.col_vals_gt`)) where the `thresholds=` and `actions=` parameters are set using `Thresholds` and `Actions` calls: ```{python} import pointblank as pb ( pb.Validate(data=pb.load_dataset(dataset="small_table")) .col_vals_gt( columns="c", value=2, thresholds=pb.Thresholds(warning=1, error=5), # Emit a console message when the warning threshold is exceeded --- actions=pb.Actions(warning="WARNING: failing test found.") ) .interrogate() ) ``` The code uses `thresholds=pb.Thresholds(warning=1, error=5)` to set a 'warning' threshold of `1` and an 'error' threshold of `5` failing test units. The results part of the validation table shows that: - The `FAIL` column shows that 3 tests units have failed - The `W` column (short for 'warning') shows a filled gray circle indicating it's reached its threshold level - The `E` ('error') column shows an open yellow circle indicating it's below the threshold level More importantly, the text `"WARNING: failing test found."` has been emitted. Here it appears above the validation table and that's because the action is executed eagerly during interrogation (before the report has even been generated). So, an action is executed for a particular condition (e.g., 'warning') within a validation step if these three things are true: 1. there is a threshold set for that condition (either globally, or as part of that step) 2. there is an associated action set for the condition (again, either set globally or within the step) 3. during interrogation, the threshold value for the condition was exceeded by the number or proportion of failing test units There is a lot of flexibility for setting both thresholds and actions and everything here is considered optional. Put another way, you can set various thresholds and various actions as needed and the interrogation phase will determine whether all the requirements are met for executing an action. ## Defining Actions Actions can be defined in several ways, providing flexibility for different notification needs. ### Using String Messages There are a few options in how to define the actions: 1. **String**: a message to be displayed in the console 2. **Callable**: a function to be called 3. **List of Strings/Callables**: for execution of multiple messages or functions The actions are executed at interrogation time when the threshold level assigned to the action is exceeded by the number or proportion of failing test units. When providing a string, it will simply be printed to the console. A callable will also be executed at the time of interrogation. If providing a list of strings or callables, each item in the list will be executed in order. Such a list can contain a mix of strings and callables. Displaying console messages may be a simple approach, but it is effective. And the strings don't have to be static, there are templating features that can be useful for constructing strings for a variety of situations. The following placeholders are available for use: - `{type}`: The validation step type where the action is executed (e.g., ‘col_vals_gt’, etc.) - `{level}`: The threshold level where the action is executed (‘warning’, ‘error’, or ‘critical’) - `{step}` or `{i}`: The step number in the validation workflow where the action is executed - `{col}` or `{column}`: The column name where the action is executed - `{val}` or `{value}`: An associated value for the validation method - `{time}`: A datetime value for when the action was executed Here's an example where we prepare a console message with a number of value placeholders (`action_str`) and use it globally at `Actions(critical=)`: ```{python} action_str = "[{LEVEL}: {TYPE}]: Step {step} has failed validation. ({time})" ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), # Use `action_str` for any critical thresholds exceeded --- actions=pb.Actions(critical=action_str), ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}\d{3}") .col_vals_gt(columns="item_revenue", value=0.10) .col_vals_ge(columns="session_duration", value=15) .interrogate() ) ``` What we get here are two messages in the console, corresponding to critical failures in steps 2 and 3. The placeholders were replaced with the correct text for the context. Note that some of the resulting text is capitalized (e.g., `"CRITICAL"`, `"COL_VALS_GT"`, etc.) and this is because we capitalized the placeholder text itself. Have a look at the documentation article of `Actions` for more details on this. ### Using Callable Functions Aside from strings, any callable can be used as an action value. Here's an example where we use a custom function as part of an action: ```{python} def duration_issue(): from datetime import datetime print(f"Data quality issue found ({datetime.now()}).") ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}\d{3}") .col_vals_gt(columns="item_revenue", value=0.05) .col_vals_gt( columns="session_duration", value=15, # Use the `duration_issue()` function as an action for this step --- actions=pb.Actions(warning=duration_issue), ) .interrogate() ) ``` In this case, the 'warning' action is set to call the user's `dq_issue()` function. This action is only executed when the 'warning' threshold is exceeded in step 3. Because all three thresholds are exceeded in that step, the 'warning' action of executing the function occurs (resulting in a message being printed to the console). This is an example where actions can be defined locally for an individual validation step. The global threshold setting applied to all three validation steps but the step-level action only applied to step 3. You are free to mix and match both threshold and action settings at the global level (i.e., set in the `Validate` call) or at the step level. The key thing to be aware of is that step-level settings of thresholds and actions take precedence. ## Accessing Context in Actions While string templates provide helpful placeholders to access information about validation steps, callable functions offer more flexibility through access to detailed metadata. When using functions as actions, you can retrieve comprehensive information about the validation context, allowing for complex logic and dynamic responses to validation issues. ### Using `get_action_metadata()`{.qd-no-link} in Callables To access information about the validation step where an action was triggered, we can call `get_action_metadata()` in the body of a function to be used within `Actions`. This provides useful context about the validation step that triggered the action. ```{python} def print_problem(): m = pb.get_action_metadata() print(f"{m['level']} ({m['level_num']}) for Step {m['step']}: {m['failure_text']}") ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), # Use the `print_problem()` function as the action --- actions=pb.Actions(default=print_problem), brief=True, ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}\d{3}") .col_vals_gt(columns="item_revenue", value=0.05) .col_vals_gt(columns="session_duration", value=15) .interrogate() ) ``` In this example, we're creating a function called `print_problem()` that prints information about each validation step that fails. We then apply this function as the default action for all threshold levels using `actions=pb.Actions(default=print_problem)`. (Note that the `default=` and `highest_only=` parameters will be covered in more detail in following sections.) We end up seeing two messages printed for failures in Steps 2 and 3. And though those steps had more than one threshold exceeded, only the most severe level in each yielded a console message (due to the default `highest_only=True` behavior). By setting the action in `Validate(actions=)`, we applied it to all validation steps where thresholds are exceeded. This eliminates the need to set `actions=` at every validation step (though you can do this as a local override, even setting `actions=None` to disable globally set actions). ### Available Metadata Fields The dictionary returned by `get_action_metadata()` contains the following fields: - `step`: The step number. - `column`: The column name. - `value`: The value being compared (only available in certain validation steps). - `type`: The assertion type (e.g., `"col_vals_gt"`, etc.). - `time`: The time the validation step was executed (in ISO format). - `level`: The severity level (`"warning"`, `"error"`, or `"critical"`). - `level_num`: The severity level as a numeric value (`30`, `40`, or `50`). - `autobrief`: A localized and brief statement of the expectation for the step. - `failure_text`: Localized text that explains how the validation step failed. ## Customizing Action Behavior The `Actions` class has two additional parameters that provide more control over how actions are executed: ### Setting Default Actions with `default=` Instead of specifying actions separately for each threshold level, you can use the `default=` parameter to set a common action for all levels: ```{python} def log_all_issues(): m = pb.get_action_metadata() print(f"[{m['level'].upper()}] Validation failed in step {m['step']} with level {m['level']}") ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), # The `log_all_issues()` callable is set to every threshold --- actions=pb.Actions(default=log_all_issues), ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}\d{3}") .col_vals_gt(columns="item_revenue", value=0.05) .col_vals_gt(columns="session_duration", value=15) .interrogate() ) ``` The `default=` parameter sets the same action for all threshold levels. If you later specify an action for a specific level, it will override this default for that level only. When using the `default=` parameter, be aware that your action (whether a string template or callable function) needs to work across all validation steps where thresholds might be exceeded. Not all validation methods provide the same context for string templates or in the metadata dictionary returned by `get_action_metadata()`. For example, some validation steps like [`Validate.col_vals_gt()`](`Validate.col_vals_gt`) provide a `value` field that can be accessed with `{value}` in string templates, while others like [`Validate.col_exists()`](`Validate.col_exists`) don't have this concept. When creating default actions, either use only the universally available placeholders (`{step}`, `{level}`, `{type}`, and `{time}`), or include conditional logic in your callable functions to handle different validation types appropriately. ### Controlling Action Execution with `highest_only=` By default, Pointblank only executes the action for the most severe threshold level that's been exceeded. If you want actions for all exceeded thresholds to be executed, you can set `highest_only=False`: ```{python} ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), actions=pb.Actions( warning="Warning threshold exceeded in step {step}", error="Error threshold exceeded in step {step}", critical="Critical threshold exceeded in step {step}", # Execute all applicable actions --- highest_only=False ), ) .col_vals_gt(columns="session_duration", value=15) .interrogate() ) ``` In this example, if all three thresholds are exceeded in a step, you'll see all three messages printed, rather than just the critical one. The default behavior (`highest_only=True`) helps prevent notification fatigue by limiting the number of actions executed when multiple thresholds are exceeded in the same validation step. For example, if a validation step fails with 60% of rows not passing, it would exceed 'warning', 'error', and 'critical' thresholds simultaneously. With `highest_only=True`, only the critical action would execute. You might want to set `highest_only=False` when: - different threshold levels need to trigger different types of notifications (e.g., warnings to Slack, errors to email, critical to urgent notifications) - you need comprehensive logging of all severity levels for audit purposes - you're building a dashboard that displays counts of issues at each severity level ## Using Multiple Actions for a Threshold You can specify multiple actions to be executed for a single threshold level by providing a list: ```{python} def send_notification(): print("📧 Notification sent to data team") def log_to_system(): print("📝 Issue logged in system") ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(critical=0.15), # Set multiple actions for the critical threshold exceedance --- actions=pb.Actions( critical=[ "CRITICAL: Data validation failed", # First action: display message send_notification, # Second action: call function log_to_system # Third action: call another function ] ), ) .col_vals_gt(columns="session_duration", value=15) .interrogate() ) ``` When providing a list of actions, they will be executed in sequence when the threshold is exceeded. This allows you to combine different types of actions such as displaying messages, sending notifications, and logging events. ## Final Actions ### Creating Final Actions When you need to execute actions after all validation steps are complete, Pointblank provides the `FinalActions` class. Unlike `Actions` which triggers on a per-step basis during the validation process, `FinalActions` executes after the entire validation is complete, giving you a way to respond to the overall validation results. Here's how to use `FinalActions`: ```{python} def send_alert(): summary = pb.get_validation_summary() if summary["highest_severity"] == "critical": print(f"ALERT: Critical validation failures found in `{summary['tbl_name']}`") ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), tbl_name="game_revenue", thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), # Set final actions to be executed after all interrogations --- final_actions=pb.FinalActions( "Validation complete.", # 1. a string message send_alert # 2. a callable function ) ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}\d{3}") .col_vals_gt(columns="item_revenue", value=0.10) .interrogate() ) ``` In this example: - We define the function `send_alert()` that checks the validation summary for critical failures - We provide a simple string message `"Validation complete."` that will print to the console - Both actions will execute in order after all validation steps have completed Because the 'critical' threshold was exceeded in Step 2, we see the printed alert of `send_alert()` after the simple string message. `FinalActions` accepts any number of actions as positional arguments. Each argument can be: 1. **String**: A message to be displayed in the console 2. **Callable**: A function to be called with no arguments 3. **List of Strings/Callables**: Multiple actions to execute in sequence All actions will be executed in the order they are provided after all validation steps have completed. ### Using `get_validation_summary()`{.qd-no-link} in Final Actions When creating a callable function to use with `FinalActions`, you can access information about the overall validation results using the `get_validation_summary()` function. This gives you a dictionary with comprehensive information about the validation: ```python def comprehensive_report(): summary = pb.get_validation_summary() print(f"Validation Report for {summary['tbl_name']}:") print(f"- Steps: {summary['n_steps']}") print(f"- Passing steps: {summary['n_passing_steps']}") print(f"- Failing steps: {summary['n_failing_steps']}") # Take additional actions based on results if summary["n_failing_steps"] > 0: # Create a Slack notification function --- notify = pb.send_slack_notification( webhook_url="https://hooks.slack.com/services/your/webhook/url", summary_msg=""" 🚨 *Validation Failure Alert* • Table: {tbl_name} • Failed Steps: {n_failing_steps} of {n_steps} • Highest Severity: {highest_severity} • Time: {time} """, ) # Execute the notification function notify() ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), tbl_name="game_revenue", final_actions=pb.FinalActions(comprehensive_report), ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}\d{3}") .col_vals_gt(columns="item_revenue", value=0.05) .interrogate() ) ``` ```{python} # | echo: false ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), tbl_name="game_revenue", ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}\d{3}") .col_vals_gt(columns="item_revenue", value=0.05) .interrogate() ) ``` Here we used the `send_slack_notification()` function, which is available in Pointblank as a pre-built action. It can be used by itself in `final_actions=` but here it's integrated into the user's `comprehensive_report()` function to provide finer control with conditional logic. ### Combining Step-level and Final Actions You can use both `Actions` and `FinalActions` together for comprehensive validation control: ```{python} def log_step_failure(): m = pb.get_action_metadata() print(f"Step {m['step']} failed with {m['level']}") def generate_summary(): summary = pb.get_validation_summary() # Sum up total failed test units across all steps total_failed = sum(summary["dict_n_failed"].values()) # Sum up total test units across all steps total_units = sum(summary["dict_n"].values()) print(f"Validation complete: {total_failed} failures out of {total_units} tests") ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10), # Set an action for each step (highest threshold exceeded) --- actions=pb.Actions(default=log_step_failure), # Set a final action to get a summary of the validation process --- final_actions=pb.FinalActions(generate_summary), ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}\d{3}") .col_vals_gt(columns="item_revenue", value=0.05) .interrogate() ) ``` This approach allows you to: 1. log individual step failures during the validation process using `Actions` 2. generate a comprehensive report after all validation steps are complete using `FinalActions` Using both action types gives you fine-grained control over when and how notifications and other actions are triggered in your validation workflow. ## Conclusion Actions provide a powerful mechanism for responding to data validation results in Pointblank. By combining threshold settings with appropriate actions, you can create sophisticated data quality workflows that: - provide immediate feedback through console messages - execute custom functions when validation thresholds are exceeded - customize notifications based on severity levels - generate comprehensive reports after validation is complete - automate responses to data quality issues The flexible design of `Actions` and `FinalActions` allows you to start simple with basic console messages and gradually build up to complex validation workflows with conditional logic, custom reporting, and integrations with other systems like Slack, email, or logging services. When designing your validation strategy, consider leveraging both step-level actions for immediate responses and final actions for holistic reporting. This combination provides comprehensive control over your data validation process and helps ensure that data quality issues are detected, reported, and addressed efficiently. ### Briefs ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_footer_timings=False) ``` When validating data with Pointblank, it's often helpful to have descriptive labels for each validation step. This is where *briefs* come in. A brief is a short description of what a validation step is checking and it appears in the `STEP` column of the validation report table. Briefs make your validation reports more readable and they help others understand what each step is verifying without needing to look at the code. Briefs can be set in two ways: 1. Globally: applied to all validation steps via the `brief=` parameter in `Validate` 2. Locally: set for individual validation steps via the `brief=` parameter in each validation method Understanding these two approaches to adding briefs gives you flexibility in how you document your validation process. Global briefs provide consistency across all steps and save time when you want similar descriptions throughout, while step-level briefs allow for precise customization when specific validations need more detailed or unique explanations. In practice, many validation workflows will combine both approaches (i.e., setting a useful global brief template while overriding it for steps that require special attention). ## Global Briefs To set a global brief that applies to all validation steps, use the `Validate(brief=)` parameter when creating a `Validate` object: ```{python} import pointblank as pb import polars as pl # Sample data data = pl.DataFrame({ "id": [1, 2, 3, 4, 5], "value": [10, 20, 30, 40, 50], "category": ["A", "B", "C", "A", "B"] }) # Create a validation with a global brief ( pb.Validate( data=data, # Global brief template --- brief="Step {step}: {auto}" ) .col_vals_gt(columns="value", value=5) .col_vals_in_set(columns="category", set=["A", "B", "C"]) .interrogate() ) ``` In this example, every validation step will have a brief description that follows the pattern `"Step X: [auto-generated description]"`. This is a simple example of template-based briefs. Later in this guide, we'll explore the full range of templating elements available for creating custom brief descriptions that precisely communicate what each validation step is checking. ## Step-level Briefs You can also set briefs for individual validation steps: ```{python} ( pb.Validate(data=data) .col_vals_gt( columns="value", value=5, brief="Check if values exceed minimum threshold of 5" ) .col_vals_in_set( columns="category", set=["A", "B", "C"], brief="Verify categories are valid" ) .interrogate() ) ``` Local briefs override any global briefs that might be set. ## Brief Templating Briefs support templating elements that get replaced with specific values: - `{auto}`: an auto-generated description of the validation - `{step}`: the step number in the validation plan - `{col}`: the column name(s) being validated - `{value}`: the comparison value used in the validation (when applicable) - `{thresholds}`: a short summary of thresholds levels set (or unset) for the step - `{segment}`, `{segment_column}`, `{segment_value}`: information on the step's segment Here's how to use these templates: ```{python} ( pb.Validate(data=data) .col_vals_gt( columns="value", value=5, brief="Step {step}: Checking column '{col}' for values `> 5`" ) .col_vals_in_set( columns="category", set=["A", "B", "C"], brief="{auto} **(Step {step})**" ) .interrogate() ) ``` These template elements make briefs highly flexible and customizable. You can combine multiple templating elements in a single brief to create descriptive yet concise validation step descriptions. The templates help maintain consistency across your validation reports while providing enough detail to understand what each step is checking. Note that not all templating elements will be relevant for every validation step. For instance, `{value}` is only applicable to validation functions that hold a comparison value like [`Validate.col_vals_gt()`](`Validate.col_vals_gt`). If you include a templating element that isn't relevant to a particular step, it will not be replaced with a corresponding value. Briefs support the use of Markdown formatting, allowing you to add emphasis with **bold** or _italic_ text, include `inline code` formatting, or other Markdown elements to make your briefs more visually distinctive and informative. This can be especially helpful when you want certain parts of your briefs to stand out in the validation report. ## Automatic Briefs If you want Pointblank to generate briefs for you automatically, you can set `brief=True`. Here, we'll make that setting at the global level (by using `Validate(brief=True)`): ```{python} ( pb.Validate( data=data, # Setting for automatically generated briefs --- brief=True ) .col_vals_gt(columns="value", value=5) .col_vals_in_set(columns="category", set=["A", "B", "C"]) .interrogate() ) ``` Automatic briefs are descriptive and include information about what's being validated, including the column names and the validation conditions. ## Briefs Localized to a Specified Language When using the `lang=` parameter in `Validate`, automatically generated briefs will be created in the specified language (along with other elements of the validation report table): ```{python} ( pb.Validate( data=data, # Setting the language as Spanish --- lang="es", # Automatically generate all briefs in Spanish brief=True ) .col_vals_gt(columns="value", value=5) .col_vals_in_set(columns="category", set=["A", "B", "C"]) .interrogate() ) ``` When using the `lang=` parameter in combination with the `{auto}` templating element, the auto-generated portion of the brief will also be translated to the specified language. This makes it possible to create fully localized validation reports where both custom text and auto-generated descriptions appear in the same language. Pointblank supports several languages for localized briefs, including French (`"fr"`), German (`"de"`), Spanish (`"es"`), Italian (`"it"`), and Portuguese (`"pt"`). For the complete list of supported languages, refer to the `Validate` documentation. ## Disabling Briefs If you've set a global brief but want to disable it for specific validation steps, you can set `brief=False`: ```{python} ( pb.Validate( data=data, # Global brief template --- brief="Step {step}: {auto}" ) .col_vals_gt(columns="value", value=5) # This step uses the global brief setting .col_vals_in_set( columns="category", set=["A", "B", "C"], # No brief for this step --- brief=False ) .interrogate() ) ``` ## Practical Example: Comprehensive Validation with Briefs In real-world data validation scenarios, you'll likely work with more complex datasets and apply various types of validation checks. This final example brings together many of the brief-generating techniques we've covered, showing how you can mix different approaches in a single validation workflow. ```{python} # Create a slightly larger dataset data_2 = pl.DataFrame({ "id": [1, 2, 3, 4, 5, 6, 7, 8], "value": [10, 20, 30, 40, 50, 60, 70, 80], "ratio": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8], "category": ["A", "B", "C", "A", "B", "C", "A", "B"], "date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05", "2023-01-06", "2023-01-07", "2023-01-08"] }) ( pb.Validate(data=data_2) .col_vals_gt( columns="value", value=0, # Plaintext brief --- brief="All values must be positive." ) .col_vals_between( columns="ratio", left=0, right=1, # Template-based brief --- brief="**Step {step}**: Ratios should be between `0` and `1`." ) .col_vals_in_set( columns="category", set=["A", "B", "C"], # Automatically generated brief --- brief=True ) .interrogate() ) ``` The example above demonstrates: - plaintext briefs with direct messages - template-based briefs with Markdown formatting - automatically generated briefs (`brief=True`) By combining these different brief styles, you can create validation reports that are informative, consistent, and tailored to your specific data quality requirements. ## Best Practices for Using Briefs Well-crafted briefs can significantly enhance the readability and usefulness of your validation reports. Here are some guidelines to follow: 1. Be concise: briefs should be short and to the point; they're meant to quickly communicate the purpose of a validation step 2. Be specific: include relevant details or conditions that make the validation meaningful 3. Use templates consistently: if you're using template elements like `"{step}"` or `"{col}"`, try to use them consistently across all briefs for a cleaner look 4. Use auto-generated briefs as a starting point: you can start with `Validate(brief=True)` to see what Pointblank generates automatically, then customize as needed 5. Add custom briefs for complex validations: custom briefs are especially useful for complex validations where the purpose might not be immediately obvious from the code Following these best practices will help ensure your validation reports are easy to understand for everyone who needs to review them. ## Conclusion Briefs help make validation reports more readable and understandable. By using global briefs, step-level briefs, or a combination of both, you can create validation reports that clearly communicate what each validation step is checking. Whether you want automatically generated descriptions or precisely tailored custom messages, the brief system provides the flexibility to make your data validation work more transparent and easier to interpret for all stakeholders. ### Expression-Based Validation ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_footer_timings=False) ``` While Pointblank offers many specialized validation functions for common data quality checks, sometimes you need more flexibility for complex validation requirements. This is where expression-based validation with [`Validate.col_vals_expr()`](`Validate.col_vals_expr`) comes in. The [`Validate.col_vals_expr()`](`Validate.col_vals_expr`) method allows you to: - combine multiple conditions in a single validation step - access row-wise values across multiple columns Now let's explore how to use these capabilities through a collection of examples! ## Basic Usage At its core, [`Validate.col_vals_expr()`](`Validate.col_vals_expr`) validates whether an expression evaluates to `True` for each row in your data. Here's a simple example: ```{python} import pointblank as pb import polars as pl # Load small_table dataset as a Polars DataFrame small_table_pl = pb.load_dataset(dataset="small_table", tbl_type="polars") ( pb.Validate(data=small_table_pl) .col_vals_expr( # Use Polars expression syntax --- expr=pl.col("d") > pl.col("a") * 50, brief="Column `d` should be at least 50 times larger than `a`." ) .interrogate() ) ``` In this example, we're validating that for each row, the value in column `d` is at least 50 times larger than the value in column `a`. ## Notes on Expression Syntax The expression syntax depends on your table type: - **Polars**: uses Polars expression syntax with `pl.col("column_name")` - **Pandas**: uses standard Python/NumPy syntax The expression should: - evaluate to a boolean result for each row - reference columns using the appropriate syntax for your table type - use standard operators (`+`, `-`, `*`, `/`, `>`, `<`, `==`, etc.) - not include assignments ## Complex Expressions The real power of [`Validate.col_vals_expr()`](`Validate.col_vals_expr`) comes with complex expressions that would be difficult to represent using the standard validation functions: ```{python} # Load game_revenue dataset as a Polars DataFrame game_revenue_pl = pb.load_dataset(dataset="game_revenue", tbl_type="polars") ( pb.Validate(data=game_revenue_pl) .col_vals_expr( # Use Polars expression syntax --- expr=(pl.col("session_duration") > 20) | (pl.col("item_revenue") > 10), brief="Sessions should be either long (>20 min) or high-value (>$10)." ) .interrogate() ) ``` This validates that either the session duration is longer than 20 minutes OR the item revenue is greater than $10. ## Example: Multiple Conditions You can create sophisticated validations with multiple conditions: ```{python} # Create a simple Polars DataFrame employee_df = pl.DataFrame({ "age": [25, 30, 15, 40, 35], "income": [50000, 75000, 0, 100000, 60000], "years_experience": [3, 8, 0, 15, 7] }) ( pb.Validate(data=employee_df, tbl_name="employee_data") .col_vals_expr( # Complex condition with multiple comparisons --- expr=( (pl.col("age") >= 18) & (pl.col("income") / (pl.col("years_experience") + 1) <= 25000) ), brief="Adults should have reasonable income-to-experience ratios." ) .interrogate() ) ``` ## Example: Handling Null Values When working with expressions, consider how to handle null/missing values: ```{python} ( pb.Validate(data=small_table_pl) .col_vals_expr( # Check for nulls before division --- expr=(pl.col("c").is_not_null()) & ((pl.col("c") / pl.col("a")) > 1.5), brief="Ratio of `c`/`a` should exceed 1.5 (when `c` is not null)." ) .interrogate() ) ``` ## Best Practices Here are some tips and tricks for effectively using expression-based validation with [`Validate.col_vals_expr()`](`Validate.col_vals_expr`). ### Document Your Expressions Always provide clear documentation in the `brief=` parameter: ```{python} ( pb.Validate(data=small_table_pl) .col_vals_expr( expr=pl.col("d") > pl.col("a") * 1.5, # Document which columns are being compared --- brief="Column `d` should be at least 1.5 times larger than column `a`." ) .interrogate() ) ``` ### Handle Edge Cases Consider potential edge cases like division by zero or nulls: ```{python} ( pb.Validate(data=small_table_pl) .col_vals_expr( # Check denominator before division --- expr=(pl.col("a") != 0) & (pl.col("d") / pl.col("a") > 1.5), brief="Ratio of `d`/`a` should exceed 1.5 (avoiding division by zero)." ) .interrogate() ) ``` ### Test on Small Datasets First When developing complex expressions, test on a small sample of your data first to ensure your logic is correct before applying it to large datasets. ## Conclusion The [`Validate.col_vals_expr()`](`Validate.col_vals_expr`) method provides a powerful way to implement complex validation logic in Pointblank when standard validation methods aren't sufficient. By leveraging expressions, you can create sophisticated data quality checks tailored to your specific requirements, combining conditions across multiple columns and applying transformations as needed. This flexibility makes expression-based validation an essential tool for addressing complex data quality scenarios in your validation workflows. ### Schema Validation ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_footer_timings=False) ``` Schema validation in Pointblank allows you to verify that your data conforms to an expected structure and type specification. This is particularly useful when ensuring data consistency across systems or validating incoming data against predefined requirements. Let's first look at the dataset we'll use for the first example: ```{python} import pointblank as pb # Preview the small_table dataset we'll use throughout this guide pb.preview(pb.load_dataset(dataset="small_table", tbl_type="polars")) ``` ## Schema Definition and Validation A schema in Pointblank is created using the `Schema` class which defines the expected structure of a table. Once created, you apply schema validation through the [`Validate.col_schema_match()`](`Validate.col_schema_match`) validation step. ```{python} # Create a schema definition matching small_table structure schema = pb.Schema( columns=[ ("date_time",), # Only check column name ("date",), # Only check column name ("a", "Int64"), # Check name and type ("b", "String"), # Check name and type ("c", "Int64"), # Check name and type ("d", "Float64"), # Check name and type ("e", "Boolean"), # Check name and type ("f",), # Only check column name ] ) # Validate the small_table against the schema small_table_validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars"), label="Schema validation of `small_table`.", ) .col_schema_match(schema=schema) .interrogate() ) small_table_validation ``` The output shows the validation passed successfully. When all columns have the correct names and types as specified in the schema, the validation passes with a single passing test unit. If there were discrepancies, this would fail, but the basic output wouldn't show specific issues. For detailed information about validation results, use [`Validate.get_step_report()`](`Validate.get_step_report`): ```{python} small_table_validation.get_step_report(i=1) ``` The step report provides specific details about which columns were checked and whether they matched the schema, helping diagnose issues when validation fails. ## Schema Components and Column Types When defining a schema, you need to specify column names and optionally their data types. By default, Pointblank enforces strict validation where: - all columns in your table must match the specified schema - column order must match the schema - column types are case-sensitive - type names must match exactly The schema definition accepts column types as string representations, which vary depending on your data source: - `string`: Character data (may also be `"String"`, `"varchar"`, `"character"`, etc.) - `integer`: Integer values (may also be `"Int64"`, `"int"`, `"bigint"`, etc.) - `numeric`: Numeric values including integers and floating-point numbers (may also be `"Float64"`, `"double"`, `"decimal"`, etc.) - `boolean`: Logical values (`True`/`False`) (may also be `"Boolean"`, `"bool"`, etc.) - `datetime`: Date and time values (may also be `"Datetime"`, `"timestamp"`, etc.) - `date`: Date values (may also be `"Date"`, etc.) - `time`: Time values For specific database engines or DataFrame libraries, you may need to use their exact type names (like `"VARCHAR(255)"` for SQL databases or `"Int64"` for Polars integers). ## Discovering Column Types To easily determine the correct type string for columns in your data, Pointblank provides two helpful functions: ```{python} import polars as pl from datetime import date # Define a sample dataframe sample_df = pl.DataFrame({ "id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "join_date": [date(2020, 1, 1), date(2021, 3, 15), date(2022, 7, 10)] }) ``` ```{python} # Method 1: Using `preview()` with `show_types=True` to see column types pb.preview(sample_df) ``` ```{python} # Method 2: Using `col_summary_tbl()` which shows column types and other details pb.col_summary_tbl(sample_df) ``` These functions help you identify the exact type strings to use in your schema definitions, eliminating guesswork and ensuring compatibility with your data source. ## Creating a Schema You can create a schema in four different ways, each with its own advantages. All schema objects can be printed to display their column names and data types. ### 1. Using a List of Tuples with `columns=` This approach allows for mixed validation: some columns checked for both name and type, others only for name: ```{python} schema_tuples = pb.Schema( # List of tuples approach: flexible for mixed type/name checking --- columns=[ ("name", "String"), # Check name and type ("age", "Int64"), # Check name and type ("height",) # Check name only ] ) print(schema_tuples) ``` This is the only method that allows checking just column names for some columns while checking both names and types for others. ### 2. Using a Dictionary with `columns=` This approach is often the most readable when defining a schema manually, especially for larger schemas: ```{python} schema_dict = pb.Schema( # Dictionary approach (keys are column names, values are data types) --- columns={ "name": "String", "age": "Int64", "height": "Float64", "created_at": "Datetime" } ) print(schema_dict) ``` With this method, you must always provide both column names (as keys) and their types (as values). ### 3. Using Keyword Arguments For more readable code with a small number of columns: ```{python} schema_kwargs = pb.Schema( # Keyword arguments approach (more readable for simple schemas) --- name="String", age="Int64", height="Float64" ) print(schema_kwargs) ``` Like the dictionary method, this approach requires both column names and types. ### 4. Extracting from an Existing Table with `tbl=` You can automatically extract a schema from an existing table: ```{python} import polars as pl # Create a sample dataframe df = pl.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "height": [5.6, 6.0, 5.8] }) # Extract schema from table schema_from_table = pb.Schema(tbl=df) print(schema_from_table) ``` This is especially useful when you want to validate that future data matches the structure of a reference dataset. ## Multiple Data Types for a Column You can specify multiple acceptable types for a column by providing a list of types: ```{python} # Schema with multiple possible types for a column schema_multi_types = pb.Schema( columns={ "name": "String", "age": ["Int64", "Float64"], # Accept either integer or float "active": "Boolean" } ) print(schema_multi_types) ``` This is useful when working with data sources that might represent the same information in different ways (e.g., integers sometimes stored as floats). ## Schema Validation Options When using `col_schema_match()`, you can customize validation behavior with several important options: | Option | Default | Description | |--------|---------|-------------| | `complete` | `True` | Require exact column presence (no extra columns allowed) | | `in_order` | `True` | Enforce column order | | `case_sensitive_colnames` | `True` | Make column name matching case-sensitive | | `case_sensitive_dtypes` | `True` | Make data type matching case-sensitive | | `full_match_dtypes` | `True` | Require exact (not partial) type name matches | ### Controlling Column Presence By default, [`Validate.col_schema_match()`](`Validate.col_schema_match`) requires a complete match between the schema's columns and the table's columns. You can make this more flexible: ```{python} # Create a sample table users_table_extra = pl.DataFrame({ "id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "extra_col": ["a", "b", "c"] # Extra column not in schema }) # Create a schema schema = pb.Schema( columns={"id": "Int64", "name": "String", "age": "Int64"} ) # Validate without requiring all columns to be present validation = ( pb.Validate(data=users_table_extra) .col_schema_match( schema=schema, # Allow schema columns to be a subset --- complete=False ) .interrogate() ) validation.get_step_report(i=1) ``` ### Column Order Enforcement You can control whether column order matters in your validation: ```{python} # Create a sample table users_table = pl.DataFrame({ "id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], }) # Create a schema schema = pb.Schema( columns={"name": "String", "age": "Int64", "id": "Int64"} ) # Validate without enforcing column order validation = ( pb.Validate(data=users_table) .col_schema_match( schema=schema, # Don't enforce column order --- in_order=False ) .interrogate() ) validation.get_step_report(i=1) ``` ### Case Sensitivity Control whether column names and data types are case-sensitive: ```{python} # Create schema with different case charactistics case_schema = pb.Schema( columns={"ID": "int64", "NAME": "string", "AGE": "int64"} ) # Create validation with case-insensitive column names and types validation = ( pb.Validate(data=users_table) .col_schema_match( schema=case_schema, # Ignore case in column names --- case_sensitive_colnames=False, # Ignore case in data type names --- case_sensitive_dtypes=False ) .interrogate() ) validation.get_step_report(i=1) ``` ### Type Matching Precision Control how strictly data types must match: ```{python} # Create schema with simplified type names type_schema = pb.Schema( # Using simplified type names --- columns={"id": "int", "name": "str", "age": "int"} ) # Allow partial type matches validation = ( pb.Validate(data=users_table) .col_schema_match( schema=type_schema, # Ignore case in data type names --- case_sensitive_dtypes=False, # Allow partial type name matches --- full_match_dtypes=False ) .interrogate() ) validation.get_step_report(i=1) ``` ## Common Schema Validation Patterns This section explores common patterns for applying schema validation to different scenarios. Each pattern addresses specific validation needs you might encounter when working with real-world data. We'll examine the step reports ([`Validate.get_step_report()`](`Validate.get_step_report`)) for these validations since they provide more detailed information about what was checked and how the validation performed, offering an intuitive way to understand the results beyond simple pass/fail indicators. ### Structural Validation Only When you only care about column names but not their types: ```{python} # Create a schema with only column names structure_schema = pb.Schema( columns=["id", "name", "age", "extra_col"] ) # Validate structure only validation = ( pb.Validate(data=users_table_extra) .col_schema_match(schema=structure_schema) .interrogate() ) validation.get_step_report(i=1) ``` ### Mixed Validation Validate types for critical columns but just presence for others: ```{python} # Mixed validation for different columns mixed_schema = pb.Schema( columns=[ ("id", "Int64"), # Check name and type ("name", "String"), # Check name and type ("age",), # Check name only ("extra_col",) # Check name only ] ) # Validate with mixed approach validation = ( pb.Validate(data=users_table_extra) .col_schema_match(schema=mixed_schema) .interrogate() ) validation.get_step_report(i=1) ``` ### Progressive Schema Evolution As your data evolves, you might need to adapt your validation approach: ```{python} # Original schema original_schema = pb.Schema( columns={ "id": "Int64", "name": "String" } ) # New data with additional columns evolved_data = pl.DataFrame({ "id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], # New column "active": [True, False, True] # New column }) # Validate with flexible parameters validation = ( pb.Validate(evolved_data) .col_schema_match( schema=original_schema, # Allow extra columns --- complete=False, # Don't enforce order --- in_order=False ) .interrogate() ) validation.get_step_report(i=1) ``` ## Integrating with Larger Validation Workflows Schema validation is often just one part of a comprehensive data validation strategy. You can combine schema checks with other validation steps: ```{python} # Define a schema schema = pb.Schema( columns={ "id": "Int64", "name": "String", "age": "Int64" } ) # Create a validation plan validation = ( pb.Validate( users_table, label="User data validation", thresholds=pb.Thresholds(warning=0.05, error=0.1) ) # Add schema validation --- .col_schema_match(schema=schema) # Add other validation steps --- .col_vals_not_null(columns="id") .col_vals_gt(columns="age", value=26) .interrogate() ) validation ``` This approach allows you to first validate the structure of your data and then check specific business rules or constraints. ## Best Practices 1. Define schemas early: document and define expected data structures early in your data workflow. 2. Choose the right creation method: - use `columns=` for readability with many columns - use `columns=` for mixed name/type validation - use `kwargs` for small schemas with simple column names - use `tbl=` to extract schemas from reference datasets 3. Be deliberate about strictness: choose validation parameters based on your specific needs: - strict validation (`complete=True`) for critical data interfaces - flexible validation (`complete=False`, `in_order=False`) for evolving datasets 4. Reuse schemas: create schema definitions that can be reused across multiple validation contexts. 5. Version control schemas: as your data evolves, maintain versions of your schemas to track changes. 6. Extract schemas from reference data: when you have a 'golden' dataset that represents your ideal structure, use `Schema(tbl=reference_data)` to extract its schema. 7. Consider type flexibility: use multiple types per column (`["Int64", "Float64"]`) when working with data from diverse sources. 8. Combine with targeted validation: use schema validation for structural checks and add specific validation steps for business rules. ## Conclusion Schema validation provides a powerful mechanism for ensuring your data adheres to expected structural requirements. It serves as an excellent first line of defense in your data validation strategy, verifying that the data you're working with has the expected shape before applying more detailed business rule validations. The `Schema` class offers multiple ways to define schemas, from manual specification with dictionaries or keyword arguments to automatic extraction from reference tables. When combined with the flexible options of [`Validate.col_schema_match()`](`Validate.col_schema_match`), you can implement validation approaches ranging from strict structural enforcement to more flexible evolution-friendly checks. By understanding the different schema creation methods and validation options, you can efficiently validate the structure of your data tables and ensure they meet your requirements before processing. ### Assertions ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_footer_timings=False) ``` In addition to validation steps that create reports, Pointblank provides **assertions**. This is a lightweight way to confirm data quality by raising exceptions when validation conditions aren't met. Assertions are particularly useful in: - data processing pipelines where you need to halt execution if data doesn't meet expectations - testing environments where you want to verify data properties programmatically - scripts and functions where you need immediate notification of data problems ## Basic Assertion Workflow The assertion workflow uses your familiar validation steps with assertion methods to check that validations meet your requirements: ```{python} import pointblank as pb import polars as pl # Create sample data sample_data = pl.DataFrame({ "id": [1, 2, 3, 4, 5], "value": [10.5, 8.3, -2.1, 15.7, 7.2] }) # Create a validation plan and assert that all steps pass ( pb.Validate(data=sample_data) .col_vals_gt(columns="id", value=0, brief="IDs must be positive") .col_vals_gt(columns="value", value=-5, brief="Values should exceed -5") # Will automatically `interrogate()` and raise an AssertionError if any validation fails --- .assert_passing() ) ``` This simple pattern allows you to integrate data quality checks into your data pipelines. With it, you can create clear stopping points when data doesn't meet specified criteria. ## Assertion Methods Pointblank offers two types of assertions: 1. Full Passing Assertions: using [`Validate.assert_passing()`](`Validate.assert_passing`) to verify that every single test unit passes 2. Threshold-Based Assertions: using [`Validate.assert_below_threshold()`](`Validate.assert_below_threshold`) to verify that failure rates stay within acceptable thresholds ### `assert_passing()` The [`Validate.assert_passing()`](`Validate.assert_passing`) method is the strictest form of assertion, requiring every single validation test unit to pass: ```{python} try: ( pb.Validate(data=sample_data) .col_vals_gt(columns="value", value=0) # Direct assertion: automatically interrogates --- .assert_passing() ) except AssertionError as e: print("AssertionError:", str(e)) ``` ### `assert_below_threshold()` The [`Validate.assert_below_threshold()`](`Validate.assert_below_threshold`) method is more flexible as it allows some failures as long as they stay below specified threshold levels. Pointblank uses three severity thresholds that increase in order of seriousness: - **'warning'** (least severe): the first threshold that gets triggered when failures exceed this level - **'error'** (more severe): the middle threshold indicating more serious data quality issues - **'critical'** (most severe): the highest threshold indicating critical data quality problems ```{python} # Create a two-column DataFrame for this example tbl_pl = pl.DataFrame({ "a": [4, 6, 9, 7, 12, 8, 7, 12, 10, 7], "b": [9, 8, 10, 5, 10, 9, 14, 6, 6, 8], }) # Set thresholds: warning=0.2 (20%), error=0.3 (30%), critical=0.4 (40%) validation = ( pb.Validate(data=tbl_pl, thresholds=(0.2, 0.3, 0.4)) .col_vals_gt(columns="b", value=5) # 1/10 failing (10% failure rate) .col_vals_lt(columns="a", value=11) # 2/10 failing (20% failure rate) .col_vals_ge(columns="b", value=8) # 3/10 failing (30% failure rate) .interrogate() ) validation ``` The validation report above visually indicates threshold levels with colored circles: - gray circles in the `W` column indicate the 'warning' threshold - yellow circles in the `E` column indicate the 'error' threshold - red circles in the `C` column indicate the 'critical' threshold This won't pass the [`Validate.assert_below_threshold()`](`Validate.assert_below_threshold`) assertion for the 'error' level because step 3 exceeds this threshold (30% failure rate matches the error threshold): ```{python} try: validation.assert_below_threshold(level="error") except AssertionError as e: print("AssertionError:", str(e)) ``` We can check against the 'error' threshold for specific steps with the `i=` parameter: ```{python} validation.assert_below_threshold(level="error", i=[1, 2]) ``` This passes because the highest threshold exceeded in steps 1 and 2 is 'warning'. The [`Validate.assert_below_threshold()`](`Validate.assert_below_threshold`) method takes these parameters: - `level=`: threshold level to check against (`"warning"`, `"error"`, or `"critical"`) - `i=`: optional specific step number(s) to check - `message=`: optional custom error message This is particularly useful when: - working with real-world data where some percentage of failures is acceptable - implementing different severity levels for data quality rules - gradually improving data quality with stepped thresholds ::: {.callout-note} Assertion methods like [`Validate.assert_passing()`](`Validate.assert_passing`) and [`Validate.assert_below_threshold()`](`Validate.assert_below_threshold`) will automatically call [`Validate.interrogate()`](`Validate.interrogate`) if needed, so you don't have to explicitly include this step when using assertions directly. ::: ## Using Status Check Methods In addition to assertion methods that raise exceptions, Pointblank provides status check methods that return boolean values: ### `all_passed()` The [`Validate.all_passed()`](`Validate.all_passed`) method will return `True` only if every single test unit in every validation step passed: ```{python} validation = ( pb.Validate(data=sample_data) .col_vals_gt(columns="value", value=0) .interrogate() ) if not validation.all_passed(): print("Validation failed: some values are not positive") ``` ### `warning()`, `error()`, and `critical()` The methods [`Validate.warning()`](`Validate.warning`), [`Validate.error()`](`Validate.error`), and [`Validate.critical()`](`Validate.critical`) all return information about whether validation steps exceeded that specific threshold level. While assertion methods raise exceptions to halt execution when thresholds are exceeded, these status methods give you fine-grained control to implement custom logic based on different validation quality levels. ```{python} validation = ( pb.Validate(data=sample_data, thresholds=(0.05, 0.10, 0.20)) .col_vals_gt(columns="value", value=0) # Some values are negative .interrogate() ) validation ``` The [`Validate.warning()`](`Validate.warning`) method returns a dictionary mapping step numbers to boolean values. A `True` value means that step exceeds the warning threshold: ```{python} # Get dictionary of warning status for each step warning_status = validation.warning() print(f"Warning status: {warning_status}") # {1: True} means step 1 exceeds warning threshold ``` You can check a specific step using the `i=` parameter, and get a single boolean with `scalar=True`: ```{python} # Check error threshold for specific step has_errors = validation.error(i=1, scalar=True) if has_errors: print("Step 1 exceeded the error threshold.") ``` Similarly, we can check if any steps exceed the 'critical' threshold: ```{python} # Check against critical threshold critical_status = validation.critical() print(f"Critical status: {critical_status}") ``` These methods are particularly useful for: 1. Conditional logic: taking different actions based on threshold severity 2. Reporting: generating summary reports about validation quality 3. Monitoring: tracking data quality trends over time 4. Graceful degradation: implementing fallback logic when quality decreases Each method has these options: - without parameters: returns a dictionary mapping step numbers to boolean status values - with `i=`: check specific step(s) - with `scalar=True`: return a single boolean instead of a dictionary (when checking a specific step) While assertion methods raise exceptions to halt execution when thresholds are exceeded, these methods give you fine-grained control to implement custom logic based on different validation quality levels. ## Customizing Error Messages You can provide custom error messages when assertions fail to make them more meaningful in your specific workflow context: ```{python} # Create a validation with potential failures validation = ( pb.Validate(data=sample_data, thresholds=(0.2, 0.3, 0.4)) .col_vals_gt(columns="value", value=0) .interrogate() ) # Display the validation results validation ``` When you need to customize the error message that appears when an assertion fails, use the `message=` parameter: ```{python} try: # Custom message for threshold assertion validation.assert_below_threshold( level="warning", message="Data quality too low for processing!" ) except AssertionError as e: print(f"Custom handling of failure: {e}") ``` Descriptive error messages are essential in production systems where multiple team members might need to interpret validation failures. The custom message lets you provide context appropriate to your specific workflow or data pipeline stage. ## Combining Assertions with Actions Actions and assertions serve complementary but distinct purposes in data validation workflows: - Actions trigger during validation but shouldn't raise errors (as this would halt report generation) - Assertions are designed to raise errors based on specific conditions, making them ideal for flow control after validation completes Here's a simplified example showing how to use them together. The print statements simulate logging or monitoring that would be valuable in production data pipelines: ```{python} # Define a simple action function (won't raise errors) def notify_quality_issue(message="Data quality issue detected"): print(f"ACTION TRIGGERED: {message}") # Create data with known failures problem_data = pl.DataFrame({ "id": [1, 2, 3, -4, 5], # One negative ID "value": [10.5, 8.3, -2.1, 15.7, 7.2] # One negative value }) # First use actions for automated responses during validation print("Running validation with actions...") validation = ( pb.Validate(data=problem_data, thresholds=(0.1, 0.2, 0.3)) .col_vals_gt( columns="id", value=0, brief="IDs must be positive", actions=pb.Actions(warning=notify_quality_issue) ) .interrogate() # Actions trigger here but won't stop report generation ) # Then use assertions after validation for workflow control print("\nNow using assertion for flow control...") try: validation.assert_below_threshold(level="warning") print("This line won't execute if the assertion fails") except AssertionError as e: print(f"Validation failed threshold check: {e}") print("Implementing fallback process...") ``` This approach gives you the best of both worlds: - Actions provide immediate notification during validation without interrupting the process - Assertions control workflow execution after validation when important thresholds are exceeded This pattern works well in data pipelines where you want both: (1) automated responses during validation and (2) clear decision points after validation is complete. ## Best Practices for Assertions When using assertions in your data workflows, consider these best practices: 1. **Choose the right assertion type**: - use [`Validate.assert_passing()`](`Validate.assert_passing`) for critical validations where any failure is unacceptable - use [`Validate.assert_below_threshold()`](`Validate.assert_below_threshold`) for validations where some failure rate is acceptable 2. **Set appropriate thresholds** that match your data quality requirements: ```python # Example threshold strategy validation = pb.Validate( data=sample_data, # warning at 1%, error at 5%, critical at 10% thresholds=pb.Thresholds(warning=0.01, error=0.05, critical=0.10) ) ``` 3. **Use a graduated approach** to validation severity: ```python # Critical validations: must be perfect validation_1.assert_passing() # Important validations: must be below error threshold validation_2.assert_below_threshold(level="error") # Monitor-only validations: check warning status warning_status = validation_3.warning() ``` 4. **Placement in pipelines**: place assertions at critical points where data quality is essential 5. **Error handling**: wrap assertions in try-except blocks for better error handling in production systems 6. **Combine with reporting**: use both assertions and reporting approaches for comprehensive quality control ## Conclusion Pointblank's assertion methods give you flexible options for enforcing data quality requirements: - [`Validate.assert_passing()`](`Validate.assert_passing`) for strict validation where every test unit must pass - [`Validate.assert_below_threshold()`](`Validate.assert_below_threshold`) for more flexible validation where some failures are tolerable - Status methods ([`Validate.warning()`](`Validate.warning`), [`Validate.error()`](`Validate.error`), and [`Validate.critical()`](`Validate.critical`)) for programmatic threshold checking By using these assertion methods appropriately, you can build robust data pipelines with different levels of quality enforcement (from strict validation of critical data properties to more lenient checks for less critical aspects). This graduated approach to data quality helps create systems that are both reliable and practical in real-world data environments. ### Draft Validation Draft validation in Pointblank leverages large language models (LLMs) to automatically generate validation plans for your data. This feature is especially useful when starting validation on a new dataset or when you need to quickly establish baseline validation coverage. The `DraftValidation` class connects to various LLM providers to analyze your data's characteristics and generate a complete validation plan tailored to its structure and content. ## How `DraftValidation`{.qd-no-link} Works When you use `DraftValidation`, the process works through these steps: 1. a statistical summary of your data is generated using the `DataScan` class 2. this summary is converted to JSON format and sent to your selected LLM provider 3. the LLM uses the summary along with knowledge about Pointblank's validation capabilities to generate a validation plan 4. the result is returned as executable Python code that you can use directly or modify as needed The entire process happens without sending all of the data to the LLM provider, but only a summary that includes column names, data types, basic statistics, and a small sample of values. ## Requirements and Setup To use the `DraftValidation` feature, you'll need: 1. an API key from a supported LLM provider 2. the required Python packages installed You can install all necessary dependencies with: ```bash pip install pointblank[generate] ``` This will install the `chatlas` package and other dependencies required for `DraftValidation`. ### Supported LLM Providers The `DraftValidation` class supports multiple LLM providers: - **Anthropic** (Claude models) - **OpenAI** (GPT models) - **Ollama** (local LLMs) - **Amazon Bedrock** (AWS-hosted models) Each provider has different capabilities and performance characteristics, but all can be used to generate validation plans through a consistent interface. ## Basic Usage The simplest way to use `DraftValidation` is to provide your data and specify an LLM model. Let's try it out with the `global_sales` dataset. ```python import pointblank as pb # Load a dataset data = pb.load_dataset(dataset="global_sales", tbl_type="polars") # Generate a validation plan pb.DraftValidation( data=data, model="anthropic:claude-sonnet-4-5", api_key="your_api_key_here" # Replace with your actual API key ) ``` ````plaintext ```python import pointblank as pb # Define schema based on column names and dtypes schema = pb.Schema(columns=[ ("product_id", "String"), ("product_category", "String"), ("customer_id", "String"), ("customer_segment", "String"), ("region", "String"), ("country", "String"), ("city", "String"), ("timestamp", "Datetime(time_unit='us', time_zone=None)"), ("quarter", "String"), ("month", "Int64"), ("year", "Int64"), ("price", "Float64"), ("quantity", "Int64"), ("status", "String"), ("email", "String"), ("revenue", "Float64"), ("tax", "Float64"), ("total", "Float64"), ("payment_method", "String"), ("sales_channel", "String") ]) # The validation plan validation = ( pb.Validate( data=your_data, # Replace your_data with the actual data variable label="Draft Validation", thresholds=pb.Thresholds(warning=0.10, error=0.25, critical=0.35) ) .col_schema_match(schema=schema) .col_vals_not_null(columns=[ "product_category", "customer_segment", "region", "country", "price", "quantity", "status", "email", "revenue", "tax", "total", "payment_method", "sales_channel" ]) .col_vals_between(columns="month", left=1, right=12, na_pass=True) .col_vals_between(columns="year", left=2021, right=2023, na_pass=True) .col_vals_gt(columns="price", value=0) .col_vals_gt(columns="quantity", value=0) .col_vals_gt(columns="revenue", value=0) .col_vals_gt(columns="tax", value=0) .col_vals_gt(columns="total", value=0) .col_vals_in_set(columns="product_category", set=[ "Manufacturing", "Retail", "Healthcare" ]) .col_vals_in_set(columns="customer_segment", set=[ "Government", "Consumer", "SMB" ]) .col_vals_in_set(columns="region", set=[ "Asia Pacific", "Europe", "North America" ]) .col_vals_in_set(columns="status", set=[ "returned", "shipped", "delivered" ]) .col_vals_in_set(columns="payment_method", set=[ "Apple Pay", "PayPal", "Bank Transfer", "Credit Card" ]) .col_vals_in_set(columns="sales_channel", set=[ "Partner", "Distributor", "Phone" ]) .row_count_match(count=50000) .col_count_match(count=20) .rows_distinct() .interrogate() ) validation ``` ```` ### Managing API Keys While you can directly provide API keys as shown above, there are more secure approaches: ```python import os # Get API key from environment variable api_key = os.getenv("ANTHROPIC_API_KEY") draft_validation = pb.DraftValidation( data=data, model="anthropic:claude-sonnet-4-5", api_key=api_key ) ``` You can also store API keys in a `.env` file in your project's root directory: ``` # Contents of .env file ANTHROPIC_API_KEY=your_anthropic_api_key_here OPENAI_API_KEY=your_openai_api_key_here ``` If your API keys have standard names (like `ANTHROPIC_API_KEY` or `OPENAI_API_KEY`), `DraftValidation` will automatically find and use them: ```python # No API key needed if stored in .env with standard names draft_validation = pb.DraftValidation( data=data, model="anthropic:claude-sonnet-4-5" ) ``` ## Example Output for `nycflights` Here's an example of a validation plan that might be generated by `DraftValidation` for the `nycflights` dataset: ```python pb.DraftValidation( pb.load_dataset(dataset="nycflights", tbl_type="duckdb", model="anthropic:claude-sonnet-4-5" ) ``` ````plaintext ```python import pointblank as pb # Define schema based on column names and dtypes schema = pb.Schema(columns=[ ("year", "int64"), ("month", "int64"), ("day", "int64"), ("dep_time", "int64"), ("sched_dep_time", "int64"), ("dep_delay", "int64"), ("arr_time", "int64"), ("sched_arr_time", "int64"), ("arr_delay", "int64"), ("carrier", "string"), ("flight", "int64"), ("tailnum", "string"), ("origin", "string"), ("dest", "string"), ("air_time", "int64"), ("distance", "int64"), ("hour", "int64"), ("minute", "int64") ]) # The validation plan validation = ( pb.Validate( data=your_data, # Replace your_data with the actual data variable label="Draft Validation", thresholds=pb.Thresholds(warning=0.10, error=0.25, critical=0.35) ) .col_schema_match(schema=schema) .col_vals_not_null(columns=[ "year", "month", "day", "sched_dep_time", "carrier", "flight", "origin", "dest", "distance", "hour", "minute" ]) .col_vals_between(columns="month", left=1, right=12) .col_vals_between(columns="day", left=1, right=31) .col_vals_between(columns="sched_dep_time", left=106, right=2359) .col_vals_between(columns="dep_delay", left=-43, right=1301, na_pass=True) .col_vals_between(columns="air_time", left=20, right=695, na_pass=True) .col_vals_between(columns="distance", left=17, right=4983) .col_vals_between(columns="hour", left=1, right=23) .col_vals_between(columns="minute", left=0, right=59) .col_vals_in_set(columns="origin", set=["EWR", "LGA", "JFK"]) .col_count_match(count=18) .row_count_match(count=336776) .rows_distinct() .interrogate() ) validation ``` ```` Notice how the generated plan includes: 1. A schema validation with appropriate data types 2. Not-null checks for required columns 3. Range validations for numerical data 4. Set membership checks for categorical data 5. Row and column count validations 6. Appropriate handling of missing values with `na_pass=True` ## Working with Model Providers ### Specifying Models When using `DraftValidation`, you specify the model in the format `"provider:model_name"`: ```python # Using Anthropic's Claude model pb.DraftValidation(data=data, model="anthropic:claude-sonnet-4-5") # Using OpenAI's GPT model pb.DraftValidation(data=data, model="openai:gpt-4-turbo") # Using a local model with Ollama pb.DraftValidation(data=data, model="ollama:llama3:latest") # Using Amazon Bedrock pb.DraftValidation(data=data, model="bedrock:anthropic.claude-3-sonnet-20240229-v1:0") ``` ### Model Performance and Privacy Different models have different capabilities when it comes to generating validation plans: - Anthropic Claude Sonnet 4.5 generally provides the most comprehensive and accurate validation plans - OpenAI GPT-4 models also perform well - Local models through Ollama can be useful for private data but they currently have reduced capabilities here A key advantage of `DraftValidation` is that your actual dataset is not sent to the LLM provider. Instead, only a summary is transmitted, which includes: - the number of rows and columns - column names and data types - basic statistics (min, max, mean, median, missing values count) - a small sample of values from each column (usually 5-10 values) This approach protects your data while still providing enough context for the LLM to generate relevant validation rules. ## Customizing Generated Plans The validation plan generated by `DraftValidation` is just a starting point. You'll typically want to: 1. review the generated code for correctness 2. replace `your_data` with your actual data variable name that exists in your workspace 3. ensure the data object referenced is actually present in your workspace 4. adjust thresholds and validation parameters 5. add domain-specific validation rules 6. remove any unnecessary checks For example, you might start by capturing the text representation of your draft validation. This will give you the raw Python code that you can copy into a new code cell in your notebook or script. From there, you can customize it by modifying thresholds to match your organization's data quality standards, adding business-specific validation rules that require domain knowledge, or removing checks that aren't relevant to your use case. Once you've made your modifications, you can execute the customized validation plan as you would any other Pointblank validation. ## Under the Hood ### The Generated Data Summary To understand what the LLM works with, here's an example of the data summary format that's sent: ```json { "table_info": { "rows": 336776, "columns": 18, "table_type": "duckdb" }, "column_info": [ { "column_name": "year", "column_type": "int64", "missing_values": 0, "min": 2013, "max": 2013, "mean": 2013.0, "median": 2013, "sample_values": [2013, 2013, 2013, 2013, 2013] }, { "column_name": "month", "column_type": "int64", "missing_values": 0, "min": 1, "max": 12, "mean": 6.548819, "median": 7, "sample_values": [1, 1, 1, 1, 1] }, // Additional columns... ] } ``` ### The Prompt Strategy The `DraftValidation` class uses a carefully crafted prompt that instructs the LLM to: 1. use the schema information to create a `Schema` object 2. include [`Validate.col_vals_not_null()`](`Validate.col_vals_not_null`) for columns with no missing values 3. add appropriate range validations based on min/max values 4. include row and column count validations 5. format the output as clean, executable Python code The prompt also contains constraints to ensure consistent, high-quality results, such as using line breaks in long lists for readability, applying `na_pass=True` for columns with missing values, and avoiding duplicate validations. ## Best Practices and Troubleshooting ### When to Use `DraftValidation`{.qd-no-link} Drafting a validation is most useful when: - working with a new dataset for the first time - needing to quickly establish baseline validation - exploring potential validation rules before formalizing them - validating columns with consistent patterns (numeric ranges, categories, etc.) Consider writing validation plans manually when you need very specific business rules, are working with sensitive data, need complex validation logic, or need to validate relationships between columns. ### Recommended Workflow and Common Issues Here's a recommended workflow incorporating `DraftValidation`: 1. generate an initial plan with `DraftValidation` 2. review the generated validations for relevance 3. adjust thresholds and parameters as needed 4. add specific business logic and cross-column validations 5. store the final validation plan in version control It's possible that you might bump up against some issues. Here are some common ones and solutions you might try: - Authentication Errors: ensure your API key is valid and correctly passed to `DraftValidation` - Package Not Found: make sure you've installed the required packages with `pip install pointblank[generate]` - Unsupported Model: verify you're using the correct `provider:model` format - Poor Quality Plans: try a more capable model ## Conclusion `DraftValidation` provides a powerful way to jumpstart your data validation process by leveraging LLMs to generate context-aware validation plans. By analyzing your data's structure and content, `DraftValidation` can create comprehensive validation rules that would otherwise take significant time to develop manually. The feature balances privacy (by sending only data summaries) with utility (by generating executable validation code). While the generated plans should always be reviewed and refined, they provide an excellent starting point for ensuring your data meets your quality requirements. By understanding how `DraftValidation` works and how to customize its output, you can significantly accelerate your data validation workflows and improve the quality of your data throughout your projects. ### YAML Validation Workflows Pointblank supports defining validation workflows using YAML configuration files, providing a declarative, readable, and maintainable approach to data validation. YAML workflows are particularly useful for teams, version control, automation pipelines, and scenarios where you want to separate validation logic from application code. YAML validation workflows offer several advantages: they're easy to read and write, can be version controlled alongside your data processing code, enable non-programmers to contribute to data quality definitions, and provide a clear separation between validation logic and execution code. The YAML approach complements Pointblank's Python API, giving you flexibility to choose the right tool for each situation. Simple, repetitive validations work well in YAML, while complex logic with custom functions might be better suited for the Python API. ## Basic YAML Validation Structure A YAML validation workflow consists of a few key components: - `tbl`: specifies the data source (file path, dataset name, or Python expression) - `steps`: defines the validation checks to perform - Optional metadata: table name, label, thresholds, actions, and other configuration Here's a simple example validating the built-in `small_table` dataset: ```yaml tbl: small_table df_library: polars # Optional: specify DataFrame library tbl_name: "Small Table Validation" label: "Basic data quality checks" steps: - rows_distinct - col_exists: columns: [a, b, c, d] - col_vals_not_null: columns: [a, b] ``` You can save this configuration to a .yaml file and execute it using the `yaml_interrogate()` function: ```{python} import pointblank as pb from pathlib import Path # Save the YAML configuration to a file yaml_content = """ tbl: small_table df_library: polars tbl_name: "Small Table Validation" label: "Basic data quality checks" steps: - rows_distinct - col_exists: columns: [a, b, c, d] - col_vals_not_null: columns: [a, b] """ yaml_file = Path("basic_validation.yaml") yaml_file.write_text(yaml_content) # Execute the validation from the file result = pb.yaml_interrogate(yaml_file) result ``` The validation table shows the results of each step, just as if you had written the equivalent Python code. You can also pass YAML content directly as a string for quick testing, but working with files is the recommended approach for production workflows. ## Data Sources in YAML The `tbl` field supports various data source types, making it easy to work with different kinds of data. You can also control the DataFrame library used for loading data with the `df_library` parameter. ### DataFrame Library Selection By default, Pointblank loads data as Polars DataFrames, but you can specify alternative libraries: ```yaml # Load as Polars DataFrame (default) tbl: small_table df_library: polars # Load as Pandas DataFrame tbl: small_table df_library: pandas # Load as DuckDB table (via Ibis) tbl: small_table df_library: duckdb ``` This is particularly useful when using validation expressions that require specific DataFrame APIs: ```yaml # Using Pandas-specific operations tbl: small_table df_library: pandas steps: - specially: expr: "lambda df: df.assign(total=df['a'] + df['d'])" # Using Polars-specific operations tbl: small_table df_library: polars steps: - specially: expr: "lambda df: df.select(pl.col('a') + pl.col('d') > 0)" ``` ### File-based Sources ```yaml # CSV files (respects df_library setting) tbl: "data/customers.csv" df_library: pandas # Parquet files tbl: "warehouse/sales.parquet" df_library: polars # Multiple files with patterns tbl: "logs/*.parquet" ``` ### Built-in Datasets ```yaml # Use Pointblank's built-in datasets tbl: small_table tbl: game_revenue tbl: nycflights ``` ### Python Expressions for Complex Sources For more complex data loading, use the `python:` block syntax. This syntax can be used with several parameters throughout your YAML configuration: - `tbl`: For complex data source loading (as shown below) - `expr`: For custom validation expressions in `col_vals_expr` - `pre`: For data preprocessing before validation steps - `actions`: For callable action functions (`warning`, `error`, `critical`, and `default`) ```yaml # Load data with custom Polars operations tbl: python: | pl.scan_csv("sales_data.csv") .filter(pl.col("date") >= "2024-01-01") .head(1000) # Load from a database connection tbl: python: | pl.read_database( query="SELECT * FROM customers WHERE active = true", connection="postgresql://user:pass@localhost/db" ) ``` ## Reusable Templates with `set_tbl=` One of the most powerful features of YAML validation workflows is the ability to create reusable templates that can be applied to different datasets. Using the `set_tbl=` parameter with `yaml_interrogate()`, you can define validation logic once and apply it to multiple data sources. ### Creating Validation Templates When creating templates for use with `set_tbl=`, the `tbl` field is still required but its value will be overridden. The recommended approach is to use `tbl: null`: ```yaml tbl: null tbl_name: "Sales Data Validation Template" label: "Standard validation checks for sales data" steps: - col_exists: columns: [customer_id, revenue, region, date] - col_vals_not_null: columns: [customer_id, revenue] - col_vals_gt: columns: [revenue] value: 0 - col_vals_in_set: columns: [region] set: [North, South, East, West] ``` ### Applying Templates to Multiple Datasets Here's a practical example showing how to apply the same validation template to multiple quarterly datasets, demonstrating the power of reusable YAML configurations: ```{python} import pointblank as pb import polars as pl # Define the template once sales_template = """ tbl: null # Will be overridden tbl_name: "Sales Data Validation" label: "Standard sales validation checks" thresholds: warning: 0.05 error: 0.1 steps: - col_exists: columns: [customer_id, revenue, region] - col_vals_not_null: columns: [customer_id, revenue] - col_vals_gt: columns: [revenue] value: 0 - col_vals_in_set: columns: [region] set: [North, South, East, West] """ # Create different datasets q1_data = pl.DataFrame({ "customer_id": [1, 2, 3, 4], "revenue": [100, 200, 150, 300], "region": ["North", "South", "East", "West"] }) q2_data = pl.DataFrame({ "customer_id": [5, 6, 7, 8], "revenue": [250, 180, 220, 350], "region": ["South", "North", "West", "East"] }) # Apply the same template to both datasets q1_result = pb.yaml_interrogate(sales_template, set_tbl=q1_data) q2_result = pb.yaml_interrogate(sales_template, set_tbl=q2_data) print(f"Q1 validation: {all(v.all_passed for v in q1_result.validation_info)}") print(f"Q2 validation: {all(v.all_passed for v in q2_result.validation_info)}") ``` ### Template Best Practices 1. **Use `tbl: null`**: this clearly indicates the template expects a data source to be provided 2. **Include comprehensive metadata**: use `tbl_name`, `label`, and `brief` to make results self-documenting 3. **Set appropriate thresholds**: define warning/error levels that make sense for your use case 4. **Version control templates**: store templates in your repository alongside your data processing code 5. **Test with sample data**: validate your templates work with representative datasets ### Common Template Patterns For API response validation, you can ensure that responses have the expected structure and valid status codes: ```yaml tbl: null tbl_name: "API Response Validation" brief: "Standard checks for API response data" steps: - col_exists: columns: [user_id, status, timestamp] - col_vals_in_set: columns: [status] set: [success, error, pending] - col_vals_not_null: columns: [user_id, timestamp] ``` For file upload validation, you can check file sizes and formats to ensure they meet your requirements: ```yaml tbl: null tbl_name: "File Upload Validation" steps: - col_vals_gt: columns: [file_size] value: 0 - col_vals_lt: columns: [file_size] value: 10485760 # 10MB limit - col_vals_in_set: columns: [file_type] set: [csv, json, xlsx, parquet] ``` This template approach is particularly valuable in data pipelines, ETL processes, and automated testing scenarios where you need to apply consistent validation logic across multiple similar datasets. ## Validation Steps YAML supports all of Pointblank's validation methods. Here are some common patterns: ### Column-based Validations ```yaml tbl: worldcities.csv steps: # Check for missing values - col_vals_not_null: columns: [city_name, country] # Validate value ranges - col_vals_between: columns: latitude left: -90 right: 90 # Check set membership - col_vals_in_set: columns: country_code set: [US, CA, MX, UK, DE, FR] # Regular expression validation - col_vals_regex: columns: postal_code pattern: "^[0-9]{5}(-[0-9]{4})?$" ``` ### Row-based Validations ```yaml tbl: sales_data.csv steps: # Check for duplicate rows - rows_distinct # Ensure complete rows (no missing values) - rows_complete # Check row count - row_count_match: count: 1000 ``` ### Schema Validations Schema validation ensures your data has the expected structure and column types. The `col_schema_match` validation method uses a `schema` key that contains a `columns` list, where each item in the list can specify a column name alone or a column name with its expected data type. Each `column` entry can be specified as: - `column_name`: column name as a scalar string (structure validation, no type checking) - `[column_name, "data_type"]`: column name with type validation (as a list with two elements) - `[column_name]`: column name in a single-item list (equivalent to scalar, for consistency) ```yaml tbl: customer_data.csv steps: # Complete schema validation (structure and types) - col_schema_match: schema: columns: - [customer_id, "int64"] - [name, "object"] - [email, "object"] - [signup_date, "datetime64[ns]"] # Structure-only validation (column names without types) - col_schema_match: schema: columns: - customer_id - name - email complete: false brief: "Check that core columns exist" ``` #### Schema Validation Options Schema validations support the full range of validation options: ```yaml tbl: data_file.csv steps: - col_schema_match: schema: columns: - [id, "int64"] - name complete: false # Allow extra columns in_order: false # Column order doesn't matter case_sensitive_colnames: false # Case-insensitive column names case_sensitive_dtypes: false # Case-insensitive type names full_match_dtypes: false # Allow partial type matching brief: "Flexible schema validation" ``` #### Other Structure Validations ```yaml tbl: customer_data.csv steps: # Check column count - col_count_match: count: 4 ``` ### Trend Validations Validate that values follow increasing or decreasing patterns across rows: ```yaml tbl: time_series_data.csv steps: # Ensure timestamp values increase - col_vals_increasing: columns: timestamp brief: "Timestamps must be in chronological order" # Validate countdown timer decreases - col_vals_decreasing: columns: countdown allow_stationary: true brief: "Countdown values should decrease (ties allowed)" # Check trend with tolerance - col_vals_increasing: columns: temperature decreasing_tol: 0.5 brief: "Temperature trends upward (small drops < 0.5°C allowed)" ``` ### Specification-based Validations Validate values against common data specifications like email addresses, URLs, postal codes, and more: ```yaml tbl: user_contact_info.csv steps: # Validate email addresses - col_vals_within_spec: columns: email spec: "email" # Validate US ZIP codes - col_vals_within_spec: columns: zip_code spec: "postal_code[US]" # Validate URLs - col_vals_within_spec: columns: website spec: "url" na_pass: true ``` Available specifications include: `"email"`, `"url"`, `"phone"`, `"ipv4"`, `"ipv6"`, `"mac"`, `"isbn"`, `"vin"`, `"credit_card"`, `"swift"`, `"postal_code[]"`, `"iban[]"`. ### Table Comparison Validate that an entire table matches a reference table: ```yaml tbl: processed_output.csv steps: # Compare against expected output - tbl_match: tbl_compare: python: | pb.load_dataset("expected_output", tbl_type="polars") brief: "Output matches expected results" ``` The `tbl_match()` validation performs comprehensive comparison including column count, row count, schema, and data values. It supports cross-backend validation (e.g., comparing Polars vs. Pandas DataFrames). ### AI-Powered Validation Use Large Language Models to validate data based on natural language criteria: ```yaml tbl: customer_feedback.csv steps: # Validate sentiment - prompt: prompt: "Customer feedback should express positive sentiment" model: "anthropic:claude-sonnet-4" columns_subset: [feedback_text, rating] batch_size: 500 thresholds: warning: 0.1 # Validate semantic correctness - prompt: prompt: "Product descriptions should mention the product category and at least one benefit" model: "openai:gpt-4" columns_subset: [product_name, description, category] ``` **Note**: AI validations require API keys to be set as environment variables (e.g., `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`) or in a `.env` file. These validations are best suited for semantic, context-dependent, or subjective quality checks rather than simple numeric comparisons. ## Thresholds and Severity Levels Thresholds determine when validation failures trigger different severity levels. You can set global thresholds for the entire workflow: ```yaml tbl: sales_data.csv tbl_name: "Sales Data Quality Check" thresholds: warning: 0.05 # 5% failure rate triggers warning error: 0.10 # 10% failure rate triggers error critical: 0.15 # 15% failure rate triggers critical steps: - col_vals_not_null: columns: [customer_id, amount] - col_vals_gt: columns: amount value: 0 ``` You can also set thresholds for individual validation steps: ```yaml tbl: user_data.csv steps: - col_vals_not_null: columns: email thresholds: warning: 1 # Any missing email is a warning error: 0.01 # 1% missing emails is an error - col_vals_regex: columns: email pattern: "^[\\w\\.-]+@[\\w\\.-]+\\.[a-zA-Z]{2,}$" thresholds: error: 1 # Any invalid email format is an error ``` ## Actions: Responding to Validation Failures Actions define what happens when validation thresholds are exceeded. You can use string templates with placeholder variables or callable functions. ### String Template Actions ```yaml tbl: orders.csv thresholds: warning: 0.02 error: 0.05 actions: warning: "Warning: Step {step} found {n_failed} failures in {col} column" error: "Error in {TYPE} validation: {n_failed}/{n} rows failed (Step {step})" critical: "Critical failure detected at {time}" steps: - col_vals_not_null: columns: [order_id, customer_id] ``` Available template variables include: - `{step}`: validation step number - `{col}`: column name being validated - `{val}`: specific failing value (when applicable) - `{n_failed}`: number of failing rows - `{n}`: total number of rows checked - `{TYPE}`: validation method name (e.g., "COL_VALS_NOT_NULL") - `{LEVEL}`: severity level ("WARNING", "ERROR", "CRITICAL") - `{time}`: timestamp of the validation ### Callable Actions For more complex responses, use Python callable functions: ```yaml tbl: critical_data.csv thresholds: error: 1 actions: error: python: | lambda: print("ALERT: Critical data validation failed!") critical: python: | lambda: print("CRITICAL: Validation failure - manual intervention required!") steps: - col_vals_not_null: columns: [transaction_id, amount] ``` Note: The Python environment in YAML actions is restricted for security. You can use built-in functions like `print()`, basic operations, and available DataFrame libraries, but cannot import external modules like `requests` or `logging`. For external notifications, consider using string template actions or handling alerts in your application code after the validation completes. ### Step-level Actions You can also define actions for individual validation steps: ```yaml tbl: financial_data.csv steps: - col_vals_not_null: columns: account_balance thresholds: error: 1 actions: error: "Missing account balance detected in step {step}." - col_vals_gt: columns: account_balance value: 0 actions: warning: python: | lambda: print("Negative balance warning triggered.") ``` ## Advanced Features ### Pre-processing with the `pre` Parameter You can apply data transformations before validation using the `pre` parameter: ```yaml tbl: transactions.csv steps: # Validate only recent transactions - col_vals_gt: columns: amount value: 0 pre: python: | lambda df: df.filter( pl.col("transaction_date") >= "2024-01-01" ) # Check completeness for active customers only - col_vals_not_null: columns: [email, phone] pre: | lambda df: df.filter(pl.col("status") == "active") ``` Note that you can use either the explicit `python:` block syntax or the shortcut syntax (just `pre: |`) for the lambda expressions. ### Complex Expressions For advanced validation logic, use a `col_vals_expr` step with custom expressions: ```yaml tbl: sales_data.csv steps: # Custom business logic validation - col_vals_expr: expr: python: | ( pl.when(pl.col("product_type") == "premium") .then(pl.col("price") >= 100) .when(pl.col("product_type") == "standard") .then(pl.col("price").is_between(20, 99)) .otherwise(pl.col("price") <= 19) ) ``` ### Brief Descriptions Add human-readable descriptions to validation steps. The `brief` parameter supports string templating and automatic generation: ```yaml tbl: customer_data.csv brief: "Customer data quality validation for {auto}" steps: - col_vals_not_null: columns: customer_id brief: "Ensure all customers have valid IDs" - col_vals_regex: columns: email pattern: "^[\\w\\.-]+@[\\w\\.-]+\\.[a-zA-Z]{2,}$" brief: "Validate email format compliance" - col_vals_between: columns: age left: 13 right: 120 brief: "Check reasonable age ranges" # Use automatic brief generation - col_vals_not_null: columns: phone_number brief: true # Template variables in briefs - col_vals_in_set: columns: status set: [active, inactive, pending] brief: "Column '{col}' must be one of: {set}" ``` Brief Templating Options: - custom strings: Write your own descriptive text - `true`: Automatically generates a brief based on the validation method and parameters - `{auto}`: Placeholder for auto-generated text within custom strings - template variables: Use the same variables available in actions: - `{col}`: column name(s) being validated - `{step}`: the step number in the validation plan - `{value}`: the comparison value used in the validation (for single-value comparisons) - `{pattern}`: for regex validations, the pattern being matched ### Governance Metadata YAML workflows support governance metadata that identifies ownership and usage of validation workflows. These fields are embedded in the validation report: ```yaml tbl: sales_data.csv tbl_name: "Sales Pipeline" owner: "Data Engineering" consumers: [Analytics Team, Finance, Compliance] version: "2.1.0" steps: - col_vals_not_null: columns: [customer_id, revenue] - col_vals_gt: columns: [revenue] value: 0 ``` The `owner`, `consumers`, and `version` fields are forwarded to the `Validate` constructor and appear in the validation report header. These fields are optional and do not affect validation behavior. ### Data Freshness and Null Percentage Two additional validation methods support common data quality checks: **`data_freshness`**: Validate that a date/datetime column has recent data: ```yaml steps: - data_freshness: columns: event_date freshness: "24h" ``` **`col_pct_null`**: Validate that the percentage of null values is within bounds: ```yaml steps: - col_pct_null: columns: [email, phone] value: 0.05 ``` ### Aggregate Validations Aggregate methods validate column-level statistics like sum, average, and standard deviation: ```yaml steps: # Check that total revenue is positive - col_sum_gt: columns: [revenue] value: 0 # Validate average rating is at most 5 - col_avg_le: columns: [rating] value: 5 # Ensure temperature variation is bounded - col_sd_lt: columns: [temperature] value: 10 ``` Available methods follow the `col_{stat}_{comparator}` pattern where `{stat}` is `sum`, `avg`, or `sd`, and `{comparator}` is `gt`, `lt`, `ge`, `le`, `eq`, `between`, or `outside`. ### Step Activation Control The `active` parameter allows you to temporarily disable validation steps without removing them from the configuration: ```yaml steps: # This step is disabled - col_vals_gt: columns: [amount] value: 0 active: false # This step runs normally (active: true is the default) - col_vals_not_null: columns: [customer_id] ``` This is useful for debugging, phased rollouts, or temporarily skipping steps that are known to fail. ### Reference Tables The `reference` top-level key specifies a reference table for comparison-based validations: ```yaml tbl: current_data.csv reference: python: | pb.load_dataset("baseline_data", tbl_type="polars") steps: - tbl_match: tbl_compare: python: | pb.load_dataset("baseline_data", tbl_type="polars") ``` ## Working with YAML Files ### Loading from Files You can save your YAML configuration to files and load them: ```{python} # Create a YAML file yaml_content = """ tbl: small_table tbl_name: "File-based Validation" steps: - col_vals_between: columns: c left: 1 right: 10 - col_vals_in_set: columns: f set: [low, mid, high] """ # Save to file from pathlib import Path yaml_file = Path("validation_config.yaml") yaml_file.write_text(yaml_content) # Load and execute result = pb.yaml_interrogate(yaml_file) result ``` ### Converting YAML to Python Use `yaml_to_python()` to generate equivalent Python code from your YAML configuration: ```{python} yaml_config = """ tbl: small_table tbl_name: "Example Validation" thresholds: warning: 0.1 error: 0.2 actions: warning: "Warning: {TYPE} validation failed" steps: - col_vals_gt: columns: a value: 0 - col_vals_in_set: columns: f set: [low, mid, high] """ # Generate Python code python_code = pb.yaml_to_python(yaml_config) print(python_code) ``` This is useful for: - learning how YAML maps to Python API calls - transitioning from YAML to code-based workflows - generating documentation that shows both approaches - debugging YAML configurations ## Practical Examples ### Data Pipeline Validation Here's a comprehensive example for validating data in a processing pipeline: ```yaml tbl: python: | ( pl.scan_csv("raw_data/customer_events.csv") .filter(pl.col("event_date") >= "2024-01-01") ) tbl_name: "Customer Events Pipeline Validation" label: "Daily data quality check for customer events" thresholds: warning: 0.01 # 1% failure rate error: 0.05 # 5% failure rate actions: warning: "Pipeline warning: {TYPE} validation found {n_failed} issues" error: python: | lambda: print("ERROR: Pipeline validation failed - manual review required") steps: # Schema validation - col_schema_match: schema: columns: - [customer_id, "int64"] - [event_type, "object"] - [event_date, "object"] - [revenue, "float64"] brief: "Validate table structure matches expected schema" # Data completeness - col_vals_not_null: columns: [customer_id, event_type, event_date] brief: "Critical fields must be complete" # Business logic validation - col_vals_in_set: columns: event_type set: [signup, purchase, cancellation, upgrade] brief: "Event types must be from approved list" # Data quality checks - col_vals_gt: columns: revenue value: 0 na_pass: true brief: "Revenue values must be positive when present" # Temporal validation - col_vals_expr: expr: python: | pl.col("event_date").str.strptime(pl.Date, "%Y-%m-%d").is_not_null() brief: "Event dates must be valid YYYY-MM-DD format" ``` ### Quality Monitoring Dashboard For ongoing data quality monitoring: ```yaml tbl: warehouse/daily_metrics.parquet tbl_name: "Daily Metrics Quality Check" thresholds: warning: 5 # 5 failing rows error: 50 # 50 failing rows critical: 100 # 100 failing rows actions: warning: "Quality check warning: {n_failed} rows failed {TYPE} validation" error: "Quality degradation detected: Step {step} failed for {n_failed}/{n} rows" critical: python: | lambda: print("CRITICAL: Data quality failure detected - immediate attention required") highest_only: false steps: - row_count_match: count: 10000 brief: "Verify expected daily record count" - col_vals_not_null: columns: [date, metric_value, source_system] brief: "Core fields must be complete" - col_vals_between: columns: metric_value left: 0 right: 1000000 brief: "Metric values within reasonable range" - rows_distinct: columns_subset: [date, metric_name, source_system] brief: "No duplicate metric records per day" ``` ## Best Practices ### Organization and Structure 1. use descriptive names: give your validations clear `tbl_name` and `label` values 2. add brief descriptions: document what each validation step checks 3. group related validations: organize steps logically (schema, completeness, business rules) 4. version control: store YAML files in git alongside your data processing code ### Error Handling and Monitoring 1. set appropriate thresholds: start conservative and adjust based on your data patterns 2. use actions for alerting: set up notifications for critical failures 3. document expected failures: some data quality issues might be acceptable 4. monitor validation results: track validation performance over time ### Performance Considerations 1. use the `pre` parameter efficiently: apply filters early to reduce data volume 2. order validations strategically: put fast, likely-to-fail checks first 3. consider data source location: local files are faster than remote sources 4. use appropriate column selections: only validate the columns you need ## Wrapping Up YAML validation workflows provide a powerful, declarative approach to data validation in Pointblank. Such workflows are great at expressing common validation patterns in a readable format that can be easily shared, version controlled, and maintained by teams. Key advantages of YAML workflows: - readable: non-programmers can understand and contribute to validation logic - maintainable: easy to modify validation rules without changing application code - portable: YAML files can be shared between projects and teams - version controlled: track changes to validation logic over time - flexible: support for simple checks and complex custom logic Use YAML workflows when you want declarative, maintainable validation definitions, and fall back to the Python API when you need complex programmatic logic or tight integration with application code. The two approaches complement each other well and can be used together as your validation needs evolve. ### YAML Reference This reference provides a comprehensive guide to all YAML keys and parameters supported by Pointblank's YAML validation workflows. Use this document as a quick lookup when building validation configurations. ## Global Configuration Keys ### Top-level Structure ```yaml tbl: data_source # REQUIRED: Data source specification df_library: "polars" # OPTIONAL: DataFrame library ("polars", "pandas", "duckdb") tbl_name: "Custom Table Name" # OPTIONAL: Human-readable table name label: "Validation Description" # OPTIONAL: Description for the validation workflow lang: "en" # OPTIONAL: Language code (default: "en") locale: "en" # OPTIONAL: Locale setting (default: "en") brief: "Global brief: {auto}" # OPTIONAL: Global brief template owner: "Data Engineering" # OPTIONAL: Data owner (governance metadata) consumers: [Analytics, Finance] # OPTIONAL: Data consumers (governance metadata) version: "1.0.0" # OPTIONAL: Validation version (governance metadata) reference: # OPTIONAL: Reference table for comparison validations python: | pb.load_dataset("ref_table") thresholds: # OPTIONAL: Global failure thresholds warning: 0.1 error: 0.2 critical: 0.3 actions: # OPTIONAL: Global failure actions warning: "Warning message template" error: "Error message template" critical: "Critical message template" highest_only: false final_actions: # OPTIONAL: Actions triggered after all steps complete warning: "Post-validation warning" error: "Post-validation error" steps: # REQUIRED: List of validation steps - validation_method_name - validation_method_name: parameter: value ``` ### Data Source (`tbl`) The `tbl` key specifies the data source and supports multiple formats: ```yaml # File paths tbl: "data/file.csv" tbl: "data/file.parquet" # Built-in datasets tbl: small_table tbl: game_revenue tbl: nycflights # Python expressions for complex data loading tbl: python: | pl.scan_csv("data.csv").filter(pl.col("date") >= "2024-01-01") ``` #### Using Templates with `set_tbl=` For reusable validation templates that will always use a custom data source via the `set_tbl=` parameter in `yaml_interrogate()`, the `tbl` field is still required but its value doesn't matter since it will be overridden. Recommended approaches: ```yaml # Option 1: Use a valid dataset name (gets overridden anyway) tbl: small_table # Will be ignored when `set_tbl=` is used # Option 2: Use YAML null (clearest semantic intent) tbl: null # Indicates table will be provided via `set_tbl=` ``` When using `yaml_interrogate()` with `set_tbl=`, the validation template becomes fully reusable: ```python # Define reusable template template = """ tbl: null # Will be overridden tbl_name: "Sales Validation" steps: - col_exists: columns: [customer_id, revenue, region] - col_vals_gt: columns: [revenue] value: 0 """ # Apply to different datasets q1_result = pb.yaml_interrogate(template, set_tbl=q1_data) q2_result = pb.yaml_interrogate(template, set_tbl=q2_data) ``` ### DataFrame Library (`df_library`) The `df_library` key controls which DataFrame library is used to load data sources. This parameter affects both built-in datasets and file loading: ```yaml # Use Polars DataFrames (default) df_library: polars # Use Pandas DataFrames df_library: pandas # Use DuckDB tables (via Ibis) df_library: duckdb ``` Examples with different libraries: ```yaml # Load built-in dataset as Pandas DataFrame tbl: small_table df_library: pandas steps: - specially: expr: "lambda df: df.assign(validation_result=df['a'] > 0)" # Load CSV file as Polars DataFrame tbl: "data/sales.csv" df_library: polars steps: - col_vals_gt: columns: amount value: 0 # Load dataset as DuckDB table tbl: nycflights df_library: duckdb steps: - row_count_match: count: 336776 ``` The `df_library` parameter is particularly useful when: - using validation expressions that require specific DataFrame APIs (e.g., Pandas `.assign()`, Polars `.select()`) - integrating with existing pipelines that use a specific DataFrame library - optimizing performance for different data sizes and operations - ensuring compatibility with downstream processing steps ### Global Thresholds Thresholds define when validation failures trigger different severity levels: ```yaml thresholds: warning: 0.05 # 5% failure rate triggers warning error: 0.10 # 10% failure rate triggers error critical: 0.15 # 15% failure rate triggers critical ``` - values: numbers between `0` and `1` (percentages) or integers (row counts) - levels: `warning`, `error`, `critical` ### Global Actions Actions define responses when thresholds are exceeded. When supplying a string to a severity level ('warning', 'error', 'critical'), you can use template variables that will be automatically substituted with contextual information: ```yaml actions: warning: "Warning: {n_failed} failures in step {step}" error: python: | lambda: print("Error detected!") critical: "Critical failure at {time}" highest_only: false # Execute all applicable actions vs. only highest severity ``` Template variables available for action strings: - `{step}`: current validation step number - `{col}`: column name(s) being validated - `{val}`: validation value or threshold - `{n_failed}`: number of failing records - `{n}`: total number of records - `{type}`: validation method type - `{level}`: severity level ('warning'/'error'/'critical') - `{time}`: timestamp of validation ## Validation Methods Reference ### Column Value Validations #### Comparison Methods `col_vals_gt`: are column data greater than a fixed value or data in another column? ```yaml - col_vals_gt: columns: [column_name] # REQUIRED: Column(s) to validate value: 100 # REQUIRED: Comparison value na_pass: true # OPTIONAL: Pass NULL values (default: false) pre: | # OPTIONAL: Data preprocessing lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Values must be > 100" # OPTIONAL: Step description ``` `col_vals_lt`: are column data less than a fixed value or data in another column? ```yaml - col_vals_lt: columns: [column_name] value: 100 na_pass: true # ... (same parameters as col_vals_gt) ``` `col_vals_ge`: are column data greater than or equal to a fixed value or data in another column? ```yaml - col_vals_ge: columns: [column_name] value: 100 na_pass: true # ... (same parameters as col_vals_gt) ``` `col_vals_le`: are column data less than or equal to a fixed value or data in another column? ```yaml - col_vals_le: columns: [column_name] value: 100 na_pass: true # ... (same parameters as col_vals_gt) ``` `col_vals_eq`: are column data equal to a fixed value or data in another column? ```yaml - col_vals_eq: columns: [column_name] value: "expected_value" na_pass: true # ... (same parameters as col_vals_gt) ``` `col_vals_ne`: are column data not equal to a fixed value or data in another column? ```yaml - col_vals_ne: columns: [column_name] value: "forbidden_value" na_pass: true # ... (same parameters as col_vals_gt) ``` #### Range Methods `col_vals_between`: are column data between two specified values (inclusive)? ```yaml - col_vals_between: columns: [column_name] # REQUIRED: Column(s) to validate left: 0 # REQUIRED: Lower bound right: 100 # REQUIRED: Upper bound inclusive: [true, true] # OPTIONAL: Include bounds [left, right] na_pass: false # OPTIONAL: Pass NULL values pre: | # OPTIONAL: Data preprocessing lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Values between 0 and 100" # OPTIONAL: Step description ``` `col_vals_outside`: are column data outside of two specified values? ```yaml - col_vals_outside: columns: [column_name] left: 0 right: 100 inclusive: [false, false] # OPTIONAL: Exclude bounds [left, right] na_pass: false # ... (same parameters as col_vals_between) ``` #### Set Membership Methods `col_vals_in_set`: are column data part of a specified set of values? ```yaml - col_vals_in_set: columns: [column_name] # REQUIRED: Column(s) to validate set: [value1, value2, value3] # REQUIRED: Allowed values na_pass: false # OPTIONAL: Pass NULL values pre: | # OPTIONAL: Data preprocessing lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Values in allowed set" # OPTIONAL: Step description ``` `col_vals_not_in_set`: are column data not part of a specified set of values? ```yaml - col_vals_not_in_set: columns: [column_name] set: [forbidden1, forbidden2] # REQUIRED: Forbidden values na_pass: false # ... (same parameters as col_vals_in_set) ``` #### NULL Value Methods `col_vals_null`: are column data null (missing)? ```yaml - col_vals_null: columns: [column_name] # REQUIRED: Column(s) to validate pre: | # OPTIONAL: Data preprocessing lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Values must be NULL" # OPTIONAL: Step description ``` `col_vals_not_null`: are column data not null (not missing)? ```yaml - col_vals_not_null: columns: [column_name] # ... (same parameters as col_vals_null) ``` #### Pattern Matching Methods `col_vals_regex`: do string-based column data match a regular expression? ```yaml - col_vals_regex: columns: [column_name] # REQUIRED: Column(s) to validate pattern: "^[A-Z]{2,3}$" # REQUIRED: Regular expression pattern na_pass: false # OPTIONAL: Pass NULL values pre: | # OPTIONAL: Data preprocessing lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Values match pattern" # OPTIONAL: Step description ``` `col_vals_within_spec`: do column data conform to a specification (email, URL, postal codes, etc.)? ```yaml - col_vals_within_spec: columns: [column_name] # REQUIRED: Column(s) to validate spec: "email" # REQUIRED: Specification type na_pass: false # OPTIONAL: Pass NULL values pre: | # OPTIONAL: Data preprocessing lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Values match spec" # OPTIONAL: Step description ``` Available specification types: - `"email"` - Email addresses - `"url"` - Internet URLs - `"phone"` - Phone numbers - `"ipv4"` - IPv4 addresses - `"ipv6"` - IPv6 addresses - `"mac"` - MAC addresses - `"isbn"` - International Standard Book Numbers (10 or 13 digit) - `"vin"` - Vehicle Identification Numbers - `"credit_card"` - Credit card numbers (uses Luhn algorithm) - `"swift"` - Business Identifier Codes (SWIFT-BIC) - `"postal_code[]"` - Postal codes for specific countries (e.g., `"postal_code[US]"`, `"postal_code[CA]"`) - `"zip"` - Alias for US ZIP codes (`"postal_code[US]"`) - `"iban[]"` - International Bank Account Numbers (e.g., `"iban[DE]"`, `"iban[FR]"`) Examples: ```yaml # Email validation - col_vals_within_spec: columns: user_email spec: "email" # US postal codes - col_vals_within_spec: columns: zip_code spec: "postal_code[US]" # German IBAN - col_vals_within_spec: columns: account_number spec: "iban[DE]" ``` #### Custom Expression Methods `col_vals_expr`: do column data agree with a predicate expression? ```yaml - col_vals_expr: expr: # REQUIRED: Custom validation expression python: | pl.when(pl.col("status") == "active") .then(pl.col("value") > 0) .otherwise(pl.lit(True)) pre: | # OPTIONAL: Data preprocessing lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Custom validation rule" # OPTIONAL: Step description ``` #### Trend Validation Methods `col_vals_increasing`: are column data increasing row-by-row? ```yaml - col_vals_increasing: columns: [column_name] # REQUIRED: Column(s) to validate allow_stationary: false # OPTIONAL: Allow consecutive equal values (default: false) decreasing_tol: 0.5 # OPTIONAL: Tolerance for negative movement (default: null) na_pass: false # OPTIONAL: Pass NULL values pre: | # OPTIONAL: Data preprocessing lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Values must increase" # OPTIONAL: Step description ``` This validation checks whether values in a column increase as you move down the rows. Useful for validating time-series data, sequence numbers, or any monotonically increasing values. Parameters: - `allow_stationary`: If `true`, allows consecutive values to be equal (stationary phases). For example, `[1, 2, 2, 3]` would pass when `true` but fail at the third value when `false`. - `decreasing_tol`: Absolute tolerance for negative movement. Setting this to `0.5` means values can decrease by up to 0.5 units and still pass. Setting any value also sets `allow_stationary` to `true`. Examples: ```yaml # Strict increasing validation - col_vals_increasing: columns: timestamp_seconds brief: "Timestamps must strictly increase" # Allow stationary values - col_vals_increasing: columns: version_number allow_stationary: true brief: "Version numbers should increase (ties allowed)" # With tolerance for small decreases - col_vals_increasing: columns: temperature decreasing_tol: 0.1 brief: "Temperature trend (small drops allowed)" ``` `col_vals_decreasing`: are column data decreasing row-by-row? ```yaml - col_vals_decreasing: columns: [column_name] # REQUIRED: Column(s) to validate allow_stationary: false # OPTIONAL: Allow consecutive equal values (default: false) increasing_tol: 0.5 # OPTIONAL: Tolerance for positive movement (default: null) na_pass: false # OPTIONAL: Pass NULL values pre: | # OPTIONAL: Data preprocessing lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Values must decrease" # OPTIONAL: Step description ``` This validation checks whether values in a column decrease as you move down the rows. Useful for countdown timers, inventory depletion, or any monotonically decreasing values. Parameters: - `allow_stationary`: If `true`, allows consecutive values to be equal (stationary phases). For example, `[10, 8, 8, 5]` would pass when `true` but fail at the third value when `false`. - `increasing_tol`: Absolute tolerance for positive movement. Setting this to `0.5` means values can increase by up to 0.5 units and still pass. Setting any value also sets `allow_stationary` to `true`. Examples: ```yaml # Strict decreasing validation - col_vals_decreasing: columns: countdown_timer brief: "Timer must strictly decrease" # Allow stationary values - col_vals_decreasing: columns: priority_score allow_stationary: true brief: "Priority scores should decrease (ties allowed)" # With tolerance for small increases - col_vals_decreasing: columns: stock_level increasing_tol: 5 brief: "Stock levels decrease (small restocks allowed)" ``` ### Row-based Validations `rows_distinct`: are row data distinct? ```yaml - rows_distinct # Simple form - rows_distinct: # With parameters columns_subset: [col1, col2] # OPTIONAL: Check subset of columns pre: | # OPTIONAL: Data preprocessing lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "No duplicate rows" # OPTIONAL: Step description ``` `rows_complete`: are row data complete? ```yaml - rows_complete # Simple form - rows_complete: # With parameters columns_subset: [col1, col2] # OPTIONAL: Check subset of columns pre: | # OPTIONAL: Data preprocessing lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Complete rows only" # OPTIONAL: Step description ``` ### Structure Validations `col_exists`: does column exist in the table? ```yaml - col_exists: columns: [col1, col2, col3] # REQUIRED: Column(s) that must exist thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Required columns exist" # OPTIONAL: Step description ``` `col_schema_match`: does the table have expected column names and data types? ```yaml - col_schema_match: schema: # REQUIRED: Expected schema columns: - [column_name, "data_type"] # Column with type validation - column_name # Column name only (no type check) - [column_name] # Alternative syntax complete: true # OPTIONAL: Require exact column set in_order: true # OPTIONAL: Require exact column order case_sensitive_colnames: true # OPTIONAL: Case-sensitive column names case_sensitive_dtypes: true # OPTIONAL: Case-sensitive data types full_match_dtypes: true # OPTIONAL: Exact type matching thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Schema validation" # OPTIONAL: Step description ``` `row_count_match`: does the table have n rows? ```yaml - row_count_match: count: 1000 # REQUIRED: Expected row count thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Expected row count" # OPTIONAL: Step description ``` `col_count_match`: does the table have n columns? ```yaml - col_count_match: count: 10 # REQUIRED: Expected column count thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Expected column count" # OPTIONAL: Step description ``` `tbl_match`: does the table match a comparison table? ```yaml - tbl_match: tbl_compare: # REQUIRED: Comparison table python: | pb.load_dataset("reference_table", tbl_type="polars") pre: | # OPTIONAL: Data preprocessing lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds warning: 0.0 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Table structure matches" # OPTIONAL: Step description ``` This validation performs a comprehensive comparison between the target table and a comparison table, using progressively stricter checks: 1. **Column count match**: both tables have the same number of columns 2. **Row count match**: both tables have the same number of rows 3. **Schema match (loose)**: column names and dtypes match (case-insensitive, any order) 4. **Schema match (order)**: columns in correct order (case-insensitive names) 5. **Schema match (exact)**: column names match exactly (case-sensitive, correct order) 6. **Data match**: values in corresponding cells are identical The validation fails at the first check that doesn't pass, making it easy to diagnose mismatches. This operates over a single test unit (pass/fail for complete table match). **Cross-backend validation**: `tbl_match()` supports automatic backend coercion when comparing tables from different backends (e.g., Polars vs. Pandas, DuckDB vs. SQLite). The comparison table is automatically converted to match the target table's backend. Examples: ```yaml # Compare against reference dataset - tbl_match: tbl_compare: python: | pb.load_dataset("expected_output", tbl_type="polars") brief: "Output matches expected results" # Compare against CSV file - tbl_match: tbl_compare: python: | pl.read_csv("reference_data.csv") brief: "Matches reference CSV" # Compare with preprocessing on target table only - tbl_match: tbl_compare: python: | pb.load_dataset("reference_table", tbl_type="polars") pre: | lambda df: df.select(["id", "name", "value"]) brief: "Selected columns match reference" ``` ### Special Validation Methods `conjointly`: are multiple validations having a joint dependency? ```yaml - conjointly: expressions: # REQUIRED: List of lambda expressions - "lambda df: df['d'] > df['a']" - "lambda df: df['a'] > 0" - "lambda df: df['a'] + df['d'] < 12000" thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "All conditions must pass" # OPTIONAL: Step description ``` `specially`: do table data pass a custom validation function? ```yaml - specially: expr: # REQUIRED: Custom validation function "lambda df: df.select(pl.col('a') + pl.col('d') > 0)" thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Custom validation" # OPTIONAL: Step description ``` Alternative syntax with Python expressions: ```yaml - specially: expr: python: | lambda df: df.select(pl.col('amount') > 0) ``` For Pandas DataFrames (when using `df_library: pandas`): ```yaml - specially: expr: "lambda df: df.assign(is_valid=df['a'] + df['d'] > 0)" ``` ### AI-Powered Validation `prompt`: validate rows using AI/LLM-powered analysis ```yaml - prompt: prompt: "Values should be positive and realistic" # REQUIRED: Natural language criteria model: "anthropic:claude-sonnet-4" # REQUIRED: Model identifier columns_subset: [column1, column2] # OPTIONAL: Columns to validate batch_size: 1000 # OPTIONAL: Rows per batch (default: 1000) max_concurrent: 3 # OPTIONAL: Concurrent API requests (default: 3) pre: | # OPTIONAL: Data preprocessing lambda df: df.filter(condition) thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "AI validation" # OPTIONAL: Step description ``` This validation method uses Large Language Models (LLMs) to validate rows of data based on natural language criteria. Each row becomes a test unit that either passes or fails the validation criteria, producing binary True/False results that integrate with standard Pointblank reporting. **Supported models:** - **Anthropic**: `"anthropic:claude-sonnet-4"`, `"anthropic:claude-opus-4"` - **OpenAI**: `"openai:gpt-4"`, `"openai:gpt-4-turbo"`, `"openai:gpt-3.5-turbo"` - **Ollama**: `"ollama:"` (e.g., `"ollama:llama3"`) - **Bedrock**: `"bedrock:"` **Authentication**: API keys are automatically loaded from environment variables or `.env` files: - **OpenAI**: Set `OPENAI_API_KEY` environment variable or add to `.env` file - **Anthropic**: Set `ANTHROPIC_API_KEY` environment variable or add to `.env` file - **Ollama**: No API key required (runs locally) - **Bedrock**: Configure AWS credentials through standard AWS methods Example `.env` file: ```plaintext ANTHROPIC_API_KEY="your_anthropic_api_key_here" OPENAI_API_KEY="your_openai_api_key_here" ``` **Performance optimization**: The validation process uses row signature memoization to avoid redundant LLM calls. When multiple rows have identical values in the selected columns, only one representative row is validated, and the result is applied to all matching rows. This dramatically reduces API costs and processing time for datasets with repetitive patterns. Examples: ```yaml # Basic AI validation - prompt: prompt: "Email addresses should look realistic and professional" model: "anthropic:claude-sonnet-4" columns_subset: [email] # Complex semantic validation - prompt: prompt: "Product descriptions should mention the product category and include at least one benefit" model: "openai:gpt-4" columns_subset: [product_name, description, category] batch_size: 500 max_concurrent: 5 # Sentiment analysis - prompt: prompt: "Customer feedback should express positive sentiment" model: "anthropic:claude-sonnet-4" columns_subset: [feedback_text, rating] # Context-dependent validation - prompt: prompt: "For high-value transactions (amount > 1000), a detailed justification should be provided" model: "openai:gpt-4" columns_subset: [amount, justification, approver] thresholds: warning: 0.05 error: 0.15 # Local model with Ollama - prompt: prompt: "Transaction descriptions should be clear and professional" model: "ollama:llama3" columns_subset: [description] ``` **Best practices for AI validation:** - Be specific and clear in your prompt criteria - Include only necessary columns in `columns_subset` to reduce API costs - Start with smaller `batch_size` for testing, increase for production - Adjust `max_concurrent` based on API rate limits - Use thresholds appropriate for probabilistic validation results - Consider cost implications for large datasets - Test prompts on sample data before full deployment **When to use AI validation:** - Semantic checks (e.g., "does the description match the category?") - Context-dependent validation (e.g., "is the justification appropriate for the amount?") - Subjective quality assessment (e.g., "is the text professional?") - Pattern recognition that's hard to express programmatically - Natural language understanding tasks **When NOT to use AI validation:** - Simple numeric comparisons (use `col_vals_gt`, `col_vals_lt`, etc.) - Exact pattern matching (use `col_vals_regex`) - Schema validation (use `col_schema_match`) - Performance-critical validations with large datasets - When deterministic results are required ### Data Quality Methods `col_pct_null`: is the percentage of null values in a column within bounds? ```yaml - col_pct_null: columns: [column_name] # REQUIRED: Column(s) to validate value: 0.05 # REQUIRED: Maximum allowed null fraction thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Null rate check" # OPTIONAL: Step description ``` `data_freshness`: is the data in a date/datetime column recent? ```yaml - data_freshness: columns: [date_column] # REQUIRED: Date/datetime column freshness: "24h" # REQUIRED: Maximum age of data thresholds: # OPTIONAL: Step-level thresholds warning: 0.1 actions: # OPTIONAL: Step-level actions warning: "Custom message" brief: "Data is recent" # OPTIONAL: Step description ``` ### Aggregate Validations Aggregate methods validate column-level statistics (sum, average, standard deviation) against a threshold. They follow the pattern `col_{stat}_{comparator}`: ```yaml # Sum validations - col_sum_gt: columns: [revenue] value: 0 brief: "Total revenue is positive" # Average validations - col_avg_le: columns: [rating] value: 5 brief: "Average rating at most 5" # Standard deviation validations - col_sd_lt: columns: [temperature] value: 10 brief: "Temperature variation is bounded" ``` Available aggregate methods: - **Sum**: `col_sum_gt`, `col_sum_lt`, `col_sum_ge`, `col_sum_le`, `col_sum_eq` - **Average**: `col_avg_gt`, `col_avg_lt`, `col_avg_ge`, `col_avg_le`, `col_avg_eq` - **Standard deviation**: `col_sd_gt`, `col_sd_lt`, `col_sd_ge`, `col_sd_le`, `col_sd_eq` All aggregate methods accept these common parameters: `columns`, `value`, `thresholds`, `actions`, `brief`, `active`, and `pre`. ## Column Selection Patterns All validation methods that accept a `columns` parameter support these selection patterns: ```yaml # Single column columns: column_name # Multiple columns as list columns: [col1, col2, col3] # Column selector functions (when used in Python expressions) columns: python: | starts_with("prefix_") # Examples of common patterns columns: [customer_id, order_id] # Specific columns columns: user_email # Single column ``` ## Parameter Details ### Common Parameters These parameters are available for most validation methods: - `columns`: column selection (string, list, or selector expression) - `na_pass`: whether to pass NULL/missing values (boolean, default: false) - `pre`: data preprocessing function (Python lambda expression) - `thresholds`: step-level failure thresholds (dict) - `actions`: step-level failure actions (dict) - `brief`: step description (string, boolean, or template) - `active`: whether the step is active (boolean, default: true) ### Active Parameter The `active` parameter controls whether a validation step runs. It defaults to `true`; set it to `false` to skip a step without removing it from the configuration: ```yaml steps: # This step will be skipped - col_vals_gt: columns: [amount] value: 0 active: false # This step runs normally (default active: true) - col_vals_not_null: columns: [customer_id] ``` ### Brief Parameter Options The `brief` parameter supports several formats: ```yaml brief: "Custom description" # Custom text brief: true # Auto-generated description brief: false # No description brief: "Step {step}: {auto}" # Template with auto-generated text brief: "Column '{col}' validation" # Template with variables ``` template variables: `{step}`, `{col}`, `{value}`, `{set}`, `{pattern}`, `{auto}` ### Python Expressions Several parameters support Python expressions using the `python:` block syntax: ```yaml # Data source loading tbl: python: | pl.scan_csv("data.csv").filter(pl.col("active") == True) # Preprocessing pre: python: | lambda df: df.filter(pl.col("date") >= "2024-01-01") # Custom expressions expr: python: | pl.col("value").is_between(0, 100) # Callable actions actions: error: python: | lambda: print("VALIDATION ERROR: Critical data quality issue detected!") ``` Note: The Python environment in YAML is restricted for security. Only built-in functions (`print`, `len`, `str`, etc.), `Path` from pathlib, and available DataFrame libraries (`pl`, `pd`) are accessible. You cannot import additional modules like `requests`, `logging`, or custom libraries. You can also use the shortcut syntax for lambda expressions: ```yaml # Shortcut syntax (equivalent to python: block) pre: | lambda df: df.filter(pl.col("status") == "active") ``` ### Restricted Python Environment For security reasons, the Python environment in YAML configurations is restricted to a safe subset of functionality. The available namespace includes: Built-in functions: - basic types: `str`, `int`, `float`, `bool`, `list`, `dict`, `tuple`, `set` - math functions: `sum`, `min`, `max`, `abs`, `round`, `len` - iteration: `range`, `enumerate`, `zip` - output: `print` Available modules: - `Path` from pathlib for file path operations - `pb` (`pointblank`) for dataset loading and validation functions - `pl` (`polars`) if available on the system - `pd` (`pandas`) if available on the system Restrictions: - cannot import external libraries (`requests`, `logging`, `os`, `sys`, etc.) - cannot use `__import__`, `exec`, `eval`, or other dynamic execution functions - file operations are limited to `Path` functionality **Examples of valid callable actions:** ```yaml # Simple output with built-in functions actions: warning: python: | lambda: print(f"WARNING: {sum([1, 2, 3])} validation issues detected") # Using available variables and string formatting actions: error: python: | lambda: print("ERROR: Data validation failed at " + str(len("validation"))) # Multiple statements in lambda (using parentheses) actions: critical: python: | lambda: ( print("CRITICAL ALERT:"), print("Immediate attention required"), print("Contact data team") )[-1] # Return the last value ``` For complex alerting, logging, or external system integration, use string template actions instead of callable actions, and handle the external communication in your application code after validation completes. ## Best Practices ### Organization - use descriptive `tbl_name` and `label` values - add `brief` descriptions for complex validations - group related validations logically - use consistent indentation and formatting ### Performance - apply `pre` filters early to reduce data volume - order validations from fast to slow - use `columns_subset` for row-based validations when appropriate - consider data source location (local vs. remote) - choose `df_library` based on data size and operations: - `polars`: fastest for large datasets and analytical operations - `pandas`: best for complex transformations and data science workflows - `duckdb`: optimal for analytical queries on very large datasets ### Maintainability - store YAML files in version control - use template variables in actions and briefs - document expected failures with comments - test configurations with `validate_yaml()` before deployment - specify `df_library` explicitly when using library-specific validation expressions - keep DataFrame library choice consistent within related validation workflows ### Error Handling - set appropriate thresholds based on data patterns - use actions for monitoring and alerting - start with conservative thresholds and adjust - consider using `highest_only: false` for comprehensive reporting ### Validation Reports ```{python} #| echo: false #| output: false import pointblank as pb ``` After interrogating your data with a validation plan, Pointblank automatically generates a *validation report*. That tabular report comprehensively summarizes the results of all validation steps. It'll be your primary tool for understanding data quality at a glance, identifying issues, and communicating results to stakeholders. Validation reports are [Great Tables](https://github.com/posit-dev/great-tables) objects that provide rich information about each validation step. It includes: identifying information for the step, pass/fail statistics, threshold exceedances, and visual status indicators. The report makes it easy to quickly assess overall data quality and pinpoint specific areas that need attention. ## Viewing the Validation Report The most straightforward way to view a validation report is to simply print the `Validate` object after calling `interrogate()`: ```{python} import pointblank as pb import polars as pl # Sample data data = pl.DataFrame({ "id": range(1, 11), "value": [120, 85, 47, 210, 30, 155, 175, 95, 205, 140], "category": ["A", "B", "C", "A", "D", "B", "A", "E", "A", "C"], "ratio": [0.5, 0.7, 0.3, 1.2, 0.8, 0.9, 0.4, 1.5, 0.6, 0.2], }) # Create and interrogate a validation validation = ( pb.Validate(data=data, tbl_name="sales_data") .col_vals_gt(columns="value", value=50, brief=True) .col_vals_in_set(columns="category", set=["A", "B", "C"], brief=True) .col_exists(columns=["id", "value"], brief=True) .interrogate() ) # Display the validation report validation ``` In a notebook or interactive environment, simply typing the validation object name displays the report automatically. In a script or REPL, you might need to explicitly call `validation.get_tabular_report().show()` to display the table. ::: {.callout-note} You can display a validation report even before calling `interrogate()`. The report will show your validation plan with all the steps you've defined, but it won't contain any interrogation results. Additionally, validation steps that use column selection patterns (like validating multiple columns at once) won't be expanded into individual rows yet, as that expansion happens during interrogation. ::: ## Understanding Report Components The validation report table consists of several key components that work together to provide a complete picture of your data quality: #### Report Header The report header (title and subtitle area) contains important metadata about the validation: - **Title**: by default, shows "Pointblank Validation" but can be customized - **Label**: your custom label for the validation (if provided via the `label=` parameter) - **Table Information**: the table name and type (Polars, Pandas, DuckDB, etc.) - **Thresholds**: the warning, error, and critical threshold values used This header information provides essential context for interpreting the validation results, especially when sharing reports with stakeholders or reviewing historical validations. #### Report Footer The report footer contains several pieces of information that provide context and traceability: **Timestamps**: The footer shows when the interrogation was performed, including the start time, duration, and end time. This helps track when data quality checks were executed, which is especially useful when archiving reports or monitoring data quality over time. **Governance Metadata**: When you provide governance parameters to `Validate`, they are displayed in the footer as well. This metadata helps document data ownership and dependencies: - **Owner**: who is responsible for the data quality (e.g., `"data-platform-team"`) - **Consumers**: who depends on this data (e.g., `["ml-team", "analytics"]`) - **Version**: the version of the validation plan or data contract (e.g., `"2.1.0"`) Here's an example showing governance metadata in a validation report: ```{python} # Example with governance metadata governance_validation = ( pb.Validate( data=data, tbl_name="sales_data", label="Sales data validation", owner="data-platform-team", consumers=["ml-team", "analytics", "finance"], version="1.2.0", ) .col_vals_gt(columns="value", value=0) .col_vals_in_set(columns="category", set=["A", "B", "C", "D", "E"]) .interrogate() ) governance_validation ``` The governance metadata appears below the timestamps in the footer, making it easy to identify who owns the data, who depends on it, and which version of the validation rules are being applied. ::: {.callout-tip} Governance metadata is particularly useful in enterprise environments where data lineage and accountability are important. By including `owner`, `consumers`, and `version` in your validations, you create self-documenting reports that can be easily understood by anyone reviewing them. ::: ::: {.callout-note} Throughout this documentation, the footer is hidden in example reports for brevity. This is controlled through a global option (see the section on controlling header and footer display later in this guide). In practice, including the footer provides valuable timestamp information for tracking when validations were executed. ::: ### Report Columns The validation report table includes the following columns, each providing specific information about the validation steps: #### Status Indicator (first column, unlabeled) The first column is an unlabeled vertical colored bar that provides instant visual feedback about each step's status: - **Green**: all test units passed the validation - **Light green (semi-transparent)**: some test units failed but no thresholds were exceeded - **Gray**: the 'warning' threshold was exceeded - **Yellow**: the 'error' threshold was exceeded - **Red**: the 'critical' threshold was exceeded This visual indicator allows you to quickly scan the report and identify problem areas. #### Step Number (second column, unlabeled) The second column is unlabeled and contains the sequential step number, starting from 1. This number is used when referencing specific steps in other methods like `get_step_report(i=2)` or when extracting data from specific validation steps. #### TYPE The TYPE column displays the validation method name along with an icon that visually represents the type of validation being performed. The validation method indicates what aspect of data quality is being checked, such as: - `col_vals_gt()`: column values greater than - `col_vals_in_set()`: column values in a set - `col_exists()`: column existence check - `rows_distinct()`: row uniqueness check - and many others... When you provide a brief message (via `brief=True` for auto-generated briefs or `brief="custom text"` for custom messages), it appears within the TYPE column below the validation method name. These briefs provide human-readable explanations of what each validation step is checking, making the report more accessible to non-technical stakeholders. ```{python} # Example showing brief messages in the TYPE column validation_with_briefs = ( pb.Validate(data=data, tbl_name="sales_data") .col_vals_gt( columns="value", value=50, brief="Sales values should always exceed the $50 threshold" ) .col_vals_in_set( columns="category", set=["A", "B", "C"], brief=True # Auto-generated brief ) .interrogate() ) validation_with_briefs ``` In the above report, you'll see the custom brief message appear below the `col_vals_gt` method name in the first step, and an automatically generated brief below `col_vals_in_set` in the second step. #### COLUMNS The column(s) being validated in this step. For validation methods that don't target specific columns (like `row_count_match`), this will show an em dash (—). #### VALUES The comparison value(s) or criteria used in the validation. For example: - for `col_vals_gt(value=100)`, this shows `100` - for `col_vals_in_set(set=["A", "B", "C"])`, this shows `A | B | C` - for existence checks, this shows an em dash (—) #### TBL Icons indicating whether any preprocessing or segmentation was applied: - **Table icon**: standard validation on the original data - **Transformation icon**: preprocessing function was applied via `pre=` - **Segmentation icon**: data was segmented via `segments=` These icons help you understand if you're validating transformed or segmented data. #### EVAL Indicates whether the validation step was evaluated: - **Checkmark**: step was successfully evaluated - **Error icon**: an evaluation error occurred (e.g., column not found) - **Inactive icon**: step was marked as inactive This column is crucial for identifying validation steps that couldn't be executed properly. #### UNITS The number of units tested in this validation step. A 'test unit' is the atomic unit being validated, which varies by validation type: - for column value checks: each cell in the target column(s) - for row checks: each row - for table checks: typically 1 (the table itself) This number is formatted with locale-appropriate thousand separators for readability. Also, since space is limited, values are often abbreviated so a figure like 43,534 will appear as `43.5K`. #### PASS The number and fraction of test units that passed the validation, displayed as: ``` n_passed f_passed ``` For example, the cell with ``` 8 0.80 ``` means 8 test units passed out of the total, representing an 80% success rate (though `f_passed` is always expressed as a fractional value from `0` to `1`). #### FAIL The number and fraction of test units that failed the validation, displayed similarly to PASS: ``` n_failed f_failed ``` For example, the cell with ``` 2 0.20 ``` means 2 test units failed, representing a 20% failure rate from a fractional value of `0.20`. Note that this fractional `f_failed` value is what's used to set failure thresholds for 'warning', 'error', and 'critical' states. #### W, E, C (Warning, Error, Critical) Three columns showing whether each threshold level was exceeded for the three different states. - **Long dash**: threshold wasn't set for a state - **Empty colored circle**: threshold was set but wasn't exceeded for a given state - **Filled colored circle**: threshold was set and exceeded In terms of colors, the 'warning' state is gray, the 'error' state is yellow, and the 'critical' state is red. Having visual indicators makes it easy to identify which validation steps have crossed into warning, error, or critical territory. #### EXT Indicates whether failing row data was extracted for this step: - **Em dash (—)**: no extract available - **Download button**: click to download failing rows as CSV When extracts are available, you can download them directly from the report for further analysis or to share with data stewards who need to fix the issues. ## Understanding Validation Status The validation report helps you quickly understand the overall status of your data: - **All green status indicators**: all validations passed completely - **Light green indicators**: minor failures below warning threshold - **Gray, yellow, or red indicators**: threshold exceedances requiring attention - **Error icons in EVAL column**: validation steps that couldn't be evaluated By scanning the status indicators column, you can immediately identify which validation steps need attention and prioritize your data quality efforts accordingly. ## Customizing the Report Title You can customize the validation report's title using the `title=` parameter in `get_tabular_report()`. This is particularly useful when generating multiple reports or when you want to provide more context: ```{python} # Default title validation.get_tabular_report() ``` ```{python} # Use the table name as the title validation.get_tabular_report(title=":tbl_name:") ``` ```{python} # Provide a custom title (supports Markdown) validation.get_tabular_report(title="**Sales Data** Quality Report") ``` ```{python} # No title validation.get_tabular_report(title=":none:") ``` The title customization options are: - `":default:"` (default): shows `"Pointblank Validation"` - `":tbl_name:"`: uses the table name from `tbl_name=` parameter - `":none:"`: hides the title completely - Any string: custom title text (Markdown is supported) ## Customizing with Great Tables Since the validation report is a Great Tables object, you can leverage the full power of Great Tables to customize its appearance. This allows you to match your organization's branding, highlight specific information, or adjust the presentation for different audiences. ### Guide to Internal Column Names When working with Great Tables methods to customize the validation report, you'll need to use the *internal column names* rather than the display labels you see in the rendered table. This is because Great Tables operates on the underlying data table structure, where columns have technical names that differ from their user-facing labels. For example, the column labeled `"STEP"` in the report is actually stored internally as `"i"`, and the `"TYPE"` column is internally named `"type_upd"`. Most Great Tables methods that target specific columns (like `tab_style()`, `cols_width()`, `cols_hide()`, etc.) require these internal names. Here's the complete mapping from display labels to internal column names: 1. Status indicator (no label): `"status_color"` 2. Step number (no label): `"i"` 3. `TYPE`: `"type_upd"` 4. `COLUMNS`: `"columns_upd"` 5. `VALUES`: `"values_upd"` 6. `TBL`: `"tbl"` 7. `EVAL`: `"eval"` 8. `UNITS`: `"test_units"` 9. `PASS`: `"pass"` 10. `FAIL`: `"fail"` 11. `W`: `"w_upd"` 12. `E`: `"e_upd"` 13. `C`: `"c_upd"` 14. `EXT`: `"extract_upd"` Always use these internal names when calling Great Tables methods. Using the display labels (like `"STEP"` or `"TYPE"`) will result in errors since these labels only exist in the rendered output, not in the underlying data structure. In the examples that follow, you'll see how to use these internal column names to customize various aspects of the validation report. ### Adding Custom Styling You can apply custom styles to the report table: ```{python} from great_tables import style, loc # Get the report as a Great Tables object report = validation.get_tabular_report() # Add custom styling using internal column names report = ( report .tab_style( style=style.fill(color="#F0F8FF"), locations=loc.body(columns="i") # Internal name for step number ) .tab_style( style=style.text(weight="bold"), locations=loc.body(columns="type_upd") # Internal name for TYPE ) ) report ``` ### Modifying Column Widths Adjust column widths to optimize the layout: ```{python} report = ( validation .get_tabular_report() .cols_width( cases={ "status_color": "20px", # Status indicator column "i": "40px", # Step number column "type_upd": "170px", # TYPE column "columns_upd": "100px", # COLUMNS column } ) ) report ``` ### Hiding Columns Hide specific columns that aren't relevant for your audience: ```{python} # Hide the TBL and EVAL columns for a cleaner presentation (using internal names) report = ( validation .get_tabular_report() .cols_hide(columns=["tbl", "eval"]) # Use internal column names ) report ``` ### Adding a Source Note Add information about data source or validation context: ```{python} report = ( validation .get_tabular_report() .tab_source_note( source_note="Data validated on 2025-10-10 | Production database snapshot" ) ) report ``` ## Exporting the Report Great Tables provides multiple export options for sharing validation reports: ```python # Save as a standalone HTML file validation.get_tabular_report().write_raw_html("validation_report.html") # Save as a PNG image validation.get_tabular_report().save("validation_report.png") # Open in browser validation.get_tabular_report().show("browser") ``` ## Controlling Header and Footer Display You can control whether the header and footer appear in the validation report: ```{python} # Hide the footer validation.get_tabular_report(incl_footer=False) ``` ```{python} # Hide the header validation.get_tabular_report(incl_header=False) ``` ```{python} # Hide both validation.get_tabular_report(incl_header=False, incl_footer=False) ``` You can also set these preferences globally using `pb.config()`: ```python # Set global preferences pb.config(report_incl_header=True, report_incl_footer_timings=False) ``` ## Best Practices for Validation Reports Here are some guidelines for creating effective validation reports: #### 1. Use Descriptive Table Names and Labels Provide meaningful names and labels to make reports self-documenting: ```python validation = pb.Validate( data=sales_df, tbl_name="Q3_2025_sales", label="Quarterly sales data validation for financial reporting" ) ``` #### 2. Include Governance Metadata for Accountability Add ownership and dependency information for enterprise data governance: ```python validation = pb.Validate( data=sales_df, tbl_name="Q3_2025_sales", label="Quarterly sales data validation", owner="data-platform-team", consumers=["ml-team", "analytics", "finance"], version="2.1.0" ) ``` This creates a clear record of who is responsible for the data, who depends on it, and which version of the validation rules are being applied. #### 3. Add Brief Messages for Stakeholder Reports When sharing reports with non-technical stakeholders, always include briefs: ```python .col_vals_between( columns="price", left=0, right=10000, brief="Product prices must be between $0 and $10,000" ) ``` #### 4. Set Appropriate Thresholds Configure thresholds that align with your data quality requirements: ```python validation = pb.Validate( data=data, tbl_name="customer_data", thresholds=pb.Thresholds( warning=0.01, # 1% failure triggers warning error=0.05, # 5% failure triggers error critical=0.10 # 10% failure triggers critical ) ) ``` #### 5. Customize for Your Audience Tailor the report presentation to your audience: - **Technical teams**: include all columns, show preprocessing indicators - **Management**: hide technical columns, emphasize status indicators - **Data stewards**: include extract download buttons, detailed briefs #### 6. Combine with Other Reporting Tools Use validation reports alongside other Pointblank features: - **Step reports**: drill down into specific failing steps with `get_step_report()` - **Extracts**: use `get_data_extracts()` to get all failing data for analysis - **Sundered data**: use `get_sundered_data()` to split data into passing/failing sets #### 7. Archive Reports for Trend Analysis Save validation reports over time to track data quality trends: ```python from datetime import datetime # Save with timestamp timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") validation.get_tabular_report().write_raw_html(f"validation_report_{timestamp}.html") ``` ## Conclusion The validation report is your primary interface for understanding data quality after running a validation. By providing a comprehensive overview of all validation steps, visual status indicators, and detailed statistics, it enables you to: - quickly assess overall data quality across multiple dimensions - identify specific validation steps that need attention - communicate data quality status to technical and non-technical stakeholders - track threshold exceedances and their severity levels - access failing data through extract downloads Combined with customization options from Great Tables, you can create reports that perfectly match your organization's needs and workflows. Whether you're validating data in an interactive notebook, generating automated quality reports, or presenting findings to stakeholders, the validation report provides the clarity and detail you need to maintain high data quality standards. ### Step Reports ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_footer_timings=False) ``` While validation reports provide a comprehensive overview of all validation steps, sometimes you need to focus on a specific validation step in greater detail. This is where *step reports* come in. A step report is a detailed examination of a single validation step, providing in-depth information about the test units that were validated and their pass/fail status. Step reports are especially useful when debugging validation failures, investigating problematic data, or communicating detailed findings to colleagues who are responsible for specific data quality issues. ## Creating a Step Report To create a step report, you first need to run a validation and then use the [`Validate.get_step_report()`](`Validate.get_step_report`) method, specifying which validation step you want to examine: ```{python} import pointblank as pb import polars as pl # Sample data as a Polars DataFrame data = pl.DataFrame({ "id": range(1, 11), "value": [10, 20, 3, 35, 50, 2, 70, 8, 20, 4], "category": ["A", "B", "C", "A", "D", "F", "A", "E", "H", "G"], "ratio": [0.5, 0.7, 0.3, 1.2, 0.8, 0.9, 0.4, 1.5, 0.6, 0.2], "status": ["active", "active", "inactive", "active", "inactive", "active", "inactive", "active", "active", "inactive"] }) # Create a validation validation = ( pb.Validate(data=data, tbl_name="example_data") .col_vals_gt( columns="value", value=10 ) .col_vals_in_set( columns="category", set=["A", "B", "C"] ) .interrogate() ) # Get step report for the second validation step (i=2) step_report = validation.get_step_report(i=2) step_report ``` In this example, we first create and interrogate a validation object with two steps. We then generate a step report for the second validation step (`i=2`), which checks if the values in the `category` column are in the set `["A", "B", "C"]`. Note that step numbers in Pointblank start at `1`, matching what you see in the validation report's `STEP` column (i.e., not 0-based indexing). So the first step is referred to with `i=1`, the second step with `i=2`, and so on. ## Understanding Step Report Components A step report consists of several key components that provide detailed information about the validation step: 1. Header: displays the validation step number, type of validation, and a brief description 2. Table Body: presents either the failing rows, a sample of completely passing data, or an expected/actual comparison (for a [`Validate.col_schema_match()`](`Validate.col_schema_match`) step) The step report table highlights passing and failing rows, making it easy to identify problematic data points. This is especially useful for diagnosing issues when dealing with large datasets. ## Different Types of Step Reports It's important to note that step reports vary in appearance and structure depending on the type of validation method used: - Value-based validations (like [`Validate.col_vals_gt()`](`Validate.col_vals_gt`), [`Validate.col_vals_in_set()`](`Validate.col_vals_in_set`)): show individual rows that failed validation - Uniqueness checks ([`Validate.rows_distinct()`](`Validate.rows_distinct`)): group together the duplicate records in order of appearance - Schema validations ([`Validate.col_schema_match()`](`Validate.col_schema_match`)): display column-level information about expected vs. actual data types Additionally, step reports for value-based validations and uniqueness checks operate in two distinct modes: 1. When errors are present: The report shows only the failing rows and, for value-based validations, clearly highlights the column under study 2. When no errors exist: The report header clearly indicates success, and a sample of the data is shown (along with the studied column highlighted, for value-based validations) This variation in reporting style allows step reports to effectively communicate the specific type of validation being performed and display relevant information in the most appropriate format. When you're working with different validation types, expect to see different step report layouts optimized for each context. ### Value-Based Validation Step Reports Value-based step reports focus on showing individual rows where values in the target column failed the validation check. These reports highlight the specific column being validated and clearly display which values violated the condition. ```{python} # Create sample data with some validation failures data = pl.DataFrame({ "id": range(1, 8), "value": [120, 85, 47, 210, 30, 10, 5], "category": ["A", "B", "C", "A", "D", "B", "E"] }) # Create a validation with a value-based check validation_values = ( pb.Validate(data=data, tbl_name="sales_data") .col_vals_gt( columns="value", value=50, brief="Sales values should exceed $50" ) .interrogate() ) # Display the step report for the value-based validation validation_values.get_step_report(i=1) ``` This report clearly identifies which rows contain values that don't meet our threshold, making it easy to investigate these specific data points. ### Uniqueness Validation Step Reports Uniqueness checks produce a different type of step report that groups duplicate records together. This format makes it easy to identify patterns in duplicate data. ```{python} # Create sample data with some duplicate rows based on the combination of columns data = pl.DataFrame({ "customer_id": [101, 102, 103, 101, 104, 105, 102], "order_date": ["2023-01-15", "2023-01-16", "2023-01-16", "2023-01-15", "2023-01-17", "2023-01-18", "2023-01-19"], "product": ["Laptop", "Phone", "Tablet", "Laptop", "Monitor", "Keyboard", "Headphones"] }) # Create a validation checking for unique customer-product combinations validation_duplicates = ( pb.Validate(data=data, tbl_name="order_data") .rows_distinct( columns_subset=["customer_id", "product"], brief="Customer should not order the same product twice" ) .interrogate() ) # Display the step report for the uniqueness validation validation_duplicates.get_step_report(i=1) ``` The report organizes duplicate records together, making it easy to see which combinations are repeated and how many times they appear. ### Schema Validation Step Reports Schema validation step reports have a completely different structure, comparing expected versus actual column data types and presence. ```{python} schema = pb.Schema( columns=[ ("date_time", "timestamp"), ("dates", "date"), ("a", "int64"), ("b",), ("c",), ("d", "float64"), ("e", ["bool", "boolean"]), ("f", "str"), ] ) validation_schema = ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="duckdb"), tbl_name="small_table", label="Step report for a schema check" ) .col_schema_match(schema=schema) .interrogate() ) # Display the step report for the schema validation validation_schema.get_step_report(i=1) ``` This report style focuses on comparing the expected schema against the actual table structure, highlighting mismatches in data types or missing/extra columns. The table format makes it easy to see exactly where the schema expectations differ from reality. ## Customizing Step Reports Step reports can be customized with several parameters to better focus your analysis and tailor the output to your specific needs. The [`Validate.get_step_report()`](`Validate.get_step_report`) method offers multiple customization options to help you create more effective reports. When a dataset has many columns, you might want to focus on just those relevant to your analysis. You can create a step report containing only a subset of the columns in the target table: ```{python} validation.get_step_report( i=2, # Only show these columns --- columns_subset=["id", "category", "status"] ) ``` This approach makes step reports much easier to interpret by highlighting just the essential columns that help understand the validation failures. For large datasets with many failing rows, you might want to use `limit=` to set a cap on the number of rows shown in the report: ```{python} validation.get_step_report( i=2, # Only show up to 2 failing rows --- limit=2 ) ``` The report header can also be extensively customized to provide more specific context. You can replace the default header with plain text or Markdown formatting: ```{python} validation.get_step_report( i=2, header="Category Values Validation: *Critical Analysis*" ) ``` For more advanced header customization, you can use the templating system with the `{title}` and `{details}` elements to retain parts of the default header while adding your own content. The `{title}` template is the default title whereas `{details}` provides information on the assertion, number of failures, etc. Let's move away from the default template of `{title}{details}` and provide a custom title to go with the details text: ```{python} validation.get_step_report( i=2, header="Custom Category Validation Report {details}" ) ``` We can keep `{title}` and `{details}` and add some more context in between the two: ```{python} validation.get_step_report( i=2, header=( "{title}
" "" "This validation is critical for our data quality standards." "
" "{details}" ) ) ``` You could always use more HTML and CSS to do *a lot* of customization: ```{python} validation.get_step_report( i=2, header=( "VALIDATION SUMMARY\n\n{details}\n\n" "
" "
" "{title}" "
" ) ) ``` If you prefer no header at all, simply set `header=None`: ```{python} validation.get_step_report( i=2, header=None ) ``` These customization options can be combined to create highly focused reports tailored to specific needs: ```{python} validation.get_step_report( i=2, columns_subset=["id", "category"], header="*Category Validation:* Top Issues", limit=2 ) ``` Through these customization options, you can craft step reports that effectively communicate the most important information to different audiences. Technical teams might benefit from seeing all columns but with a limited number of examples. Business stakeholders might prefer a focused view with only the most relevant columns. For documentation purposes, custom headers provide important context about what's being validated. Remember that customizing your step reports is about more than aesthetics: it's about making complex validation information more accessible and actionable for all stakeholders involved in data quality. ## Using Step Reports for Data Investigation Step reports can be powerful tools for investigating data quality issues. Let's look at a more complex example: ```{python} # Create a more complex dataset with multiple issues complex_data = pl.DataFrame({ "id": range(1, 11), "value": [10, 20, 3, 40, 50, 2, 70, 80, 90, 7], "ratio": [0.1, 0.2, 0.3, 1.4, 0.5, 0.6, 0.7, 0.8, 1.2, 0.9], "category": ["A", "B", "C", "A", "D", "B", "A", "C", "B", "E"] }) # Create a validation with multiple steps validation_complex = ( pb.Validate(data=complex_data, tbl_name="complex_data") .col_vals_gt(columns="value", value=10) .col_vals_le(columns="ratio", value=1.0) .col_vals_in_set(columns="category", set=["A", "B", "C"]) .interrogate() ) # Get step report for the ratio validation (step 2) ratio_report = validation_complex.get_step_report(i=2) ratio_report ``` In this example, we're investigating issues with the `ratio` column by generating a step report specifically for that validation step. The step report shows exactly which rows have values that exceed our maximum threshold of `1.0`. ## Combining Step Reports with Extracts For more advanced analysis, you can extract the actual data from a step report into a DataFrame: ```{python} # Extract the data from the step report failing_ratios = validation_complex.get_data_extracts(i=2) failing_ratios ``` This extracts the failing rows from the validation step, which you can then further analyze or fix as needed. Note that the parameter `i=2` corresponds directly to the step number shown in the validation report; it's the same numbering system used for [`Validate.get_step_report()`](`Validate.get_step_report`). These extracts are particularly valuable for analysts who need to: - perform additional calculations on problematic data - feed failing records into correction pipelines - create visualizations of data patterns that led to validation failures - export problem records to share with data owners It's worth noting that the validation report itself includes export buttons on the far right of each row that allow you to download CSV files of the failing data directly. This serves as a convenient delivery mechanism for sharing extracts with colleagues who may not be working in Python, making the validation report not just a visual tool but also a practical means of distributing problematic data for further investigation. ## Step Reports with Segmented Data When working with segmented validation, step reports become even more valuable as they allow you to investigate issues within specific segments: ```{python} # Create data with different regions segmented_data = pl.DataFrame({ "id": range(1, 10), "value": [10, 20, 3, 40, 50, 2, 6, 8, 60], "region": ["North", "North", "South", "South", "East", "East", "West", "West", "West"] }) # Create a validation with segments segmented_validation = ( pb.Validate(data=segmented_data, tbl_name="regional_data") .col_vals_gt( columns="value", value=10, segments="region" # Segment by region ) .interrogate() ) # Get step report for a specific segment (the 'West' region) # For segmented validations, each segment gets its own step number north_report = segmented_validation.get_step_report(i=4) north_report ``` For segmented validations, each segment is treated as a separate validation step with its own step number. This allows you to investigate issues specific to each data segment using the appropriate step number from the validation report. ## Best Practices for Using Step Reports Here are some guidelines for effectively using step reports in your data validation workflow: 1. Generate step reports selectively: create reports only for steps that require detailed investigation rather than for all steps 2. Use the `limit=` parameter for large datasets: when working with large datasets, focus only on a subset of failing rows to avoid information overload 3. Share specific step reports with stakeholders: when collaborating with domain experts, share relevant step reports to help them understand and address specific data quality issues (and customize the header to improve clarity) 4. Combine with extracts for deeper analysis: use the [`Validate.get_data_extracts()`](`Validate.get_data_extracts`) method to extract the failing rows for further analysis or correction 5. Document findings from step reports: when you discover patterns or insights from step reports, document them to inform future data quality improvements Remember that step reports are most valuable when used strategically as part of a broader data quality framework. By following these best practices, you can use step reports not just for troubleshooting, but to develop a deeper understanding of your data's characteristics and quality patterns over time. This approach transforms step reports from simple debugging tools into strategic assets for continuous data quality improvement. ## Conclusion Step reports provide a focused lens into specific validation steps, allowing you to investigate data quality issues in detail. By generating targeted reports for specific validation steps, you can: - pinpoint exactly which data points are causing validation failures - communicate specific issues to relevant stakeholders - gather insights that might be missed in the aggregate validation report - track improvements in specific aspects of data quality over time Whether you're debugging validation failures, investigating edge cases, or communicating specific data quality issues to colleagues, step reports can give you the detailed information you need to understand and resolve data quality problems effectively. ### Data Extracts ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False) ``` When validating data, identifying exactly which rows failed is critical for diagnosing and resolving data quality issues. This is where *data extracts* come in. Data extracts consist of target table rows containing at least one cell that failed validation. While the validation report provides an overview of pass/fail statistics, data extracts give you the actual problematic records for deeper investigation. This article will cover: - which validation methods collect data extracts - multiple ways to access and work with data extracts - practical examples of using extracts for data quality improvement - advanced techniques for analyzing extract patterns ## The Validation Methods that Work with Data Extracts The following validation methods operate on column values and will have rows extracted when there are failing test units in those rows: - [`Validate.col_vals_gt()`](`Validate.col_vals_gt`) - [`Validate.col_vals_lt()`](`Validate.col_vals_lt`) - [`Validate.col_vals_ge()`](`Validate.col_vals_ge`) - [`Validate.col_vals_le()`](`Validate.col_vals_le`) - [`Validate.col_vals_eq()`](`Validate.col_vals_eq`) - [`Validate.col_vals_ne()`](`Validate.col_vals_ne`) - [`Validate.col_vals_between()`](`Validate.col_vals_between`) - [`Validate.col_vals_outside()`](`Validate.col_vals_outside`) - [`Validate.col_vals_in_set()`](`Validate.col_vals_in_set`) - [`Validate.col_vals_not_in_set()`](`Validate.col_vals_not_in_set`) - [`Validate.col_vals_null()`](`Validate.col_vals_null`) - [`Validate.col_vals_not_null()`](`Validate.col_vals_not_null`) - [`Validate.col_vals_regex()`](`Validate.col_vals_regex`) - [`Validate.col_vals_expr()`](`Validate.col_vals_expr`) - [`Validate.conjointly()`](`Validate.conjointly`) These row-based validation methods will also have rows extracted should there be failing rows: - [`Validate.rows_distinct()`](`Validate.rows_distinct`) - [`Validate.rows_complete()`](`Validate.rows_complete`) Note that some validation methods like [`Validate.col_exists()`](`Validate.col_exists`) and [`Validate.col_schema_match()`](`Validate.col_schema_match`) don't generate data extracts because they validate structural aspects of the table rather than checking column values. ## Accessing Data Extracts There are three primary ways to access data extracts in Pointblank: 1. the **CSV** buttons in validation reports 2. through the [`Validate.get_data_extracts()`](`Validate.get_data_extracts`) method 3. inspecting a subset of failed rows in step reports Let's explore each approach using examples. ### CSV Data from Validation Reports Data extracts are embedded within validation report tables. Let's look at an example, using the `small_table` dataset, where data extracts are collected in a single validation step due to failing test units: ```{python} import pointblank as pb validation = ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .col_vals_lt( columns="d", value=3000) .interrogate() ) validation ``` The single validation step checks whether values in `d` are less than `3000`. Within that column, values range from `108.34` to `9999.99` so it makes sense that we can see 4 failing test units in the `FAIL` column. If you look at the far right of the validation report you'll find there's a `CSV` button. Pressing it initiates the download of a CSV file, and that file contains the data extract for this validation step. The `CSV` button only appears when: 1. there is a non-zero number of failing test units 2. the validation step is based on the use of a column-value or a row-based validation method (the methods outlined in the section entitled *The Validation Methods that Work with Data Extracts*) Access to CSV data for the test unit errors is useful when the validation report is shared with other data quality stakeholders, since it is easily accessible and doesn't require further use of Pointblank. The stakeholder can simply open the downloaded CSV in their preferred spreadsheet software, import it into a different analysis environment like R or Julia, or process it with any tool that supports CSV files. This cross-platform compatibility makes the CSV export particularly valuable in mixed-language data teams where different members might be working with different tools. ### `get_data_extracts()` For programmatic access to data extracts, Pointblank provides the [`Validate.get_data_extracts()`](`Validate.get_data_extracts`) method. This allows you to work with extract data directly in your Python workflow: ```{python} # Get data extracts from step 1 extract_1 = validation.get_data_extracts(i=1, frame=True) extract_1 ``` The extracted table is of the same type (a Polars DataFrame) as the target table. Previously we used `load_dataset()` with the `tbl_type="polars"` option to fetch the dataset in that form. Note these important details about using [`Validate.get_data_extracts()`](`Validate.get_data_extracts`): - the parameter `i=1` corresponds to the step number shown in the validation report (1-indexed, not 0-indexed) - setting `frame=True` returns the data as a DataFrame rather than a dictionary (only works when `i` is a single integer) - the extract includes all columns from the original data, not just the column being validated - an additional `_row_num_` column is added to identify the original row positions ### Step Reports Step reports provide another way to access and visualize failing data. When you generate a step report for a validation step that has failing rows, those failing rows are displayed directly in the report: ```{python} # Get a step report for the first validation step step_report = validation.get_step_report(i=1) step_report ``` Step reports offer several advantages for working with data extracts as they: 1. provide immediate visual context by highlighting the specific column being validated 2. format the data for better readability, especially useful when sharing results with colleagues 3. include additional metadata about the validation step and failure statistics For steps with many failures, you can customize how many rows to display: ```{python} # Limit to just 2 rows of failing data limited_report = validation.get_step_report(i=1, limit=2) limited_report ``` Step reports are particularly valuable when you want to quickly inspect the failing data without extracting it into a separate DataFrame. They provide a bridge between the high-level validation report and the detailed data extracts. ## Viewing Data Extracts with `preview()`{.qd-no-link} To get a consistent HTML representation of any data extract (regardless of the table type), we can use the `preview()` function: ```{python} pb.preview(data=extract_1) ``` The view is optimized for readability, with column names and data types displayed in a compact format. Notice that the `_row_num_` column is now part of the table stub and doesn't steal focus from the table's original columns. The `preview()` function is designed to provide the head and tail (5 rows each) of the table so very large extracts won't overflow the display. ## Working with Multiple Validation Steps When validating data with multiple steps, you can extract failing rows from any step or combine extracts from multiple steps: ```{python} # Create a validation with multiple steps multi_validation = ( pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars")) .col_vals_gt(columns="a", value=3) # Step 1 .col_vals_lt(columns="d", value=3000) # Step 2 .col_vals_regex(columns="b", pattern="^[0-9]-[a-z]{3}-[0-9]{3}$") # Step 3 .interrogate() ) multi_validation ``` ### Extracting Data from a Specific Step You can access extracts from any specific validation step: ```{python} # Get extracts from step 2 (`d < 3000` validation) less_than_failures = multi_validation.get_data_extracts(i=2, frame=True) less_than_failures ``` Using `frame=True` means that returned value will be a DataFrame (not a dictionary that contains a single DataFrame). If a step has no failing rows, an empty DataFrame will be returned: ```{python} # Get extracts from step 3 (regex check) regex_failures = multi_validation.get_data_extracts(i=3, frame=True) regex_failures ``` ### Getting All Extracts at Once To retrieve extracts from all steps with failures in one command: ```{python} # Get all extracts () all_extracts = multi_validation.get_data_extracts() # Display the step numbers that have extracts print(f"Steps with data extracts: {list(all_extracts.keys())}") ``` A dictionary of DataFrames is returned and only steps with failures will appear in this dictionary. ### Getting Specific Extracts You can also retrieve data extracts from several specified steps as a dictionary: ```{python} # Get extracts from steps 1 and 2 as a dictionary extract_dict = multi_validation.get_data_extracts(i=[1, 2]) # The keys are the step numbers print(f"Dictionary keys: {list(extract_dict.keys())}") # Get the number of failing rows in each extract for step, extract in extract_dict.items(): print(f"Step {step}: {len(extract)} failing rows") ``` Note that `frame=True` cannot be used when retrieving multiple extracts. ## Applications of Data Extracts Once you have extracted the failing data, there are numerous ways to analyze and use this information to improve data quality. Let's explore some practical applications. ### Finding Patterns Across Validation Steps You can analyze patterns across different validation steps by combining extracts: ```{python} # Get a consolidated view of all rows that failed any validation all_failure_rows = set() for step, extract in all_extracts.items(): if len(extract) > 0: all_failure_rows.update(extract["_row_num_"]) print(f"Total unique rows with failures: {len(all_failure_rows)}") print(f"Row numbers with failures: {sorted(all_failure_rows)}") ``` ### Identifying Rows with Multiple Failures You might want to find rows that failed multiple validation checks, as these often represent more serious data quality issues: ```{python} # Get row numbers from each extract step1_rows = set(multi_validation.get_data_extracts(i=1, frame=True)["_row_num_"]) step2_rows = set(multi_validation.get_data_extracts(i=2, frame=True)["_row_num_"]) # Find rows that failed both validations common_failures = step1_rows.intersection(step2_rows) print(f"Rows failing both step 1 and step 2: {common_failures}") ``` ### Statistical Analysis of Failing Values Once you have data extracts, you can perform statistical analysis to identify patterns in the failing data: ```{python} # Get extracts from step 2 d_value_failures = multi_validation.get_data_extracts(i=2, frame=True) # Basic statistical analysis of the failing values if len(d_value_failures) > 0: print(f"Min failing value: {d_value_failures['d'].min()}") print(f"Max failing value: {d_value_failures['d'].max()}") print(f"Mean failing value: {d_value_failures['d'].mean()}") ``` These analysis techniques help you thoroughly investigate data quality issues by examining failing data from multiple perspectives. Rather than treating failures as isolated incidents, you can identify patterns that might indicate systematic problems in your data pipeline. ### Detailed Analysis with `col_summary_tbl()`{.qd-no-link} For a more comprehensive view of the statistical properties of your extract data, you can use the `col_summary_tbl()` function: ```{python} # Get extracts from step 2 d_value_failures = multi_validation.get_data_extracts(i=2, frame=True) # Generate a comprehensive statistical summary of the failing data pb.col_summary_tbl(d_value_failures) ``` This statistical overview provides: 1. a count of values (including missing values) 2. type information for each column 3. distribution metrics like min, max, mean, and quartiles for numeric columns 4. frequency of common values for categorical columns 5. missing value counts and proportions Using `col_summary_tbl()` on data extracts lets you quickly understand the characteristics of failing data without writing custom analysis code. This approach is particularly valuable when: - You need to understand the statistical properties of failing records - You want to compare distributions of failing vs passing data - You're looking for anomalies or unexpected patterns within the failing rows For example, if values failing a validation check are concentrated at certain quantiles or have an unusual distribution shape, this might indicate a systematic data collection or processing issue rather than random errors. ## Using Extracts for Data Quality Improvement Data extracts are especially valuable for: 1. **Root Cause Analysis**: examining the full context of failing rows to understand why they failed 2. **Data Cleaning**: creating targeted cleanup scripts that focus only on problematic records 3. **Feedback Loops**: sharing specific examples with data providers to improve upstream quality 4. **Pattern Recognition**: identifying systemic issues by analyzing groups of failing records Here's an example of using extracts to create a corrective action plan: ```{python} import polars as pl # Create a new sample of an extract DF sample_extract = pl.DataFrame({ "id": range(1, 11), "value": [3500, 4200, 3800, 9800, 5500, 7200, 8300, 4100, 7600, 3200], "category": ["A", "B", "A", "C", "B", "A", "C", "B", "A", "B"], "region": [ "South", "South", "North", "East", "South", "South", "East", "South", "West", "South" ] }) # Identify which regions have the most failures region_counts = ( sample_extract .group_by("region") .agg(pl.len().alias("failure_count")) .sort("failure_count", descending=True) ) region_counts ``` Analysis shows that 6 out of 10 failing records (60%) are from the `"South"` region, making it the highest priority area for data quality investigation. This suggests a potential systemic issue with data collection or processing in that specific region. ## Best Practices for Working with Data Extracts When incorporating data extracts into your data quality workflow: 1. Use extracts for investigation, not just reporting: the real value is in the insights you gain from analyzing the problematic data 2. Combine with other Pointblank features: data extracts work well with step reports and can inform threshold settings for future validations 3. Consider sampling for very large datasets: if your extracts contain thousands of rows, focus your investigation on a representative sample 4. Look beyond individual validation steps: cross-reference extracts from different steps to identify complex issues that span multiple validation rules 5. Document patterns in failing data: record and share insights about common failure modes to build organizational knowledge about data quality issues. By integrating these practices into your data validation workflow, you'll transform data extracts from simple error lists into powerful diagnostic tools. The most successful data quality initiatives treat extracts as the starting point for investigation rather than the end result of validation. When systematically analyzed and documented, patterns in failing data can reveal underlying issues in data systems, collection methods, or business processes that might otherwise remain hidden. Remember that the ultimate goal isn't just to identify problematic records, but to use that information to implement targeted improvements that prevent similar issues from occurring in the future. ## Conclusion Data extracts bridge the gap between high-level validation statistics and the detailed context needed to fix data quality issues. By providing access to the actual failing records, Pointblank enables you to: - pinpoint exactly which data points caused validation failures - understand the full context around problematic values - develop targeted strategies for data cleanup and quality improvement - communicate specific examples to stakeholders Whether you're accessing extracts through CSV downloads, the [`Validate.get_data_extracts()`](`Validate.get_data_extracts`) method, or step reports, this feature provides the detail needed to move from identifying problems to implementing solutions. ### Sundering Validated Data ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_header=False, report_incl_footer_timings=False) ``` Sundering data? First off, let's get the correct meaning across here. Sundering is really just splitting, dividing, cutting into two pieces. And it's a useful thing we can do in Pointblank to any data that we are validating. When you interrogate the data, you learn about which rows have test failures within them. With more validation steps, we get an even better picture of this simply by virtue of more testing. The power of sundering lies in its ability to separate your data into two distinct categories: 1. rows that pass all validation checks (clean data) 2. rows that fail one or more validation checks (problematic data) This approach allows you to: - focus your analysis on clean, reliable data - isolate problematic records for investigation or correction - create pipelines that handle good and bad data differently Let's use the `small_table` in our examples to show just how sundering is done. Here's that table: ```{python} # | echo: false pb.preview(pb.load_dataset(dataset="small_table"), n_head=20, n_tail=20) ``` ## A Simple Example Where Data is Torn Asunder We'll begin with a very simple validation plan, having only a single step. There *will be* failing test units here. ```{python} import pointblank as pb validation = ( pb.Validate(data=pb.load_dataset(dataset="small_table")) .col_vals_ge(columns="d", value=1000) .interrogate() ) validation ``` We see six failing test units in `FAIL` column of the above validation report table. There is a data extract (collection of failing rows) available. Let's use the [`Validate.get_data_extracts()`](`Validate.get_data_extracts`) method to have a look at it. ```{python} validation.get_data_extracts(i=1, frame=True) ``` This is six rows of data that had failing test units in column `d`. Indeed we can see that all values in that column are less than `1000` (and we asserted that values should be greater than or equal to `1000`). This is the 'bad' data, if you will. Using the [`Validate.get_sundered_data()`](`Validate.get_sundered_data`) method, we get the 'good' part: ```{python} validation.get_sundered_data() ``` This is a Polars DataFrame of seven rows. All values in `d` were passing test units (i.e., fulfilled the expectation outlined in the validation step) and, in many ways, this is like a 'good extract'. You can always collect the failing rows with [`Validate.get_sundered_data()`](`Validate.get_sundered_data`) by using the `type="fail"` option. Let's try that here: ```{python} validation.get_sundered_data(type="fail") ``` It gives us the same rows as in the DataFrame obtained from using `validation.get_data_extracts(i=1, frame=True)`. Two important things to know about [`Validate.get_sundered_data()`](`Validate.get_sundered_data`) are that the table rows returned from `type=pass` (the default) and `type=fail` are: - the sum of rows across these returned tables will be equal to that of the original table - the rows in each split table are mutually exclusive (i.e., you won't find the same row in both) You can think of sundered data as a filtered version of the original dataset based on validation results. While the simple example illustrates how this process works on a basic level, the value of the method is better seen in a slightly more complex example. ## Using `get_sundered_data()` with a More Comprehensive Validation The previous example used exactly one validation step. You're likely to use more than that in standard practice so let's see how [`Validate.get_sundered_data()`](`Validate.get_sundered_data`) works in those common situations. Here's a validation with three steps: ```{python} validation_2 = ( pb.Validate(data=pb.load_dataset(dataset="small_table")) .col_vals_ge( columns="d", value=1000 ) .col_vals_not_null(columns="c") .col_vals_gt( columns="a", value=2 ) .interrogate() ) validation_2 ``` There are quite a few failures here across the three validation steps. In the `FAIL` column of the validation report table, there are 12 failing test units if we were to tally them up. So if the input table has 13 rows in total, does this mean there would be one row in the table returned by [`Validate.get_sundered_data()`](`Validate.get_sundered_data`)? Not so: ```{python} validation_2.get_sundered_data() ``` There are four rows. This is because the different validation steps tested values in different columns of the table. Some of the failing test units had to have occurred in more than once in certain rows. The rows that didn't have any failing test units across the three different tests (in three different columns) are the ones seen above. This brings us to the third important thing about the sundering process: - the absence of test-unit failures in a row across all validation steps means those rows are returned as the 'passing' set, all others are placed in the 'failing' set In validations where many validation steps are used, we can be more confident about the level of data quality for those rows returned in the passing set. But not every type of validation step is considered within this splitting procedure. The next section will explain the rules on that. ## The Validation Methods Considered When Sundering The sundering procedure relies on row-level validation types to be used. This makes sense as it's impossible to judge the quality of a row when using the [`col_exists()`](https://posit-dev.github.io/pointblank/reference/Validate.col_exists.html) validation method, for example. Luckily, we have many row-level validation methods; here's a list: - [`Validate.col_vals_gt()`](`Validate.col_vals_gt`) - [`Validate.col_vals_lt()`](`Validate.col_vals_lt`) - [`Validate.col_vals_ge()`](`Validate.col_vals_ge`) - [`Validate.col_vals_le()`](`Validate.col_vals_le`) - [`Validate.col_vals_eq()`](`Validate.col_vals_eq`) - [`Validate.col_vals_ne()`](`Validate.col_vals_ne`) - [`Validate.col_vals_between()`](`Validate.col_vals_between`) - [`Validate.col_vals_outside()`](`Validate.col_vals_outside`) - [`Validate.col_vals_in_set()`](`Validate.col_vals_in_set`) - [`Validate.col_vals_not_in_set()`](`Validate.col_vals_not_in_set`) - [`Validate.col_vals_null()`](`Validate.col_vals_null`) - [`Validate.col_vals_not_null()`](`Validate.col_vals_not_null`) - [`Validate.col_vals_regex()`](`Validate.col_vals_regex`) - [`Validate.col_vals_expr()`](`Validate.col_vals_expr`) - [`Validate.rows_distinct()`](`Validate.rows_distinct`) - [`Validate.rows_complete()`](`Validate.rows_complete`) - [`Validate.conjointly()`](`Validate.conjointly`) This is the same list of validation methods that are considered when creating data extracts. There are some additional caveats though. Even if using a validation method drawn from the set above, the validation step won't be used for sundering if: - the `active=` parameter for that step has been set to `False` - the `pre=` parameter has been used The first one makes intuitive sense (you decided to skip this validation step entirely), the second one requires some explanation. Using `pre=` allows you to modify the target table, there's no easy or practical way to compare rows in a mutated table compared to the original table (e.g., a mutation may drastically reduce the number of rows). ## Practical Applications of Sundering ### 1. Creating Clean Datasets for Analysis One of the most common use cases for sundering is preparing validated data for downstream analysis: ```{python} # Comprehensive validation for analysis-ready data analysis_validation = ( pb.Validate(data=pb.load_dataset(dataset="small_table")) .col_vals_not_null(columns=["a", "b", "c", "d", "e", "f"]) # No missing values .col_vals_gt(columns="a", value=0) # Positive values only .col_vals_lt(columns="d", value=10000) # No extreme outliers .interrogate() ) # Extract only the clean data that passed all checks clean_data = analysis_validation.get_sundered_data(type="pass") # Use the clean data for your analysis pb.preview(clean_data) ``` This approach ensures that any subsequent analysis is based on data that meets your quality standards, reducing the risk of misleading results or spurious conclusions due to problematic records. By making validation an explicit step in your analytical workflow, you create a natural quality gate that prevents invalid data from influencing your findings. ### 2. Creating Parallel Workflows for Clean and Problematic Data You can use sundering to create parallel processing paths: ```{python} # Get both clean and problematic data clean_data = analysis_validation.get_sundered_data(type="pass") problem_data = analysis_validation.get_sundered_data(type="fail") # Process clean data (in real applications, you'd do more here) print(f"Clean data size: {len(clean_data)} rows") # Log problematic data for investigation print(f"Problematic data size: {len(problem_data)} rows") ``` This approach enables you to build robust data processing pathways with separate handling for clean and problematic data. In production environments, you could save problematic records to a separate location for further investigation, generate detailed logs of validation failures, and trigger automated notifications to data stewards when issues arise. By establishing clear protocols for handling both data streams, you create a systematic approach to data quality that balances immediate analytical needs with longer-term data improvement goals. ### 3. Data Quality Monitoring and Improvement Tracking the ratio of passing to failing rows over time can help monitor data quality trends: ```{python} # Calculate data quality metrics total_rows = len(pb.load_dataset(dataset="small_table")) passing_rows = len(clean_data) quality_score = passing_rows / total_rows print(f"Data quality score: {quality_score:.2%}") print(f"Passing rows: {passing_rows} out of {total_rows}") ``` By tracking these metrics over time, you can measure the impact of your data quality improvement efforts and communicate progress to stakeholders. This approach transforms sundering from a one-time filtering tool into an ongoing data quality management system, where improving the ratio of passing rows becomes a measurable business objective aligned with broader data governance goals. ## Best Practices for Using Sundered Data When incorporating data sundering into your workflow, consider these best practices: 1. Be comprehensive in your validation: the more validation steps you include (assuming they're meaningful), the more confidence you can have in your passing dataset 2. Document your validation criteria: when sharing sundered data with others, always document the criteria used to determine passing rows 3. Consider traceability: for audit purposes, it may be valuable to add a column indicating whether a record was originally in the passing or failing set 4. Balance strictness and practicality: if you're too strict with validation rules, you might end up with very few passing rows; consider the appropriate level of strictness for your use case 5. Use sundering as part of a pipeline: automate the process of validation, sundering, and subsequent handling of the two resulting datasets 6. Continually refine validation rules: as you learn more about your data and domain, update your validation rules to improve the accuracy of your sundering process By following these best practices, data scientists and engineers can transform sundering from a simple utility into a strategic component of their data quality framework. When implemented thoughtfully, sundering enables a shift from reactive data cleaning to proactive quality management, where validation criteria evolve alongside your understanding of the data. The ultimate goal isn't just to separate good data from bad, but to gradually improve your entire dataset over time by addressing the root causes of validation failures that appear in the failing set. This approach turns data validation from a gatekeeper function into a continuous improvement process. ## Conclusion Data sundering provides a powerful way to separate your data based on validation results. While the concept is simple (splitting data into passing and failing sets) the feature can very useful in many data workflows. By integrating sundering into your data pipeline, you can: - ensure that downstream analysis only works with validated data - create focused datasets for different purposes - improve overall data quality through systematic identification and isolation of problematic records - build more robust data pipelines that explicitly handle data quality issues So long as you're aware of the rules and limitations of sundering, you're likely to find it to be a simple and useful way to filter your input table on the basis of a validation plan, turning data validation from a passive reporting tool into an active component of your data processing workflow. ### Previewing Data ```{python} #| echo: false #| output: false import pointblank as pb ``` In many cases, it's *good* to look at your data tables. Before validating a table, you'll likely want to inspect a portion of it before diving into the creation of data-quality rules. This is pretty easily done with Polars and Pandas DataFrames, however, it's not as easy with database tables and each table backend displays things differently. To make this common task a little better, you can use the `preview()` function in Pointblank. It has been designed to work with every table that the package supports (i.e., DataFrames and Ibis-backend tables, the latter of which are largely database tables). Plus, what's shown in the output is consistent, no matter what type of data you're looking at. ## Viewing a Table with `preview()`{.qd-no-link} Let's look at how `preview()` works. It requires only a table and, for this first example, let's use the `nycflights` dataset: ```{python} import pointblank as pb nycflights = pb.load_dataset(dataset="nycflights", tbl_type="polars") pb.preview(nycflights) ``` This is an HTML table using the style of the other reporting tables in the library. The header is more minimal here, only showing the type of table we're looking at (`POLARS` in this case) along with the table dimensions. The column headers provide both the column names and the column data types. By default, we're getting the first five rows and the last five rows. Row numbers (from the original dataset) provide an indication of which rows are the head and tail rows. The blue lines provide additional demarcation of the column containing the row numbers and the head and tail row groups. Finally, any cells with missing values are prominently styled with red lettering and a lighter red background. If you'd rather not see the row numbers in the table, you can use the `show_row_numbers=False` option. Let's try that with the `game_revenue` dataset as a DuckDB table: ```{python} game_revenue = pb.load_dataset(dataset="game_revenue", tbl_type="duckdb") pb.preview(game_revenue, show_row_numbers=False) ``` With the above preview, the row numbers are gone. The horizontal blue line still serves to divide the top and bottom rows of the table, however. ## Adjusting the Number of Rows Shown It could be that displaying the five top and bottom rows is not preferred. This can be changed with the `n_head=` and `n_tail=`. Maybe, you want three from the top along with the last row? Let's try that out with the `small_table` dataset as a Pandas DataFrame: ```{python} small_table = pb.load_dataset(dataset="small_table", tbl_type="pandas") pb.preview(small_table, n_head=3, n_tail=1) ``` If you're looking at a small table and want to see the entirety of it, you can enlarge the `n_head=` and `n_tail=` values: ```{python} small_table = pb.load_dataset(dataset="small_table", tbl_type="pandas") pb.preview(small_table, n_head=10, n_tail=10) ``` Given that the table has 13 rows, asking for 20 rows to be displayed effectively shows the entire table. ## Previewing a Subset of Columns The preview scales well to tables that have many columns by allowing for a horizontal scroll. However, previewing data from all columns can be impractical if you're only concerned with a key set of them. To preview only a subset of a table's columns, we can use the `columns_subset=` argument. Let's do this with the `nycflights` dataset and provide a list of six columns from that table. ```{python} pb.preview( nycflights, columns_subset=["hour", "minute", "sched_dep_time", "year", "month", "day"] ) ``` What we see are the six columns we specified from the `nycflights` dataset. Note that the columns are displayed in the order provided in the `columns_subset=` list. This can be useful for making quick, side-by-side comparisons. In the example above, we placed `hour` and `minute` next to the `sched_dep_time` column. In the original dataset, `sched_dep_time` is far apart from the other two columns, but, it's useful to have them next to each other in the preview since `hour` and `minute` are derived from `sched_dep_time` (and this lets us spot check any issues). We can also use column selectors within `columns_subset=`. Suppose we want to only see those columns that have `"dep_"` or `"arr_"` in the name. To do that, we use the `matches()` column selector function: ```{python} pb.preview(nycflights, columns_subset=pb.matches("dep_|arr_")) ``` Several selectors can be combined together through use of the `col()` function and operators such as `&` (*and*), `|` (*or*), `-` (*difference*), and `~` (*not*). Let's look at a column selection case where: - the first three columns are selected - all columns containing `"dep_"` or `"arr_"` are selected - any columns beginning with `"sched"` are omitted This is how we put that together within `col()`: ```{python} pb.preview( nycflights, columns_subset=pb.col((pb.first_n(3) | pb.matches("dep_|arr_")) & ~ pb.starts_with("sched")) ) ``` This gives us a preview with only the columns that fit the specific selection rules. Incidentally, using selectors with a dataset through `preview()` is a good way to test out the use of selectors more generally. Since they are primarily used to select columns for validation, trying them beforehand with `preview()` can help verify that your selection logic is sound. ### Column Summaries ```{python} #| echo: false #| output: false import pointblank as pb ``` While previewing a table with `preview()` is undoubtedly a good thing to do, sometimes you need more. This is where summarizing a table comes in. When you view a summary of a table, the column-by-column info can quickly increase your understanding of a dataset. Plus, it allows you to quickly catch anomalies in your data (e.g., the maximum value of a column could be far outside the realm of possibility). Pointblank provides a function to make it extremely easy to view column-level summaries in a single table. That function is called `col_summary_tbl()` and, just like `preview()` does, it supports the use of any table that Pointblank can use for validation. And no matter what the input data is, the resultant reporting table is consistent in its design and construction. ## Trying out `col_summary_tbl()`{.qd-no-link} The function only requires a table. Let's use the `small_table` dataset (a very simple table) to start us off: ```{python} import pointblank as pb small_table = pb.load_dataset(dataset="small_table", tbl_type="polars") pb.col_summary_tbl(small_table) ``` The header provides the type of table we're looking at (`POLARS`, since this is a Polars DataFrame) and the table dimensions. The rest of the table focuses on the column-level summaries. As such, each row represents a summary of a column in the `small_table` dataset. There's a lot of information in this summary table to digest. Some of it is intuitive since this sort of table summarization isn't all that uncommon, but other aspects of it could also give some pause. So we'll carefully wade through how to interpret this report. ## Data Categories in the Column Summary Table On the left side of the table are icons of different colors. These represent categories that the columns fall into. There are only five categories and columns can only be of one type. The categories (and their letter marks) are: - `N`: numeric - `S`: string-based - `D`: date/datetime - `T/F`: boolean - `O`: object The numeric category (`N`) takes data types such as floats and integers. The `S` category is for string-based columns. Date or datetime values are lumped into the `D` category. Boolean columns (`T/F`) have their own category and are *not* considered numeric (e.g., `0`/`1`). The `O` category is a catchall for all other types of columns. Given the disparity of these categories and that we want them in the same table, some statistical measures will be sensible for certain column categories but not for others. Given that, we'll explain how each category is represented in the column summary table. ## Numeric Data Three columns in `small_table` are numeric: `a` (`Int64`), `c` (`Int64`), and `d` (`Float64`). The common measures of the missing count/proportion (`NA`) and the unique value count/proportion (`UQ`) are provided for the numeric data type. For these two measures, the top number is the absolute count of missing values and the count of unique values. The bottom number is a proportion of the absolute count divided by the row count; this makes each proportion a value between `0` and `1` (bounds included). The next two columns represent the mean (`Mean`) and the standard deviation (`SD`). The minumum (`Min`), maximum, (`Max`) and a set of quantiles occupy the next few columns (includes `P5`, `Q1`, `Med` for median, `Q3`, and `P95`). Finally, the interquartile range (`IQR`: `Q3` - `Q1`) is the last measure provided. ## String Data String data is present in `small_table`, being in columns `b` and `f`. The missing value (`NA`) and uniqueness (`UQ`) measures are accounted for here. The statistical measures are all based on string lengths, so what happens is that all strings in a column are converted to those numeric values and a subset of stats values is presented. To avoid some understandable confusion when reading the table, the stats values in each of the cells with values are annotated with the text `"SL"`. It makes less sense to provide a full suite of quantile values so only the minimum (`Min`), median (`Med`), and maximum (`Max`) are provided. ## Date/Datetime Data and Boolean Data We see that in the first two rows of our summary table there are summaries of the `date_time` and `date` columns. The summaries we provide for a date/datetime category (notice the green `D` to the left of the column names) are: 1. the missing count/proportion (`NA`) 2. the unique value count/proportion (`UQ`) 3. the minimum and maximum dates/datetimes One column, `e`, is of the `Boolean` type. Because columns of this type could only have `True`, `False`, or missing values, we provide summary data for missingness (under `NA`) and proportions of `True` and `False` values (under `UQ`). ### Missing Values Reporting ```{python} #| echo: false #| output: false import pointblank as pb ``` Sometimes values just aren't there: they're missing. This can either be expected or another thing to worry about. Either way, we can dig a little deeper if need be and use the `missing_vals_tbl()` function to generate a summary table that can elucidate how many values are missing, and roughly where. ## Using and Understanding `missing_vals_tbl()`{.qd-no-link} The missing values table is arranged a lot like the column summary table (generated via the `col_summary_tbl()` function) in that columns of the input table are arranged as rows in the reporting table. Let's use `missing_vals_tbl()` on the `nycflights` dataset, which has a lot of missing values: ```{python} import pointblank as pb nycflights = pb.load_dataset(dataset="nycflights", tbl_type="polars") pb.missing_vals_tbl(nycflights) ``` There are 18 columns in `nycflights` and they're arranged down the missing values table as rows. To the right we see column headers indicating 10 columns that are row sectors. Row sectors are groups of rows and each sector contains a tenth of the total rows in the table. The leftmost sectors are the rows at the top of the table whereas the sectors on the right are closer to the bottom. If you'd like to know which rows make up each row sector, there are details on this in the table footer area (click the `ROW SECTORS` text or the disclosure triangle). Now that we know about row sectors, we need to understand the visuals here. A light blue cell indicates there are no (`0`) missing values within a given row sector of a column. For `nycflights` we can see that several columns have no missing values at all (i.e., the light blue color makes up the entire row in the missing values table). When there are missing values in a column's row sector, you'll be met with a grayscale color. The proportion of missing values corresponds to the color ramp from light gray to solid black. Interestingly, most of the columns that have missing values appear to be related to each other in terms of the extent of missing values (i.e., the appearance in the reporting table looks roughly the same, indicating a sort of systematic missingness). These columns are `dep_time`, `dep_delay`, `arr_time`, `arr_delay`, and `air_time`. The odd column out with regard to the distribution of missing values is `tailnum`. By scanning the row and observing that the grayscale color values are all a little different we see that the degree of missingness of more variable and not related to the other columns containing missing values. ## Missing Value Tables from the Other Datasets The `small_table` dataset has only 13 rows to it. Let's use that as a Pandas DataFrame with `missing_vals_tbl()`: ```{python} small_table = pb.load_dataset(dataset="small_table", tbl_type="pandas") pb.missing_vals_tbl(small_table) ``` It appears that only column `c` has missing values. And since the table is very small in terms of row count, most of the row sectors contain only a single row. The `game_revenue` dataset has *no* missing values. And this can be easily proven by using `missing_vals_tbl()` with it: ```{python} game_revenue = pb.load_dataset(dataset="game_revenue", tbl_type="duckdb") pb.missing_vals_tbl(game_revenue) ``` We see nothing but light blue in this report! The header also indicates that there are no missing values by displaying a large green check mark (the other report tables provided a count of total missing values across all columns). ### Test Data Generation ```{python} #| echo: false #| output: false import pointblank as pb pb.config(report_incl_footer_timings=False) ``` Pointblank provides a built-in test data generation system that creates realistic, locale-aware synthetic data based on schema definitions. This is useful for testing validation rules, creating sample datasets, and generating fixture data for development. ::: {.callout-note} Throughout this guide, we use `pb.preview()` to display generated datasets with nice HTML formatting. This is optional: `pb.generate_dataset()` returns a standard DataFrame that you can display or manipulate however you prefer. ::: ## Quick Start Generate test data using a schema with field constraints: ```{python} import pointblank as pb # Define a schema with typed field specifications schema = pb.Schema( user_id=pb.int_field(min_val=1, unique=True), name=pb.string_field(preset="name"), email=pb.string_field(preset="email"), age=pb.int_field(min_val=18, max_val=80), status=pb.string_field(allowed=["active", "pending", "inactive"]), ) # Generate 100 rows of test data (seed ensures reproducibility) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` ## Field Types Pointblank provides helper functions for defining typed columns with constraints: | Function | Description | Key Parameters | |----------|-------------|----------------| | `int_field()` | Integer columns | `min_val`, `max_val`, `allowed`, `unique` | | `float_field()` | Float columns | `min_val`, `max_val`, `allowed` | | `string_field()` | String columns | `preset`, `pattern`, `allowed`, `unique` | | `bool_field()` | Boolean columns | `p_true` (probability of True) | | `date_field()` | Date columns | `min_val`, `max_val` | | `datetime_field()` | Datetime columns | `min_val`, `max_val` | | `time_field()` | Time columns | `min_val`, `max_val` | | `duration_field()` | Duration columns | `min_val`, `max_val` | | `profile_fields()` | Bundled person-profile fields | `set`, `split_name`, `include`, `exclude`, `prefix` | ### Integer Fields Integer fields support range constraints with `min_val` and `max_val`, discrete allowed values with `allowed`, and uniqueness enforcement with `unique=True`: ```{python} schema = pb.Schema( id=pb.int_field(min_val=1000, max_val=9999, unique=True), quantity=pb.int_field(min_val=1, max_val=100), rating=pb.int_field(allowed=[1, 2, 3, 4, 5]), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` The `unique=True` constraint ensures no duplicate values appear in that column, which is useful for generating primary keys or identifiers. ### Float Fields Float fields work similarly to integers, with `min_val` and `max_val` defining the range of generated values: ```{python} schema = pb.Schema( price=pb.float_field(min_val=0.0, max_val=1000.0), discount=pb.float_field(min_val=0.0, max_val=0.5), temperature=pb.float_field(min_val=-40.0, max_val=50.0), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` Values are uniformly distributed across the specified range, making this useful for simulating measurements, prices, or any continuous numeric data. ### String Fields with Presets Presets generate realistic data like names, emails, and addresses. When you include related fields like `name` and `email` in the same schema, Pointblank ensures **coherence** (e.g., the email address will be derived from the person's name), making the generated data more realistic: ```{python} schema = pb.Schema( full_name=pb.string_field(preset="name"), email=pb.string_field(preset="email"), company=pb.string_field(preset="company"), city=pb.string_field(preset="city"), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` This coherence extends to other related fields like `user_name`, which will also reflect the person's name when included alongside name and email fields. ### String Fields with Patterns Use regex patterns to generate strings matching specific formats: ```{python} schema = pb.Schema( product_code=pb.string_field(pattern=r"[A-Z]{3}-[0-9]{4}"), phone=pb.string_field(pattern=r"\([0-9]{3}\) [0-9]{3}-[0-9]{4}"), hex_color=pb.string_field(pattern=r"#[0-9A-F]{6}"), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` Patterns support standard regex character classes and quantifiers, giving you flexibility to generate data matching virtually any format specification. ### Boolean Fields Control the probability of `True` values: ```{python} schema = pb.Schema( is_active=pb.bool_field(p_true=0.8), # 80% True is_premium=pb.bool_field(p_true=0.2), # 20% True is_verified=pb.bool_field(), # 50% True (default) ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` This probabilistic control is helpful when you need to simulate real-world distributions where certain states are more common than others. ### Date and Datetime Fields Temporal fields accept Python `date` and `datetime` objects for their range boundaries, generating values uniformly distributed within the specified period: ```{python} from datetime import date, datetime schema = pb.Schema( birth_date=pb.date_field( min_date=date(1960, 1, 1), max_date=date(2005, 12, 31) ), created_at=pb.datetime_field( min_date=datetime(2024, 1, 1), max_date=datetime(2024, 12, 31) ), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` The same pattern applies to `time_field()` and `duration_field()`, allowing you to generate realistic temporal data for any use case. ## Available Presets The `preset=` parameter in `string_field()` supports many data types: **Personal Data:** - `name`: full name (first + last) - `name_full`: full name with optional prefix/suffix (e.g., "Dr. Ana Sousa", "Prof. Tanaka Yuki") - `first_name`: first name only - `last_name`: last name only - `gender`: person's gender (`"male"` or `"female"`), coherent with name fields - `email`: email address - `phone_number`: phone number in country-specific format **Location Data:** - `address`: full street address - `city`: city name - `state`: state/province name - `country`: country name - `country_code_2`: ISO 3166-1 alpha-2 country code (e.g., `"US"`) - `country_code_3`: ISO 3166-1 alpha-3 country code (e.g., `"USA"`) - `postcode`: postal/ZIP code - `latitude`: latitude coordinate - `longitude`: longitude coordinate **Business Data:** - `company`: company name - `job`: job title - `catch_phrase`: business catch phrase **Internet Data:** - `url`: website URL - `domain_name`: domain name - `ipv4`: IPv4 address - `ipv6`: IPv6 address - `user_name`: username - `password`: password **Financial Data:** - `credit_card_number`: credit card number - `credit_card_provider`: card network name (Visa, Mastercard, American Express, or Discover); coherent with `credit_card_number` - `iban`: International Bank Account Number - `currency_code`: currency code (USD, EUR, etc.) **Identifiers:** - `uuid4`: UUID version 4 - `md5`: MD5 hash (32 hex characters) - `sha1`: SHA-1 hash (40 hex characters) - `sha256`: SHA-256 hash (64 hex characters) - `ssn`: Social Security Number (country-specific format) - `license_plate`: vehicle license plate (location-aware for CA, US, DE, AU, GB) **Barcodes:** - `ean8`: EAN-8 barcode with valid check digit - `ean13`: EAN-13 barcode with valid check digit **Date/Time:** - `date_this_year`: a date within the current year - `date_this_decade`: a date within the current decade - `date_between`: a random date between 2000 and 2025 - `date_range`: two dates joined with an en-dash (e.g., `"2012-05-12 – 2015-11-22"`) - `future_date`: a date up to 1 year in the future - `past_date`: a date up to 10 years in the past - `time`: a time value **Text:** - `word`: single word - `sentence`: full sentence - `paragraph`: paragraph of text - `text`: multiple paragraphs **Miscellaneous:** - `color_name`: color name - `file_name`: file name - `file_extension`: file extension - `mime_type`: MIME type - `user_agent`: browser user agent string (country-weighted) - `locale_code`: locale identifier (e.g., `"en_US"`, `"de_DE"`; multilingual countries return a random official locale) ## Profile Fields When generating person-profile data, you often need several related presets together: a name, an email derived from that name, an address, a phone number, and so on. Rather than wiring up each column individually, the `profile_fields()` helper returns a ready-made dictionary of `StringField` objects that you can unpack directly into a `Schema()`. ### Basic Usage With no arguments, `profile_fields()` returns the **standard** set of seven columns: `first_name`, `last_name`, `email`, `city`, `state`, `postcode`, and `phone_number`. All coherence rules apply automatically: emails are derived from names, and city/state/postcode/phone are internally consistent. ```{python} schema = pb.Schema( user_id=pb.int_field(unique=True, min_val=1), **pb.profile_fields(), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` The `**` operator unpacks the dictionary into keyword arguments, as if you had written each `string_field(preset=...)` call by hand. ### Choosing a Set Three built-in sets control how many columns are generated: | Set | Columns | |-----|------| | `"minimal"` | `first_name`, `last_name`, `email`, `phone_number` | | `"standard"` | `first_name`, `last_name`, `email`, `city`, `state`, `postcode`, `phone_number` | | `"full"` | `first_name`, `last_name`, `email`, `address`, `city`, `state`, `postcode`, `phone_number`, `company`, `job` | ```{python} # Minimal profile: just name, email, and phone pb.preview( pb.generate_dataset( pb.Schema(**pb.profile_fields(set="minimal")), n=100, seed=23, ) ) ``` ```{python} # Full profile: includes address, company, and job title pb.preview( pb.generate_dataset( pb.Schema(**pb.profile_fields(set="full")), n=100, seed=23, ) ) ``` ### Combined vs. Split Names By default, names are split into `first_name` and `last_name` columns. Set `split_name=False` to get a single `name` column instead: ```{python} pb.preview( pb.generate_dataset( pb.Schema(**pb.profile_fields(set="minimal", split_name=False)), n=100, seed=23, ) ) ``` ### Adding and Removing Columns Use `include=` to add presets to the base set and `exclude=` to remove them. Both accept lists of preset names. The available profile presets are: `first_name`, `last_name`, `name`, `email`, `address`, `city`, `state`, `postcode`, `phone_number`, `company`, and `job`. ```{python} # Standard set + company column pb.preview( pb.generate_dataset( pb.Schema(**pb.profile_fields(include=["company"])), n=100, seed=23, ) ) ``` ```{python} # Standard set without city and state pb.preview( pb.generate_dataset( pb.Schema(**pb.profile_fields(exclude=["city", "state"])), n=100, seed=23, ) ) ``` You can combine `include=` and `exclude=` in the same call, as long as the same preset does not appear in both. ### Column Prefixes The `prefix=` parameter prepends a string to every column name. This is especially useful when a schema needs two independent profiles (e.g., sender and recipient): ```{python} schema = pb.Schema( **pb.profile_fields(set="minimal", prefix="sender_"), **pb.profile_fields(set="minimal", prefix="recipient_"), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23)) ``` Each prefixed group maintains its own coherence: the sender's email is derived from the sender's name, and the recipient's email from the recipient's name. ### Combining with Other Field Types Since `profile_fields()` returns a plain dictionary, it composes naturally with any other field types: ```{python} schema = pb.Schema( id=pb.int_field(unique=True, min_val=1000), **pb.profile_fields(), active=pb.bool_field(p_true=0.8), signup_date=pb.date_field( min_date="2024-01-01", max_date="2025-12-31", ), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23, country="DE")) ``` ## Country-Specific Data One of the most powerful features is generating locale-aware data. Use the `country=` parameter to generate data specific to a country. This affects names, cities, addresses, and other locale-sensitive presets. Let's create a schema that includes several location-related fields. When generating data for a specific country, Pointblank ensures *consistency across related fields*. The city, address, postcode, and coordinates will all correspond to the same location: ```{python} # Schema with linked location fields schema = pb.Schema( name=pb.string_field(preset="name"), city=pb.string_field(preset="city"), address=pb.string_field(preset="address"), postcode=pb.string_field(preset="postcode"), latitude=pb.string_field(preset="latitude"), longitude=pb.string_field(preset="longitude"), ) ``` Here's German data with authentic names and addresses from cities like Berlin, Munich, and Hamburg. Notice how the latitude/longitude coordinates match real locations in Germany: ```{python} pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="DE")) ``` Japanese data includes names in romanized form and addresses from cities like Tokyo, Osaka, and Kyoto. The coordinates fall within Japan's geographic boundaries: ```{python} pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="JP")) ``` Brazilian data features Portuguese names and addresses from cities like São Paulo, Rio de Janeiro, and Brasília. The postal codes follow Brazil's CEP format: ```{python} pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="BR")) ``` This location coherence is valuable when testing geospatial applications, address validation systems, or any scenario where realistic, internally-consistent location data matters. ### Data Coherence Pointblank automatically links related columns to produce realistic rows. There are three coherence systems that activate based on which presets appear together in a schema: **Address coherence** activates when *any* address-related preset is present (`address`, `city`, `state`, `postcode`, `latitude`, `longitude`, `phone_number`, `license_plate`). All of these fields will refer to the same location within each row. **Person coherence** activates when *any* person-related preset is present (`name`, `name_full`, `first_name`, `last_name`, `email`, `user_name`). The email and username are derived from the person's name. **Business coherence** activates when *both* `job` and `company` are present. When active: - the company and job title are drawn from the same industry (e.g., a nurse will work at a hospital, not a law firm). - `name_full` gains profession-matched titles: a doctor may appear as "Dr. Ana Sousa" and a professor as "Prof. Tanaka Yuki". For German-speaking countries (DE, AT, CH), the honorific stacks before the professional title (e.g., "Herr Dr. med. Klaus Weber"). - integer columns whose name contains `age` (e.g., `age`, `person_age`) are automatically constrained to working-age range (22–65). Here's an example showing all three coherence systems working together: ```{python} schema = pb.Schema( name=pb.string_field(preset="name_full"), email=pb.string_field(preset="email"), company=pb.string_field(preset="company"), job=pb.string_field(preset="job"), city=pb.string_field(preset="city"), state=pb.string_field(preset="state"), license_plate=pb.string_field(preset="license_plate"), age=pb.int_field(), ) pb.preview(pb.generate_dataset(schema, n=100, seed=23, country="DE")) ``` **License plate coherence** is part of address coherence. For `CA`, `US`, `DE`, `AU`, and `GB`, license plates follow real subregion-specific formats when location fields are present. For example, an Ontario row produces plates like `"CABC 123"` while a British Columbia row produces `"AB1 23C"`. Letters I, O, Q, and U are excluded from plate generation, matching real-world restrictions. ### Supported Countries Pointblank currently supports 100 countries with full locale data for realistic test data generation. You can use either ISO 3166-1 alpha-2 codes (e.g., `"US"`) or alpha-3 codes (e.g., `"USA"`). **Europe (38 countries):** - Armenia (`AM`), Austria (`AT`), Azerbaijan (`AZ`), Belgium (`BE`), Bulgaria (`BG`), Croatia (`HR`), Cyprus (`CY`), Czech Republic (`CZ`), Denmark (`DK`), Estonia (`EE`), Finland (`FI`), France (`FR`), Georgia (`GE`), Germany (`DE`), Greece (`GR`), Hungary (`HU`), Iceland (`IS`), Ireland (`IE`), Italy (`IT`), Latvia (`LV`), Lithuania (`LT`), Luxembourg (`LU`), Malta (`MT`), Moldova (`MD`), Netherlands (`NL`), Norway (`NO`), Poland (`PL`), Portugal (`PT`), Romania (`RO`), Russia (`RU`), Serbia (`RS`), Slovakia (`SK`), Slovenia (`SI`), Spain (`ES`), Sweden (`SE`), Switzerland (`CH`), Ukraine (`UA`), United Kingdom (`GB`) **Americas (19 countries):** - Argentina (`AR`), Bolivia (`BO`), Brazil (`BR`), Canada (`CA`), Chile (`CL`), Colombia (`CO`), Costa Rica (`CR`), Dominican Republic (`DO`), Ecuador (`EC`), El Salvador (`SV`), Guatemala (`GT`), Honduras (`HN`), Jamaica (`JM`), Mexico (`MX`), Panama (`PA`), Paraguay (`PY`), Peru (`PE`), United States (`US`), Uruguay (`UY`) **Asia-Pacific (22 countries):** - Australia (`AU`), Bangladesh (`BD`), Cambodia (`KH`), China (`CN`), Hong Kong (`HK`), India (`IN`), Indonesia (`ID`), Japan (`JP`), Kazakhstan (`KZ`), Malaysia (`MY`), Myanmar (`MM`), Nepal (`NP`), New Zealand (`NZ`), Pakistan (`PK`), Philippines (`PH`), Singapore (`SG`), South Korea (`KR`), Sri Lanka (`LK`), Taiwan (`TW`), Thailand (`TH`), Uzbekistan (`UZ`), Vietnam (`VN`) **Middle East & Africa (21 countries):** - Algeria (`DZ`), Cameroon (`CM`), Egypt (`EG`), Ethiopia (`ET`), Ghana (`GH`), Israel (`IL`), Jordan (`JO`), Kenya (`KE`), Lebanon (`LB`), Morocco (`MA`), Mozambique (`MZ`), Nigeria (`NG`), Rwanda (`RW`), Saudi Arabia (`SA`), Senegal (`SN`), South Africa (`ZA`), Tanzania (`TZ`), Tunisia (`TN`), Turkey (`TR`), Uganda (`UG`), United Arab Emirates (`AE`) Additional countries and expanded coverage are planned for future releases. ### Mixing Multiple Countries When you need test data that spans multiple locales (e.g., simulating an international customer base), you can pass a list or dict to the `country=` parameter instead of a single string. Passing a list of country codes splits rows equally across those countries. Here, 200 rows are divided evenly among the US, Germany, and Japan (~67 each): ```{python} schema = pb.Schema( name=pb.string_field(preset="name"), city=pb.string_field(preset="city"), postcode=pb.string_field(preset="postcode"), ) pb.preview(pb.generate_dataset(schema, n=200, seed=23, country=["US", "DE", "JP"])) ``` To control the proportion of rows per country, pass a dict mapping country codes to weights. The following generates 200 rows with 70% from the US, 20% from Germany, and 10% from France: ```{python} pb.preview( pb.generate_dataset( schema, n=200, seed=23, country={"US": 0.7, "DE": 0.2, "FR": 0.1}, ) ) ``` Weights are auto-normalized, so `{"US": 7, "DE": 2, "FR": 1}` is equivalent to the example above. Row counts are allocated using largest-remainder apportionment, ensuring they always sum to exactly `n`. By default, rows from different countries are interleaved randomly (`shuffle=True`). Set `shuffle=False` to keep rows grouped by country in the order the countries are listed: ```{python} pb.preview( pb.generate_dataset( schema, n=120, seed=23, country=["US", "DE", "JP"], shuffle=False, ) ) ``` All coherence systems (address, person, business) work correctly within each country's batch of rows. A French row will have a French name with a matching French email; a Japanese row will have a Japanese name with a matching Japanese email. Non-preset columns (integers, floats, booleans, dates) are generated independently for each batch but still respect their field constraints. ### Frequency-Weighted Sampling By default, names and cities are sampled uniformly at random from the locale data, giving every entry the same probability of being selected. Real-world distributions are far from uniform though: "James" and "Maria" appear orders of magnitude more often than "Thaddeus" or "Xiomara", and more people live in New York City than in Flagstaff. The `weighted=True` parameter makes generated data reflect this natural skew. ```{python} schema = pb.Schema( name=pb.string_field(preset="name"), city=pb.string_field(preset="city"), ) pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="US", weighted=True)) ``` With weighting enabled you will see popular names like James, John, Mary, and Patricia appear more frequently, while unusual names surface only occasionally. Similarly, cities like New York, Los Angeles, and Chicago dominate the output while smaller cities appear less often. The feature works by organizing locale data into four frequency tiers. Each tier has a sampling probability that determines how likely its members are to be selected: | Tier | Probability | Contents | |------|-------------|----------| | very_common | 45% | The top ~10% of entries by real-world frequency | | common | 30% | The next ~20% of entries | | uncommon | 20% | The next ~30% of entries | | rare | 5% | The remaining ~40% of entries | When a value is needed, a tier is first chosen according to these probabilities and then a single entry is picked uniformly at random within that tier. This two-step approach keeps sampling fast while producing a realistic long-tail distribution. Setting `weighted=False` pools all entries across every tier and samples them uniformly, which can be useful when you want an even spread rather than a realistic distribution. Weighted sampling combines seamlessly with multi-country mixing. Each country's batch uses its own tiered data independently, so a mixed dataset will have weighted US names alongside weighted German names: ```{python} pb.preview( pb.generate_dataset( schema, n=200, seed=23, country={"US": 0.6, "DE": 0.4}, weighted=True, ) ) ``` All 100 supported country locales have tiered name and location data, so `weighted=True` produces realistic frequency distributions for every country. ## Output Formats The `generate_dataset()` function supports multiple output formats via the `output=` parameter, making it easy to integrate with your preferred data processing library. ```{python} schema = pb.Schema( id=pb.int_field(min_val=1), name=pb.string_field(preset="name"), ) ``` The default output is a Polars DataFrame, which offers excellent performance and a modern API for data manipulation: ```{python} polars_df = pb.generate_dataset(schema, n=100, seed=23, output="polars") pb.preview(polars_df) ``` If your workflow uses Pandas, simply specify `output="pandas"` to get a **Pandas DataFrame**: ```{python} pandas_df = pb.generate_dataset(schema, n=100, seed=23, output="pandas") pb.preview(pandas_df) ``` Both formats work seamlessly with Pointblank's validation functions, so you can choose whichever fits best with your existing data pipeline. ## Using Generated Data for Validation Testing A common use case is generating test data to validate your validation rules: ```{python} # Define a schema with constraints schema = pb.Schema( user_id=pb.int_field(min_val=1, unique=True), email=pb.string_field(preset="email"), age=pb.int_field(min_val=18, max_val=100), status=pb.string_field(allowed=["active", "pending", "inactive"]), ) # Generate test data test_data = pb.generate_dataset(schema, n=100, seed=23) # Validate the generated data (it should pass all checks) validation = ( pb.Validate(test_data) .col_vals_gt("user_id", 0) .col_vals_regex("email", r".+@.+\..+") .col_vals_between("age", 18, 100) .col_vals_in_set("status", ["active", "pending", "inactive"]) .interrogate() ) validation ``` Since the generated data respects the constraints defined in the schema, it should pass all validation checks. This workflow is particularly useful for testing validation logic before applying it to production data, or for creating reproducible test fixtures in your CI/CD pipeline. ## Pytest Fixture When Pointblank is installed, a `generate_dataset` **pytest fixture** is automatically available in all your test files. There is no need to import anything or add configuration to `conftest.py`: the fixture is registered via pytest's plugin system. The fixture works identically to `pb.generate_dataset()`, but with one key difference: when you don't supply a `seed=` parameter, a deterministic seed is automatically derived from the test's fully-qualified name. This means: - the **same test** always produces the **same data**: no manual seed management required. - *different tests* get different seeds, so they exercise different datasets. - you can still pass an explicit `seed=` to override the automatic seed when needed. ### Basic Usage Use it by adding `generate_dataset` to your test function's parameter list: ```{.python filename="test_pipeline.py"} import pointblank as pb def test_etl_handles_nulls(generate_dataset): schema = pb.Schema( user_id=pb.int_field(unique=True), email=pb.string_field(preset="email", nullable=True, null_probability=0.3), age=pb.int_field(min_val=0, max_val=120), ) df = generate_dataset(schema, n=500) result = my_etl_pipeline(df) assert result.filter(pl.col("email").is_null()).shape[0] == 0 ``` All parameters from `generate_dataset()` are supported: `n=`, `seed=`, `output=`, and `country=`: ```python def test_german_data(generate_dataset): schema = pb.Schema( name=pb.string_field(preset="name"), city=pb.string_field(preset="city"), ) df = generate_dataset(schema, n=200, country="DE", output="pandas") assert len(df) == 200 ``` ### Multiple Datasets in One Test Calling the fixture multiple times within the same test produces different (but still deterministic) data on each call: ```python def test_merge_pipeline(generate_dataset): customers = generate_dataset(customer_schema, n=1000, country="US") orders = generate_dataset(order_schema, n=5000) # Each call gets a unique seed derived from the test name + call index, # so both DataFrames are deterministic and different from each other. result = merge_pipeline(customers, orders) assert result.shape[0] > 0 ``` ### Testing Across Locales The fixture makes locale testing particularly concise when combined with `pytest.mark.parametrize`: ```python import pytest import pointblank as pb @pytest.mark.parametrize("country", ["US", "DE", "JP", "BR"]) def test_name_normalizer(generate_dataset, country): schema = pb.Schema(name=pb.string_field(preset="name_full")) df = generate_dataset(schema, n=100, country=country) result = normalize_names(df) assert result["name"].str.len_chars().min() > 0 ``` ### Sharing Schemas Across Tests Define schemas as fixtures in `conftest.py` and compose them with `generate_dataset`: ```{.python filename="conftest.py"} import pytest import pointblank as pb @pytest.fixture def customer_schema(): return pb.Schema( id=pb.int_field(unique=True), name=pb.string_field(preset="name"), email=pb.string_field(preset="email"), city=pb.string_field(preset="city"), ) ``` ```{.python filename="test_validation.py"} def test_customer_validation(generate_dataset, customer_schema): df = generate_dataset(customer_schema, n=200, country="DE") validation = pb.Validate(df).col_vals_not_null(columns="email").interrogate() assert validation.all_passed() ``` ```{.python filename="test_export.py"} def test_customer_export(generate_dataset, customer_schema): df = generate_dataset(customer_schema, n=50, country="JP") exported = export_to_parquet(df) assert exported.exists() ``` ### Debugging with Seed Introspection The fixture callable exposes two attributes that make debugging failed tests straightforward: - `generate_dataset.default_seed`: the base seed derived from the test name (available before any call) - `generate_dataset.last_seed`: the seed actually used for the most recent call (accounts for the call counter and explicit overrides) Include `.last_seed` in assertion messages so failures are immediately reproducible: ```python def test_age_range(generate_dataset): schema = pb.Schema(age=pb.int_field(min_val=18, max_val=100)) df = generate_dataset(schema, n=500) min_age = df["age"].min() assert min_age >= 18, ( f"Expected min age >= 18, got {min_age} (seed={generate_dataset.last_seed})" ) ``` You can also use `.default_seed` to reproduce the exact dataset outside of pytest: ```python # In a REPL or notebook, reproduce the data from a failed test: import pointblank as pb df = pb.generate_dataset(schema, n=500, seed=) ``` ### Seed Stability A given seed (whether explicit or auto-derived) is guaranteed to produce identical output **within the same Pointblank version**. Across versions, changes to country data files or generator logic may alter the output for a given seed. For CI pipelines that require bit-exact data across library upgrades, we recommend saving generated DataFrames as Parquet or CSV snapshot files rather than relying on cross-version seed stability. This is the same approach used by snapshot-testing tools like `pytest-snapshot` and `syrupy`. ## Conclusion Test data generation provides a convenient way to create realistic synthetic datasets directly from schema definitions. While the concept is straightforward (defining field types and constraints, then generating matching data), the feature can be invaluable in many development and testing workflows. By incorporating test data generation into your process, you can: - quickly prototype validation rules before working with production data - create reproducible test fixtures for automated testing and CI/CD pipelines - generate locale-specific data for internationalization testing across 100 countries - ensure coherent relationships between related fields like names, emails, addresses, jobs, and license plates - produce datasets of any size with consistent, realistic values Whether you're building validation logic, testing data pipelines, or simply need sample data for development, the schema-based generation approach gives you precise control over data characteristics while maintaining the realism needed to uncover edge cases and validate your assumptions about data quality. ### Data Inspection Pointblank’s CLI (`pb`) makes it easy to view your data before running validations. It has several commands that are exceedingly useful for understanding your data’s structure, checking for obvious issues, and confirming that your data source is being read correctly. We also make it easy to explore data in various formats and locations. Let's go through each of the commands for inspecting and exploring data. ## `pb info`: Inspecting the Data Structure Use `pb info` to display basic information about your data source. Here's how this works with a local CSV file: ```bash pb info worldcities.csv ``` ![](/assets/pb-info-worldcities-csv.png){width=100%} This command shows the (1) table type (e.g., `pandas`, `polars`, etc.), (2) the number of rows and columns, and (3) the data source path or identifier. That example used a local CSV file. The same file is also present in Pointblank's GitHub repository (in the `data-raw` directory) and the CLI is able to load the data from there as well: ```bash pb info https://github.com/posit-dev/pointblank/blob/main/data_raw/worldcities.csv ``` ![](/assets/pb-info-worldcities-github-csv.png){width=100%} The `pb info` command is useful before running validations to confirm your data source's dimensions, and, whether it can even be loaded. ::: {.callout-info} You can inspect a wide variety of data sources using the CLI! Here are some examples with `pb info`: ```bash pb info small_table # built in dataset pb info worldcities.csv # single CSV file pb info meteo.parquet # single Parquet file pb info "*.parquet" # several Parquet files pb info "data/*.parquet" # partitioned Parquet files pb info "duckdb:///warehouse/analytics.ddb::customer_metrics" # DB table via connection string pb info https://github.com/posit-dev/pointblank/blob/main/data_raw/global_sales.csv # GitHub URL ``` And these input schemes work with all other commands that accept a `DATA_SOURCE`. ::: ## `pb preview`: Previewing Data Use `pb preview` to view the first and last rows of your data. Let's try it out with the `worldcities.csv` file: ```bash pb preview worldcities.csv ``` ![](/assets/pb-preview-worldcities-csv.png){width=100%} As can be seen, `pb preview` gives you a preview of the dataset as a table in the console. The dataset has 41K rows but we're electing to show only five rows from the head and from the tail. Let's go over some features of the table preview. First off, the table header provides information on the data source and the DataFrame library that handled the reading of the CSV. Below the column names are simplified representations of the data types (e.g., `` for `object`, `` for `Float64`). We provide row numbers (in gray) in the table stub to indicate which of the rows are from the head or the tail (and a divider helps to distinguish these row groups). If you'd prefer to eliminate the row numbers, use the `--no-row-numbers` option: ```bash pb preview worldcities.csv --no-row-numbers ``` ![](/assets/pb-preview-worldcities-csv-no-row-numbers.png){width=100%} While `pb preview` purposefully displays only a few rows, the number of columns shown can be more than you might need. Furthermore, if a table has *a lot* of columns, you'll only see some of the first and some of the last columns. This is where column selection becomes useful and there are a few methods available for subsetting the preview table's columns. A good one (provided you know the column names) is to use the `--columns` option along with a comma-delimted set of column names. Let's look at a preview of the included `game_revenue` dataset before subsetting the columns: ```bash pb preview game_revenue ``` ![](/assets/pb-preview-game_revenue-all-columns.png){width=100%} That's 11 columns in total and while the all columns *are* shown (i.e., none in the middle are truncated from view), we start to see some necessary instances of abbreviating via `…` within the column names and in the displayed values. Let's now use the `--columns` with a set of column names: ```bash pb preview game_revenue -columns "player_id, item_type, item_name, start_day" ``` ![](/assets/pb-preview-game_revenue-column-names.png){width=100%} With that, the few columns that are displayed no longer have to abbreviate their data values. This is an important consideration since a selective display of column becomes more necessary if column content is large or if the width of the terminal (in terms of characters) cannot be increased. You may want to view ranges of columns by their indices. This is convenient when you want to get a closer look at a few side-by-side columns and you don't want to bother with getting the set of column names exactly right (i.e., for quick inspection). For this, we need to use the `--col-range` option with the desired left/right column bounds separated by a colon: ```bash pb preview game_revenue —-col-range "3:6" ``` ![](/assets/pb-preview-game_revenue-column-range.png){width=100%} In the case that you want to save a table preview as an HTML table in a standalone file, you can add in the `--output-html` option (just add a path/filename with an .html extension). And there are many more options that allow for quick iteration while previewing a table. Use `pb preview --help` to get a helpful listing. ## `pb scan`: Getting Column Summaries We can use `pb scan` for fairly comprehensive summaries of column data, including: - data types - missing value counts - unique value counts - summary statistics (mean, standard deviation, min, max, quartiles, and the interquartile range) Let's use this on the `worldcities.csv` dataset: ```bash pb scan worldcities.csv ``` ![](/assets/pb-scan-worldcities-csv.png){width=100%} Each row in the summary table represents a column in the input dataset. Just as in `pb preview` we get simplified dtypes (in the `Type` column). The `NA` and `UQ` indicate how many missing and unique values are in the column. The remaining columns are statistical measures and there's an important thing to note here: the values provided for any string-based columns (here, `city_name` and `country`) are derived from string lengths. When using `pb scan`, it's helpful to know that large numbers in the summary table are automatically abbreviated for readability, so you'll see values like `39.8k` or `38.0M` instead of long numbers that would require many more characters. For the best experience, try to use a terminal window that's at least 150 characters wide. This will help ensure that all column values are fully visible and not adversely abbreviated by the underlying table mechanism. If your table has many columns, that's not much of a problem for the reporting! Each column is represented as a row in the report, so you'll simply see more lines in the output (and you could always limit the number of columns reported). There are two options for `pb scan`: - `--columns "col1,col2"`: scan only specified columns - `--output-html "file.html"`: save scan as an HTML file Both of these options are also in the `pb preview` command and they behave the same way here. ## `pb missing`: Reporting on Missing Values Use `pb missing` to generate a missing values report, visualizing missingness across columns and 10 *row sectors*. Here's an example using `worldcities.csv`: ```bash pb missing worldcities.csv ``` ![](/assets/pb-missing-worldcities-csv.png){width=100%} This report is arranged similarly to that of `pb scan`, where each column in the input table gets a row in this report table. Each of the 10 row sectors represents 1/10 of the rows in the dataset, where sector `1` encompasses the head of the table, and `10` the tail. More often than not, we expect few missing values so a filled green circle signifies that the collection of rows in a sector (for a column) has no missing values. We don't see any red circles in the `worldcities.csv`-based example but, if we did, that would mean that sectors for a given column are entirely filled with missing values. What's in between the no-missing and completely-missing cases are percentages of missing values. For instance, we can see that row sector `3` of the `population` column has 18% missing values (which is very odd for a table with the sole purpose of providing population values). We also have cases where we see <1% of values in a row sector missing. The reporting of `pb missing` is very careful not to 'round down' in cases where there could be very few missing values (or even just one) in a large table. Seeing this type of missing value report can be really important! You might not expect *any* missing values but finding them will inform decisions on whether to institute checks for them. Another case is that missing values will pop up in specific sectors, indicating a change in how data is processed and appended to the table. By way of options, there's only one for `pb missing` and it is `--output-html`. With that (as in the previous two commands discussed), we can write the missing values report to a standalone HTML file. ## Wrapping Up Pointblank’s CLI provides a set of commands that make it easy to inspect, understand, and diagnose your data before you move on to validation or analysis. Using these tools can help you catch issues early and gain confidence in your data sources. - use `pb info` and before running validations to confirm your data source can be loaded - use `pb preview` to quickly understand what the data looks like - use `pb scan` for a quick data profile and to spot outliers or data quality issues - use `pb missing` to visualize and diagnose missing data patterns By incorporating these commands into your workflow, you’ll be better equipped to work efficiently with your data (and avoid surprises down the line). ### Data Validation Validating data directly in the terminal with the Pointblank CLI offers a fast, scriptable, and repeatable way to check your data. This approach is especially useful for quick checks, CI/CD pipelines, and automation workflows, where you want immediate feedback and clear pass/fail results. The CLI commands are designed for efficiency: you can run validations with a single line, integrate them easily into shell scripts or data pipelines, and benefit from clear, color-coded output that’s easy to interpret at a glance. The `pb validate` command lets you perform common validation checks directly on your data source with a simple command-line interface. This works well both for quick, one-off checks and for use in automated pipelines. For more complex validation logic, the `pb run` command serves as a runner for validation scripts written with the Pointblank Python API, allowing you to execute custom validation workflows from the command line. ## `pb validate`: Quick, One-Line Data Checks The `pb validate` command is your go-to for running common validation checks directly on your data source. It’s perfect for quick, one-off checks or for use in automated pipelines. You specify exactly which check you want to run using the `--check` option, making your intent clear and your validation explicit. Here’s how you construct a validation command: ```bash pb validate worldcities.csv --check [other options] ``` You always provide the data source first, then specify one or more checks with `--check`. Each check can have its own options, such as `--column` or `--value`, depending on what you want to validate. ### Checking for Duplicate and Complete Rows To check for duplicate rows, use the `rows-distinct` check: ```bash pb validate worldcities.csv --check rows-distinct ``` ![](/assets/pb-validate-rows-distinct-worldcities-csv.png){width=100%} The output shows you whether your data contains any duplicate rows, how many rows were checked, and if any duplicates were found. The color-coding of the results helps you quickly interpret the results, using green for pass and red for fail. Here, no duplicate rows were detected out of the 41K rows checked. To check that every row is complete (i.e., no missing values in any column), use the `rows-complete` check: ```bash pb validate worldcities.csv --check rows-complete ``` ![](/assets/pb-validate-rows-complete-worldcities-csv.png){width=100%} With this check we see that the `worldcities.csv` dataset has 739 rows containing at least one Null/missing value. And with any dataset, it's easy to quickly spot if there are any rows with missing data using this command. ### Checking for Nulls and Value Ranges You can easily check for missing values in a column, or ensure that values fall within a certain range. Here’s how to check that all values in the `population` column are not null: ```bash pb validate worldcities.csv --check col-vals-not-null --column city_name ``` ![](/assets/pb-validate-worldcities-not-null-city_name.png){width=100%} Perhaps surprisingly, we find that one row has a missing city name. Let's now check whether all values in the `population` column are greater than zero: ```bash pb validate worldcities.csv --check col-vals-gt --column population --value 0 ``` ![](/assets/pb-validate-worldcities-gt-0-population.png){width=100%} With that we find that there are 741 rows where the `population` value is not greater than 0 (note that this check also fails when cells are null or missing). ### Multiple Checks in One Command You can chain several checks together in a single command. This is handy for comprehensive data quality checks: ```bash pb validate worldcities.csv --check rows-distinct --check col-vals-not-null --column city_name --check col-vals-gt --column population --value 0 ``` ![](/assets/pb-validate-multi-check.png){width=100%} Each check is shown one after the other in the terminal output, so you can review the result of each validation step individually as the command proceeds. ### Seeing and Saving Failing Rows If a check fails, you might want to see which rows caused the failure. Use the `--show-extract` option to display failing rows right in the terminal: ```bash pb validate worldcities.csv --check rows-complete --show-extract ``` ![](/assets/pb-validate-show-extract.png){width=100%} Or, save the failing rows to a CSV file for further investigation: ```bash pb validate worldcities.csv --check rows-complete --show-extract --write-extract incomplete_failing_rows ``` ![](/assets/pb-validate-write-extract.png){width=100%} Note here in the output the additional lines stating that failing rows were saved to a folder (`incomplete_failing_rows`) and, within that folder the `step_01_rows_complete.csv` file was written. Using a folder for extracts is necessary in practice since there may be multiple validations defined in a `pb validate` command. ### Advanced Options and CI/CD Integration - use `--exit-code` to make the command exit with a non-zero code if any check fails; useful for CI/CD pipelines - use `--limit` to control how many failing rows are shown or saved - use `--list-checks` to see all available validation checks and their options ```bash pb validate worldcities.csv --check col-vals-not-null --column city_name --exit-code ``` ![](/assets/pb-validate-exit-code.png){width=100%} ## `pb run`: Custom Validation Workflows with Python For more complex validation logic, use the `pb run` command. This lets you execute a Python script containing Pointblank validation steps, combining the flexibility of the Python API with the convenience of the CLI. You can always scaffold a template script using the `pb make-template` command: ```bash pb make-template my_validation.py ``` ![](/assets/pb-make-template.png){width=100%} But for our example, we'll elect to make our own `worldcities_validation.py` file from scratch. It will: - use the `worldcities.csv` file - apply two thresholds (one for 'warning', another for 'error') - have six validation steps Here's what it looks like: ```python import pointblank as pb validation = ( pb.Validate( data="worldcities.csv", thresholds=pb.Thresholds( warning=1, # 1 failure error=0.05, # 5% of rows failing ), ) .col_schema_match( schema=pb.Schema( columns=[ ("city_name", "object"), ("latitude", "float64"), ("longitude", "float64"), ("country", "object"), ("population", "float64"), ] ), ) .col_vals_not_null(columns="city_name") .col_vals_not_null(columns="population") .col_vals_gt(columns="population", value=0, na_pass=True) .col_vals_between(columns="latitude", left=-90, right=90) .col_vals_between(columns="longitude", left=-180, right=180) .interrogate() ) ``` Now, we'll run the .py script from the terminal: ```bash pb run worldcities_validation.py ``` ![](/assets/pb-run-worldcities_validation.png){width=100%} You’ll see a summary table that lists all of the steps and their results and you can include as many steps and as much logic as you need. ### Output Options You could save the validation report as HTML or JSON (or both) for the purposes of sharing or for automation: ```bash pb run worldcities_validation.py --output-html report.html --output-json report.json ``` ![](/assets/pb-run-worldcities_validation-output.png){width=100%} There are also the options to produce extracts (subset of failing rows) with `--show-extract` or `--write-extract` (just like with `pb validate`). Let's do both in the following example: ```bash pb run worldcities_validation.py --show-extract --write-extract worldcities_failures ``` ![](/assets/pb-run-worldcities_validation-extracts.png){width=100%} This shows a preview of each extract for those validation steps where extracts were produced (steps 2, 3, and 4). Individual CSV files with extracted rows for those steps were written to the `worldcities_failures` directory. ### Controlling Failure Behavior It's possible to use the `--fail-on` option to control when the command should exit with an error, based on the severity of validation failures. This is especially useful for automated workflows and CI/CD pipelines. Let's try that with our `worldcities_validation.py` validation, which we've seen exeeds the 'warning' in steps 2, 3, and 4: ```bash pb run worldcities_validation.py --fail-on warning ``` ![](/assets/pb-run-worldcities_validation-fail-on-warning.png){width=100%} Notice the final line states `Exiting with error due to warning, error, or critical validation failures`. Because we applied `--fail-on warning`, any presence of `warning' (or higher levels such as 'error' or 'critical') will yield a non-zero exit code that should stop a pipeline process. We can prove this by running the following lines in the terminal ```bash pb run worldcities_validation.py --fail-on warning > /dev/null 2>&1 echo $? ``` which returns `1`. ## Wrapping Up Pointblank’s CLI gives you powerful tools for validating your data, whether you need a quick check or a custom workflow. Use `pb validate` for fast, one-liner checks and `pb run` for more advanced, scriptable validation logic. With clear output and flexible options, you can catch data issues early and keep your workflows running smoothly. ### CLI Reference This page provides a complete reference for all Pointblank CLI commands. Each section shows the full help text as it appears in the terminal, giving you quick access to all available options and examples. For practical usage examples and workflows, see the [CLI Data Validation](cli-data-validation.qmd) and [CLI Data Inspection](cli-data-inspection.qmd) guides. ## `pb` - Main Command The main entry point for all Pointblank CLI operations: > ``` > Usage: pb [OPTIONS] COMMAND [ARGS]... > > Pointblank CLI: Data validation and quality tools for data engineers. > > Use this CLI to validate data quality, explore datasets, and generate > comprehensive reports for CSV, Parquet, and database sources. Suitable for > data pipelines, ETL validation, and exploratory data analysis from the > command line. > > Quick Examples: > > pb preview data.csv Preview your data > pb scan data.csv Generate data profile > pb validate data.csv Run basic validation > > Use pb COMMAND --help for detailed help on any command. > > Options: > -v, --version Show the version and exit. > -h, --help Show this message and exit. > > Commands: > info Display information about a data source. > preview Preview a data table showing head and tail rows. > scan Generate a data scan profile report. > missing Generate a missing values report for a data table. > validate Perform single or multiple data validations. > run Run a Pointblank validation script or YAML configuration. > make-template Create a validation script or YAML configuration template. > pl Execute Polars expressions and display results. > datasets List available built-in datasets. > requirements Check installed dependencies and their availability. > ``` ## `pb info` - Data Source Information Display basic information about a data source: > ``` > Usage: pb info [OPTIONS] [DATA_SOURCE] > > Display information about a data source. > > Shows table type, dimensions, column names, and data types. > > DATA_SOURCE can be: > > - CSV file path (e.g., data.csv) > - Parquet file path or pattern (e.g., data.parquet, data/*.parquet) > - GitHub URL to CSV/Parquet (e.g., https://github.com/user/repo/blob/main/data.csv) > - Database connection string (e.g., duckdb:///path/to/db.ddb::table_name) > - Dataset name from pointblank (small_table, game_revenue, nycflights, global_sales) > > Options: > --help Show this message and exit. > ``` ## `pb preview` - Data Table Preview Preview data showing head and tail rows: > ``` > Usage: pb preview [OPTIONS] [DATA_SOURCE] > > Preview a data table showing head and tail rows. > > DATA_SOURCE can be: > > - CSV file path (e.g., data.csv) > - Parquet file path or pattern (e.g., data.parquet, data/*.parquet) > - GitHub URL to CSV/Parquet (e.g., https://github.com/user/repo/blob/main/data.csv) > - Database connection string (e.g., duckdb:///path/to/db.ddb::table_name) > - Dataset name from pointblank (small_table, game_revenue, nycflights, global_sales) > - Piped data from pb pl command > > COLUMN SELECTION OPTIONS: > > For tables with many columns, use these options to control which columns are > displayed: > > - --columns: Specify exact columns (e.g., --columns "name,age,email") > - --col-range: Select column range (e.g., --col-range "1:10", --col-range "5:", --col-range ":15") > - --col-first: Show first N columns (e.g., --col-first 5) > - --col-last: Show last N columns (e.g., --col-last 3) > > Tables with >15 columns automatically show first 7 and last 7 columns with > indicators. > > Options: > --columns TEXT Comma-separated list of columns to display > --col-range TEXT Column range like '1:10' or '5:' or ':15' > (1-based indexing) > --col-first INTEGER Show first N columns > --col-last INTEGER Show last N columns > --head INTEGER Number of rows from the top (default: 5) > --tail INTEGER Number of rows from the bottom (default: 5) > --limit INTEGER Maximum total rows to display (default: 50) > --no-row-numbers Hide row numbers > --max-col-width INTEGER Maximum column width in pixels (default: 250) > --min-table-width INTEGER Minimum table width in pixels (default: 500) > --no-header Hide table header > --output-html PATH Save HTML output to file > --help Show this message and exit. > ``` ## `pb scan` - Data Profile Reports Generate comprehensive data profiles: > ``` > Usage: pb scan [OPTIONS] [DATA_SOURCE] > > Generate a data scan profile report. > > Produces a comprehensive data profile including: > > - Column types and distributions > - Missing value patterns > - Basic statistics > - Data quality indicators > > DATA_SOURCE can be: > > - CSV file path (e.g., data.csv) > - Parquet file path or pattern (e.g., data.parquet, data/*.parquet) > - GitHub URL to CSV/Parquet (e.g., https://github.com/user/repo/blob/main/data.csv) > - Database connection string (e.g., duckdb:///path/to/db.ddb::table_name) > - Dataset name from pointblank (small_table, game_revenue, nycflights, global_sales) > - Piped data from pb pl command > > Options: > --output-html PATH Save HTML scan report to file > -c, --columns TEXT Comma-separated list of columns to scan > --help Show this message and exit. > ``` ## `pb missing` - Missing Values Reports Generate reports focused on missing values: > ``` > Usage: pb missing [OPTIONS] [DATA_SOURCE] > > Generate a missing values report for a data table. > > DATA_SOURCE can be: > > - CSV file path (e.g., data.csv) > - Parquet file path or pattern (e.g., data.parquet, data/*.parquet) > - GitHub URL to CSV/Parquet (e.g., https://github.com/user/repo/blob/main/data.csv) > - Database connection string (e.g., duckdb:///path/to/db.ddb::table_name) > - Dataset name from pointblank (small_table, game_revenue, nycflights, global_sales) > - Piped data from pb pl command > > Options: > --output-html PATH Save HTML output to file > --help Show this message and exit. > ``` ## `pb validate` - Quick Data Validations Perform single or multiple data validations: > ``` > Usage: pb validate [OPTIONS] [DATA_SOURCE] > > Perform single or multiple data validations. > > Run one or more validation checks on your data in a single command. Use > multiple --check options to perform multiple validations. > > DATA_SOURCE can be: > > - CSV file path (e.g., data.csv) > - Parquet file path or pattern (e.g., data.parquet, data/*.parquet) > - GitHub URL to CSV/Parquet (e.g., https://github.com/user/repo/blob/main/data.csv) > - Database connection string (e.g., duckdb:///path/to/db.ddb::table_name) > - Dataset name from pointblank (small_table, game_revenue, nycflights, global_sales) > > AVAILABLE CHECK_TYPES: > > Require no additional options: > > - rows-distinct: Check if all rows in the dataset are unique (no duplicates) > - rows-complete: Check if all rows are complete (no missing values in any column) > > Require --column: > > - col-exists: Check if a specific column exists in the dataset > - col-vals-not-null: Check if all values in a column are not null/missing > > Require --column and --value: > > - col-vals-gt: Check if column values are greater than a fixed value > - col-vals-ge: Check if column values are greater than or equal to a fixed value > - col-vals-lt: Check if column values are less than a fixed value > - col-vals-le: Check if column values are less than or equal to a fixed value > > Require --column and --set: > > - col-vals-in-set: Check if column values are in an allowed set > > Use --list-checks to see all available validation methods with examples. The > default CHECK_TYPE is 'rows-distinct' which checks for duplicate rows. > > Examples: > > pb validate data.csv # Uses default validation (rows-distinct) > pb validate data.csv --list-checks # Show all available checks > pb validate data.csv --check rows-distinct > pb validate data.csv --check rows-distinct --show-extract > pb validate data.csv --check rows-distinct --write-extract failing_rows_folder > pb validate data.csv --check rows-distinct --exit-code > pb validate data.csv --check col-exists --column price > pb validate data.csv --check col-vals-not-null --column email > pb validate data.csv --check col-vals-gt --column score --value 50 > pb validate data.csv --check col-vals-in-set --column status --set "active,inactive,pending" > > Multiple validations in one command: pb validate data.csv --check rows- > distinct --check rows-complete > > Options: > --list-checks List available validation checks and exit > --check CHECK_TYPE Type of validation check to perform. Can be used > multiple times for multiple checks. > --column TEXT Column name or integer position as #N (1-based index) > for validation. > --set TEXT Comma-separated allowed values for col-vals-in-set > checks. > --value FLOAT Numeric value for comparison checks. > --show-extract Show extract of failing rows if validation fails > --write-extract TEXT Save failing rows to folder. Provide base name for > folder. > --limit INTEGER Maximum number of failing rows to save to CSV > (default: 500) > --exit-code Exit with non-zero code if validation fails > --help Show this message and exit. > ``` ## `pb run` - Validation Scripts and YAML Run Python validation scripts or YAML configurations: > ``` > Usage: pb run [OPTIONS] [VALIDATION_FILE] > > Run a Pointblank validation script or YAML configuration. > > VALIDATION_FILE can be: - A Python file (.py) that defines validation logic > - A YAML configuration file (.yaml, .yml) that defines validation steps > > Python scripts should load their own data and create validation objects. > YAML configurations define data sources and validation steps declaratively. > > If --data is provided, it will automatically replace the data source in your > validation objects (Python scripts) or override the 'tbl' field (YAML > configs). > > To get started quickly, use 'pb make-template' to create templates. > > DATA can be: > > - CSV file path (e.g., data.csv) > - Parquet file path or pattern (e.g., data.parquet, data/*.parquet) > - GitHub URL to CSV/Parquet (e.g., https://github.com/user/repo/blob/main/data.csv) > - Database connection string (e.g., duckdb:///path/to/db.ddb::table_name) > - Dataset name from pointblank (small_table, game_revenue, nycflights, global_sales) > > Examples: > > pb make-template my_validation.py # Create a Python template > pb run validation_script.py > pb run validation_config.yaml > pb run validation_script.py --data data.csv > pb run validation_config.yaml --data small_table --output-html report.html > pb run validation_script.py --show-extract --fail-on error > pb run validation_config.yaml --write-extract extracts_folder --fail-on critical > > Options: > --data TEXT Data source to replace in validation objects > (Python scripts and YAML configs) > --output-html PATH Save HTML validation report to file > --output-json PATH Save JSON validation summary to file > --show-extract Show extract of failing rows if validation > fails > --write-extract TEXT Save failing rows to folders (one CSV per > step). Provide base name for folder. > --limit INTEGER Maximum number of failing rows to save to > CSV (default: 500) > --fail-on [critical|error|warning|any] > Exit with non-zero code when validation > reaches this threshold level > --help Show this message and exit. > ``` ## `pb make-template` - Template Generation Create validation script or YAML configuration templates: > ``` > Usage: pb make-template [OPTIONS] [OUTPUT_FILE] > > Create a validation script or YAML configuration template. > > Creates a sample Python script or YAML configuration with examples showing > how to use Pointblank for data validation. The template type is determined > by the file extension: - .py files create Python script templates - > .yaml/.yml files create YAML configuration templates > > Edit the template to add your own data loading and validation rules, then > run it with 'pb run'. > > OUTPUT_FILE is the path where the template will be created. > > Examples: > > pb make-template my_validation.py # Creates Python script template > pb make-template my_validation.yaml # Creates YAML config template > pb make-template validation_template.yml # Creates YAML config template > > Options: > --help Show this message and exit. > ``` ## `pb pl` - Polars Expression Execution Execute Polars expressions and display results: > ``` > Usage: pb pl [OPTIONS] [POLARS_EXPRESSION] > > Execute Polars expressions and display results. > > Execute Polars DataFrame operations from the command line and display the > results using Pointblank's visualization tools. > > POLARS_EXPRESSION should be a valid Polars expression that returns a > DataFrame. The 'pl' module is automatically imported and available. > > Examples: > > # Direct expression > pb pl "pl.read_csv('data.csv')" > pb pl "pl.read_csv('data.csv').select(['name', 'age'])" > pb pl "pl.read_csv('data.csv').filter(pl.col('age') > 25)" > > # Multi-line with editor (supports multiple statements) > pb pl --edit > > # Multi-statement code example in editor: > # csv = pl.read_csv('data.csv') > # result = csv.select(['name', 'age']).filter(pl.col('age') > 25) > > # Multi-line with a specific editor > pb pl --edit --editor nano > pb pl --edit --editor code > pb pl --edit --editor micro > > # From file > pb pl --file query.py > > Piping to other pb commands > pb pl "pl.read_csv('data.csv').head(20)" --pipe | pb validate --check rows-distinct > pb pl --edit --pipe | pb preview --head 10 > pb pl --edit --pipe | pb scan --output-html report.html > pb pl --edit --pipe | pb missing --output-html missing_report.html > > Use --output-format to change how results are displayed: > pb pl "pl.read_csv('data.csv')" --output-format scan > pb pl "pl.read_csv('data.csv')" --output-format missing > pb pl "pl.read_csv('data.csv')" --output-format info > > Note: For multi-statement code, assign your final result to a variable like > 'result', 'df', 'data', or ensure it's the last expression. > > Options: > -e, --edit Open editor for multi-line input > -f, --file PATH Read query from file > --editor TEXT Editor to use for --edit mode (overrides > $EDITOR and auto-detection) > -o, --output-format [preview|scan|missing|info] > Output format for the result > --preview-head INTEGER Number of head rows for preview > --preview-tail INTEGER Number of tail rows for preview > --output-html PATH Save HTML output to file > --pipe Output data in a format suitable for piping > to other pb commands > --pipe-format [parquet|csv] Format for piped output (default: parquet) > --help Show this message and exit. > ``` ## `pb datasets` - Built-in Datasets List available built-in datasets: > ``` > Usage: pb datasets [OPTIONS] > > List available built-in datasets. > > Options: > --help Show this message and exit. > ``` ## `pb requirements` - Dependency Check Check installed dependencies and their availability: > ``` > Usage: pb requirements [OPTIONS] > > Check installed dependencies and their availability. > > Options: > --help Show this message and exit. > ``` ## Common Data Source Types All commands that accept a `DATA_SOURCE` parameter support these formats: - **CSV files**: `data.csv`, `path/to/data.csv` - **Parquet files**: `data.parquet`, `data/*.parquet` (patterns supported) - **GitHub URLs**: `https://github.com/user/repo/blob/main/data.csv` - **Database connections**: `duckdb:///path/to/db.ddb::table_name` - **Built-in datasets**: `small_table`, `game_revenue`, `nycflights`, `global_sales` - **Piped data**: Output from `pb pl` command (where supported) ## Exit Codes and Automation Many commands support options useful for automation and CI/CD: - `--exit-code`: Exit with non-zero code on validation failure - `--fail-on [critical|error|warning|any]`: Control failure thresholds - `--output-html`, `--output-json`: Save reports for external consumption - `--write-extract`: Save failing rows for investigation These features make Pointblank CLI commands suitable for integration into data pipelines, quality gates, and automated workflows. ### MCP Quick Start Transform your data validation workflow with conversational AI in VS Code or Positron IDE. Here are three simple steps to start validating data through conversation (and no complex configuration required). ### 1. Install ```bash pip install pointblank[mcp] ``` ### 2. Configure Your IDE **For VS Code**: **Option 1: Workspace Configuration (Recommended for teams)** 1. Create a `.vscode/mcp.json` file in your project folder 2. Add this configuration: ```json { "servers": { "pointblank": { "command": "python", "args": ["-m", "pointblank_mcp_server.pointblank_server"] } } } ``` **Option 2: User Configuration (Personal use)** 1. Run command: `MCP: Open User Configuration` (Cmd/Ctrl + Shift + P) 2. Add the same JSON configuration above > ⚠️ **Security Note**: Only add MCP servers from trusted sources. VS Code will ask you to confirm trust when starting the server for the first time. **For Positron**: 1. Open Positron Settings 2. Navigate to MCP Server configuration 3. Add the configuration (format may vary) > **Note**: If you don't see MCP settings, you may need to install an MCP extension first. Search for "MCP" in the Extensions marketplace. ### 3. Start Chatting ``` "Load my sales data and check its quality" ``` That's basically how you get started. ## Essential Commands Master these five command patterns and you'll be able to handle most data validation scenarios. Think of these as your fundamental vocabulary for talking to Pointblank. ### Load Data ``` "Load the file /path/to/data.csv" "Load my Netflix dataset from the working directory" "Load the CSV file with sales metrics" "Load customer_data.csv as my main dataset" ``` ### Explore Data ``` "Analyze the data for netflix_data" "Show me a preview of the loaded data" "Create a column summary table" "Generate a missing values analysis" ``` **What you'll get**: Comprehensive data profiling with statistics including missing values, data types, distributions, and summary statistics for each column. The preview and summary tables are automatically generated as beautiful HTML files that open in your browser. This gives you a complete picture of your dataset's structure and characteristics before you define quality rules. ### Check Quality ``` "Create a validator for netflix_data" "Add validation that ratings are between 0 and 10" "Check that all release years are reasonable" "Apply the basic_quality template" ``` **What you'll get**: Actual data quality validation that checks your data against business rules and domain knowledge. This tells you if your data meets your specific quality requirements and identifies rows that fail validation criteria. ### Create Data Validations ``` "Add validation that show_id values are unique" "Check that cast field is not empty for movies" "Ensure vote_count is greater than 0" "Validate that country field follows ISO format" ``` **What you'll get**: Individual validation rules added to your validator. Each rule tests a specific business requirement and can be customized with thresholds and actions. ### Run and Export ``` "Run the validation and show results" "Export validation failures to CSV" "Get failed rows for step 2" "Save the validation report" ``` --- ## Top Three Use Cases Once you have Pointblank set up, these scenarios show the most common and powerful ways to use natural language data validation. Each example demonstrates the conversational flow and immediate value you'll experience. ### 1. 📊 Quick Data Check **When**: You get a new dataset ``` 👤 "I just got netflix_data.csv. Is it clean?" 🤖 "Loading dataset... Found 16,000 movies/shows with 18 columns. Quality analysis complete: - ✅ No duplicate show IDs - ⚠️ 466 missing countries (2.9%) - ⚠️ 132 missing directors (0.8%) Opening data preview in your browser..." ``` ### 2. 🔄 Daily Validation **When**: Regular data quality checks ``` 👤 "Apply the basic_quality template to my data" 🤖 "Creating validator with basic quality checks... ✅ Data types validated ✅ Missing value thresholds checked ✅ Duplicate detection passed Validation complete! All checks passed." ``` ### 3. 🚨 Issue Investigation **When**: Something looks wrong ``` 👤 "Show me movies with ratings above 9.5" 🤖 "Creating validator for high ratings... Found 47 entries with ratings > 9.5 Exporting suspicious rows to CSV Most are documentaries - could be valid!" ``` These examples show how natural conversation can quickly identify and resolve data quality issues that might take hours to diagnose manually. ## Core Capabilities Pointblank's MCP server provides powerful tools for comprehensive data validation with beautiful, interactive HTML reports: ### Data Exploration - **Interactive HTML previews** with automatic browser opening showing head/tail rows - **Column summary tables** with detailed statistics and color-coded data types - **Missing values analysis** with visual patterns and percentages - **Data quality analysis** with comprehensive profiling insights ### Validation Workflows - **Validator creation** with flexible thresholds and configuration - **Many validation types** for comprehensive data quality checking - **Step-by-step validation** building with natural language commands - **Template-based validation** for common data quality patterns ### HTML Reports & Analysis - **Interactive validation reports** automatically opened in your browser - **Timestamped HTML files** for easy sharing and documentation - **Python code generation** for reproducible validation scripts All interactions use natural language, making advanced data validation accessible to users at any technical level while producing publication-ready HTML reports. ## Common Validation Rules Understanding what validation rules to ask for will help you quickly build comprehensive data quality checks. These examples cover the most frequent validation scenarios using Pointblank's built-in validation functions. ### Data Integrity - "Check for duplicate show IDs" - "Ensure no missing required fields like title" - "Validate that release years are between 1900 and 2025" ### Business Logic - "Ratings must be between 0 and 10" - "Budget must be positive numbers" - "Duration should be greater than 0" ### Cross-Field Validation - "Release year should match date_added year" - "Vote count should correlate with popularity" - "Movies should have directors specified" ### Available Templates Pointblank includes pre-built validation templates: - `basic_quality` - Essential data quality checks - `financial_data` - Money and numeric validations - `customer_data` - Personal information validations - `sensor_data` - Time series and measurement checks - `survey_data` - Response and rating validations These rule patterns can be combined and customized for your specific data and business requirements. The natural language interface makes it easy to express complex validation logic without learning technical syntax. ## Some Tips and Tricks These recommendations will help you get more value from your Pointblank MCP server and avoid some common pitfalls. ### Talk Naturally ✅ **Good:** "Check if customer emails look valid" ❌ **Avoid:** "Execute col_vals_regex on email column" ### Provide Context ✅ **Good:** "This is for the board presentation" ❌ **Avoid:** Just asking for validation without explanation ### Build Incrementally 1. Start with data profiling 2. Add basic validation rules 3. Create templates for reuse 4. Set up automated checks ### Save Templates ``` "Save these rules as 'customer_validation'" "Apply the financial_data template" "Use our standard survey validation" ``` ### Interactive Visual Tables Pointblank automatically generates beautiful, interactive HTML tables for data exploration: ``` "Show me a preview of the data" "Generate a column summary table" "Create a missing values analysis" ``` These commands create professional HTML tables with: - **Color-coded data types** (numeric in purple, text in yellow) - **Gradient styling** tailored to each table type - **Automatic browser opening** for immediate viewing - **Timestamped files** for easy reference and sharing The tables open automatically in your default browser, making it easy to share data insights with colleagues or include in presentations. These practices help you build data quality workflows that scale with your needs while remaining accessible to those with varying technical backgrounds. ## File Support Pointblank works with many major data file formats, making it easy to validate data regardless of how it's stored. This support means you can maintain consistent validation practices across your entire data ecosystem. | Type | Extensions | Example | Backend Support | |------|------------|---------|-----------------| | **CSV** | `.csv` | `sales_data.csv` | pandas, polars | | **Parquet** | `.parquet` | `big_data.parquet` | pandas, polars | | **JSON** | `.json` | `api_response.json` | pandas, polars | | **JSON Lines** | `.jsonl` | `streaming_data.jsonl` | pandas, polars | The consistent natural language interface works the same regardless of file format, so you can focus on validation logic rather than technical details. Polars provides faster processing for large datasets, while Pandas offers broader format support. ## Quick Troubleshooting When you encounter issues, these quick fixes resolve the most common problems. Furthermore, the natural language interface means you can always ask for help and explanations. | Problem | Quick Fix | |---------|-----------| | "File not found" | Use absolute path: `/Users/name/Downloads/data.csv` | | "DataFrame not found" | Check loaded datasets with "List my loaded dataframes" | | "Validator not found" | Use "List active validators" to see available validators | | "Validation too slow" | Try "Use pandas backend" or sample your data first | | "HTML tables won't open" | Check your default browser settings | | "Need validation ideas" | Ask "Show me validation templates" or "Suggest validations for my data" | **Browser Issues**: The HTML tables automatically open in your default browser. If they don't appear, check that your browser isn't blocking pop-ups and that you have a default browser set in your system preferences. Remember, you can always ask the AI to explain what's happening or suggest solutions when you run into problems. ## Now You're Ready! You now have everything needed to start validating data through conversation. The beauty of Pointblank's MCP server is that it grows with your expertise: start simple and gradually build more sophisticated validation workflows as you become comfortable with the interface. Start with simple commands and build up to more complex validation workflows. The AI will guide you through the process and help you create robust data quality checks!