bayescart package
Submodules
bayescart.bcart module
Bayesian CART (BCART) models.
This module implements the BCART class and several subclasses to perform Bayesian inference on classification and regression trees. Subclasses address performance improvements, it is recommended to use these.
- class bayescart.bcart.BCART(X, y, alpha=0.95, beta=0.5, mu_0=None, sigma_0=None, nu=3, lambd=0.1, mu_bar=0, a=0.3, node_min_size=5, alpha_dirichlet=1.0, iters=1250, burnin=250, thinning=1, store_tree_spacing=10, max_stored_trees=2000, move_prob=[0.25, 0.25, 0.25, 0.25], verbose='', seed=45, debug=False, light=True)[source]
Bases:
object
Base class for Bayesian CART (Classification and Regression Trees).
This class provides methods for initializing the tree, running the Markov chain Monte Carlo (MCMC) sampling, computing likelihoods and priors, resampling parameters, and updating tree moves (grow, prune, change, swap).
Specific MCMC-methods are abstract and should be implemented in derived classes.
- Parameters:
X (pd.DataFrame) – Feature data.
y (pd.Series) – Response data.
alpha (float, optional) – Hyperparameter for the tree prior (default 0.95).
beta (float, optional) – Hyperparameter controlling the split probability decay (default 0.5).
mu_0 (float or None, optional) – Initial value for the regression mean (default None, will be sampled).
sigma_0 (float or None, optional) – Initial value for the regression standard deviation (default None, will be sampled).
nu (float, optional) – Hyperparameter for the inverse gamma prior (default 3).
lambd (float, optional) – Scale hyperparameter for the inverse gamma prior (default 0.1).
mu_bar (float, optional) – Prior mean for regression (default 0).
a (float, optional) – Hyperparameter (default 0.3).
node_min_size (int, optional) – Minimum observations per node (default 5).
alpha_dirichlet (NDArrayLike, optional) – Parameter(s) for the Dirichlet prior in classification (default 1.0).
iters (int, optional) – Total number of MCMC iterations (default 1250).
burnin (int, optional) – Number of burn-in iterations (default 250).
thinning (int, optional) – Thinning factor (default 1).
store_tree_spacing (int, optional) – Spacing for storing tree copies (default 10).
max_stored_trees (int, optional) – Maximum number of stored trees (default 2000). Caps the total memory consumption.
move_prob (Sequence[float], optional) – Probabilities for the moves (grow, prune, change, swap) (default [0.25,0.25,0.25,0.25]).
verbose (str, optional) – Verbosity level.
seed (int or np.random.Generator, optional) – Random seed or generator (default 45).
debug (bool, optional) – If True, enables debugging assertions (default False).
light (bool, optional) – If True, uses lighter copies for speed (default True).
- is_classification
True if running a classification task.
- Type:
bool
- K
Number of classes (if classification).
- Type:
int
Initialize the BCART object with model and MCMC parameters.
- __init__(X, y, alpha=0.95, beta=0.5, mu_0=None, sigma_0=None, nu=3, lambd=0.1, mu_bar=0, a=0.3, node_min_size=5, alpha_dirichlet=1.0, iters=1250, burnin=250, thinning=1, store_tree_spacing=10, max_stored_trees=2000, move_prob=[0.25, 0.25, 0.25, 0.25], verbose='', seed=45, debug=False, light=True)[source]
Initialize the BCART object with model and MCMC parameters.
- calc_change_swap_mh_data(transition_p)[source]
Compute and store move-specific information needed for the change/swap move acceptance ratio.
- Parameters:
transition_p (float) – The transition probability ratio for the move.
- calc_grow_prune_mh_data(new_tree, node_to_split, tot_parents_with_two_leaves, b)[source]
Compute and store move-specific information needed for the grow/prune move acceptance ratio.
- calc_llik(tree)[source]
Compute the integrated log-likelihood P(Y|X,T) for the tree.
- Parameters:
tree (Tree)
- Returns:
The integrated log-likelihood.
- Return type:
float
- calc_log_tree_prob(tree)[source]
Compute the log prior probability of the tree structure P(T|X).
This is the prob of sampling a specific tree given a dataset x
- Parameters:
tree (Tree)
- Returns:
The log prior probability.
- Return type:
float
- change_copy(*args, light=False, **kwargs)[source]
Pick an internal node and change its splitting rule. Return a copied tree.
- Returns:
The new tree after the change move.
- Return type:
- copy()[source]
Create a copy of the BCART object.
- Returns:
A copy of the current BCART instance.
- Return type:
Self
- Raises:
NotImplementedError – To be implemented.
- deep_copy()[source]
Create a deep copy of the BCART object.
- Returns:
A deep copy.
- Return type:
Self
- Raises:
NotImplementedError – To be implemented.
- get_acceptance_prob(new_tree, move)[source]
Compute the acceptance probability for a proposed tree move.
- Parameters:
new_tree (Tree) – The proposed tree after a move.
move (str) – The move type.
- Returns:
The acceptance probability.
- Return type:
float
- get_leaves_counts(tree)[source]
For classification, compute the counts of observations per class in each leaf.
- Parameters:
tree (Tree)
- Returns:
Array of shape (n_leaves, n_classes).
- Return type:
NDArrayInt
- get_leaves_means(tree)[source]
For regression, compute for each leaf the number of observations, the mean, and un-normalized variance.
- Parameters:
tree (Tree)
- Returns:
(n_obs per leaf, means per leaf, un-normalized variances per leaf)
- Return type:
tuple
- get_log_posterior_prob(tree)[source]
Compute the log posterior probability of a tree (up to a normalizing constant).
This includes the full conditionals for leaf parameters, the tree prior, and the integrated likelihood.
P( heta, T | X, Y) prop P( heta | T, X, Y) * P(T | X, Y) P( heta | T, X, Y) = P(mu | T, X, Y, sigma) * P(sigma | T, X, Y) P(T | X, Y) prop P(Y | T, X) * P(T | X)
- Parameters:
tree (Tree)
- Returns:
The log posterior probability.
- Return type:
float
- get_p_split(depth)[source]
Compute the probability of splitting a node at a given depth.
- Parameters:
depth (int) – The node depth.
- Returns:
The splitting probability.
- Return type:
float
- get_posterior_params_mu(leaf, new_sigma)[source]
Get the posterior parameters for mu for a leaf (regression).
- Parameters:
leaf (Node) – The leaf node.
new_sigma (float) – The resampled sigma value.
- Returns:
(m, std) for the normal full conditional.
- Return type:
tuple
- get_posterior_params_p(leaf)[source]
Get the posterior Dirichlet parameters for a leaf (classification).
- Parameters:
leaf (Node)
- Returns:
The updated Dirichlet parameters.
- Return type:
NDArrayFloat
- get_posterior_params_sigma(leaf)[source]
Get the posterior parameters for sigma for a leaf (regression).
- Parameters:
leaf (Node)
- Returns:
(a, scale) parameters for the inverse gamma full conditional.
- Return type:
tuple
- grow_copy(*args, light=False, **kwargs)[source]
Pick a leaf which has > node_min_size observations. Split it using the splitting rule. Return a copied tree.
- Returns:
The new tree after the grow move.
- Return type:
- prune_copy(*args, light=False, **kwargs)[source]
Pick a node with two leaves as children and prune them. Return a copied tree.
- Returns:
The new tree after the prune move.
- Return type:
- resample_mu_sigma_old()[source]
(Deprecated) Resample mu and sigma using the old full conditional formulas.
- Raises:
Exception – Always raises an exception indicating deprecation.
- run()[source]
Run the MCMC algorithm for Bayesian CART.
- Returns:
A dictionary containing stored trees, integrated likelihoods, terminal region counts, cache counters, setup parameters, and data information.
- Return type:
dict
- sample_leaf_params()[source]
Sample new parameters for the leaves from the prior full conditionals.
- Returns:
(left_leaf_params, right_leaf_params)
- Return type:
tuple
- sample_prior_mu(sigma)[source]
Sample an initial mu value from the normal prior.
- Parameters:
sigma (float) – The sigma value used in the variance of the normal prior.
- Returns:
The sampled mu.
- Return type:
float
- sample_prior_p()[source]
Sample initial class probabilities from a Dirichlet prior.
- Returns:
The sampled probability vector.
- Return type:
NDArrayFloat
- sample_prior_sigma()[source]
Sample an initial sigma value from the inverse gamma prior.
- Returns:
The sampled sigma.
- Return type:
float
- swap_copy(*args, light=False, **kwargs)[source]
Pick a parent-child of internal nodes and swap their splitting rules. If the otehr child has the same rule, then swap parent with both children. Return a copied tree.
- Returns:
The new tree after the swap move.
- Return type:
- trans_prob_log(move, new_tree, old_tree)[source]
(Abstract in this base class) Compute the log transition probability for a move.
- Parameters:
- Returns:
The log transition probability.
- Return type:
float
- Raises:
AbstractMethodError – Always raised; to be implemented in a subclass.
- class bayescart.bcart.BCARTClassic(X, y, alpha=0.95, beta=0.5, mu_0=None, sigma_0=None, nu=3, lambd=0.1, mu_bar=0, a=0.3, node_min_size=5, alpha_dirichlet=1.0, iters=1250, burnin=250, thinning=1, store_tree_spacing=10, max_stored_trees=2000, move_prob=[0.25, 0.25, 0.25, 0.25], verbose='', seed=45, debug=False, light=True)[source]
Bases:
BCARTFast
,BCARTClassicSlow
BCART implementation using the classic CGM98 algorithm.
Initialize the BCART object with model and MCMC parameters.
- class bayescart.bcart.BCARTClassicSlow(X, y, alpha=0.95, beta=0.5, mu_0=None, sigma_0=None, nu=3, lambd=0.1, mu_bar=0, a=0.3, node_min_size=5, alpha_dirichlet=1.0, iters=1250, burnin=250, thinning=1, store_tree_spacing=10, max_stored_trees=2000, move_prob=[0.25, 0.25, 0.25, 0.25], verbose='', seed=45, debug=False, light=True)[source]
Bases:
BCART
A slower implementation of BCART following the original CGM98 algorithm.
This can be made more efficient by improving underlying operations in the BCART class.
Initialize the BCART object with model and MCMC parameters.
- calc_change_swap_mh_data(transition_p)[source]
Compute and store move-specific information needed for the change/swap move acceptance ratio.
- Parameters:
transition_p (float) – The transition probability ratio for the move.
- calc_grow_prune_mh_data(new_tree, node_to_split, tot_parents_with_two_leaves, b)[source]
Compute and store move-specific information needed for the grow/prune move acceptance ratio.
- trans_prob_log(move, new_tree, old_tree)[source]
(Abstract in this base class) Compute the log transition probability for a move.
- Parameters:
- Returns:
The log transition probability.
- Return type:
float
- Raises:
AbstractMethodError – Always raised; to be implemented in a subclass.
- class bayescart.bcart.BCARTFast(X, y, alpha=0.95, beta=0.5, mu_0=None, sigma_0=None, nu=3, lambd=0.1, mu_bar=0, a=0.3, node_min_size=5, alpha_dirichlet=1.0, iters=1250, burnin=250, thinning=1, store_tree_spacing=10, max_stored_trees=2000, move_prob=[0.25, 0.25, 0.25, 0.25], verbose='', seed=45, debug=False, light=True)[source]
Bases:
BCART
Optimized version of BCART that uses the “fast” classes.
Optimizations: - Improved memory usage on copy operations (“fast” classes) - Using custom sampling functions instead of scipy’s for speed Optimized for memory.
Initialize the BCART object with model and MCMC parameters.
- __init__(X, y, alpha=0.95, beta=0.5, mu_0=None, sigma_0=None, nu=3, lambd=0.1, mu_bar=0, a=0.3, node_min_size=5, alpha_dirichlet=1.0, iters=1250, burnin=250, thinning=1, store_tree_spacing=10, max_stored_trees=2000, move_prob=[0.25, 0.25, 0.25, 0.25], verbose='', seed=45, debug=False, light=True)[source]
Initialize the BCART object with model and MCMC parameters.
- get_log_posterior_prob(tree)[source]
Compute posterior probability for a tree (up to norming constant) P( heta, T | X, Y) prop P( heta | T, X, Y) * P(T | X, Y) P( heta | T, X, Y) = P(mu | T, X, Y, sigma) * P(sigma | T, X, Y) P(T | X, Y) prop P(Y | T, X) * P(T | X)
- Return type:
float
bayescart.eval module
Utility functions for BCART experiments evaluation and analysis.
This module contains helper functions for computing tree probabilities, comparing trees from BCART runs and provide summary tables of estimated posterior probabilities for the most commonly visited trees. This can be used to gauge convergence.
- bayescart.eval.calc_log_tree_prob(tree, run)[source]
Calculate the log prior probability of a tree given a run result.
- Parameters:
tree (Tree) – The tree for which to compute the prior probability.
run (dict) – A run result dictionary containing configuration and data information.
- Returns:
The log prior probability P(T|X) of the tree.
- Return type:
float
- bayescart.eval.calc_log_tree_prob_and_llik(tree, run)[source]
Calculate both the log prior probability and the integrated log-likelihood of a tree.
- Parameters:
tree (Tree) – The tree for which to compute the probabilities.
run (dict) – A run result dictionary containing configuration and data information.
- Returns:
A tuple (prior, llik) where ‘prior’ is the log prior probability and ‘llik’ is the integrated log-likelihood of the tree.
- Return type:
tuple of float
- bayescart.eval.calc_tree_llik(tree, run)[source]
Calculate the integrated log-likelihood of a tree given a run result.
- Parameters:
tree (Tree) – The tree for which to compute the likelihood.
run (dict) – A run result dictionary containing configuration and data information.
- Returns:
The integrated log-likelihood P(Y|X,T) of the tree.
- Return type:
float
- bayescart.eval.calc_tree_post_prob(tree, run)[source]
Calculate the (log) posterior probability of a tree given a run result.
- Parameters:
tree (Tree) – The tree for which to compute the posterior probability.
run (dict) – A run result dictionary containing configuration and data information.
- Returns:
The log posterior probability of the given tree.
- Return type:
float
- bayescart.eval.compare_trees(tree1, tree2, res_or_run, type='prob')[source]
Compare two trees to check if they are equivalent under a given mode.
There are different levels of comparison: - Basic, using buil-in tree comaprison (hard=0) - Probability, using data likelihood and tree prior (hard=1). Note that different likelihoods imply different partitions, with high probability. Instead, the prior should tell apart trees that have same partition but different structure so that the two trees are not probabilistically equivalent.
- Parameters:
tree1 (Tree) – The first tree to compare.
tree2 (Tree) – The second tree to compare.
res_or_run (dict or list) – Either a single run result dictionary or a list of run results, used to extract necessary configuration and data.
type (str, optional) – Comparison type (‘basic’ or ‘prob’). Default is ‘prob’.
- Returns:
True if the trees are considered equal under the chosen criteria, False otherwise.
- Return type:
bool
- bayescart.eval.produce_tree_table(res)[source]
Produce a summary table of tree posterior probability estimates across runs.
- This function:
Extracts the top 5 most frequent trees from each run.
Pools all top trees and computes a unique set.
Sorts the unique trees by their average empirical frequency (posterior probability) across runs.
Constructs a table with each unique tree’s estimated frequency per run, mean, standard deviation, and number of terminal nodes.
- Parameters:
res (list) – A list of run result dictionaries.
- Returns:
A pandas DataFrame summarizing the frequency estimates for the most frequent trees.
- Return type:
pd.DataFrame
- bayescart.eval.summarize_trees(run, top_n=None, type='prob', plot=False)[source]
Summarize the most common trees from a single run.
The function extracts the stored ‘tree_store’ from the run result, summarizes the top trees using frequency of occurrence (by the chosen comparison type), and optionally plots each unique tree.
- Parameters:
run (dict) – A run result dictionary containing a ‘tree_store’ key.
top_n (int or None, optional) – The number of top trees to extract. If None, all unique trees are returned.
type (str, optional) – The type of comparison to use (‘prob’ or ‘basic’). Default is ‘prob’.
plot (bool, optional) – If True, each unique tree is displayed using its show() method. Default is False.
- Returns:
A tuple (freq_list, unique_trees) representing the frequencies and corresponding unique trees.
- Return type:
tuple
bayescart.exceptions module
Custom exceptions for bayescart.
This module defines exceptions used by bayescart to signal errors in tree operations and abstract method implementations.
- exception bayescart.exceptions.AbstractMethodError[source]
Bases:
Exception
Exception raised when an abstract method has not been implemented.
bayescart.mytyping module
bayescart.node module
Node class for bayescart.
This module defines the Node class used to represent nodes in a tree. It extends the Node implementation from treelib.
- class bayescart.node.Node(id, is_l, data, rng, debug)[source]
Bases:
Node
Extended node class for Bayesian CART. Provides functionalities for adding data, splitting, and updating parameters.
This class is mostly a wrapper around the NodeData object, which handles the data and parameters associated with the node.
- is_l
Flag indicating if this node is a left child.
- Type:
bool
- _rng
The random number generator.
- Type:
np.random.Generator
- debug
If True, enable debug checks.
- Type:
bool
- _depth
The depth of the node.
- Type:
int
Create a new Node object to be placed inside a Tree object
- __init__(id, is_l, data, rng, debug)[source]
Create a new Node object to be placed inside a Tree object
- calc_avail_split_and_vars()[source]
Compute the number of available variables for splitting, and the number of available splits for the current split variable.
- Returns:
(number of available variables, number of available splits)
- Return type:
tuple
- count_values()[source]
Count the of observations per class.
- Returns:
Array of counts per class.
- Return type:
NDArrayInt
- property depth
- get_available_splits(*args, **kw)[source]
Get available splits from the underlying NodeData.
- Returns:
(avail_vars, is_cat)
- Return type:
tuple
- get_data_averages()[source]
Compute data averages (number of observations, mean, un-normalized variance) for regression.
- Returns:
(n, mean, un-normalized variance)
- Return type:
tuple
- get_data_split(split_var=None, split_val=None)[source]
Split the node’s data using the current (or provided) split rule. Returns the left and right data subsets.
- Parameters:
split_var (str or None, optional) – The splitting feature. If None, uses the current split_var.
split_val (Sequence[T] or T or None, optional) – The splitting value(s). If None, uses the current split_set.
- Returns:
(left_X, right_X, left_y, right_y)
- Return type:
tuple
- get_new_split()[source]
Sample a new split for the node.
- Returns:
(split_var, split_val)
- Return type:
tuple
- get_preds()[source]
Get the predictions from the node’s model.
- Returns:
(indices, predictions)
- Return type:
tuple
- get_split_data(split_var, split_val, left_params, right_params)[source]
Split the data at this node into two parts. Returns the children.
- Parameters:
split_var (str) – The feature to split on.
split_val (Sequence[T] or T) – The split rule.
left_params (Any) – Parameters for the left child.
right_params (Any) – Parameters for the right child.
- Returns:
(left_node_data, right_node_data)
- Return type:
tuple
- get_split_info()[source]
Retrieve the current split rule: split variable (str), split value (array if cat, float else), whether the split is categorical.
- Returns:
(split_var, split_val, is_cat)
- Return type:
tuple
- get_true_preds()[source]
Return the true predictions (i.e. the actual response values).
- Returns:
(indices, true response values)
- Return type:
tuple
- property id
bayescart.node_data module
Node data classes for bayescart.
This module defines classes that encapsulate the data associated with a tree node. It provides a base class NodeData along with subclasses for regression (NodeDataRegression) and classification (NodeDataClassification).
- class bayescart.node_data.NodeData(X, y, rng, debug, node_min_size, split_var=None, split_set=None, is_cat_split=None)[source]
Bases:
object
Abstract base class for data associated with a node in the CART tree.
- X
The data features at this node.
- Type:
pd.DataFrame
- y
The response variable at this node.
- Type:
pd.Series
- rng
Random number generator for sampling.
- Type:
np.random.Generator
- debug
If True, enables additional debugging checks.
- Type:
bool
- node_min_size
Minimum number of observations required in the node.
- Type:
int
- split_var
The variable used for splitting (if any).
- Type:
str
- split_set
The value(s) used in the splitting rule.
- Type:
Sequence[T] or T
- is_cat_split
Indicator whether the split is categorical.
- Type:
bool
- avail_splits
Cached available splits (initially None).
- Type:
Any
- property X: DataFrame
- __init__(X, y, rng, debug, node_min_size, split_var=None, split_set=None, is_cat_split=None)[source]
- calc_avail_split_and_vars()[source]
Compute the number of available variables and the number of available splits for the current split variable.
- Returns:
A tuple (n_avail_vars, n_splits).
- Return type:
tuple
- count_values()[source]
Compute the number of observations per class (classification only).
- Returns:
Array of counts for each class.
- Return type:
NDArrayInt
- Raises:
AbstractMethodError – Always raised; should be implemented in a subclass.
- get_available_splits(force_eval=False, assert_cached=False)[source]
For each feature in X, return the available unique splits.
- Parameters:
force_eval (bool, optional) – If True, force re-evaluation of available splits (default is False).
assert_cached (bool, optional) – If True, raise an error if a cached value is not available (default is False).
- Returns:
- A tuple containing:
A dictionary mapping feature names to sorted unique available split values.
A dictionary mapping feature names to booleans indicating if the feature is categorical.
- Return type:
tuple
- get_data_averages()[source]
Compute the number of observations, the mean, and un-normalized variance (regression only).
- Returns:
(n, mean, un-normalized variance)
- Return type:
tuple
- Raises:
AbstractMethodError – Always raised; should be implemented in a subclass.
- get_data_split(split_var, split_val)[source]
Split the data into left and right subsets based on the split rule. Returns the left and right data subsets.
- Parameters:
split_var (str) – The feature on which to split.
split_val (Sequence[T] or T) – The split value(s).
- Returns:
A tuple (left_X, right_X, left_y, right_y).
- Return type:
tuple
- static get_nobs_from_arg(x)[source]
Get the number of observations from a DataFrame or Series.
- Parameters:
x (pd.DataFrame or pd.Series)
- Returns:
The number of rows (observations).
- Return type:
int
- get_params(print=False)[source]
Retrieve the node-specific parameters (e.g., mu and sigma for regression, or p for classification).
- Parameters:
print (bool, optional) – If True, return a formatted string (default is False).
- Returns:
The node parameters.
- Return type:
Any
- Raises:
AbstractMethodError – Always raised; should be implemented in a subclass.
- get_preds()[source]
Compute the model predictions (e.g., posterior mean).
- Returns:
A tuple (indices, predictions).
- Return type:
tuple
- Raises:
AbstractMethodError – Always raised; should be implemented in a subclass.
- get_split_data(split_var, split_val, left_params, right_params)[source]
Split the current node’s data into two subsets (left and right) based on a given split rule and generate new NodeData objects for the children.
- Parameters:
split_var (str) – The feature on which to split.
split_val (Sequence[T] or T) – The split value(s).
left_params (Any) – Parameters for the left child.
right_params (Any) – Parameters for the right child.
- Returns:
A tuple (l_node_data, r_node_data) corresponding to the left and right NodeData objects.
- Return type:
tuple
- Raises:
AbstractMethodError – Always raised; this method should be implemented in a subclass.
- get_split_set(print=False)[source]
Get the splitting value(s) for this node.
- Parameters:
print (bool, optional) – If True, return a formatted string description (default is False).
- Returns:
The splitting value(s) or a description thereof.
- Return type:
Sequence[T] or T
- reset_avail_splits()[source]
Reset the cached available splits. Necessary whenever the underlying dataset changes
- sample_split(split_var, avail_vals)[source]
- Return type:
tuple
[bool
,Union
[Sequence
[TypeVar
(T
,str
,int
,float
)],TypeVar
(T
,str
,int
,float
)]]
- update_node_params(params)[source]
Update the node parameters with new values.
- Parameters:
params (tuple[float, float] or NDArrayFloat) – The new parameter values.
- Raises:
AbstractMethodError – Always raised; should be implemented in a subclass.
- property y: Series
- class bayescart.node_data.NodeDataClassification(p, *args, **kwargs)[source]
Bases:
NodeData
Node data class for classification trees.
- p
The probability vector for the classes.
- Type:
NDArrayFloat
- count_values()[source]
Compute the number of observations per class (classification only).
- Returns:
Array of counts for each class.
- Return type:
NDArrayInt
- Raises:
AbstractMethodError – Always raised; should be implemented in a subclass.
- get_params(print=False)[source]
Retrieve the node-specific parameters (e.g., mu and sigma for regression, or p for classification).
- Parameters:
print (bool, optional) – If True, return a formatted string (default is False).
- Returns:
The node parameters.
- Return type:
Any
- Raises:
AbstractMethodError – Always raised; should be implemented in a subclass.
- get_split_data(split_var, split_val, left_params, right_params)[source]
Split the data for classification and generate new NodeDataClassification objects.
- Parameters:
split_var (str) – The feature to split on.
split_val (Sequence[T] or T) – The splitting rule.
left_params (NDArrayFloat) – New class probability vector for the left child.
right_params (NDArrayFloat) – New class probability vector for the right child.
- Returns:
(l_node_data, r_node_data)
- Return type:
tuple
- update_node_params(params)[source]
Update the node parameters with new values.
- Parameters:
params (tuple[float, float] or NDArrayFloat) – The new parameter values.
- Raises:
AbstractMethodError – Always raised; should be implemented in a subclass.
- class bayescart.node_data.NodeDataClassificationFast(p, *args, **kwargs)[source]
Bases:
NodeDataFast
,NodeDataClassification
- class bayescart.node_data.NodeDataFast(X, y, rng, debug, node_min_size, split_var=None, split_set=None, is_cat_split=None)[source]
Bases:
NodeData
Efficient implementation of NodeData, reduces the cost of copy operations.
- class bayescart.node_data.NodeDataRegression(mu, sigma, *args, **kwargs)[source]
Bases:
NodeData
Node data class for regression trees.
- mu
The mean parameter.
- Type:
float
- sigma
The standard deviation parameter.
- Type:
float
- get_data_averages()[source]
Compute the number of observations, mean, and un-normalized variance for regression.
- Returns:
(n, mean, uvar)
- Return type:
tuple
- get_params(print=False)[source]
Retrieve the node-specific parameters (e.g., mu and sigma for regression, or p for classification).
- Parameters:
print (bool, optional) – If True, return a formatted string (default is False).
- Returns:
The node parameters.
- Return type:
Any
- Raises:
AbstractMethodError – Always raised; should be implemented in a subclass.
- get_preds()[source]
Get predictions for regression.
- Returns:
(indices, predicted constant value equal to mu)
- Return type:
tuple
- get_split_data(split_var, split_val, left_params, right_params)[source]
Split the data for regression and generate new NodeDataRegression objects for the left and right children.
- Parameters:
split_var (str) – The feature to split on.
split_val (Sequence[T] or T) – The splitting threshold (or set).
left_params (tuple[float, float]) – Parameters (mu, sigma) for the left child.
right_params (tuple[float, float]) – Parameters (mu, sigma) for the right child.
- Returns:
(l_node_data, r_node_data)
- Return type:
tuple
- update_node_params(params)[source]
Update the node parameters with new values.
- Parameters:
params (tuple[float, float] or NDArrayFloat) – The new parameter values.
- Raises:
AbstractMethodError – Always raised; should be implemented in a subclass.
- class bayescart.node_data.NodeDataRegressionFast(mu, sigma, *args, **kwargs)[source]
Bases:
NodeDataFast
,NodeDataRegression
bayescart.parallel_tempering_base module
Bayesian CART (BCART) models solved using parallel tempering.
This module implements the parallel tempering algorithm for addressing the multimodality of BCART posterior space. This is an abstract base class. Subclasses will implement the specific parallel tempering logic, such as how to apply flattening of the posterior distribution.
- class bayescart.parallel_tempering_base.BCARTPT(X, y, alpha=0.95, beta=0.5, mu_0=None, sigma_0=None, nu=3, lambd=0.1, mu_bar=0, a=0.3, node_min_size=5, alpha_dirichlet=1.0, iters=1250, burnin=250, thinning=1, store_tree_spacing=10, max_stored_trees=2000, move_prob=[0.25, 0.25, 0.25, 0.25], verbose='', seed=45, debug=False, light=True)[source]
Bases:
BCARTClassic
Base class for Parallel Tempering variants of BCART.
Supports staggered activation of chains through steps_per_chain. This can help warmer chains to converge before adding more chains closer to the cold one (target chain).
- Parameters:
temps (Sequence[float] or NDArrayFloat) – List of temperatures for each chain. One of the ends must be 1.
steps_per_chain (Sequence[int] or NDArray[int] or None, optional) – Number of iterations to run before unlocking additional chains.
swap_type (str, optional) – Swap type: ‘stochastic’ or deterministic (default ‘stochastic’).
Initialize the BCART object with model and MCMC parameters.
- __init__(X, y, alpha=0.95, beta=0.5, mu_0=None, sigma_0=None, nu=3, lambd=0.1, mu_bar=0, a=0.3, node_min_size=5, alpha_dirichlet=1.0, iters=1250, burnin=250, thinning=1, store_tree_spacing=10, max_stored_trees=2000, move_prob=[0.25, 0.25, 0.25, 0.25], verbose='', seed=45, debug=False, light=True)
Initialize the BCART object with model and MCMC parameters.
- get_acceptance_prob(new_tree, move)[source]
Compute the acceptance probability for a move in the parallel tempering framework.
- Parameters:
new_tree (Tree) – The proposed tree.
move (str) – The move type.
- Returns:
The acceptance probability.
- Return type:
float
- get_swap_acceptance_prob(tree_from, tree_to)[source]
Compute the swap acceptance probability between two chains.
- Parameters:
- Returns:
The swap acceptance probability.
- Return type:
float
- Raises:
NotImplementedError – This method must be implemented in subclasses.
- run()[source]
Run the Parallel Tempering MCMC sampler.
- Returns:
A dictionary containing MCMC results and tempering statistics.
- Return type:
dict
- tempering = ''
bayescart.tempering_geom_lik module
Bayesian CART (BCART) models solved using geometric-likelihood parallel tempering. Only the likelihood is flattened, thereby preserving prior information on the importnance of different tree sizes. The prior is used instrumentally to guide the search over small trees.
Disentangling the likelihood from the prior leads to sensible analytical simplifications and performacne gains (also in terms of convergence). However, this approach loses the ability to leverage prior inforamtion, as it is now used to constrain smaller trees in heated chains.
This module implements the geometric-likelihood parallel tempering algorithm for addressing the multimodality of BCART posterior space.
- class bayescart.tempering_geom_lik.BCARTGeomLik(X, y, alpha=0.95, beta=0.5, mu_0=None, sigma_0=None, nu=3, lambd=0.1, mu_bar=0, a=0.3, node_min_size=5, alpha_dirichlet=1.0, iters=1250, burnin=250, thinning=1, store_tree_spacing=10, max_stored_trees=2000, move_prob=[0.25, 0.25, 0.25, 0.25], verbose='', seed=45, debug=False, light=True)[source]
Bases:
BCARTPT
Geometric-Likelihood parallel tempering implementation.
In this variant, only the likelihood is flattened, while the prior remains unchanged.
Initialize the BCART object with model and MCMC parameters.
- __init__(X, y, alpha=0.95, beta=0.5, mu_0=None, sigma_0=None, nu=3, lambd=0.1, mu_bar=0, a=0.3, node_min_size=5, alpha_dirichlet=1.0, iters=1250, burnin=250, thinning=1, store_tree_spacing=10, max_stored_trees=2000, move_prob=[0.25, 0.25, 0.25, 0.25], verbose='', seed=45, debug=False, light=True)
Initialize the BCART object with model and MCMC parameters.
bayescart.tempering_geometric module
Bayesian CART (BCART) models solved using geometric parallel tempering, the whole posterior distribution is flattened uniformly.
This module implements the geometric parallel tempering algorithm for addressing the multimodality of BCART posterior space.
- class bayescart.tempering_geometric.BCARTGeom(X, y, alpha=0.95, beta=0.5, mu_0=None, sigma_0=None, nu=3, lambd=0.1, mu_bar=0, a=0.3, node_min_size=5, alpha_dirichlet=1.0, iters=1250, burnin=250, thinning=1, store_tree_spacing=10, max_stored_trees=2000, move_prob=[0.25, 0.25, 0.25, 0.25], verbose='', seed=45, debug=False, light=True)[source]
Bases:
BCARTPT
Geometric parallel tempering implementation of Bayesian CART.
In this variant, the posterior of each chain is raised to a power controlled by the temperature. This yields a flattening effect. Chains furthest from the original one (t=1) are designed to explore the space fully. However, this leads to the search over very big trees, which are computationally expensive. It also slows down the convergence of the original chain (beyond runtime, because it is difficult to undo big trees).
Initialize the BCART object with model and MCMC parameters.
- __init__(X, y, alpha=0.95, beta=0.5, mu_0=None, sigma_0=None, nu=3, lambd=0.1, mu_bar=0, a=0.3, node_min_size=5, alpha_dirichlet=1.0, iters=1250, burnin=250, thinning=1, store_tree_spacing=10, max_stored_trees=2000, move_prob=[0.25, 0.25, 0.25, 0.25], verbose='', seed=45, debug=False, light=True)
Initialize the BCART object with model and MCMC parameters.
- calc_change_swap_mh_data(transition_p)[source]
Compute and store move-specific information needed for the change/swap move acceptance ratio.
- Parameters:
transition_p (float) – The transition probability ratio for the move.
- calc_grow_prune_mh_data(new_tree, node_to_split, tot_parents_with_two_leaves, b)[source]
Compute and store move-specific information needed for the grow/prune move acceptance ratio.
- get_swap_acceptance_prob(tree_from, tree_to)[source]
Compute the acceptance probability for a swap between chains in the geometric tempering variant.
- trans_prob_log(move, new_tree, old_tree)[source]
(Abstract in this base class) Compute the log transition probability for a move.
- Parameters:
- Returns:
The log transition probability.
- Return type:
float
- Raises:
AbstractMethodError – Always raised; to be implemented in a subclass.
bayescart.tempering_pseudo_prior module
Bayesian CART (BCART) models solved using tempering with a pseudo-prior. Only the likelihood is flattened, thereby preserving prior information on the importnance of different tree sizes. The prior is left unchanged as it should be used to inform the model with previous information. Instead, a pseudo-prior is used instrumentally to guide the search over small trees.
We take as pseudo prior the same CGM98 prior to benefit from substantial cancellations in computing acceptance rates.
This module implements the tempering with a pseudo-prior algorithm for addressing the multimodality of BCART posterior space.
- class bayescart.tempering_pseudo_prior.BCARTPseudoPrior(X, y, alpha=0.95, beta=0.5, mu_0=None, sigma_0=None, nu=3, lambd=0.1, mu_bar=0, a=0.3, node_min_size=5, alpha_dirichlet=1.0, iters=1250, burnin=250, thinning=1, store_tree_spacing=10, max_stored_trees=2000, move_prob=[0.25, 0.25, 0.25, 0.25], verbose='', seed=45, debug=False, light=True)[source]
Bases:
BCARTPT
Bayesian CART with a pseudo-prior used to bias the search towards smaller trees.
The pseudo-prior is used instrumentally in computing the acceptance probabilities.
- Parameters:
pprior_alpha (float, optional) – Pseudo-prior alpha parameter (default 0.95).
pprior_beta (float, optional) – Pseudo-prior beta parameter (default 1).
Initialize the BCART object with model and MCMC parameters.
- __init__(X, y, alpha=0.95, beta=0.5, mu_0=None, sigma_0=None, nu=3, lambd=0.1, mu_bar=0, a=0.3, node_min_size=5, alpha_dirichlet=1.0, iters=1250, burnin=250, thinning=1, store_tree_spacing=10, max_stored_trees=2000, move_prob=[0.25, 0.25, 0.25, 0.25], verbose='', seed=45, debug=False, light=True)
Initialize the BCART object with model and MCMC parameters.
- calc_log_p_split(tree, alpha, beta)[source]
Compute the log prior probability of the tree structure (ignoring the specific split values).
- Parameters:
tree (Tree)
alpha (float)
beta (float)
- Returns:
The computed log prior probability of the tree structure.
- Return type:
float
- get_acceptance_prob(new_tree, move)[source]
Compute the acceptance probability for a move using the pseudo-prior adjustment.
- Parameters:
new_tree (Tree)
move (str)
- Returns:
The acceptance probability.
- Return type:
float
- get_swap_acceptance_prob(tree_from, tree_to)[source]
Compute the swap acceptance probability for the pseudo-prior variant.
bayescart.tree module
Tree class for bayescart.
This module defines the Tree class that extends treelib’s Tree to support operations needed for Bayesian CART such as applying splits, copying subtrees, and computing prior probabilities.
- class bayescart.tree.Tree(root_node_data, rng, node_min_size, debug)[source]
Bases:
Tree
Extended tree class for Bayesian CART that builds on the treelib Tree.
This class maintains a counter for node IDs and provides additional methods for sampling leaves, copying trees, and applying splits.
- id_counter
Counter for unique node identifiers.
- Type:
int
- rng
Random generator used for sampling.
- Type:
np.random.Generator
- node_min_size
Minimum number of observations per node.
- Type:
int
- debug
If True, enables additional assertions.
- Type:
bool
- llik
Cached integrated log-likelihood of the tree.
- Type:
float or None
- log_tree_prior_prob
Cached log prior probability of the tree.
- Type:
float or None
- temperature
The current temperature for tempering.
- Type:
float
Initiate a new tree or copy another tree with a shallow or deep copy.
- __init__(root_node_data, rng, node_min_size, debug)[source]
Initiate a new tree or copy another tree with a shallow or deep copy.
- add_node(data, is_l, parent=None)[source]
Add a new node object to the tree and make the parent as the root by default.
The ‘node’ parameter refers to an instance of Class::Node.
- Return type:
- apply_split(node, split_var, split_val, l_leaf_params, r_leaf_params)[source]
Apply a split to a leaf node by generating two children nodes.
- Parameters:
node (Node) – The leaf node to be split.
split_var (str) – The feature to split on.
split_val (Sequence[T] or T) – The split value(s).
l_leaf_params (Any) – Parameters for the left child.
r_leaf_params (Any) – Parameters for the right child.
- Raises:
InvalidTreeError – If any of the resulting children have fewer observations than the minimum.
- check_split_struct(node)[source]
Check that the splitting rules along the subtree starting at a given node do not produce empty splits.
While this function does not guarantee that the split is valid, its fast to execute and can filter some obvious incompatibilities.
- Parameters:
node (Node) – The root of the subtree to check.
- Raises:
InvalidTreeError – If an empty split is encountered.
- copy(light=False, no_data=False, memo=None)[source]
Deep copy the tree with all node info. If light, don’t deep-copy (X,y) at each node.
- Return type:
- get_children(node)[source]
Get the children of the specified node.
- Parameters:
node (Node or int) – The node or its identifier.
- Returns:
A list of child nodes (ordered as left then right).
- Return type:
list
- Raises:
ValueError – If the node does not have exactly two children (in non-leaf cases).
- get_leaves()[source]
Return all the leaf nodes of the tree.
- Returns:
A list of leaf nodes.
- Return type:
list
- get_n_leaves()[source]
Get the number of leaf nodes.
- Returns:
The count of leaves.
- Return type:
int
- get_node(node_id)[source]
Get the object of the node with ID of
nid
.An alternative way is using ‘[]’ operation on the tree. But small difference exists between them:
get_node()
will return None ifnid
is absent, whereas ‘[]’ will raiseKeyError
.- Return type:
- get_parents_with_two_leaves()[source]
Get all internal nodes that have exactly two leaves as children.
- Returns:
List of nodes satisfying the condition.
- Return type:
list
- is_equal(other, hard=0, categories=None)[source]
Check if two trees are equal in structure and (optionally) in parameters.
The hard parameter controls how strict the testing should be. Hard = 0 checks they have the same structure and same splitting variables. Hard = 1 additionally checks the splitting values. Hard = 2 expects same structure, same splits, same data in nodes. Hard = 3 expects equality also for leaf parameters.
- Parameters:
other (Tree) – The tree to compare with.
hard (int, optional) – Level of strictness (0: structure and split variable; 1: includes split values; 2: includes data; 3: also leaf parameters) (default 0).
categories (dict[str, set] or None, optional) – Mapping of categorical variable values, if needed.
- Returns:
True if the trees are equal under the chosen criteria.
- Return type:
bool
- is_valid()[source]
Check whether the tree is valid.
The function recursively checks each node for logical consistency of splits, data assignments, and node properties.
- Returns:
True if the tree passes all validity checks.
- Return type:
bool
- remove_node(node)[source]
Remove a node indicated by ‘identifier’ with all its successors. Return the number of removed nodes.
- Return type:
int
- sample_leaf(node_min_size)[source]
Sample a leaf node at random from the tree that has at least node_min_size observations.
- Parameters:
node_min_size (int) – Minimum required observations.
- Returns:
A randomly chosen leaf node.
- Return type:
- Raises:
InvalidTreeError – If no leaf with sufficient observations is found.
- show()[source]
Print the tree. Update all the tags first. Tags are not automatically updated for performance speed.
- Returns:
The string representation of the tree.
- Return type:
str
- update_split(node, split_var, split_val)[source]
Change the splitting rule for the current node. Update the data subset of the children, and of all the descendants recursively.
- Parameters:
node (Node) – The node to update.
split_var (str) – The new splitting variable.
split_val (Sequence[T] or T) – The new splitting value(s).
- class bayescart.tree.TreeFast(root_node_data, rng, node_min_size, debug)[source]
Bases:
Tree
Fast implementation of Tree using NodeFast as the underlying node class. Tree copy is also improved.
Initiate a new tree or copy another tree with a shallow or deep copy.
- update_subtree_data(node)[source]
Update the subtree data with a fast preliminary check before performing the full update.
The preliminary check does not count how many observations fall in each node; just whether the splits make logical sense. This is fast. If so, proceed with the proper, expensive check.
bayescart.utils module
Utility functions for bayescart.
This module provides helper functions for sampling from distributions, computing log-probability densities, and a generic choice function.
- bayescart.utils.dirichlet_logpdf(x, alpha)[source]
Compute the log probability density function of a Dirichlet distribution.
- Parameters:
x (array-like) – The point (vector) at which to evaluate the log-pdf.
alpha (array-like) – The concentration parameters of the Dirichlet distribution.
- Returns:
The log probability density.
- Return type:
float
- bayescart.utils.invgamma_logpdf(x, a, scale)[source]
Compute the log probability density function of an inverse gamma distribution.
- Parameters:
x (float) – The point at which to evaluate the log-pdf.
a (float) – The shape parameter.
scale (float) – The scale parameter.
- Returns:
The log probability density.
- Return type:
float
- bayescart.utils.invgamma_rvs(a, scale, rng)[source]
Sample a random variate from an inverse gamma distribution.
- Parameters:
a (float) – The shape parameter (adjusted by 1/2 in this context).
scale (float) – The scale parameter.
rng (np.random.Generator) – The random number generator.
- Returns:
A random variate from the inverse gamma distribution, scaled accordingly.
- Return type:
float
- bayescart.utils.my_choice(rng, a, replace=False, p=None)[source]
Sample a random element from a generic sequence using the provided random generator.
- Parameters:
rng (np.random.Generator) – The random number generator.
a (Sequence[elem]) – The sequence to sample from.
replace (bool, optional) – Whether the sampling is done with replacement (default is False).
p (Sequence[float] or None, optional) – The probability weights associated with each element (default is None).
- Returns:
A randomly selected element from the sequence.
- Return type:
elem
- bayescart.utils.norm_logpdf(x, a, scale)[source]
Compute the log probability density function of a normal distribution.
- Parameters:
x (float) – The point at which to evaluate the log-pdf.
a (float) – The mean of the distribution.
scale (float) – The standard deviation of the distribution.
- Returns:
The log probability density.
- Return type:
float
- bayescart.utils.plot_chain_comm(res, title='')[source]
Plot the distribution of terminal regions across the chains from a PT run.
This function uses the stored parallel tempering (PT) statistics (the number of terminal nodes per chain) from the run results and plots them as histograms. It also prints the final swap acceptance probabilities.
- Parameters:
res (dict) – A run result dictionary containing ‘PT_swap_stats’ with key ‘PT_term_reg’ and ‘PT_swap_final_prob’.
title (str, optional) – The title of the plot (default is ‘’).
- Return type:
None
- bayescart.utils.plot_hists(res, idx_range=(0, 1), title='')[source]
Plot the distribution of terminal nodes for the cold chain across runs.
The function extracts the stored terminal node counts from each run in ‘res’ and plots histograms for a selected fraction of the post-burnin iterations, where idx_range is given as (min_percentage, max_percentage).
- Parameters:
res (list or sequence of dict) – A list of run result dictionaries, each containing a ‘setup’ key and a ‘tree_term_reg’ key.
idx_range (tuple, optional) – A tuple (min_percentage, max_percentage) specifying the range (as fractions) of the available post-burnin observations to use for plotting. Default is (0,1).
title (str, optional) – The title of the plot (default is ‘’).
- Return type:
None
- bayescart.utils.plot_swap_prob(res, title='')[source]
Plot the evolution of swap acceptance probabilities over time for PT chains.
- Parameters:
res (dict) – A run result dictionary containing ‘PT_swap_stats’ with key ‘PT_swap_prob_over_time’, which is a mapping from iteration numbers to swap acceptance probabilities.
title (str, optional) – The title of the plot (default is ‘’).
- Return type:
None
Module contents
bayescart: A Python package for Bayesian CART with tempering.
This package provides classes and functions to build, sample, and evaluate Bayesian CART models with advanced tempering strategies. The package is composed of modules for node data handling, tree structures, Bayesian CART sampling, and utility functions.
- Available objects:
BCARTClassic, BCARTPT, BCARTGeom, BCARTGeomLik, BCARTPseudoPrior
Tree
Node
NodeData, NodeDataRegression, NodeDataClassification
Utility functions and plotting functions
Evaluation and functions for analysis
Exceptions: InvalidTreeError, AbstractMethodError