Introduction


TabFSBench is a benchmarking tool for feature shifts in tabular data in open-environment scenarios. It aims to analyse the performance and robustness of a model in feature shifts.

TabFSBench offers the following advantages:

  • Various Models: Tree-based models, deep-learning models, LLMs and tabular LLMs.
  • Diverse Experiments: Single shift, most/least-revelant shift and random shift.
  • Exportable Datasets: Be able to export the feature-shift version of the given dataset.
  • Addable Components: Supports to add new datasets and models, and export the given dataset under the specific experiment.

Datasets


Datasets in TabFSBench. #Numerical means numerical features. #Categorical means categorical features.

Tasks Dataset #Samples #Numerical #Categorical #Labels Link
Binary Classification credit 1,000 7 13 2 https://www.openml.org/search?type=data&sort=runs&id=31&status=active
electricity 45,312 8 0 2 https://www.kaggle.com/datasets/vstacknocopyright/electricity
heart 918 6 5 2 https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction
MiniBooNE 72,998 50 0 2 https://www.kaggle.com/datasets/alexanderliapatis/miniboone
Multi-Class Classification Iris 150 4 0 3 https://www.kaggle.com/datasets/uciml/iris
penguins 345 4 2 3 https://www.kaggle.com/datasets/youssefaboelwafa/clustering-penguins-species
eye_movements 10,936 27 0 4 https://www.kaggle.com/datasets/vinnyr12/eye-movements
jannis 83,733 54 0 4 https://www.openml.org/search?type=data&status=active&id=45021
Regression abalone 4,178 7 1 \ https://www.kaggle.com/datasets/rodolfomendes/abalone-dataset
bike 10,886 6 3 \ https://www.kaggle.com/datasets/abdullapathan/bikesharingdemand
concrete 1,031 8 0 \ https://www.kaggle.com/datasets/maajdl/yeh-concret-data
laptop 1,275 8 14 \ https://www.kaggle.com/datasets/owm4096/laptop-prices

Models


TabFSBench is possible to test three kinds of models’ performance directly, including tree-based models, deep learning models and tabular LLMs. For LLMs, TabFSBnech provides text files(.json) about the given dataset that can be used directly for LLM to finetune.

Tree-based models

  1. CatBoost: A powerful boosting-based model designed for efficient handling of categorical features.
  2. LightGBM: A machine-learning model based on the Boosting algorithm.
  3. XGBoost: A machine-learning model incrementally building multiple decision trees by optimizing the loss function.

Deep learning models

We use LAMDA-TALENT to evaluate deep-learning models. You can get details from LAMDA-TALENT.

  1. MLP: A multi-layer neural network, which is implemented according to RTDL.
  2. ResNet: A DNN that uses skip connections across many layers, which is implemented according to RTDL.
  3. SNN: An MLP-like architecture utilizing the SELU activation, which facilitates the training of deeper neural networks.
  4. DANets: A neural network designed to enhance tabular data processing by grouping correlated features and reducing computational complexity.
  5. TabCaps: A capsule network that encapsulates all feature values of a record into vectorial features.
  6. DCNv2: Consists of an MLP-like module combined with a feature crossing module, which includes both linear layers and multiplications.
  7. NODE: A tree-mimic method that generalizes oblivious decision trees, combining gradient-based optimization with hierarchical representation learning.
  8. GrowNet: A gradient boosting framework that uses shallow neural networks as weak learners.
  9. TabNet: A tree-mimic method using sequential attention for feature selection, offering interpretability and self-supervised learning capabilities.
  10. TabR: A deep learning model that integrates a KNN component to enhance tabular data predictions through an efficient attention-like mechanism.
  11. ModernNCA: A deep tabular model inspired by traditional Neighbor Component Analysis, which makes predictions based on the relationships with neighbors in a learned embedding space.
  12. AutoInt: A token-based method that uses a multi-head self-attentive neural network to automatically learn high-order feature interactions.
  13. Saint: A token-based method that leverages row and column attention mechanisms for tabular data.
  14. TabTransformer: A token-based method that enhances tabular data modeling by transforming categorical features into contextual embeddings.
  15. FT-Transformer: A token-based method which transforms features to embeddings and applies a series of attention-based transformations to the embeddings.
  16. TANGOS: A regularization-based method for tabular data that uses gradient attributions to encourage neuron specialization and orthogonalization.
  17. SwitchTab: A self-supervised method tailored for tabular data that improves representation learning through an asymmetric encoder-decoder framework. Following the original paper, our toolkit uses a supervised learning form, optimizing both reconstruction and supervised loss in each epoch.
  18. TabPFN: A general model which involves the use of pre-trained deep neural networks that can be directly applied to any tabular task. TabFSBench uses the first version of TabPFN and supports to evaluate TabPFNv2 by updating the version.

LLMs

  1. Llama3-8B: Llama3-8B is released by Meta AI in April 2024.
    • Due to memory limitations, TabFSBench only provides json files for LLM fine-tuning and testing ( datasetname_train.json / datasetname_test_i.json , i means the degree of feature shifts), asking users to use LLM locally.
    • TabFSBench provides the context of Credit Dataset. Users can rewrite background, features_information, declaration and question of llm() in ./model/utils.py.

Tabular LLMs

  1. TabLLM: A framework that leverages LLMs for efficient tabular data classification.
  2. UniPredict: A framework that firstly trains on multiple datasets to acquire a rich repository of prior knowledge. UniPredict-Light model that TabFSBench used is available at Google Drive. After downloading the model, place it in ./model/tabularLLM/files/unified/models and rename it to light_state.pt.

Experimental Results


1. Most models have the limited applicability in feature-shift scenarios.

  • Most models can’t handle feature shifts well.
  • No Model Consistently Outperforms.

2. Shifted features’ importance has a linear trend with model performance degradation.

  • Single strong correlated feature shifted causes greater model performance degradation.
  • Shifted feature set’s correlations have a relationship with model performance degradation linearly.

We use performance gap to measure the model performance Gap $Delta$. Sum of shifted feature set’s correlations refers to the sum of Pearson correlation coefficients of shifted features. Notably, model performance Gap $Delta$ and sum of shifted feature set’s correlations demonstrate a strong correlation, with a Pearson correlation coefficient of $\rho$ = 0.7405.

3. Model closed-environment performance correlates with feature-shift performance.

Model closed-environment performance vs. model feature-shift performance. Closed-environment means that the dataset does not have any degree of feature shift. Feature-shift means average model performance in all degrees of feature shifts.