Intermediate

Tabular Data with FastAI

Apply deep learning to structured/tabular data using FastAI's TabularDataLoaders. Learn how to handle categorical and continuous variables, apply preprocessing, and train models that compete with gradient boosting.

TabularDataLoaders

FastAI makes it easy to work with CSV and DataFrame data for classification and regression tasks:

Python
from fastai.tabular.all import *

# Load the Adult Income dataset
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')

# Define column types
cat_names = ['workclass', 'education', 'marital-status', 'occupation',
             'relationship', 'race', 'sex', 'native-country']
cont_names = ['age', 'fnlwgt', 'education-num', 'capital-gain',
              'capital-loss', 'hours-per-week']

# Create DataLoaders with preprocessing
dls = TabularDataLoaders.from_df(
    df,
    path=path,
    procs=[Categorify, FillMissing, Normalize],
    cat_names=cat_names,
    cont_names=cont_names,
    y_names='salary',
    y_block=CategoryBlock,
    valid_idx=list(range(800, 1000)),
    bs=64
)

Categorical vs Continuous Variables

TypeExamplesHow FastAI Handles It
CategoricalColor, country, product typeLearned embeddings (like word embeddings but for categories)
ContinuousAge, price, temperatureNormalized to mean=0, std=1
Entity Embeddings: FastAI uses learned embeddings for categorical variables. This technique, introduced in the paper "Entity Embeddings of Categorical Variables," allows the model to discover meaningful representations — for example, learning that Monday and Tuesday are similar but Saturday is different.

Preprocessing Transforms

Python
# Built-in preprocessing transforms
procs = [
    Categorify,    # Convert categories to integer codes
    FillMissing,   # Fill missing values (adds indicator column)
    Normalize,     # Normalize continuous columns
]

# FillMissing creates a boolean column (e.g., age_na)
# that tells the model when data was missing

Training a Tabular Model

Python
# Create learner
learn = tabular_learner(
    dls,
    layers=[200, 100],   # Hidden layer sizes
    metrics=accuracy
)

# Find optimal learning rate
learn.lr_find()

# Train
learn.fit_one_cycle(5, 1e-2)

# Make predictions
row, clas, probs = learn.predict(df.iloc[0])
print(f"Prediction: {clas}, Probabilities: {probs}")

Feature Engineering Tips

  • Date features — Use add_datepart(df, 'date_column') to automatically extract year, month, day, day-of-week, etc.
  • High-cardinality categories — FastAI handles these well with embeddings. No need to one-hot encode.
  • Missing valuesFillMissing creates indicator columns, letting the model learn when data is missing.
  • Embedding sizes — FastAI automatically chooses embedding dimensions based on cardinality, or you can specify emb_szs.

Next Up: NLP

Learn how to apply FastAI to text classification and language model fine-tuning.

Next: NLP →