Tutorial 03 — Regression

Open in Colab

Full-featured regression analysis in a single function call. Vectrix supports R-style formula syntax (y ~ x1 + x2), automatic diagnostics (heteroscedasticity, normality, multicollinearity), multiple methods (OLS, Ridge, Lasso, Robust), and prediction intervals — without leaving the Python ecosystem.

Direct Input

The simplest form: pass y and X directly as numpy arrays. Vectrix runs OLS by default with automatic constant term

import numpy as np
from vectrix import regress

np.random.seed(42)
X = np.random.randn(100, 2)
y = 3 + 2 * X[:, 0] - 1.5 * X[:, 1] + np.random.randn(100) * 0.5

model = regress(y=y, X=X)

Expected output:

=== Regression Summary ===
Method: OLS
Observations: 100
R-squared: 0.954
Adj. R-squared: 0.953
F-statistic: 1012.35 (p < 0.001)

              Coef    Std.Err    t-value    P>|t|
Intercept    3.012      0.050     60.24    0.000 ***
x1           1.987      0.052     38.21    0.000 ***
x2          -1.493      0.048    -31.10    0.000 ***

Formula Mode

With a DataFrame, use R-style formulas for a more natural interface

import pandas as pd
from vectrix import regress

df = pd.DataFrame({
    "sales": [100, 150, 200, 180, 250, 300, 280, 350, 400, 380],
    "ads": [10, 15, 20, 18, 25, 30, 28, 35, 40, 38],
    "price": [50, 48, 45, 47, 42, 40, 41, 38, 35, 36],
    "promo": [0, 0, 1, 0, 1, 1, 0, 1, 1, 1],
})

model = regress(data=df, formula="sales ~ ads + price + promo")

Formula Syntax

regress(data=df, formula="y ~ x1 + x2")       # Specific variables
regress(data=df, formula="y ~ .")              # All other columns
regress(data=df, formula="y ~ x1 * x2")       # With interaction term
regress(data=df, formula="y ~ x + I(x**2)")   # Polynomial terms

Result Object

The EasyRegressionResult provides direct access to all regression statistics

print(f"R-squared: {model.r_squared:.4f}")
print(f"Adj. R-squared: {model.adj_r_squared:.4f}")
print(f"F-statistic: {model.f_stat:.2f}")
print(f"Coefficients: {model.coefficients}")
print(f"P-values: {model.pvalues}")

Result Reference

Attribute / MethodTypeDescription
.coefficientsnp.ndarrayRegression coefficients (including intercept)
.pvaluesnp.ndarrayP-values for each coefficient
.r_squaredfloatR-squared (coefficient of determination)
.adj_r_squaredfloatAdjusted R-squared
.f_statfloatF-statistic
.summary()strFormatted regression table
.diagnose()strFull diagnostic report
.predict(X)DataFramePredictions with confidence intervals

Diagnostics

The diagnose() method runs four standard regression diagnostic tests

print(model.diagnose())

Expected output:

=== Regression Diagnostics ===

1. Multicollinearity (VIF):
   ads:    2.31 (OK)
   price:  2.18 (OK)
   promo:  1.45 (OK)

2. Heteroscedasticity (Breusch-Pagan):
   LM stat: 3.42, p-value: 0.331
   Result: No heteroscedasticity detected

3. Normality (Jarque-Bera):
   JB stat: 1.87, p-value: 0.393
   Result: Residuals appear normally distributed

4. Autocorrelation (Durbin-Watson):
   DW stat: 2.05
   Result: No significant autocorrelation

Diagnostic Tests

TestWhat It ChecksWarning Threshold
VIFMulticollinearity between predictorsVIF > 10 is problematic
Breusch-PaganNon-constant variance (heteroscedasticity)p-value below 0.05
Jarque-BeraNormality of residualsp-value below 0.05
Durbin-WatsonAutocorrelation in residualsFar from 2.0

Prediction with Intervals

Generate predictions with confidence intervals for new data

import pandas as pd

new_data = pd.DataFrame({
    "ads": [50, 75, 100],
    "price": [30, 25, 20],
    "promo": [1, 1, 0],
})

predictions = model.predict(new_data)
print(predictions)

Expected output:

   prediction   lower95   upper95
0      425.3     380.1     470.5
1      530.7     478.2     583.2
2      610.2     550.8     669.6

Regression Methods

Vectrix supports five regression methods. Switch by setting the method parameter

from vectrix import regress

ols_model    = regress(data=df, formula="sales ~ ads + price", method="ols")
ridge_model  = regress(data=df, formula="sales ~ ads + price", method="ridge")
lasso_model  = regress(data=df, formula="sales ~ ads + price", method="lasso")
huber_model  = regress(data=df, formula="sales ~ ads + price", method="huber")
quant_model  = regress(data=df, formula="sales ~ ads + price", method="quantile")

Method Comparison

MethodUse CaseHow It Works
olsDefault, no issues with dataMinimizes sum of squared residuals
ridgeMulticollinearity (correlated predictors)L2 regularization, shrinks coefficients
lassoFeature selection, sparse modelsL1 regularization, can zero out coefficients
huberOutliers in the dataRobust loss function, down-weights outliers
quantileMedian regression, skewed distributionsMinimizes absolute deviations

When to Use Each Method

OLS (default) — Use when your data is well-behaved: no extreme outliers, no highly correlated predictors, and residuals are roughly normal.

Ridge — Use when you have correlated predictors (e.g., temperature and humidity, GDP and employment). Ridge keeps all variables but shrinks their coefficients.

Lasso — Use when you suspect some predictors are irrelevant. Lasso can drive coefficients to exactly zero, effectively performing variable selection.

Huber — Use when your data contains outliers. Huber uses a hybrid loss function that behaves like OLS for small errors but is robust to large errors.

Quantile — Use when you want to predict the median rather than the mean, or when your data is heavily skewed.

Suppress Auto-Print

By default, regress() prints the summary automatically. To suppress this

model = regress(data=df, formula="sales ~ ads + price", summary=False)

Complete Example

import pandas as pd
import numpy as np
from vectrix import regress

np.random.seed(42)
n = 200
df = pd.DataFrame({
    "revenue": np.random.randn(n) * 50 + 500,
    "marketing": np.random.randn(n) * 10 + 50,
    "price": np.random.randn(n) * 5 + 30,
    "season": np.random.choice([0, 1], n),
})
df["revenue"] = 100 + 8 * df["marketing"] - 5 * df["price"] + 20 * df["season"] + np.random.randn(n) * 10

model = regress(data=df, formula="revenue ~ marketing + price + season")
print(model.diagnose())

future = pd.DataFrame({
    "marketing": [60, 70, 80],
    "price": [28, 25, 22],
    "season": [1, 0, 1],
})
print()
print("Predictions:")
print(model.predict(future))