Tutorial 03 — Regression
Full-featured regression analysis in a single function call. Vectrix supports R-style formula syntax (y ~ x1 + x2), automatic diagnostics (heteroscedasticity, normality, multicollinearity), multiple methods (OLS, Ridge, Lasso, Robust), and prediction intervals — without leaving the Python ecosystem.
Direct Input
The simplest form: pass y and X directly as numpy arrays. Vectrix runs OLS by default with automatic constant term
import numpy as np
from vectrix import regress
np.random.seed(42)
X = np.random.randn(100, 2)
y = 3 + 2 * X[:, 0] - 1.5 * X[:, 1] + np.random.randn(100) * 0.5
model = regress(y=y, X=X) Expected output:
=== Regression Summary ===
Method: OLS
Observations: 100
R-squared: 0.954
Adj. R-squared: 0.953
F-statistic: 1012.35 (p < 0.001)
Coef Std.Err t-value P>|t|
Intercept 3.012 0.050 60.24 0.000 ***
x1 1.987 0.052 38.21 0.000 ***
x2 -1.493 0.048 -31.10 0.000 *** Formula Mode
With a DataFrame, use R-style formulas for a more natural interface
import pandas as pd
from vectrix import regress
df = pd.DataFrame({
"sales": [100, 150, 200, 180, 250, 300, 280, 350, 400, 380],
"ads": [10, 15, 20, 18, 25, 30, 28, 35, 40, 38],
"price": [50, 48, 45, 47, 42, 40, 41, 38, 35, 36],
"promo": [0, 0, 1, 0, 1, 1, 0, 1, 1, 1],
})
model = regress(data=df, formula="sales ~ ads + price + promo") Formula Syntax
regress(data=df, formula="y ~ x1 + x2") # Specific variables
regress(data=df, formula="y ~ .") # All other columns
regress(data=df, formula="y ~ x1 * x2") # With interaction term
regress(data=df, formula="y ~ x + I(x**2)") # Polynomial terms Result Object
The EasyRegressionResult provides direct access to all regression statistics
print(f"R-squared: {model.r_squared:.4f}")
print(f"Adj. R-squared: {model.adj_r_squared:.4f}")
print(f"F-statistic: {model.f_stat:.2f}")
print(f"Coefficients: {model.coefficients}")
print(f"P-values: {model.pvalues}") Result Reference
| Attribute / Method | Type | Description |
|---|---|---|
.coefficients | np.ndarray | Regression coefficients (including intercept) |
.pvalues | np.ndarray | P-values for each coefficient |
.r_squared | float | R-squared (coefficient of determination) |
.adj_r_squared | float | Adjusted R-squared |
.f_stat | float | F-statistic |
.summary() | str | Formatted regression table |
.diagnose() | str | Full diagnostic report |
.predict(X) | DataFrame | Predictions with confidence intervals |
Diagnostics
The diagnose() method runs four standard regression diagnostic tests
print(model.diagnose()) Expected output:
=== Regression Diagnostics ===
1. Multicollinearity (VIF):
ads: 2.31 (OK)
price: 2.18 (OK)
promo: 1.45 (OK)
2. Heteroscedasticity (Breusch-Pagan):
LM stat: 3.42, p-value: 0.331
Result: No heteroscedasticity detected
3. Normality (Jarque-Bera):
JB stat: 1.87, p-value: 0.393
Result: Residuals appear normally distributed
4. Autocorrelation (Durbin-Watson):
DW stat: 2.05
Result: No significant autocorrelation Diagnostic Tests
| Test | What It Checks | Warning Threshold |
|---|---|---|
| VIF | Multicollinearity between predictors | VIF > 10 is problematic |
| Breusch-Pagan | Non-constant variance (heteroscedasticity) | p-value below 0.05 |
| Jarque-Bera | Normality of residuals | p-value below 0.05 |
| Durbin-Watson | Autocorrelation in residuals | Far from 2.0 |
Prediction with Intervals
Generate predictions with confidence intervals for new data
import pandas as pd
new_data = pd.DataFrame({
"ads": [50, 75, 100],
"price": [30, 25, 20],
"promo": [1, 1, 0],
})
predictions = model.predict(new_data)
print(predictions) Expected output:
prediction lower95 upper95
0 425.3 380.1 470.5
1 530.7 478.2 583.2
2 610.2 550.8 669.6 Regression Methods
Vectrix supports five regression methods. Switch by setting the method parameter
from vectrix import regress
ols_model = regress(data=df, formula="sales ~ ads + price", method="ols")
ridge_model = regress(data=df, formula="sales ~ ads + price", method="ridge")
lasso_model = regress(data=df, formula="sales ~ ads + price", method="lasso")
huber_model = regress(data=df, formula="sales ~ ads + price", method="huber")
quant_model = regress(data=df, formula="sales ~ ads + price", method="quantile") Method Comparison
| Method | Use Case | How It Works |
|---|---|---|
ols | Default, no issues with data | Minimizes sum of squared residuals |
ridge | Multicollinearity (correlated predictors) | L2 regularization, shrinks coefficients |
lasso | Feature selection, sparse models | L1 regularization, can zero out coefficients |
huber | Outliers in the data | Robust loss function, down-weights outliers |
quantile | Median regression, skewed distributions | Minimizes absolute deviations |
When to Use Each Method
OLS (default) — Use when your data is well-behaved: no extreme outliers, no highly correlated predictors, and residuals are roughly normal.
Ridge — Use when you have correlated predictors (e.g., temperature and humidity, GDP and employment). Ridge keeps all variables but shrinks their coefficients.
Lasso — Use when you suspect some predictors are irrelevant. Lasso can drive coefficients to exactly zero, effectively performing variable selection.
Huber — Use when your data contains outliers. Huber uses a hybrid loss function that behaves like OLS for small errors but is robust to large errors.
Quantile — Use when you want to predict the median rather than the mean, or when your data is heavily skewed.
Suppress Auto-Print
By default, regress() prints the summary automatically. To suppress this
model = regress(data=df, formula="sales ~ ads + price", summary=False) Complete Example
import pandas as pd
import numpy as np
from vectrix import regress
np.random.seed(42)
n = 200
df = pd.DataFrame({
"revenue": np.random.randn(n) * 50 + 500,
"marketing": np.random.randn(n) * 10 + 50,
"price": np.random.randn(n) * 5 + 30,
"season": np.random.choice([0, 1], n),
})
df["revenue"] = 100 + 8 * df["marketing"] - 5 * df["price"] + 20 * df["season"] + np.random.randn(n) * 10
model = regress(data=df, formula="revenue ~ marketing + price + season")
print(model.diagnose())
future = pd.DataFrame({
"marketing": [60, 70, 80],
"price": [28, 25, 22],
"season": [1, 0, 1],
})
print()
print("Predictions:")
print(model.predict(future))