Commodity Price Prediction

Chaitanya Singh
Jul 25
2 min read

Data Loading and Cleanup

We began with a seven‑column CSV tracking daily tomato prices by market, unit, and min/Max values. Since our goal was simply to predict the average price based on the day’s index, we dropped all other columns—Unit, Date, Market, Minimum, and Maximum—leaving just the “Number” (a sequential day count) and “Average” price. After removing any missing entries, our dataset contained 2 741 clean records.

Train‑Test Split

To evaluate generalization, we split the data randomly into an 80 % training set and a 20 % test set. This ensured the model learned on the bulk of historical observations but still faced fresh, unseen days during evaluation.

Random Forest Regression

We chose a Random Forest regressor because it handles nonlinear relationships gracefully and requires little feature engineering. Feeding the day index (“Number”) and the corresponding average price into the forest, we trained an ensemble of decision trees to learn how price typically evolves over time.

Model Performance

On the held‑out test set, the Random Forest achieved an R² score of approximately 0.93. This indicates that 93 % of the variance in daily average tomato prices is explained by our simple model. Given the inherently noisy nature of agricultural commodity prices—subject to weather swings, supply disruptions, and market sentiment—this level of fit suggests the day index itself captures strong seasonal or trend effects in tomato pricing.

Visualizing the Fit

Plotting the training points (day index vs. average price) alongside the learned predictions reveals how the model tracks both upward and downward phases in the price series. Although some scatter remains—especially around abrupt price jumps—the forest’s piecewise‑constant predictions closely follow the data’s major waves.

Next Steps

To push predictive accuracy even higher, we might:

Incorporate Calendar Features: add month, week‑of‑year, or holiday flags to capture seasonal harvest cycles and demand spikes.
Add Exogenous Variables: include weather data, fuel prices, or supply indices to explain sudden price shocks.
Time Series Techniques: augment or replace our regression with ARIMA, Prophet, or LSTM models that explicitly model temporal autocorrelation.
Hyperparameter Tuning: systematically search tree depth, number of trees, and minimum samples per leaf to balance bias and variance.

By enriching the feature set and exploring models designed for sequential data, we can build a more robust predictor of tomato price dynamics.