AI Can Now Predict India's Air Quality With 97% Accuracy

Every winter, a brown curtain of smog descends over India's major cities, shutting schools, grounding flights, and pushing millions into respiratory distress. The government issues warnings, citizens stock up on masks, and then — after a few weeks — everyone forgets until it happens again. But what if an algorithm could see it coming days in advance, accurately enough to actually do something about it? A new study published in Scientific Reports suggests we're closer to that reality than most people realise. Researchers have built a machine learning model that predicts India's Air Quality Index with up to 97.68% accuracy — and the approach it uses is, genuinely, inspired by wolves.

The Air Crisis Numbers That Should Alarm You

India ranked 8th among 131 countries in global air pollution in 2022, with an average AQI of 144 — a figure that sits comfortably in the "poor" category on the national scale. That's not just a statistic. It translates to more than 2 million Indians facing serious health consequences annually, from lung cancer and bronchitis to premature births and impaired cognitive development in children.

The sources aren't mysterious. Industry accounts for roughly 50% of India's air pollution load. Vehicles add another 27%. Crop burning — that stubble smoke that chokes Delhi every October and November — contributes 17%. The remaining fraction comes from domestic cooking fires. What's less understood is how these sources interact with weather, geography, and time of year to produce the precise pollution spikes that hospitalise people.

That's the gap this research targets. Not cleaning the air — but reading it, far enough in advance that governments and citizens can actually respond.

WHAT IS THE AIR QUALITY INDEX (AQI)? The AQI is a single number that summarises air pollution from multiple pollutants — PM2.5, PM10, NO₂, SO₂, CO, and ozone — into one easy-to-read scale. In India, it runs from 0–50 (Good) to 401–500 (Severe). Think of it like a fever thermometer for the atmosphere: one number that tells you whether it's safe to go outside.

Why Old Forecasting Models Keep Getting It Wrong

For decades, scientists tried to predict air quality using statistical methods — tools like the autoregressive integrated moving average (ARIMA) model, which borrows its logic from financial forecasting. The fundamental problem? Air pollution isn't a stock price. It's a nonlinear, chaotic system that changes by the hour, responds to wind shifts, and gets tangled up with dozens of interacting variables simultaneously.

Traditional machine learning methods like support vector regression and random forests did better, but they hit their own ceiling. Feed them a large, messy real-world dataset — say, five years of hourly pollution readings across 26 Indian cities — and they start to buckle. They struggle to figure out which variables actually matter. Is it the PM2.5 reading from three hours ago? The nitrogen oxide level? The season? Some combination of all three, weighted in a way no human would ever guess?

This is where feature selection becomes critical. In a dataset full of dozens of variables, not every variable is equally useful. Feeding irrelevant data into a model doesn't just waste computing power — it actively degrades accuracy. The researchers needed a smarter way to sift signal from noise.

Read Me

8th

India's global air pollution rank (2022)

50%

India's pollution from industries

2M+

Indians facing serious health impacts annually

How Does a Wolf Pack Help Predict Air Pollution?

Grey Wolf Optimization (GWO) is one of those ideas that sounds implausible until you understand it. Grey wolves hunt in rigidly hierarchical packs: alpha wolves lead, betas advise, deltas gather intelligence, and omegas execute the hunt. The pack doesn't scatter randomly — it systematically encircles prey, coordinates its approach, and closes in efficiently. In the 2014 paper that introduced GWO as a computational method, researchers translated this hunting behaviour into a mathematical search algorithm. Instead of prey, the algorithm hunts for the optimal set of features in a dataset.

In practice, here's what that means. The dataset used in this study — five years of air quality readings from the Kaggle repository — contains variables like PM2.5, PM10, NO, NO₂, NOx, NH₃, CO, SO₂, ozone, benzene, and toluene. Not all of them matter equally for predicting the AQI in, say, Hyderabad in July. GWO's job is to hunt down the combination that matters most, discarding noise, before passing its selection to the actual prediction engine.

That prediction engine is a decision tree regressor — essentially a flowchart that learned from five years of historical data how pollution patterns lead to specific AQI outcomes. Before training, the researchers also corrected for a subtle but important data problem: the AQI categories (Good, Satisfactory, Moderate, Poor, Very Poor, Severe) were wildly unbalanced after data cleaning. Some categories had thousands of observations; others had barely any. An imbalanced dataset teaches a model to favour common cases and get the rare but dangerous ones (like "Severe") badly wrong. The team used a technique called SMOTE — Synthetic Minority Oversampling — to generate realistic synthetic data points for underrepresented categories, creating a level playing field before training began.

"The proposed optimized regression model attained maximum performance over conventional technique — with accuracy ranging from 88.98% for New Delhi to 97.68% for Visakhapatnam."

— Natarajan et al., Jain University & Leeds Beckett University · Scientific Reports, 2024

What the Results Actually Show — City by City

The results, when they came back, were striking. Across all six cities — New Delhi, Bangalore, Kolkata, Hyderabad, Chennai, and Visakhapatnam — the GWO-DT model consistently outperformed every comparison algorithm tested. Random Forest, K-Nearest Neighbor, and Support Vector Regression all fell short.

Kolkata showed the highest statistical precision, with an R-square value of 0.9874 — meaning the model explained more than 98% of the variance in the city's pollution data. Hyderabad and Visakhapatnam both topped 97% accuracy. Even New Delhi — famously one of the most complex and unpredictable pollution environments on Earth, shaped by seasonal burning from neighbouring states, traffic, construction, and abrupt weather changes — hit 88.98%.

Why did Bangalore perform slightly lower than the others on the R-square metric? The researchers note it as an open question — Bangalore's weather patterns and pollution mix are distinct, and the model may need more granular local data to refine its predictions further. That honesty matters. A model that acknowledges its limits is one you can actually deploy responsibly.

97.68%

Peak accuracy — Visakhapatnam

94.25%

Average accuracy across all six cities

+4%

Accuracy gain over KNN model

THE SMOG FORECASTING OPPORTUNITY FOR INDIA India's WHO air quality guidelines are routinely exceeded in 21 of its most densely populated cities. A 97%-accurate AQI prediction system, integrated into city-level early warning infrastructure, could give public health officials 24–48 hours to issue school closures, outdoor work restrictions, and hospital surge preparations — before the worst days hit.

The Questions That Still Need Answering

To be clear about what this study is and isn't: it's a proof-of-concept using historical data, not a live forecasting system. The model was trained and tested on data from 2015 to 2020 — a period that doesn't capture post-pandemic shifts in industrial activity, the rise in electric vehicle adoption, or recent changes in India's stubble-burning regulations. Real-world deployment would require continuous retraining as environmental and economic conditions evolve.

There's also the question of spatial resolution. The dataset draws from monitoring stations across each city, but air quality in a megacity like Delhi can vary enormously between, say, Okhla Industrial Area and Lodhi Garden — sometimes just a few kilometres apart. A model trained on city-level averages may not be fine-grained enough for neighbourhood-level alerts. The researchers themselves flag deep learning extensions as their next horizon, noting that architectures like LSTM networks could capture temporal patterns that decision trees still miss.

But these are the right questions to be asking — not objections that undermine the work, but directions that build on it. The fundamentals here are sound, the accuracy numbers are genuinely impressive, and the framework is reproducible. The dataset is publicly available on Kaggle. Any city administration, researcher, or technology firm in India could, right now, take this approach and adapt it.

GWO improves everything it touches — Adding Grey Wolf Optimization to a standard decision tree raised accuracy by up to 20 percentage points in cities like Bangalore, compared to training on imbalanced data alone.
Data quality is half the battle — SMOTE-based balancing was as important as the choice of algorithm; a model trained on imbalanced data consistently underperformed across all cities.
India needs city-specific models — No single universal model captured all six cities equally well, suggesting that local environmental factors demand locally tuned systems rather than one-size-fits-all national forecasting.

"In future, the proposed model can be extended using deep learning models for attaining better prediction performances in air quality monitoring." — Natarajan et al., Scientific Reports, 2024.

📄 Source & Citation

Primary Source: Natarajan SK, Shanmurthy P, Arockiam D, Balusamy B & Selvarajan S. (2024). Optimized machine learning model for air quality index prediction in major cities in India. Scientific Reports, 14, 6795. https://doi.org/10.1038/s41598-024-54807-1

Authors & Affiliations: Suresh Kumar Natarajan (Jain University, Bengaluru), Prakash Shanmurthy (Presidency University, Bengaluru), Daniel Arockiam (Amity University, Gwalior), Balamurugan Balusamy (Shiv Nadar Institution of Eminence, Delhi), Shitharth Selvarajan (Leeds Beckett University, UK)

Data & Code: Dataset available at Kaggle — Air Quality Data in India (2015–2020)

Key Themes: Air Quality Index · Machine Learning · Grey Wolf Optimization · Decision Tree Regression · India Air Pollution

Supporting References:

[1] World Health Organization. Ambient air quality and health fact sheet. WHO, 2023.

[2] Ameer S et al. (2019). Comparative analysis of machine learning techniques for predicting air quality in smart cities. IEEE Access, 7, 128325–128338.

[3] Mirjalili S, Mirjalili SM & Lewis A. (2014). Grey wolf optimizer. Advances in Engineering Software, 69, 46–61. doi:10.1016/j.advengsoft.2013.12.007

[4] IQAir World Air Quality Report 2022. IQAir.com

[5] Chawla NV et al. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.

AI Can Now Predict India's Air Quality With 97% Accuracy

In This Article

The Air Crisis Numbers That Should Alarm You

Why Old Forecasting Models Keep Getting It Wrong

How Does a Wolf Pack Help Predict Air Pollution?

What the Results Actually Show — City by City

The Questions That Still Need Answering

📄 Source & Citation

Leave a Comment

AI Can Now Predict India's Air Quality With 97% Accuracy

In This Article

The Air Crisis Numbers That Should Alarm You

Why Old Forecasting Models Keep Getting It Wrong

How Does a Wolf Pack Help Predict Air Pollution?

What the Results Actually Show — City by City

The Questions That Still Need Answering

📄 Source & Citation

Leave a Comment

Related Articles

Working With AI Makes You Love Your Job More — But Only If You See It …

AI Won't Kill Your Career — It Will Shape Who Survives It

Your Boss Is Now an AI: How Smart Machines Are Changing Big Decisions