Research Paper

Reinforcement Learning for Dynamic Portfolio Optimization in Indian Equity Markets

Alpha AI Research Publication — Training RL agents for adaptive portfolio allocation using Deep Q-Networks and Proximal Policy Optimization on NSE data.

By Alpha AI Research TeamMarch 5, 202621 min read

Alpha AI Research Division

Authors: Kishlay Kumar, Alpha AI Quant Research | Published: March 2026 | Category: Reinforcement Learning & Finance

Abstract

This paper investigates the application of deep reinforcement learning (DRL) for dynamic portfolio optimization in Indian equity markets. We formulate the portfolio allocation problem as a Markov Decision Process and train agents using Deep Q-Networks (DQN), Proximal Policy Optimization (PPO), and Soft Actor-Critic (SAC) algorithms on a universe of 50 most liquid NSE stocks over a 10-year period. Our results show that PPO-based agents achieve a Sharpe ratio of 1.68 in out-of-sample testing, significantly outperforming traditional mean-variance optimization (0.84) and equal-weighted benchmarks (0.72).

1. Introduction

Portfolio optimization — the problem of allocating capital across multiple assets to maximize risk-adjusted returns — has been a central challenge in quantitative finance since Markowitz's seminal work in 1952. Traditional approaches including mean-variance optimization, Black-Litterman, and risk parity rely on estimated parameters (expected returns, covariance matrices) that are notoriously unstable and lead to concentrated, fragile portfolios. Indian markets, with their higher volatility, structural breaks, and evolving correlation structures, amplify these challenges.

Reinforcement learning offers a fundamentally different approach: rather than estimating parameters and optimizing a static objective, RL agents learn optimal decision policies through interaction with the market environment. The agent observes the current portfolio state and market conditions, takes allocation actions, receives reward signals (risk-adjusted returns), and iteratively improves its policy. This framework naturally handles transaction costs, portfolio constraints, and changing market dynamics without requiring explicit parameter estimation.

2. Problem Formulation

We model the portfolio optimization problem as a continuous-state, continuous-action MDP. The state space includes: current portfolio weights (50 dimensions), asset returns over multiple lookback windows (5, 10, 20, 60 days), technical indicators for each asset (RSI, MACD, volatility), cross-sectional features (relative momentum, relative value), and market regime indicators (VIX level, trend strength, correlation regime). The action space represents target portfolio weight adjustments, constrained to sum to 1 with no short-selling (long-only constraint matching typical Indian regulatory and practical requirements).

The reward function is carefully designed to align with practical investment objectives. We use a risk-adjusted return metric that combines portfolio return with a volatility penalty and a maximum drawdown penalty. Transaction costs are modeled realistically: 0.05% per trade for large-caps, 0.10% for mid-caps, plus a market impact function proportional to trade size relative to average daily volume. The episodic structure uses rolling 252-day episodes with 60-day warm-up periods for feature calculation.

3. Agent Architectures

We implement three DRL architectures optimized for the portfolio allocation task. The DQN agent discretizes the action space into 11 target weight levels per asset (0% to 10% in 1% increments) and uses a dueling network architecture with noisy layers for exploration. The PPO agent operates in continuous action space using a Gaussian policy network with 3 hidden layers (256, 128, 64 neurons) and a separate value network with the same architecture. The SAC agent adds an entropy regularization term that encourages diverse portfolio allocations.

All agents incorporate a shared feature extraction backbone: a temporal convolutional network processes the time-series features for each asset, followed by a cross-asset attention layer that captures inter-stock relationships. This architecture allows the agent to simultaneously process individual stock dynamics and cross-sectional market structure. Batch normalization and gradient clipping are used for training stability, and target networks are updated using Polyak averaging for DQN and SAC.

4. Training Methodology

Training follows a walk-forward approach with 5-year training windows and 1-year validation windows, rolled forward annually. Hyperparameter optimization uses Optuna with 200 trials per agent type, optimizing learning rate, discount factor, network architecture, and reward function parameters on the validation set. Early stopping is triggered if validation Sharpe ratio doesn't improve for 100 episodes.

To address the challenge of training RL agents in financial markets — where data is limited and non-stationary — we employ several augmentation techniques: bootstrap sampling of historical episodes, synthetic data generation using a calibrated stochastic volatility model, and adversarial training where episodes include simulated flash crashes and regime changes. These augmentations improve agent robustness by 23% measured by worst-case performance across market regimes.

5. Experimental Results

Out-of-sample testing covers January 2024 to December 2025, providing 2 years of unseen data spanning multiple market conditions. The PPO agent achieves the best overall performance with a CAGR of 24.8%, Sharpe ratio of 1.68, and maximum drawdown of 12.3%. The SAC agent follows with CAGR 22.1%, Sharpe 1.52, drawdown 14.1%. The DQN agent shows competitive returns (CAGR 20.3%) but higher drawdown (17.8%) due to its discretized action space limiting fine-grained allocation adjustments.

Compared to benchmarks: the Nifty 50 index delivered CAGR 13.2%, Sharpe 0.72, drawdown 18.5%. Traditional mean-variance optimization with monthly rebalancing achieved CAGR 16.4%, Sharpe 0.84, drawdown 22.1%. Risk parity delivered CAGR 14.8%, Sharpe 0.91, drawdown 15.3%. The RL agents' superior risk management is particularly notable — during the October 2024 market correction (Nifty fell 8.2%), the PPO agent's portfolio declined only 3.1% by having preemptively reduced equity exposure and concentrated in defensive sectors.

6. Interpretability Analysis

Understanding RL agent decisions is critical for practical adoption. We analyze the PPO agent's policy through three lenses: attention weight analysis reveals that the cross-asset attention mechanism learns meaningful relationships (high attention between correlated banking stocks, between IT stocks and USD/INR); SHAP value analysis of the policy network shows that VIX level is the most influential feature for overall equity allocation, while individual stock RSI drives stock-level weight decisions; and behavioral analysis in specific market scenarios confirms economically rational decision-making.

Notably, the agent develops emergent behaviors not explicitly programmed: it automatically reduces portfolio concentration during high-correlation regimes (when diversification benefits diminish), increases cash-equivalent allocation before anticipated volatility events (budget announcements, RBI meetings), and tilts toward momentum in trending markets and value in mean-reverting markets. These behaviors emerge purely from the reward optimization process, demonstrating the RL framework's ability to discover sophisticated portfolio management strategies.

7. Conclusions and Deployment Considerations

Our research demonstrates that deep reinforcement learning agents can significantly outperform traditional portfolio optimization methods in Indian equity markets. The PPO agent's combination of strong returns and risk management makes it the most promising architecture for practical deployment. Key advantages include: no need for return/covariance estimation, natural incorporation of transaction costs and constraints, and adaptive behavior across market regimes.

For practical deployment, we recommend: ensemble of multiple independently trained agents (reducing single-agent risk), maximum allocation limits per stock and sector (guardrails against extreme concentration), human oversight for major allocation shifts (>20% change in total equity exposure), and continuous monitoring of out-of-sample performance with automatic model updates. Alpha AI continues to develop and refine these RL-based portfolio optimization techniques for integration into our advisory platform.

Disclaimer: This research paper is for academic and educational purposes only. Backtested results do not guarantee future performance. Reinforcement learning models involve significant computational complexity and model risk. This does not constitute investment advice.

ML Portfolio Management

Machine learning for portfolio optimization

Deep Learning for Stocks

Neural networks for price prediction

Backtesting Strategies

Validate trading strategies with data

Smart Portfolio Analysis with AI

Explore Alpha AI's platform for AI-driven stock analysis that helps optimize your portfolio decisions.

Try AI Stock Screener Free