NLP-Based Sentiment Analysis for Indian Financial Markets - Alpha AI Research

Alpha AI Research Division

Authors: Alpha AI Research Team | Published: March 2026 | Category: Natural Language Processing & Finance

Abstract

This paper introduces FinSentiment-IN, a multi-lingual natural language processing framework specifically designed for financial sentiment analysis in Indian markets. Unlike existing models trained predominantly on English-language Western financial texts, our approach addresses the unique challenge of Indian financial discourse, which spans multiple languages (English, Hindi, and Hinglish), includes India-specific financial terminology, and reflects domestic market dynamics. We fine-tune pre-trained language models on a curated dataset of 150,000 Indian financial texts and demonstrate significant improvements over general-purpose sentiment models for Indian equity market prediction.

1. Introduction

Financial sentiment analysis has emerged as a powerful tool for extracting trading signals from unstructured text data. However, the overwhelming majority of research focuses on English-language sources from US and European markets. Indian financial markets present unique NLP challenges: financial news and social media discussions occur in English, Hindi, and Hinglish (a Hindi-English hybrid); India-specific regulatory terms (SEBI circulars, RBI policy actions) carry distinct sentiment implications; and the retail-dominated nature of Indian markets makes social media sentiment particularly influential.

Existing sentiment analysis tools, including FinBERT and other financial NLP models, show degraded performance on Indian financial texts. Our analysis reveals that general FinBERT achieves only 61% accuracy on Indian financial news, compared to 84% on US financial news, highlighting the domain gap. This paper addresses this gap through targeted dataset creation, model fine-tuning, and integration with quantitative trading frameworks.

2. Dataset Construction

We construct the IndFinSent dataset comprising 150,000 annotated financial texts from five sources: (1) 40,000 English-language headlines from Economic Times, Moneycontrol, and LiveMint (2020-2025); (2) 30,000 Hindi financial news headlines from Hindi business portals; (3) 35,000 Hinglish social media posts from financial communities on Twitter/X and Telegram; (4) 25,000 SEBI and RBI circular summaries with market impact annotations; and (5) 20,000 quarterly earnings call transcripts from Nifty 100 companies.

Each text is annotated by three financial domain experts on a 5-point scale (strongly negative, negative, neutral, positive, strongly positive) with inter-annotator agreement measured via Krippendorff's alpha (0.78, indicating substantial agreement). We additionally annotate entity-level sentiment (company-specific vs. market-level), temporal relevance (immediate impact vs. long-term), and confidence level. The dataset is split 70-15-15 for training, validation, and testing, with temporal stratification to prevent data leakage.

3. Model Architecture

FinSentiment-IN builds upon the multilingual BERT (mBERT) architecture, chosen for its pre-training on 104 languages including Hindi. We implement a two-stage fine-tuning process: first, domain adaptation using a masked language modeling objective on 2 million unlabeled Indian financial texts to adapt the model's language representation to financial domain vocabulary. Second, task-specific fine-tuning on the IndFinSent labeled dataset for sentiment classification.

The architecture incorporates several innovations for Indian financial text: a custom tokenizer extension that handles common Hinglish patterns (e.g., "market girega" meaning "market will fall"), a regulatory event embedding that encodes the type and source of regulatory announcements, and an attention mechanism that weights entity mentions proportional to their market capitalization relevance. The final classification head outputs both sentiment probabilities and a confidence score calibrated through temperature scaling.

4. Experimental Results

FinSentiment-IN achieves 82.4% accuracy on the held-out test set, significantly outperforming baseline models: general mBERT (64.2%), English FinBERT applied directly (61.0%), and a Hindi-specific BERT model (67.8%). The improvement is most pronounced for Hinglish texts (84.1% vs. 52.3% for FinBERT) and regulatory announcements (79.6% vs. 58.4%), confirming the value of domain-specific fine-tuning for Indian financial contexts.

Ablation studies reveal the relative importance of each component: domain-adaptive pre-training contributes +8.2% accuracy improvement over direct fine-tuning, the custom Hinglish tokenizer adds +3.4%, regulatory event embeddings contribute +2.1%, and the entity-aware attention mechanism adds +1.5%. Analysis of failure cases shows that the model struggles most with sarcasm in social media posts (47% accuracy), ambiguous regulatory language (62%), and texts requiring world knowledge beyond the training window.

5. Trading Signal Generation

We construct a sentiment-based trading strategy by aggregating FinSentiment-IN predictions across all relevant texts for each Nifty 50 stock daily. The aggregation uses time-decayed weighting (exponential decay with 3-day half-life), source credibility weighting (official news > analyst reports > social media), and volume weighting (number of mentions reflects attention intensity). The resulting daily sentiment score ranges from -1 (extremely bearish) to +1 (extremely bullish).

A long-short portfolio constructed from the top and bottom quintiles of sentiment scores generates a 5-day forward return spread of 1.2% (annualized alpha of 14.3% after transaction costs). The sentiment signal shows low correlation (0.12) with traditional momentum and value factors, confirming its incremental information content. Combining sentiment with technical indicators in a gradient boosting ensemble improves the Sharpe ratio from 0.95 (sentiment-only) to 1.31 (combined), demonstrating complementarity with quantitative signals.

6. Real-Time Deployment Architecture

For production deployment, we design a real-time NLP pipeline that processes financial texts with sub-second latency. The architecture consists of: (a) news and social media ingestion layer using websockets and REST API polling across 15 Indian financial sources; (b) preprocessing pipeline for language detection, text cleaning, and entity extraction; (c) inference engine running the FinSentiment-IN model on GPU-accelerated infrastructure; and (d) signal aggregation and delivery to downstream trading systems.

The system processes approximately 10,000 texts per hour during market hours, with average inference latency of 45ms per text on NVIDIA T4 GPU. We implement automatic model monitoring for concept drift — if daily accuracy on a holdout set drops below 75% for three consecutive days, automatic retraining is triggered using the latest labeled data. This production architecture ensures the sentiment signals remain reliable and current as market dynamics and language patterns evolve.

7. Conclusions

FinSentiment-IN demonstrates that domain-specific, language-aware NLP models can significantly improve financial sentiment analysis for Indian markets. The multi-lingual capability addressing Hindi and Hinglish financial texts fills a critical gap in the existing research. Our trading analysis confirms that NLP-derived sentiment signals contain unique alpha not captured by traditional quantitative factors. Future work will extend the model to cover Marathi, Tamil, and Bengali financial discourse, explore the use of large language models (LLMs) for zero-shot sentiment analysis, and investigate the integration of visual information from stock charts in social media posts.

Disclaimer: This research paper is published for academic and educational purposes only. The trading results are based on historical backtests and do not guarantee future performance. This does not constitute investment advice.

NLP-Based Sentiment Analysis for Indian Financial Markets: A Multi-Lingual Approach

Abstract

1. Introduction

2. Dataset Construction

3. Model Architecture

4. Experimental Results

5. Trading Signal Generation

6. Real-Time Deployment Architecture

7. Conclusions

AI-Powered Market Insights

NLP-Based Sentiment Analysis for Indian Financial Markets: A Multi-Lingual Approach

Abstract

1. Introduction

2. Dataset Construction

3. Model Architecture

4. Experimental Results

5. Trading Signal Generation

6. Real-Time Deployment Architecture

7. Conclusions

Market Sentiment Analysis

Deep Learning for Stocks

AI in Stock Market

AI-Powered Market Insights