Skip to content

⚙️ System Architecture

Overview

The system integrates document ingestion, vector retrieval, and LLM reasoning to produce explainable tax recommendations.


Pipeline Stages

  1. Data Sources:
    Finance Acts, CBDT notifications, company annual reports, competitor filings.

  2. Preprocessing:

    • OCR for scanned PDFs
    • Table extraction (TableNet / PdfTable)
    • Text chunking for embeddings
  3. Embeddings + Vector DB:

    • Convert text to embeddings (OpenAI/Instructor models)
    • Store in a semantic vector database (e.g., Pinecone, FAISS)
  4. RAG + LLM Layer:

    • Retrieve top-k relevant legal/financial segments
    • Generate grounded responses using fine-tuned LLM
  5. Output:

    • Tax regime classification
    • Line-item to tax mapping
    • Risk-of-notice score
    • Competitor benchmarking report

Modules

ModuleFunctionTools
Data IngestionExtract structured data from PDFsOCR, TableNet
Embeddings EngineCreate semantic search spaceSentence Transformers
RAG LayerRetrieve and contextualize informationLangChain / LlamaIndex
GenerationProduce human-readable tax insightsLLM (GPT/OLLAMA)
Risk ScoringPredict notice likelihoodML model + heuristics
UI DashboardVisualize insightsStreamlit

Diagram

mermaid
flowchart TD
A[PDFs / Financial Data] --> B[OCR + Parser]
B --> C[Vector Embeddings]
C --> D[RAG Retriever]
D --> E[LLM Generator]
E --> F[Tax Insights + Risk Score]