⚙️ System Architecture
Overview
The system integrates document ingestion, vector retrieval, and LLM reasoning to produce explainable tax recommendations.
Pipeline Stages
Data Sources:
Finance Acts, CBDT notifications, company annual reports, competitor filings.Preprocessing:
- OCR for scanned PDFs
- Table extraction (TableNet / PdfTable)
- Text chunking for embeddings
Embeddings + Vector DB:
- Convert text to embeddings (OpenAI/Instructor models)
- Store in a semantic vector database (e.g., Pinecone, FAISS)
RAG + LLM Layer:
- Retrieve top-k relevant legal/financial segments
- Generate grounded responses using fine-tuned LLM
Output:
- Tax regime classification
- Line-item to tax mapping
- Risk-of-notice score
- Competitor benchmarking report
Modules
| Module | Function | Tools |
|---|---|---|
| Data Ingestion | Extract structured data from PDFs | OCR, TableNet |
| Embeddings Engine | Create semantic search space | Sentence Transformers |
| RAG Layer | Retrieve and contextualize information | LangChain / LlamaIndex |
| Generation | Produce human-readable tax insights | LLM (GPT/OLLAMA) |
| Risk Scoring | Predict notice likelihood | ML model + heuristics |
| UI Dashboard | Visualize insights | Streamlit |
Diagram
mermaid
flowchart TD
A[PDFs / Financial Data] --> B[OCR + Parser]
B --> C[Vector Embeddings]
C --> D[RAG Retriever]
D --> E[LLM Generator]
E --> F[Tax Insights + Risk Score]