Back to Projects
Deep Learning · NLP Intermediate

🔍 Document Intelligence Classifier

Transformer-based NLP pipeline for automated document classification and semantic search using ChromaDB and Pinecone vector databases.

View on GitHub
Transformer Embeddings
Vector DB Semantic Search
Gradio Interface

Project Overview

The Document Intelligence Classifier is a deep learning NLP pipeline that automatically classifies documents into predefined categories and extracts structured information from unstructured text.

The system uses transformer-based embeddings to generate dense vector representations of documents, which are stored in ChromaDB for local use and Pinecone for cloud-scale vector search. A fine-tuned Keras classification head provides fast, accurate document categorization.

A Gradio interface allows users to upload documents and receive instant classification results with confidence scores and semantic similarity matches. The system is designed to scale from hundreds to millions of documents.

What You'll Learn

  • Generate document embeddings using transformer models (sentence-transformers)
  • Build and query vector databases with ChromaDB and Pinecone
  • Fine-tune a Keras classification head on transformer embeddings
  • Implement semantic search with similarity scoring
  • Build a Gradio UI for document upload and real-time classification
  • Evaluate NLP models with precision, recall, F1, and confusion matrices

System Architecture

Documents
Input
Tokenization
Preprocessing
Transformer
Embeddings
ChromaDB/Pinecone
Vector Store
Classifier
Keras Head
Results
Gradio UI

Project Breakdown

01
Text Preprocessing

Cleaning, tokenizing, and normalizing document text. Handling multiple document formats (PDF, DOCX, TXT).

02
Embedding Generation

Using sentence-transformers to encode documents into dense vectors. Batch processing for efficiency.

03
Vector Database

Indexing embeddings in ChromaDB locally and Pinecone in the cloud. Building semantic search endpoints.

04
Classification Model

Fine-tuning a dense Keras classification head on embeddings. Using early stopping and learning rate scheduling.

05
Evaluation

Reporting accuracy, F1, and per-class metrics. Visualizing confusion matrices and t-SNE embedding projections.

06
Gradio Interface

Building an interactive Gradio app for document upload, classification, and semantic similarity search.