Transformer-based NLP pipeline for automated document classification and semantic search using ChromaDB and Pinecone vector databases.
The Document Intelligence Classifier is a deep learning NLP pipeline that automatically classifies documents into predefined categories and extracts structured information from unstructured text.
The system uses transformer-based embeddings to generate dense vector representations of documents, which are stored in ChromaDB for local use and Pinecone for cloud-scale vector search. A fine-tuned Keras classification head provides fast, accurate document categorization.
A Gradio interface allows users to upload documents and receive instant classification results with confidence scores and semantic similarity matches. The system is designed to scale from hundreds to millions of documents.
Cleaning, tokenizing, and normalizing document text. Handling multiple document formats (PDF, DOCX, TXT).
Using sentence-transformers to encode documents into dense vectors. Batch processing for efficiency.
Indexing embeddings in ChromaDB locally and Pinecone in the cloud. Building semantic search endpoints.
Fine-tuning a dense Keras classification head on embeddings. Using early stopping and learning rate scheduling.
Reporting accuracy, F1, and per-class metrics. Visualizing confusion matrices and t-SNE embedding projections.
Building an interactive Gradio app for document upload, classification, and semantic similarity search.