Research Assistant - LLM Research Pipeline
An intelligent, end-to-end pipeline for processing research PDFs using LLMs (Ollama or Gemini) with dynamic category generation, accurate PDF parsing with OCR fallback, LLM-based metadata extraction, multi-category scoring, deduplication, and topic-focused summarization.
Problem & Solution
The Problem
Organizing research PDFs is labor-intensive: extracting metadata, categorizing across multiple themes, filtering by topic relevance, detecting duplicates, and creating summaries. Manual workflows are slow, inconsistent, and hard to reproduce.
The Solution
Research Assistant automates this pipeline with LLMs. It generates a dynamic taxonomy from your topic, parses PDFs with OCR fallback, extracts rich metadata, scores papers across all categories, moves each to the best-fit folder, removes duplicates, and produces topic-focused summaries with CSV/JSONL indices for downstream analysis.
Technologies Used
Core & Parsing
LLM & Embeddings
Indexing & Quality
Key Achievements
Dynamic LLM-Driven Taxonomy
Generates categories from your topic (no hardcoding) and scores each paper across all categories simultaneously to choose the best placement.
Accurate Parsing with OCR Fallback
Uses PyMuPDF for born-digital PDFs and seamlessly falls back to OCR (ocrmypdf + Tesseract) for scanned documents, ensuring high-quality text extraction.
Smart Deduplication & Resume
Combines hash-based exact matching with MinHash for near-duplicate detection. A SQLite cache and manifests support resumable processing at scale.
Topic-Focused Summaries & Indices
Produces per-paper summaries that emphasize your topic, plus JSONL and CSV indices for analysis, and Markdown summaries per category.
Screenshots
Dynamic Taxonomy
LLM Metadata Extraction
Multi-Category Scoring
Smart Deduplication
Topic-Focused Summaries
CSV/JSONL Indices & Manifests