Back to Portfolio

Research Assistant - LLM Research Pipeline

AI & Research Tools View on GitHub View on PyPI

An intelligent, end-to-end pipeline for processing research PDFs using LLMs (Ollama or Gemini) with dynamic category generation, accurate PDF parsing with OCR fallback, LLM-based metadata extraction, multi-category scoring, deduplication, and topic-focused summarization.

Problem & Solution

The Problem

Organizing research PDFs is labor-intensive: extracting metadata, categorizing across multiple themes, filtering by topic relevance, detecting duplicates, and creating summaries. Manual workflows are slow, inconsistent, and hard to reproduce.

The Solution

Research Assistant automates this pipeline with LLMs. It generates a dynamic taxonomy from your topic, parses PDFs with OCR fallback, extracts rich metadata, scores papers across all categories, moves each to the best-fit folder, removes duplicates, and produces topic-focused summaries with CSV/JSONL indices for downstream analysis.

Technologies Used

Core & Parsing

Python 3.12+ PyMuPDF ocrmypdf Tesseract

LLM & Embeddings

Ollama Google Gemini nomic-embed-text

Indexing & Quality

SQLite Cache MinHash LSH pytest

Key Achievements

Dynamic LLM-Driven Taxonomy

Generates categories from your topic (no hardcoding) and scores each paper across all categories simultaneously to choose the best placement.

Accurate Parsing with OCR Fallback

Uses PyMuPDF for born-digital PDFs and seamlessly falls back to OCR (ocrmypdf + Tesseract) for scanned documents, ensuring high-quality text extraction.

Smart Deduplication & Resume

Combines hash-based exact matching with MinHash for near-duplicate detection. A SQLite cache and manifests support resumable processing at scale.

Topic-Focused Summaries & Indices

Produces per-paper summaries that emphasize your topic, plus JSONL and CSV indices for analysis, and Markdown summaries per category.

Screenshots

Taxonomy Generation

Dynamic Taxonomy

Metadata Extraction

LLM Metadata Extraction

Category Scoring

Multi-Category Scoring

Deduplication

Smart Deduplication

Summarization

Topic-Focused Summaries

Outputs

CSV/JSONL Indices & Manifests