Home Our Success Stories IJOGR Correspondence Extractor: Automating Author Email Retrieval Across Volumes

IJOGR Correspondence Extractor: Automating Author Email Retrieval Across Volumes

Ajay Bidyarthy

January 29, 2026

1256

Client Background

Client Name: A Research Institution
Industry Type: Academic Research / Medical Journals
Products & Services: Research publications, academic data management
Organization Size: 100+ contributors and researchers
About Client: Prof. Nachiket leads academic initiatives requiring structured data extraction from journal archives to streamline research communication and collaboration.

Client Background

The client needed to extract corresponding author email addresses from all papers in IJOGR volumes 10, 11, and 12.

Manual extraction was slow, repetitive, and error-prone.
Each paper contained only one email, embedded inside PDFs.
Large data volume across multiple issues made manual work impractical.

Our Solution

We designed and implemented the IJOGR Correspondence Extractor, a Python-based automated pipeline that

Crawls IJOGR archive pages for volumes 10–12.
Collects article PDF links.
Extracts corresponding author names and emails using text parsing and regex.
Falls back to OCR for scanned PDFs.
Outputs a clean CSV with Name, Email, PDF URL for each paper.

Solution Architecture

Web crawler – Navigates archive → volume → issue → article → PDF.
PDF parser – Extracts text using pdfminer.six.
Regex engine – Identifies email addresses.
Name capture logic – Anchors on “Corresponding Author” text.
OCR fallback – Uses Tesseract for image-based PDFs.
Output pipeline – Stores results in CSV for reproducibility and auditability.

Workflow Overview

Crawler fetches PDFs.
Parser extracts text.
Regex Engine finds email addresses.
Name Capture Logic associates names with emails.
OCR Fallback ensures coverage for image-based PDFs.
Output Pipeline logs results into CSV for reproducibility.

Deliverables

Automated pipeline script (extractor.py).
CSV file with Name, Email, PDF URL for all papers in volumes 10–12.
Documentation for setup, execution, and validation.
QA notes highlighting edge cases and verification steps.

Tech Stack

Framework used

Python scripting with modular functions.

Language/techniques used

Python, BeautifulSoup, pdfminer, regex, OCR.

Model Used

None (rule-based extraction).

Skills used

Web scraping, PDF parsing, regex design, OCR integration, reproducible pipeline design.

Databases used

CSV flat file output (no external DB required).

Web Cloud Servers used

Local execution; adaptable to cloud (AWS EC2 / Azure VM).

What are the technical Challenges Faced during Project Execution

PDFs with inconsistent layouts.
Some PDFs were scanned images (no text layer).
Identifying the correct author name near the email.
Avoiding duplicate entries across issues.

How the Technical Challenges were Solved

Implemented OCR fallback for scanned PDFs.
Designed regex anchored on “Corresponding Author” context.
Added deduplication logic for PDF URLs.
Introduced throttling to avoid server overload.

Business Impact

Reduced extraction time from 12+ hours manual → ~2-4 hours automated.
Delivered a reproducible, auditable dataset for academic collaboration.
Enabled scalability for future volumes with minimal effort.
Improved confidence in data quality and reduced human error.

IJOGR Correspondence Extractor: Automating Author Email Retrieval Across Volumes

Client Background

Client Background

Our Solution

Solution Architecture

Workflow Overview

Deliverables

Tech Stack

What are the technical Challenges Faced during Project Execution

How the Technical Challenges were Solved

Business Impact

MOST POPULAR INSIGHTS

Environmental impact of the COVID-19 pandemic – Lesson for the Future

Automated Parsing and Trip Matching for Toll Statements using AWS Serverless...

AI-Driven Project Management Platform for Defence and Space tech firm

AI-Driven Regional Demand Forecasting for Residential Gutter Systems

RECOMMENDED INSIGHTS

How Political Leaders will Shape Tomorrow using Big Data & Analytics

How to overcome your fear of making mistakes?

Role of Big Data in Healthcare

Automated Amazon Multi-Inventory Management & Risk Dashboard (KNIME + Tableau)

LATEST INSIGHTS

Multi-Modal Continuous Authentication & Intelligent Threat Detection Platform

AI Smart Parking System (Computer Vision-Based)

DeVa – Engineering PDF Validation System for AI Data Centers

POPULAR INSIGHTS

Multi-Modal Continuous Authentication & Intelligent Threat Detection Platform

AI Smart Parking System (Computer Vision-Based)

DeVa – Engineering PDF Validation System for AI Data Centers

POPULAR INSIGHTS CATEGORY

ABOUT US

FOLLOW US

Turning the Professional Networking Data into Actionable Insights

Real-Time Cryptocurrency Market Data Aggregation & API Platform

Equity Waterfalls Model-Based SaaS Application for Real Estate Sector

Lipsync Automation for Celebrities and Influencers

Grafana Dashboard Development for Realtime Analysis