Overview

The cpc_parser package is a Python library designed to extract structured question data from CPC (Certified Professional Coder) practice test PDF files. It converts unstructured PDF content into validated, structured data models that can be used for analysis, benchmarking, and machine learning applications.

Key Features

Package Structure

cpc_parser/
├── __init__.py          # Package exports and version info
├── schema.py            # Pydantic data models (Question, QuestionDataset)
└── parse_pdf.py         # Main parsing logic (CPCTestParser)

Core Components

1. Data Models (schema.py)

2. Parser (parse_pdf.py)

Processing Flow

flowchart TD
    A["CPC Test PDF"] --> B["Initialize CPCTestParser"]
    B --> C["Parse Questions<br/>(Pages 4-35)"]
    C --> D["Parse Answer Key<br/>(Answer Key Section)"]
    D --> E["Parse Explanations<br/>(Explanations Section)"]
    E --> F["Combine Data"]
    F --> G["Validate with Pydantic"]
    G --> H["QuestionDataset"]
    H --> I["Export JSONL"]
    H --> J["Generate Statistics"]
    
    style A fill:#e1f5fe
    style H fill:#e8f5e8
    style I fill:#fff3e0
    style J fill:#fff3e0

Quick Start

Basic Usage