DocsData Extraction Agent

Cookbook: Data Extraction

Turn unstructured text (PDFs, Emails) into structured JSON data.

The Problem#

You have 1,000 invoices in PDF format and you need to put them into a SQL database. Regex is too brittle. Agents are perfect for this.

The Pattern: Schema-Enforced Tool#

The trick is to create a "No-Op" tool whose only purpose is to define the output structure.

extract-invoice.ts
import { Agent, Tool } from '@akios/sdk'
import { z } from 'zod'

// 1. Define the Schema
const InvoiceSchema = z.object({
  invoice_number: z.string(),
  date: z.string().describe("ISO 8601 format"),
  vendor: z.string(),
  line_items: z.array(z.object({
    description: z.string(),
    amount: z.number(),
    quantity: z.number()
  })),
  total: z.number()
})

// 2. Create a "Save" tool
const saveInvoice = new Tool({
  name: 'save_invoice',
  description: 'Call this tool to save the extracted invoice data.',
  schema: InvoiceSchema,
  execute: async (data) => {
    // Save to DB
    console.log("Saving:", data)
    return "Success"
  }
})

// 3. The Agent
const extractor = new Agent({
  name: 'Extractor',
  model: 'gpt-4o', // Smart models work best for complex extraction
  systemPrompt: `You are a data entry clerk. 
  Extract info from the text and save it using the tool.
  If fields are missing, mark them as null or 0.`,
  tools: [saveInvoice]
})

// 4. Run
const rawText = `
  INVOICE #INV-2024-001
  Date: Jan 15, 2024
  From: Acme Corp
  
  Services:
  - Consulting: $500 (2 hrs)
  - Hosting: $50
  
  Total Due: $550
`

await extractor.run(rawText)

Cost Optimization

For high volume, use a cheaper model like `gpt-3.5-turbo` or `mistral-small` once you have verified the prompt works reliably.