Mitigate Procurement

Document parsing

How your uploaded files get converted into AI-readable text.

When you upload a PDF or Word document, the AI can't just "see" it like you do. It needs structured text. Parsing is the conversion step that makes this possible.

What parsing does

When an AI run starts, your documents are loaded into the run's sandbox and parsed there. The parsing step:

  1. Extracts text from the document format (PDF, Word, Excel, etc.)
  2. Preserves structure — headings, paragraphs, lists, tables
  3. Handles tables — converts them into a format the AI can interpret
  4. Reads scanned documents — uses OCR (optical character recognition) for image-based PDFs
  5. Processes EDOC archives — extracts all files from the archive and parses each one

The output is clean text the AI agents can search and read efficiently.

What affects parsing quality

Document type matters. A Word document or a native PDF (created from software, not scanned) gives the cleanest results. The text is already digital — parsing just restructures it.

Scanned documents are trickier. The system has to "read" images of text, which depends on scan quality. A clear, straight, high-resolution scan works well. A faded, skewed, or low-resolution scan may have errors.

Tables can be complex. Simple tables parse well. Complex ones with merged cells, nested tables, or unusual formatting may lose some structure. If a key requirement is hidden in a complex table, check the parsed result.

Formatting-heavy documents — lots of text boxes, watermarks, multi-column layouts, embedded images with text — can sometimes lose content during parsing. The simpler the layout, the more reliable the result.

The technology

Parsing happens inside the sandbox using open-source libraries: PyMuPDF for PDFs, python-docx for Word, openpyxl for Excel, and tesseract for OCR on scanned documents. Because the work happens inside your run's isolated sandbox, your document content is not sent to a separate third-party parsing service.

When parsing fails

The most common culprits are password protection, file corruption, and very poor scan quality. See FAQ and troubleshooting for specific issues.

If a file repeatedly fails to parse, try converting the source file to a different format and re-uploading.

How agents read your documents

When an analysis or composition starts, the system creates an isolated workspace (a "sandbox") for that run and uploads your original documents into it. Parsing runs inside the sandbox, and the AI agent then reads the files directly using normal file-system tools — the same way a person would on their laptop:

  • Open and read a specific file ("read the technical proposal")
  • Search by keyword with ripgrep ("find every mention of ISO 9001")
  • List files in a folder
  • Run a small shell command to count lines, extract a section, or convert formats

There is no precomputed index. Reading happens on demand. The agent decides what to look at based on what it's currently checking, the same way you'd skim and jump around when reviewing a document.

When keyword search isn't enough

For most questions, keyword search is the fastest way to find an answer — it returns specific lines from specific files in milliseconds. But sometimes you need to find passages by meaning, not exact wording.

Imagine the RFP asks for "healthcare experience". A vendor's proposal might describe:

  • "We delivered electronic health records to three hospitals"
  • "Clinical data management for St. Mary's Medical Center"
  • "Medical sector projects make up 60% of our portfolio"

None of those passages contain the exact phrase "healthcare experience", but they're all relevant. When keyword search comes back empty for a fuzzy concept, the agent falls back to semantic search — a smaller AI model reads the relevant files, ranks passages by meaning, and returns the best matches. This is slower and more expensive than ripgrep, so the agent uses it only as a fallback.

Why this matters for you

Two practical implications:

  • Parsing happens per run, inside the sandbox. Your originals stay in S3; each run gets a fresh sandbox, opens the originals there, and parses on demand. If parsing quality is poor on a specific file, re-upload a cleaner version — every future run will pick up the new version.
  • Agents read like people. They don't memorize your documents. They open files, search for what they need, and quote what they found. That's why every finding has a specific evidence quote — the agent is showing you exactly what it read.

On this page