Skip to content

Intelligent Document Parsing Tool Guide

Note:Before learning how to use different functions, we recommend that read the Request Workflow to know a basic PDF processing process. When using different functions, you can set their own special parameters when uploading files. Other basic steps are consistent.

Intelligent Document Parsing:

java
{
    "getImage": "both",
    "isAllowOcr": 1,
    "imageOutputType": "base64str"
}

Required Parameters:

getImage: Image extraction type: page returns the entire page image for each page, objects returns image objects within the page, both returns both entire page images and image objects

isAllowOcr: Whether to use OCR (0: Disable; 1: Enable).

imageOutputType: Image storage type: base64str, url base64str: Images are returned directly in base64 format in the API result (this method can result in large response volumes, not recommended for long documents). url: Images are returned as platform links, which you can download to your local storage or upload to your cloud storage.

Java Example:

You need to replace apiKey with the publicKey obtained from the console, file with the file you want to convert, and language with the desired interface error prompt language type.

java
import java.io.*;
import okhttp3.*;
public class main {
  public static void main(String []args) throws IOException{
    OkHttpClient client = new OkHttpClient().newBuilder()
      .build();
    MediaType mediaType = MediaType.parse("text/plain");
    RequestBody body = new MultipartBody.Builder().setType(MultipartBody.FORM)
      .addFormDataPart("file","{{file}}",
 RequestBody.create(MediaType.parse("application/octet-stream"),
                                          new File("<file>")))
      .addFormDataPart("language","{{language}}")
      .addFormDataPart("password","")
      .addFormDataPart("parameter","{  \"getImage\": \"objects\",\"isAllowOcr\":1,\"imageOutputType\":\"url\"}") 
      .build();
    Request request = new Request.Builder()
      .url("https://api-server.compdf.com/server/v1/process/idp/documentParsing")
      .method("POST", body)
      .addHeader("x-api-key", "{{apiKey}}")
      .build();
    Response response = client.newCall(request).execute();
  }
}

Result:

File TypeFile Description
.jsonJSON file with intelligent document parsing completed

Return Data Structure Explanation:

java
code (integer) Operation status code
message (string) Description message
version (string) Version number
duration (integer) Total processing time (in milliseconds)
x_request_id (string) Request ID
image_process (array) Whether there is a watermark
msg (string) Description message
result (object) Core data
  ├─ markdown (string) Markdown-formatted text of the entire document
  ├─ total_count (integer) Total number of pages in the PDF document
  ├─ total_page_number (integer) Total number of pages in the PDF document
  ├─ success_count (integer) Total number of successfully processed pages
  ├─ total_count (integer) Total number of pages in the PDF document
  ├─ valid_page_number (integer) Number of successfully parsed valid pages
  ├─ excel_base64 (string) Excel file base64 encoding
  ├─ catalog (object) Table of contents tree structure
  │  └─ toc (array)
  │     ├─ pos (array): Coordinates of the four corners of the directory area, in order: left-top, right-top, right-bottom, left-bottom.
  │     ├─ paragraph_id (integer): ID of the paragraph where the title is located
  │     ├─ page_id (integer): Page number where the title is located (minimum page number is 1)
  │     ├─ hierarchy (integer): Title level, 1 for level 1 title, 2 for level 2 title, and so on
  │     ├─ pos_list (array): When title merging occurs, the coordinates of multiple titles before merging. When no title merging occurs, the coordinates of the title.
  │     ├─ title (string): Title content
  │     └─ sub_type (string): Title type: text_title, image_title, table_title

  ├─ pages (array) Paginated data container
  │  ├─ status (string): Page processing status/error message
  │  ├─ page_id (number): Current page number
  │  ├─ durations (number): Page processing time (milliseconds)
  │  ├─ image_id (string): Image address
  │  ├─ width (integer): Document page width (pixels)
  │  ├─ height (integer): Document page height (pixels)
  │  ├─ angle (integer): Text orientation angle (0°:(upright)/90°:(right rotation)/180°:(inverted)/270°:(left rotation))
  │  ├─ content (array): Basic data: text lines or images, refer to textline and image descriptions
  │  └─ structured (array): Structured data, one of textblock, table, imageblock, footer, header

  └─ detail (array) Markdown detailed information (structure reused "paragraph data" model)
     ├─ page_id (integer): Current paragraph page number
     ├─ paragraph_id (integer): Current paragraph ID
     ├─ outline_level (integer): Title level: (up to 5 levels supported) -1. Body text 0. Level 1 title 1. Level 2 title ...
     ├─ text (string): Text
     ├─ type (string): Type, paragraph (paragraph type, including body text, titles, formulas, etc.), image (image type), table (table type)
     ├─ image_url (string): Image address
     ├─ content (integer): Content type 0 Body text (paragraph, image, table) 1 Non-body text (header, footer, sidebar)
     ├─ position (array): Coordinates of the four corners of the directory area, in order: left-top, right-top, right-bottom, left-bottom.
     ├─ sub_type (string): Subtype. When type is paragraph, possible values are catalog (table of contents), header (page header), footer (page footer), sidebar (sidebar), text (body text), text_title (text title), image_title (image title), table_title (table title); when type is image, possible values are stamp (seal), chart (chart), qrcode (QR code), barcode (barcode); when type is table, possible values are bordered (bordered table), borderless (borderless table).
     ├─ tags (array): Indicates whether there are special texts within the paragraph, including formula and handwritten.
     │─ cells (array): Cell array, returned only when type is table
     │  ├─ row_span (integer): Cell row span, default is 1
     │  ├─ text (integer):
     │  ├─ type (integer):
     │  ├─ col (integer): Cell column number
     │  ├─ col_span (integer): Cell column span, default is 1
     │  ├─ page_id (integer):
     │  ├─ position (integer): Coordinates of the four corners of the cell, in order: left-top, right-top, right-bottom, left-bottom.
     │  └─ row (integer): Cell row number

     └─ caption_id (object): Original OCR text result
        ├─ page_id (integer): Page number where the title is located
        └─ paragraph_id (integer): Paragraph ID where the title is located

metrics (array) Page-level performance metrics
  ├─ page_image_width (integer): Current page rendering width (pixels)
  ├─ page_image_height (integer): Current page rendering height (pixels)
  ├─ dpi (integer): Image resolution
  ├─ durations (number): Page processing time (milliseconds)
  ├─ status (string): Page processing status
  ├─ page_id (number): Current page number
  ├─ angle (integer): Text orientation angle (0°:(upright)/90°:(right rotation)/180°:(inverted)/270°:(left rotation))
  └─ image_id (string): Page image ID (download method same as pages.image_id)

Structured Data Specification:

Content (Text Line/Image)

Image Data

ParameterTypeDescription
idintegerData ID
typestringData type (fixed value: image)
posarrayText line four corner coordinates Format: [top-left (x,y), top-right (x,y), bottom-right (x,y), bottom-left (x,y)]
sizearrayImage dimensions [width, height]
dataobjectImage content object
↳ data.regionarrayImage region coordinates on the page
↳ data.pathstringImage file path
↳ data.base64stringImage file (jpg/png) base64 string

Textline Data

ParameterTypeDescription
idintegerData ID (unique within the page)
typestringData type (fixed value: line)
textstringText line content (When sub_type=stamp, it is the seal text)
posarrayText line four corner coordinates
scorenumberCharacter confidence (Generated only when OCR is performed on the input image)

Structured Data

Textblock

ParameterTypeDescription
idintegerData ID
typestringBlock type (fixed value: textblock)
posarrayText block four corner coordinates
contentarrayContained text line ID array
sub_typestringSubtype (title/list/formula, etc.)
textstringBlock text content
outline_levelintegerTitle level: -1=Body text, 0=Level 1 title, 1=Level 2 title... (Up to five levels supported)

Table Data

ParameterTypeDescription
idintegerData ID
typestringBlock type (fixed value: table)
sub_typestringTable type (Default value: bordered, borderless tables need special marking)
posarrayTable four corner coordinates
rowsintegerTotal number of rows
colsintegerTotal number of columns
columns_widtharrayColumn width array
rows_heightarrayRow height array
textstringTable content (HTML/Markdown format)

Imageblock

ParameterTypeDescription
idintegerData ID
typestringBlock type (fixed value: image)
posarrayImage block four corner coordinates
textstringImage annotation text (HTML/Markdown format)
image_urlstringImage file path
base64strstringImage base64 encoded string

Footer Block

ParameterTypeDescription
typestringBlock type (fixed value: footer)
posarrayBlock four corner coordinates
blocksarrayContent block array (Can contain textblock/imageblock/table)

Header Block

ParameterTypeDescription
typestringBlock type (fixed value: header)
posarrayBlock four corner coordinates
image_urlstringHeader image path
base64strstringHeader image base64 encoding
blocksarrayContent block array (Can contain textblock/imageblock/table)