Extract API

Odoo provides a service to automate the processing of documents of type invoices, bank statements, expenses or resumes.

The service scans documents using an OCR engine and then uses AI-based algorithms to extract fields of interest such as the total, due date, or invoice lines for invoices, the initial and final balances, the date for bank statements, the total, date for expenses, or the name, email, phone number for resumes.

This service is a paid service. Each document processing will cost you one credit from your document digitization IAP account. More information about IAP accounts can be found here.

You can either use this service directly in the Accounting, Expense, or Recruitment App or through the API. The Extract API, which is detailed in the next section, allows you to integrate our service directly into your own projects.

نمای کلی

The extract API uses the JSON-RPC2 protocol; its endpoint routes are located at https://extract.api.odoo.com.

نسخه

نسخهٔ Extract API در مسیر مشخص شده است.

آخرین نسخه‌ها عبارت‌اند از:
  • invoices: 123

  • bank statements: 100

  • expenses: 132

  • applicant: 102

جریان

جریان برای هر نوع سند یکسان است.

  1. Call /parse to submit your documents (one call for each document). On success, you receive a document_token in the response.
  2. You then have to regularly poll /get_result to get the document's parsing status.
    Alternatively, you can provide a webhook_url at the time of the call to /parse and you will be notified (via a POST request) when the result is ready.

The HTTP POST method should be used for all of them. A python implementation of the full flow for invoices can be found here and a token for integration testing is provided in the integration testing section.

تجزیه

Request the digitization of a document. The route will return a document_token that you can use to fetch the result of your request.

مسیرها

  • /api/extract/invoice/2/parse

  • /api/extract/bank_statement/1/parse

  • /api/extract/expense/2/parse

  • /api/extract/applicant/2/parse

درخواست

jsonrpc (الزامی)

به JSON-RPC2 مراجعه کنید

method (الزامی)

به JSON-RPC2 مراجعه کنید

id (الزامی)

به JSON-RPC2 مراجعه کنید

params
account_token (الزامی)

The token of the IAP account from which credits will be charged. Each successful call costs one credit.

version (الزامی)

The version will determine the format of your requests and the format of the server response. You should use the latest version available.

documents (الزامی)

The document must be provided as a Base64 string in the ASCII encoding. The list should contain only one document. This field is a list only for legacy reasons. The supported formats are pdf, png and jpg.

dbuuid (اختیاری)

شناسهٔ یکتای پایگاه‌دادهٔ Odoo.

webhook_url (اختیاری)

A webhook URL can be provided. An empty POST request will be sent to webhook_url/document_token when the result is ready.

user_infos (اختیاری)

Information concerning the person sending the document to the extract service. It can be the client or the supplier (depending on the perspective). This information is not required in order for the service to work but it greatly improves the quality of the result.

user_company_vat (اختیاری)

شمارهٔ VAT کاربر.

user_company_name (اختیاری)

نام شرکت کاربر.

user_company_country_code (اختیاری)

Country code of the user. Format: ISO3166 alpha-2.

user_lang (اختیاری)

The user language. Format: language_code + _ + locale (e.g. fr_FR, en_US).

user_email (اختیاری)

ایمیل کاربر.

purchase_order_regex (اختیاری)

Regex for purchase order identification. Will default to Odoo PO format if not provided.

perspective (اختیاری)

Can be client or supplier. This field is useful for invoices only. client means that the user information provided are related to the client of the invoice. supplier means that it's related to the supplier. If not provided, client will be used.

{
    "jsonrpc": "2.0",
    "method": "call",
    "params": {
        "account_token": string,
        "version": int,
        "documents": [string],
        "dbuuid": string,
        "webhook_url": string,
        "user_infos": {
            "user_company_vat": string,
            "user_company_name": string,
            "user_company_country_code": string,
            "user_lang": string,
            "user_email": string,
            "purchase_order_regex": string,
            "perspective": string,
        },
    },
    "id": string,
}

توجه

The user_infos parameter is optional but it greatly improves the quality of the result, especially for invoices. The more information you can provide, the better.

پاسخ

jsonrpc

به JSON-RPC2 مراجعه کنید

id

به JSON-RPC2 مراجعه کنید

result
status

The code indicating the status of the request. See the table below.

status_msg

A string giving verbose details about the request status.

document_token

فقط در صورت موفقیت‌آمیز بودن درخواست وجود دارد.

وضعیت

status_msg

success

موفقیت

error_unsupported_version

نسخهٔ پشتیبانی‌نشده

error_internal

خطایی رخ داد

error_no_credit

اعتبار کافی ندارید

error_unsupported_format

فرمت فایل پشتیبانی‌نشده

error_maintenance

سرور در حال حاضر در حال نگهداری است، لطفاً بعداً دوباره تلاش کنید

{
    "jsonrpc": "2.0",
    "id": string,
    "result": {
        "status": string,
        "status_msg": string,
        "document_token": string,
    }
}

توجه

The API does not actually use the JSON-RPC error scheme. Instead the API has its own error scheme bundled inside a successful JSON-RPC result.

دریافت نتایج

مسیرها

  • /api/extract/invoice/2/get_result

  • /api/extract/bank_statement/1/get_result

  • /api/extract/expense/2/get_result

  • /api/extract/applicant/2/get_result

درخواست

jsonrpc (الزامی)

به JSON-RPC2 مراجعه کنید

method (الزامی)

به JSON-RPC2 مراجعه کنید

id (الزامی)

به JSON-RPC2 مراجعه کنید

params
version (الزامی)

The version should match the version passed to the /parse request.

document_token (الزامی)

The document_token for which you want to get the current parsing status.

account_token (الزامی)

The token of the IAP account that was used to submit the document.

{
    "jsonrpc": "2.0",
    "method": "call",
    "params": {
        "version": int,
        "document_token": int,
        "account_token": string,
    },
    "id": string,
}

پاسخ

When getting the results from the parse, the detected field vary a lot depending on the type of document. Each response is a list of dictionaries, one for each document. The keys of the dictionary are the name of the field and the value is the value of the field.

jsonrpc

به JSON-RPC2 مراجعه کنید

id

به JSON-RPC2 مراجعه کنید

result
status

The code indicating the status of the request. See the table below.

status_msg

A string giving verbose details about the request status.

results

فقط در صورت موفقیت‌آمیز بودن درخواست وجود دارد.

full_text_annotation

Contains the unprocessed full result from the OCR for the document.

وضعیت

status_msg

success

موفقیت

error_unsupported_version

نسخهٔ پشتیبانی‌نشده

error_internal

خطایی رخ داد

error_maintenance

سرور در حال حاضر در حال نگهداری است، لطفاً بعداً دوباره تلاش کنید

error_document_not_found

سند پیدا نشد

error_unsupported_size

سند به دلیل کوچک بودن بیش از حد رد شد

error_no_page_count

امکان دریافت تعداد صفحات فایل PDF وجود ندارد

error_pdf_conversion_to_images

تبدیل PDF به تصاویر ممکن نشد

error_password_protected

فایل PDF با رمز عبور محافظت می‌شود

error_too_many_pages

سند صفحات بیش از حد دارد

{
    "jsonrpc": "2.0",
    "id": string,
    "result": {
        "status": string,
        "status_msg": string,
        "results": [
            {
                "full_text_annotation": string,
                "feature_1_name": feature_1_result,
                "feature_2_name": feature_2_result,
                ...
            },
            ...
        ]
    }
}

فیلدهای مشترک

feature_result

Each field of interest we want to extract from the document such as the total or the due date are also called features. An exhaustive list of all the extracted features associated to a type of document can be found in the sections below.

For each feature, we return a list of candidates and we spotlight the candidate our model predicts to be the best fit for the feature.

selected_value (اختیاری)

بهترین نامزد برای این ویژگی.

selected_values (اختیاری)

بهترین نامزدهای این ویژگی.

candidates (اختیاری)

List of all the candidates for this feature ordered by decreasing confidence score.

"feature_name": {
    "selected_value": candidate_12,
    "candidates": [candidate_12, candidate_3, candidate_4, ...]
}
نامزد

For each candidate we give its representation and position in the document. Candidates are sorted by decreasing order of suitability.

content

نمایش نامزد.

coords

[center_x, center_y, width, height, rotation_angle]. The position and dimensions are relative to the size of the page and are therefore between 0 and 1. The angle is a clockwise rotation measured in degrees.

page

Page of the original document on which the candidate is located (starts at 0).

"candidate": [
    {
        "content": string|float,
        "coords": [float, float, float, float, float],
        "page": int
    },
    ...
]

فاکتورها

Invoices are complex and can have a lot of different fields. The following table gives an exhaustive list of all the fields we can extract from an invoice.

نام ویژگی

ویژگی‌های خاص

SWIFT_code

content یک دیکشنری است که به‌صورت رشته کدگذاری شده است.

It contains information about the detected SWIFT code (or BIC).

کلیدها:

bic

BIC شناسایی‌شده (رشته).

name (اختیاری)

نام بانک (رشته).

country_code

کد کشور ISO3166 alpha-2 بانک (رشته).

city (اختیاری)

شهر بانک (رشته).

verified_bic

اگر BIC در پایگاه‌دادهٔ ما پیدا شده باشد True (bool).

نام و شهر فقط در صورتی وجود دارند که verified_bic برابر true باشد.

iban

content یک رشته است

aba

content یک رشته است

VAT_Number

content یک رشته است

Depending on the value of perspective in the user_infos, this will be the VAT number of the supplier or the client. If perspective is client, it'll be the supplier's VAT number. If it's supplier, it's the client's VAT number.

qr-bill

content یک رشته است

payment_ref

content یک رشته است

purchase_order

content یک رشته است

از selected_values به‌جای selected_value استفاده می‌کند

country

content یک رشته است

currency

content یک رشته است

date

content یک رشته است

Format : YYYY-MM-DD

due_date

همانند date

total_tax_amount

content یک عدد اعشاری است

invoice_id

content یک رشته است

subtotal

content یک عدد اعشاری است

total

content یک عدد اعشاری است

supplier

content یک رشته است

client

content یک رشته است

email

content یک رشته است

website

content یک رشته است

ویژگی invoice_lines

It is returned as a list of dictionaries where each dictionary represents an invoice line.

"invoice_lines": [
    {
        "description": string,
        "quantity": float,
        "subtotal": float,
        "total": float,
        "taxes": list[float],
        "total": float,
        "unit_price": float
    },
    ...
]

صورت‌حساب‌های بانکی

The following table gives a list of all the fields that are extracted from bank statements.

نام ویژگی

ویژگی‌های خاص

balance_start

content یک عدد اعشاری است

balance_end

content یک عدد اعشاری است

date

content یک رشته است

ویژگی bank_statement_lines

It is returned as a list of dictionaries where each dictionary represents a bank statement line.

"bank_statement_lines": [
    {
        "amount": float,
        "description": string,
        "date": string,
    },
    ...
]

مخارج

The expenses are less complex than invoices. The following table gives an exhaustive list of all the fields we can extract from an expense report.

نام ویژگی

ویژگی‌های خاص

description

content یک رشته است

country

content یک رشته است

date

content یک رشته است

total

content یک عدد اعشاری است

currency

content یک رشته است

متقاضی

This third type of document is meant for processing resumes. The following table gives an exhaustive list of all the fields we can extract from a resume.

نام ویژگی

ویژگی‌های خاص

name

content یک رشته است

email

content یک رشته است

phone

content یک رشته است

mobile

content یک رشته است

Integration Testing

You can test your integration by using integration_token as account_token in the /parse request.

Using this token put you in test mode and allows you to simulate the entire flow without really parsing a document and without being billed one credit for each successful document parsing.

The only technical differences in test mode is that the document you send is not parsed by the system and that the response you get from /get_result is a hard-coded one.

A python implementation of the full flow for invoices can be found here.