OCR

FREEMIUM
By API 4 AI | Updated एक महीने पहले | Visual Recognition
Popularity

9.6 / 10

Latency

1,846ms

Service Level

100%

Health Check

100%

Back to All Tutorials (8)

How to extract text from a PDF using OCR API


There are numerous reasons why you might need to extract text from PDF files. If you have a large number of PDF documents to process, using the OCR API for text extraction can automate the process, saving a significant amount of manual effort and time.

The extracted text can be seamlessly integrated with other applications or systems, such as databases or content management systems, facilitating data sharing and integration. This also makes it easy to search for specific information within the document, ultimately saving time and effort.

Moreover, it empowers data mining and analysis, enabling you to extract valuable insights, patterns, or trends from extensive volumes of textual data.

Regardless of the reason, the OCR API is here to help you solve your problem.

Python implementation

Install the requests package to send requests to the api: pip install requests.

Parse command-line arguments

The script will accept command-line arguments and handle them using argparse.
The command-line argument --api-key represents your API key from Rapid API.

def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser()
    parser.add_argument('--api-key', help='Rapid API key.', required=True)  # Get your token at https://rapidapi.com/api4ai-api4ai-default/api/brand-recognition/pricing
    parser.add_argument('pdf', type=Path,
                        help='Path to a pdf.')
    return parser.parse_args()

Parse PDF using OCR API

Text will be extracted from PDF documents through the utilization of the OCR API.
Please note, when PDF has multiple pages, each page will be as different result in results field.

def parse_pdf(pdf_path: Path, api_key: str) -> list:
    """
    Extract text from a pdf.
    Returns list of strings, representing pdf pages.
    """
    # We strongly recommend you use exponential backoff.
    error_statuses = (408, 409, 429, 500, 502, 503, 504)
    s = requests.Session()
    retries = Retry(backoff_factor=1.5, status_forcelist=error_statuses)

    s.mount('https://', HTTPAdapter(max_retries=retries))

    url = f'{API_URL}/v1/results'
    with pdf_path.open('rb') as f:
        api_res = s.post(url, files={'image': f},
                         headers={'X-RapidAPI-Key': api_key}, timeout=20)
    api_res_json = api_res.json()

    # Handle processing failure.
    if (api_res.status_code != 200 or
            api_res_json['results'][0]['status']['code'] == 'failure'):
        print('Image processing failed.')
        sys.exit(1)

    # Each page is a different result.
    pages = [result['entities'][0]['objects'][0]['entities'][0]['text']
             for result in api_res_json['results']]
    return pages

Main function

def main():
    """
    Script entry function.
    """
    args = parse_args()
    text = parse_pdf(args.pdf, args.api_key)
    for i, text in enumerate(text):
        print(f'Text on {i + 1} page:\n{text}\n')


if __name__ == '__main__':
    main()

Python code

"""
Parse PDF using OCR API.

Run script:
`python3 main.py --api-key <RAPID_API_KEY> <PATH_TO_PDF>
"""

import argparse
import sys
from pathlib import Path

import requests
from requests.adapters import Retry, HTTPAdapter


API_URL = 'https://ocr43.p.rapidapi.com/v1/results'

THRESHOLD = 0.5


def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser()
    parser.add_argument('--api-key', help='Rapid API key.', required=True)  # Get your token at https://rapidapi.com/api4ai-api4ai-default/api/brand-recognition/pricing
    parser.add_argument('pdf', type=Path,
                        help='Path to a pdf.')
    return parser.parse_args()


def parse_pdf(pdf_path: Path, api_key: str) -> list:
    """
    Extract text from a pdf.
    Returns list of strings, representing pdf pages.
    """
    # We strongly recommend you use exponential backoff.
    error_statuses = (408, 409, 429, 500, 502, 503, 504)
    s = requests.Session()
    retries = Retry(backoff_factor=1.5, status_forcelist=error_statuses)

    s.mount('https://', HTTPAdapter(max_retries=retries))

    url = f'{API_URL}/v1/results'
    with pdf_path.open('rb') as f:
        api_res = s.post(url, files={'image': f},
                         headers={'X-RapidAPI-Key': api_key}, timeout=20)
    api_res_json = api_res.json()

    # Handle processing failure.
    if (api_res.status_code != 200 or
            api_res_json['results'][0]['status']['code'] == 'failure'):
        print('Image processing failed.')
        sys.exit(1)

    # Each page is a different result.
    pages = [result['entities'][0]['objects'][0]['entities'][0]['text']
             for result in api_res_json['results']]
    return pages


def main():
    """
    Script entry function.
    """
    args = parse_args()
    text = parse_pdf(args.pdf, args.api_key)
    for i, text in enumerate(text):
        print(f'Text on {i + 1} page:\n{text}\n')


if __name__ == '__main__':
    main()

Testing the script

Let’s test the script with the following PDF file.
Run the script: python3 main.py --api-key YOUR_API_KEY path/to/pdf.
As the result, the script will print scanned text:

Conclusion

In summary, the implementation of the OCR API for text extraction from PDF documents provides a host of advantages that greatly enhance our ability to efficiently manage and leverage information contained within these files.

Furthermore, it’s noteworthy that the OCR API accommodates input requests in a format of your preference, be it JPG/PNG images or PDF files. This flexibility aligns seamlessly with diverse business needs and document formats.