English Wikipedia Vector Database

PAID
By ryan | Updated vor einem Monat | Artificial Intelligence/Machine Learning
Health Check

0%

README

English Wikipedia Vector Database

Introduction

Retrieval-augmented generation (RAG) leverages generative AI for solving real-world problems, using Wikipedia as a key source of information. However, integrating Wikipediaโ€™s content into AI applications is complex, requiring steps like downloading dumps, text processing, vector conversion, indexing, and setting up APIs. This process is resource-intensive, demanding considerable time, computational effort, and storage. For small-scale projects or initial experiments, having access to a ready-made online service would greatly simplify development.

This was the inspiration behind the creation of our online API. Through this API, users can bypass the complexities and directly retrieve similar Wikipedia articles with just a single API call. The subsequent sections will detail various use cases for this service.

Prerequisites

To proceed, you require two access tokens:

  • Firstly, an access token is necessary to utilize the HuggingFace Inference API. To learn how to generate a HuggingFace token, refer to this guide.
  • Secondly, you must create a RapidAPI account to obtain an access token for our vector database API. For instructions on setting up your account and token, consult the RapidAPI Quick Start Guide.

Input and Output

We try to keep thing minimal. There is only one POST API /search. The input has the following fomat. v represents the text embedding vector, while k specifies the how many results we would like it to return (maximum 5).

{
    "v" : [ 0.3, 0.5, 0.2 ... 0.7 ],
    "k" : 3
}

The outcome is as follows, very self-explained. d contains a list of returned result, while e contains error information if anything goes wrong.

{
  "d": [
    {
      "title": "List of past presumed highest mountains",
      "url": "https://en.wikipedia.org/?curid=10374104",
      "score": 0.75751114
    },
    {
      "title": "List of highest mountains on Earth",
      "url": "https://en.wikipedia.org/?curid=1821694",
      "score": 0.7503605
    },
    {
      "title": "Siguang Ri",
      "url": "https://en.wikipedia.org/?curid=42714303",
      "score": 0.70713806
    }
  ],
  "e": ""
}

Example 1: Remote Embedding then Search

The simplest approach requires only a Python runtime, without the need for any other local software dependencies. Text embedding can be easily achieved using the HuggingFace Inference API. The resulting vector enables querying our vector database. We utilize the all-MiniLM-L6-v2 model for embedding, offering efficient text processing.

#!/usr/bin/env python3

import json
import requests


def embed(text):
    HFAPI = "https://api-inference.huggingface.co/pipeline/feature-extraction/sentence-transformers/all-MiniLM-L6-v2"
    headers = {"Authorization": "Bearer REPLACE_WITH_YOUR_HUGGINGFACE_TOKEN_HERE"}
    payload = {"inputs": [text]}
    response = requests.post(HFAPI, headers=headers, json=payload)
    return response.json()[0]


def main():
    api_url = "https://english-wikipedia-vector-database.p.rapidapi.com/search"
    query = {"v": embed("What is the highest mountain on earth?"), "k": 5}
    headers = {
	    "X-RapidAPI-Key": "REPLACE_WITH_YOUR_RAPIDAPI_TOKEN_HERE",
	    "X-RapidAPI-Host": "english-wikipedia-vector-database.p.rapidapi.com"
    }

    response = requests.post(api_url, json=query, headers=headers)
    resp = response.json()
    print(json.dumps(resp, indent=2))


if __name__ == '__main__':
    main()

Example 2: Local Embedding then Search

Embedding function can also be performed locally. But you need to install some python libraries as below.

python3 -m venv venv
source venv/bin/activate
pip install sentence-transformers

The code is almost the same as before, except the embed function now uses local model for embedding text to vectors.

#!/usr/bin/env python3

import json
import requests
from sentence_transformers import SentenceTransformer


def embed(text):
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    embeddings = model.encode([text])
    return embeddings[0].tolist()


def main():
    api_url = "https://english-wikipedia-vector-database.p.rapidapi.com/search"
    query = {"v": embed("How many stars are there in the sky?"), "k": 5}
    headers = {
	    "X-RapidAPI-Key": "REPLACE_WITH_YOUR_RAPIDAPI_TOKEN_HERE",
	    "X-RapidAPI-Host": "english-wikipedia-vector-database.p.rapidapi.com"
    }

    response = requests.post(api_url, json=query, headers=headers)
    resp = response.json()
    print(json.dumps(resp, indent=2))


if __name__ == '__main__':
    main()

Happy hacking!

Followers: 0
Resources:
Terms of use
API Creator:
Rapid account: Ryan
ryan
ryanrhymes
Log In to Rate API
Rating: 5 - Votes: 1