ComplianceSphere

FREEMIUM
By ComplianceSphere | Updated un mese fa | Finance
Health Check

0%

README

Understanding ComplianceSphere data

Our data model provides a domain-specific language to describe people, companies, and the relationships between them.

ComplianceSphere collects data about political and economic influence and conflict. To correctly describe these real-world structures and their associated risks, we use a data model focused on the notion of entities.

Entities have various properties, denoting - for example - their name, creation date, and association with one or more countries. They can also use properties to reference other entities, e.g. a Passport entity will link to its holder. Some relationships are - in turn - entities: a Person can be linked to a Company via an Ownership entity, which might document further details, like the period and ownership percentage.

Using the matching API to build a screening process

Screening checks are a different challenge to normal text searches: your query is supposed to describe a person or company in some detail to allow the ComplianceSphere API to check if that entity (or a similar one) is flagged.

Step 1: Speak the language

Let’s say, for example, that you have a customers dataset that specifies the name, birth date, nationality and perhaps a national ID number for each person you want to check.

The first step would then be to implement a piece of code format in each of these entries that conforms with the entity format used by ComplianceSphere, assigning each of the columns in your source data to one of the fields specified in the data dictionary (This, of course, works not just for people, but also companies, vessels, even crypto wallets).

Here’s an example entity in JSON format:

{
    "schema": "Person",
    "properties": {
        "name": ["Arkadiii Romanovich Rotenberg", "Ротенберг Аркадий"],
        "birthDate": ["1951"],
        "nationality": ["Russia"],
    }
}

A few things to note:

The schema defines the type of entities to match this example against. Of course, the schema could also be Company, or a LegalEntity (a more general entity type that matches both people and companies!).
You can specify a list of property values, rather than a single value - for example, different variations of the name, or different addresses and identification numbers.
The API internally uses standardised formats for country codes, dates, phone numbers etc., but you can just supply a country name and the API will attempt to identify the correct country code (in this case: ru) for the entity.
Generating this JSON form of your records should be a simple exercise. Do not worry too much about details like whether a country name should live in the country or jurisdiction properties: the matching happens by data type (in this case: country), not precise field name.

Step 2: Choose where to look

ComplianceSphere combines data from dozens of sources - some are sanctions lists, others databases of national politicians or entities involved in crime. When running a matching process, you probably want it to run against one of the provided collections that combine and de-duplicate entities from multiple data sources:

  • default is the widest collection of sources, including sanctioned entities, politicians and entities linked to crime.
  • sanctions contains only the entities mentioned in the international sanctions lists included by the system.
  • peps lists the entities known to be politically exposed persons (PEPs), e.g. politicians and their close associates and family members.

What collection will be queried is determined by the URL of the matching endpoint used in your integration, e.g. https://compliancesphere.p.rapidapi.com/match/sanctions.

Step 3: Chunk your lookups into batches

In order to avoid the overhead of sending thousands upon thousands of HTTP requests, you can group the entities to be matched into batches, sending a few of them at a time. A good batch size is 20 or 50, not 5000.

Configuring the scoring system

Introduction

When you send a match query to the API, it will be processed in two stages: first, a search index is used to locate possible candidate results. This process is meant to optimise for recall, i.e. find a broad selection of result candidates. In a second stage, these candidates are scored against the query that has been provided by the API consumer. The following URL query parameters are used to configure that process:

  • algorithm query parameter is used to select a scoring algorithm. The different algorithms are described below, but you can also retrieve metadata about them programmatically using the /algorithms endpoint.
    • Set this to best to use the highest-quality algorithm available at any time. This will produce good results, but may mean that specific scores for matched entities change significantly over time.
  • threshold is defined as the numeric score limit above which a result should be considered a match. This parameter may need to be adapted in conjunction with the algorithm to avoid producing too many false positive matches.
  • cutoffdescribes the lower bound of the result score that should still be returned by the API. Lower this parameter to see more candidates that have been down-ranked by the scoring system.
  • limit gives the maximum number of matches returned. The ComplianceSphere dataset is de-duplicated, so there can usually only be one matching record for each query. Returning a large number of results therefore does not make sense like it would in a full-text search.
  • fuzzy is a boolean flag that can be used to disable fuzzy matching in the candidate finding stage. This has proven to be largely ineffective compared to other techniques (e.g. the search for phonetic and normalised forms of the names). We recommend disabling fuzzy candidate finding.

Recommended default: ?algorithm=best&fuzzy=false

Supported scoring mechanisms

The API supports several scoring mechanisms (“algorithms”) that can be used to compute and rank the results of a match query. Below is a narrative overview of the supported algorithms, please also refer to the technical documentation.

  • logic-v1 (currently also: best) implements a large number of deterministic rules to generate a match result suitable for screening systems. The rules include phonetic and fuzzy name matching, rules regarding the use of IMO, ISIN, LEI, OGRN, INN and other entity identifiers, and rules that reduce the quality of matches in which supporting information (such as countries, DOB, gender, and address) are divergent between the query and the matching candidate. This model is calibrated to be used with the default threshold parameter value (0.7).

  • name-based and name-qualified are name-only scoring system that combines the Jaro-Winkler and Soundex name comparison techniques to aggressively match entities by name. The algorithm attempts to loosely emulate the OFAC Sanctions Search web tool. This can be useful for regulatory purposes, or when you only know the names of the entities you need to screen. name-qualified provides a marginal improvement over name-based by computing the same score and then penalizing matches where the birth date or nationality is different for people, or where different registration numbers/tax identifiers are used for companies.

  • regression-v1 and regression-v2 are scoring systems based on logistic regression based on a wide set of features. They provide good results in particular if you can include multiple attributes to describe the entities you are screening for: dates of birth, nationalities, addresses, tax identifiers. Both models will produce high match scores only for multi-attribute matches, e.g. when a query shares the name and birth date or identification number of an entry in the database.

    • Please note: regression-v2 produces signficantly lower score values than regression-v1. You may want to set the threshold parameter for matches to 0.5 when using it.

Fine-tuning the score weights

The logic-v1, name-based and name-qualified matchers support the fine tuning of feature weights for custom scoring. For example, an API client may want to give more weight to a phonetic matching algorithm, or fully disable one of the existing mechanisms. Feature weights are between 0.0-1.0 and can be applied to any of the documented features by including a weights section in the body of the /match API request:

{
    "weights": {
        "name_literal_match": 0.0,
        "name_soundex_match": 1.0
    },
    "queries": {
        "q1": {
            "schema": "Person",
            "properties": {"name": ["Barack Ohbama"]}
        }
    }
}

The logic-v1 matcher includes some features that are weighted at 0.0 by default. These are meant to be enabled using custom weights if desired by the API user. Features that have a 0.0 weight are not computed by default, which has a positive impact on system performance.

Selecting your input data

In order to set up a matching solution with low error rates (both false positives and false negatives), it may be helpful to reflect what input data you can provide in order to allow precise decision-making. Consider the following questions:

  • Do you know if a record in your screening set refers to a person or an organization? Setting the schema in your matching query to Person and Organization will increase precision.
  • Can you provide multiple name aliases? For persons, are you able to include the first and last name separately (in the firstName, lastName properties)?
  • The following can be useful qualifiers to include in your query in order to reduce false positives from name-only matches:
    • Can you provide a birth date or year of birth for individuals (birthDate)?
    • For companies, do you know any registration numbers (registrationNumber) or tax identifiers (taxNumber)?
    • Do you know the nationality of a person, or the country in which a company was registered (country)?
  • Finally, consider reducing the scope of your query. Using /match/default will search sanctions lists, the PEP database and a broad set of other risk-adjacent entities. For a simple sanctions screening system, consider using /match/sanctions instead: this will only produce matches sourced from sanctions lists.

Using the search function to find entities

ComplianceSphere API exposes a full-text search function that users can use to find relevant entities. Below you can find some help on how to use the built-in syntax and advanced search operators effectively.

Finding an exact phrase or name

By default, ComplianceSphere tries to find matches based off your keywords pretty broadly, returning matches that include all of your keywords first, followed by matches that might only include one of your keywords. For example, if you type the keywords:

Ilham Aliyev

The search will return all matches that have the words “Ilham” and “Aliyev,” followed by matches that have either “Ilham” or “Aliyev” but not both, in them. Depending on your needs, this might not be ideal.

If you want the search to only return matches that have exactly “Ilham Aliyev”, then you should put quotations around those two keywords.

"Ilham Aliyev"

Allow for variations in spelling

Sometimes a name can be spelled many different ways or even mispelled many different ways. One way to solve this problem is to simply type each variation in the search form:

Aliyev Əliyev Aliyeva Əliyeva

You might capture all the variations you want, but you also might miss some by accident. Another way to look for variants of a name is to use the ~ operator:

Aliyev~2

What this translates to is: Give me matches that include the keyword Aliyev, but also matches that include up to any 2 letter variations from the keyword Aliyev. These variations include adding, removing, and changing a letter. This includes Aliyev, of course, but also includes Əliyev, which is just one letter variation different, and Əliyeva, which is two letter variations different from Aliyev.

Search for words that should be in proximity to each other

If you do not want to find a precise keyword, but merely specify that two words are supposed to appear close to each other, you might want to use a proximity search, which also uses the ~ operator. This will try to find all the requested search keywords within a given distance from each other. For example, to find matches where the keywords Trump and Aliyev are ten or fewer words apart from each other, you can formulate the search as:

"Trump Aliyev"~10

Including and excluding combinations of keywords

You can tell the search to find matches to multiple keywords in a variety of ways or combinations, otherwise known as a composite search.

To tell Aleph that a keyword must exist in all resulting matches, use a + operator. Similarly, to tell Aleph that a keyword must not exist in any of the resulting matches, use - operator.

+Trump -Aliyev

This translates to: Give me all matches in which each match must include the keyword Trump and must definitely not include the keyword Aliyev.

You can take these combinations a step further using the AND operator or the OR operator.

Trump AND Aliyev

This translates to: Give me all matches in which each match must contain both the keywords Trump and Aliyev, but don’t return any matches that only contains just one of those keywords.

Trump OR Aliyev

This translates to: Give me all matches in which each match may contain the keywords Trump or Aliyev or both.

You can build on these searches even further like so:

+Aliyev AND (Obama OR Trump) -Georgia

This translates to: Give me all matches in which each match must contain the keyword Aliyev and must contain either the keyword Obama or the keyword Trump, but must not contain the keyword Georgia.

Field queries

The data accessed by the search API can also be queried by directly naming a field (think: spreadsheet column) which should be considered. Field queries work by combining a field name with a value to be searched within that field:

name_parts:vladimir properties.lastName:Putin phones:+4915223433333

Some of the most useful fields available include the following, which collect multiple values of the same logical type:

  • countries - a list of all countries linked to an entity.
  • identifiers - any government identiifer, including tax and corporate IDs, passport numbers etc.
  • name_parts - individual parts of a name after processing, e.g. john or smith. All values have been converted to the latin alphabet, lowercase and company types are contracted into an abbreviation (ooo, gmbh, asbl).
  • topics - a list of semantic tags related to the entity, like role.pep, sanction.
  • dates - dates linked to any entity, also accessible as year-only (e.g. dates:2022)
  • phones - any phone numbers given, in E164 international format.
  • referents - all former IDs of an entity (before record linkage, e.g. ofac-12444)
  • others: genders, ibans, emails.

You can also query any property associated with an entity by addressing the property field directly using the properties. prefix. For example, the following are valid field names: properties.firstName, properties.innCode, properties.jurisdiction.

Field queries can lead to unexpected results because some fields are only searchable for precise values, or for the backend value stored in them. For example, a search for countries:Russia will not work - you must use countries:ru instead.

Followers: 1
API Creator:
Rapid account: Compliance Sphere
ComplianceSphere
compliancesphere1
Log In to Rate API
Rating: 5 - Votes: 1