Deduplication
One of the core concept of the OpenCTI knowledge graph is all underlying mechanisms implemented to accurately de-duplicate and consolidate (aka. upserting
) information about entities and relationships.
Creation behavior
When an object is created in the platform, whether manually by a user or automatically by the connectors / workers chain, the platform checks if something already exist based on some properties of the object. If the object already exists, it will return the existing object and, in some cases, update it as well.
Technically, OpenCTI generates deterministic IDs based on the listed properties below to prevent duplicate (aka "ID Contributing Properties"). Also, it is important to note that there is a special link between name
and aliases
leading to not have entities with overlapping aliases or an alias already used in the name of another entity.
Entities
Type | Attributes |
---|---|
Area | (name OR x_opencti_alias ) AND x_opencti_location_type |
Attack Pattern | (name OR alias ) AND optional x_mitre_id |
Campaign | name OR alias |
Channel | name OR alias |
City | (name OR x_opencti_alias ) AND x_opencti_location_type |
Country | (name OR x_opencti_alias ) AND x_opencti_location_type |
Course Of Action | (name OR alias ) AND optional x_mitre_id |
Data Component | name OR alias |
Data Source | name OR alias |
Event | name OR alias |
Feedback Case | name AND created (date) |
Grouping | name AND context |
Incident | name OR alias |
Incident Response Case | name OR alias |
Indicator | pattern OR alias |
Individual | (name OR x_opencti_alias ) and identity_class |
Infrastructure | name OR alias |
Intrusion Set | name OR alias |
Language | name OR alias |
Malware | name OR alias |
Malware Analysis | name OR alias |
Narrative | name OR alias |
Note | None |
Observed Data | name OR alias |
Opinion | None |
Organization | (name OR x_opencti_alias ) and identity_class |
Position | (name OR x_opencti_alias ) AND x_opencti_location_type |
Region | name OR alias |
Report | name AND published (date) |
RFI Case | name AND created (date) |
RFT Case | name AND created (date) |
Sector | (name OR alias ) and identity_class |
Task | None |
Threat Actor | name OR alias |
Tool | name OR alias |
Vulnerability | name OR alias |
Names and aliases management
The name and aliases of an entity define a set of unique values, so it's not possible to have the name equal to an alias and vice versa.
Relationships
The deduplication process of relationships is based on the following criterias:
- Type
- Source
- Target
- Start time between -30 days / + 30 days
- Stop time between -30 days / + 30 days
Observables
For STIX Cyber Observables, OpenCTI also generate deterministic IDs based on the STIX specification using the "ID Contributing Properties" defined for each type of observable.
Update behavior
In cases where an entity already exists in the platform, incoming creations can trigger updates to the existing entity's attributes. This logic has been implemented to converge the knowledge base towards the highest confidence and quality levels for both entities and relationships.
To understand in details how the deduplication mechanism works in context of the maximum confidence level, you can navigate through this diagram (section deduplication):