Deduplication
One of the core concept of the OpenCTI knowledge graph is all underlying mechanisms implemented to accurately de-duplicate and consolidate (aka. upserting) information about entities and relationships.
Creation behavior
When an object is created in the platform, whether manually by a user or automatically by the connectors / workers chain, the platform checks if something already exist based on some properties of the object. If the object already exists, it will return the existing object and, in some cases, update it as well.
Technically, OpenCTI generates deterministic IDs based on the listed properties below to prevent duplicate (aka "ID Contributing Properties"). Also, it is important to note that there is a special link between name and aliases leading to not have entities with overlapping aliases or an alias already used in the name of another entity.
Entities
| Type | Attributes | 
|---|---|
| Area | ( nameORx_opencti_alias) ANDx_opencti_location_type | 
| Attack Pattern | ( nameORalias) AND optionalx_mitre_id | 
| Campaign | nameORalias | 
| Channel | nameORalias | 
| City | ( nameORx_opencti_alias) ANDx_opencti_location_type | 
| Country | ( nameORx_opencti_alias) ANDx_opencti_location_type | 
| Course Of Action | ( nameORalias) AND optionalx_mitre_id | 
| Data Component | nameORalias | 
| Data Source | nameORalias | 
| Event | nameORalias | 
| Feedback Case | nameANDcreated(date) | 
| Grouping | nameANDcontext | 
| Incident | nameORalias | 
| Incident Response Case | nameORalias | 
| Indicator | patternORalias | 
| Individual | ( nameORx_opencti_alias) andidentity_class | 
| Infrastructure | nameORalias | 
| Intrusion Set | nameORalias | 
| Language | nameORalias | 
| Malware | nameORalias | 
| Malware Analysis | nameORalias | 
| Narrative | nameORalias | 
| Note | None | 
| Observed Data | nameORalias | 
| Opinion | None | 
| Organization | ( nameORx_opencti_alias) andidentity_class | 
| Position | ( nameORx_opencti_alias) ANDx_opencti_location_type | 
| Region | nameORalias | 
| Report | nameANDpublished(date) | 
| RFI Case | nameANDcreated(date) | 
| RFT Case | nameANDcreated(date) | 
| Sector | ( nameORalias) andidentity_class | 
| Task | None | 
| Threat Actor | nameORalias | 
| Tool | nameORalias | 
| Vulnerability | nameORalias | 
Names and aliases management
The name and aliases of an entity define a set of unique values, so it's not possible to have the name equal to an alias and vice versa.
Relationships
The deduplication process of relationships is based on the following criterias:
- Type
- Source
- Target
- Start time between -30 days / + 30 days
- Stop time between -30 days / + 30 days
Observables
For STIX Cyber Observables, OpenCTI also generate deterministic IDs based on the STIX specification using the "ID Contributing Properties" defined for each type of observable.
Update behavior
In cases where an entity already exists in the platform, incoming creations can trigger updates to the existing entity's attributes. This logic has been implemented to converge the knowledge base towards the highest confidence and quality levels for both entities and relationships.
To understand in details how the deduplication mechanism works in context of the maximum confidence level, you can navigate through this diagram (section deduplication):