Applying the Theory of Identity as History to Data Management
Identity can be difficult to define when an entity changes states, but using history to track the states of an entity can be a solution.
If you replaced the head and the handle of a hammer, would it be the same hammer? This old thought experiment dates back to the philosophers of Ancient Greece, and it reveals a paradoxical relationship between identity and change. How and to what extent can something change while retaining its identity?
A solution to this paradox is crucial to entity analytics, the science of linking bits of information to reveal the web of people, objects, events and other things of value in business and governance. This is the embodiment of what many consider to be the vision of “big data” – that by unifying data from many sources, the collective knowledge of things can be explored in its totality.
Entity analytics requires an ability to track changes to the identities of things over time. Otherwise the data is at risk of obsolescence, or worse, corruption. As things change, their links to historical records can break, making it impossible to know if things logged at different times are in fact the same.
The original solution to this problem was relational modeling – to identify things with arbitrary fixed identifiers called “keys,” and then reference things by their keys. But keys alone cannot satisfy entity analytics. Keys must belong to a managed key space. Conflicts arise when merging data from two or more key spaces. Business units and third-party data providers such as resellers, distributors, public records aggregators or social media services all have different representations of the same things. Moreover, the very usage of keys has waned as denormalization became an essential design for scalability.
Identities are scattered, messy and always changing in the world of enterprise data. The paradox of the hammer remains a curse.
I offer a practical solution to this paradox: Identity as history. An identity is the lineage of changes to an entity over time, rather than the state of an entity at one point in time. Attributes of an entity are authentically represented as immutable events, rather than mutable properties, that signify a change or lack thereof to the entity. The collection of those events reveals the identity of an entity in its entirety. Let’s explore this theory, and then discuss its application to data management and entity analytics.
Everything has an identity that is capable of change. A person may change a name, address, relationship or facial structure. A mobile device may change a phone number, IP address or SIM card. But the self-identity of the thing in question will endure, even upon changing its every descriptive attribute. What remains constant is the history of changes to that thing. Identity exists as the lineage of changes to an entity over time, rather than the state of the entity at a single point in time.
Then what is an entity? What is the essence of something that makes it itself? And what is an identity if there is nothing constant to identify? This is unproductive thinking. It welcomes a mess of conceptual deconstruction where things are so picked apart that nothing can be described, let alone be useful. An entity is a concept with a practical purpose that we can perceive without further deconstruction. The concept of a person, for example, is easily perceived and has clear uses in business and governance. And the identity of a person is the history of changes we can attribute to the thing we believe to be a distinct person.
So if you replaced the head and the handle of a hammer, would it be the same hammer? According to this solution, it depends on whether you can trace the lineage of changes to the hammer. By replacing the head and the handle sequentially, you have associated the new parts with the old parts, thus preserving the identity of the hammer. By discarding the head and the handle and replacing both at once, you have constructed a new hammer with its own identity.
The identity of a hammer seems easy enough to track. What’s the best way to track the identity of something more complex – like a person?
Let’s look at how to apply this theory of “identity as history” to data management. Consider these two design choices. The first assumes an environment with a single managed key space. The second assumes many conflicting key spaces. Both can coexist in the same data management strategy.
1. Represent attributes as state changes.
Schemas often represent attributes one-dimensionally. The context of time is absent. When updating a record, new values overwrite the old values. This kind of design breaks the lineage of an identity, erasing its valuable history. Records that cite details of old identities may lose referential integrity, making it impossible to relate them to the things as they now exist.
What you want is a timeline of events that indicate changes to the states of entities. Think of attributes as facts that were true at a point in time. Today I can truthfully write, “Dave lives in the Research Triangle.” But that fact can expire. What will always be true is to have written, “Dave lived in the Research Triangle on February 1, 2016.” That fact is forever etched into my history – my identity. It will always describe me.
Decouple the attributes from the entity, and decouple the attributes from their values. Relate the attributes to their values in a one-to-many relationship. Represent the values as immutable, factual events that describe the attributes of the entity at a point in time, rather than as mutable properties of the entity in the present alone. And let each factual event answer these questions:
- Which entity changed?
- Which attribute changed?
- What was the old value?
- What was the new value?
- When did the change occur?
Adapt this model to your data architecture as appropriate. But remember this principle: an attribute is a fact that represents a change of state to an entity at a point in time.
2. Infer state changes with entity resolution.
The former approach assumes a world of perfect data governance. In this world, divisional leaders agree on how to define and manage data, and how to enact change. Stewards of the data follow clear policies of entry and management. They know which changes to apply to which records. They act consistently and without error. This perfect world enjoys a trusted “single version of the truth” with a single managed key space. And the identities of the things that matter to the organization are fully represented in the data over time.
The reality is never so ideal. Politics, policies, mistakes and other human factors lead to closed silos of messy data. Business units end up representing the same information in different ways. Data brought in from the outside world introduces other representations. And the identities of things that matter to the broader organization grow more and more fragmented and incoherent. This is the biggest barrier to entity analytics. You cannot solve the paradox of the hammer without its whole historical lineage at hand.
Reconstructing identities from disparate data requires a specialized approach to data integration called entity resolution – the process of linking records that represent the same entity despite having unequal representations of the same entity. The idea is to take a pair of records, decide if they resemble the same entity, link them if they do, and repeat for each pair of records. The identity of an entity, including all known changes to the entity, becomes apparent in the linkage of its records.
An effective entity resolution system will quickly reveal the full identities of the things that matter to the organization. This is crucial to performing entity analytics when the information about the entities is scattered across large, diverse, and ungoverned data sets. It gives you the freedom to analyze the full scope and histories of entities as they exist across the organization, without having to work against the realities of decentralization and human error.
If you manage data for a large enterprise and you can see the value of entity analytics, then I recommend you take a serious look at Novetta Entity Analytics. This software solves the problem of identity and change in large, diverse, and ungoverned data – originally for matters of national security, now for everyone on Hadoop. It is the only software I have found to offer a framework for generic entity resolution that can also match a billion records in a matter of hours.
I’ll leave you with some final thoughts. Think about how you have changed over time. Think about the gradual evolution of everyone and everything in your life. Now think about how useful it would be to have the full history of everything that ever mattered to you at your fingertips. That is the power of entity analytics, and that is why you should care about the paradox of the hammer.
This article was originally posted at Novetta.com. It has been reprinted here with permission. Novetta retains all copyrights.