The New York Times recently published an article titled “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights,” which described some of the many challenges of actually using the diverse data sets that make up the foundation of Big Data.
The article is very interesting and gets to some key issues, which can be broken down into three categories:
- Data type management – creating a clean data format that can be understood by a machine
- Taxonomic management – Developing clear relationships between related data to create a coherent concept structure
- Ontological management – Identifying the actual meaning of data and related terms (synonyms).
The Patterns Are Only There If You See Them
Two of these three are firmly in the realm of Information Architecture. Yet the article implies that the biggest problem is some kind of inherent messiness. The key quote that caught my eye was this:
Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
My experience with data management is that data cleanup beyond a certain point is really a reflection of a group’s confusion about how the data should be defined, organized, and sorted.
To put it another way, if you spend one day cleaning up a data set, you have a data cleanup problem.
But if you spend every day cleaning up data sets, you have an information architecture problem.
Searching for the True Big Data
It’s very hard to get some information scientists and analysts to think about things this way, largely because the creation of taxonomies and ontologies can seem so staid and slow compared to the excitement of “diving into the data” to find things. And yet it turns out to be critically important, at least as a consideration. For example, the article gives a classic example of a problem that might initially seem like a data cleaning problem but is, in fact, an Ontological problem:
“…the Food and Drug Administration, National Institutes of Health and pharmaceutical companies often apply slightly different terms to describe the same side effect. For example, “drowsiness,” “somnolence” and “sleepiness” are all used. A human would know they mean the same thing, but a software algorithm has to be programmed to make that interpretation. That kind of painstaking work must be repeated, time and again, on data projects.”
This is not data cleaning; it is information architecture. And taking the time to think about the world this way – to carefully examine what is being said, and to propose a way things should be organized – is in many ways a lost art. The distinction would not be lost at all to the great biologists and chemists of the 18th and 19th century, who were fully aware that the nature of classification was a critical part of actually seeing the world’s work emerge before them.
Mendeleev’s 1850 insight into atomic structure was made after he spent nearly a decade sorting and resorting a deck of cards, each representing an element, by different categories of weight, electrostatic properties, temperature ranges, and any other measurable characteristic he could think of.
In both cases there was some data cleanup. But most of what Linneas and Mendeleev brought to the table was patience and an openness to seeing the patterns that were there. And, although there have been small changes to both these models as more information comes to light, they basically exist in almost exactly the same format as they did when they were created over two hundred years ago.
This is the profound, explanatory power of effective information architecture.
Why It Matters
Why am I bringing these old geezers into the story? Because they remind of us of what we have lost in the rich, almost effortless pattern-matching technological environment we live in. The ability to think about and organize data is a SKILL, a learned ability. It is very different than algorithmic analysis or programming. Perhaps most tellingly, it is orthogonal to most statistical analysis in that it posits the belief in a structure than can be observed and defined, instead of needing a statistical test to perceive.
The conflation of many modern data scientists of “data that is messy” with “data that is disorganized or misunderstood” means we run the risk of making the same mistakes of ad hoc and incorrect classifications over and over again. There is almost always a natural order in any data set we systematically examine, should we take the time to believe that the pattern will emerge.
When we apply the principles of information architecture in collaboration with the power of big data to collect and filter information, it allows us to discover things across data sets in ways we haven’t even imagined.