Can academia, researchers, decision makers and policy makers manage the challenges of broader collaboration and privacy to harness value from big data?

For academia, researchers and decision-makers, and for many other sectors like retail, healthcare, insurance, finance, capital markets, real estate, pharmaceutical, oil & gas, big data looks like a new Eldorado: a fantastic mine of information where we hope to find the key to great discoveries. In the social sciences, humanities, etc., for example, questions such as “what is social media data telling us about our societies, humanities, consumers, people live nearby?” or “how can big data be used to efficiently allocate resources, such as energy or capital or technology or manpower?” are becoming more frequent.

Quite naturally, academia is abuzz, thinking of how to best use big data, big data technologies & platforms. It is also teeming with competitiveness, as researchers race to make the discoveries waiting for us in those massive data centers.

A cautious approach

That being said, academics, researchers and especially universities, must exercise caution. There are some unique risks that must be taken into account, risks that stem from academic research being (in most cases) made publicly available. Therefore, all the results obtained, the methods used, and often all the data collected during a research project will eventually be made publicly available. This level of transparency is unique to academia. So however magical big data may appear, using it requires careful considerations of both data protection laws and of research ethical principles.

Concretely, when using data that was produced by individuals, we must not only take care to respect their privacy when processing the data, we must also ensure that the content produced cannot be used against them. Further precautions must be taken when working with data from vulnerable individuals, such as young children and teenagers, for instance.

In that respect, the considerable experience that universities have to handle human subject data within the life sciences is of great help and existing research protocols devised in that context are being adapted to address the new challenges posed by big data.

From big data to big collaboration

Academia Researchers

Researchers in statistics and mathematics have to be mindful of and interact closely with, concepts of law and ethics. This is one of many examples of collaborations and interactions across academic fields driven by the use and exploration of big data. The reason for this is that because big data requires a wider

Big data requires a wider range of expertise to be applied simultaneously to answer research questions

range of expertise to be applied simultaneously to answer research questions. In contrast, to understand smaller, traditional data sets used not so long ago, knowing both the data and the methods would be sufficient to tackle most projects.

Consider an effort to understand the drivers behind memes on Twitter. A meme is an idea, behavior, or style that spreads from person to person within a culture. A typical research project could, for example, try to understand how memes develop, who they spread to and why? Such a project would require a software engineer able to produce the database necessary to manage the large data set; a statistician or machine-learner to develop the proper methodology to explore the data and learn its properties; a linguist to help efficiently mine the tweets by identifying the relevant semantic structures; a social scientist to interpret the results and frame the research questions, and so on.

One of the challenges for academia is that there is no culture of collaboration on this scale in the field yet. Finding the right skills and synergies to produce efficient teamwork is the first challenge academics face when tackling a big data project.

The changing nature of data

But it is not the size of big data that requires so many different skills. In that sense, the term ‘big data’ is misleading. What makes big data different is the way the data is collected.

Traditionally data was collected for a specific purpose, relating to a specific scientific question. Academics sought to first delineate a scientific question, then build the most appropriate techniques to answer that question, and finally collect the appropriate data set to use the technique on. This systematic process led, hopefully, to an answer.

This tradition dates back to Greek Antiquity when, for instance, Eratosthenes measured the angle of the sun at different locations, and used geometry to compute the circumference of the globe.

In this setting, we have a clear research question: what is the circumference of the globe? To address this question, Eratosthenes used geometry to identify where he will find this answer: the angle of the sun at different locations. Then, he found how to accurately measure that information: using a gnomon.

The research project presented three clear steps that could be addressed independently. This paradigm is now shifting. The problem big data poses for academics and data scientist, in general, is that extracting value and insight from the data requires doing two processes simultaneously: discovering the informational content and how to extract it. The ‘where’ and the ‘how’ must be done at the same time.

This is why the comparison with an Eldorado and a gold rush is appropriate: we know there is valuable information to be extracted from big data, but we do not know where exactly to find it in the data set and or how to extract it.

Consider the Twitter memes we introduced above. The research question of interest in this example is the evolution of memes and their drivers. To address this question, we must define where this information can be found and how to extract it.

The answer to both questions is the intuition that the data set (tweets published online) contains a massive amount of information about memes: which ones are currently trending, how they form and spread, etc. However, it is unclear where the information can be found in this corpus (should one automatically detect a meme? should a data-expert list the memes?). There is also no methodology on how to extract this information once located (what measures should we use to describe memes evolutions? How do these measures relate? What measures are computationally feasible?). Finally, all these questions must be answered simultaneously to coherently address the problem.

The most successful solution to date is to frame very focused research questions and aim to capture only the information you are interested in, and nothing else. Defining the scope of the project too broadly runs the risk of corrupting the extracted information or limits the ability to extract the information in a scalable way. In effect, to reduce the confusion caused by addressing both the ‘how’ and the ‘why’ at the same time, research questions must be very precise and well defined.

Using big data to understand networks

The need for a focused approach can be made more concrete when considering network analysis. A network is made up of individual entities and the relationships between them. These can range from human social networks, neural network, electric networks, or protein interaction networks, for example.

We would expect that methods for analyzing one type of network would apply to others. This is true in principle, however, the research problem of interest for each network is different. For instance, when considering a social network, we are often interested in how communities form and interact. When considering protein interaction networks, we are interested in the role a particular protein plays in the overall network.

In terms of methodology, addressing simultaneously the global scale community problem (in social networks) and the local scale role problem (in protein networks) remains an open question. But, by addressing each problem separately, efficient methods can be devised. Therefore, the research question must focus on one of the two aspects: either local or global.

The incompatibility of the two approaches is due to the very different mathematical tools used to address both problems. For community detection, the network is regarded as a large matrix, and tools usually rely on the eigenvectors of that matrix.

On the other hand, to determine the role of each node, the network is considered as a graph and the occurrence of small shapes around a given node are enumerated. The former being algebraic in nature, and the later combinatorial in nature, it is not straightforward to articulate the two approaches in a single method.

One set of tools is needed to identify what community a given agent belongs to, and another to describe the role of a given agent. However, there isn’t a tool that does both. As a consequence, it is necessary to decide which of the two aspects is to be studied in advance.

A threefold challenge

Like other sectors like healthcare, retail, capital market, energy and many more, academia sees the potential of big data. To harness this potential, academics face a threefold challenge: generate teams with varied expertise, work on very focused yet meaningful research problems, while taking ethical questions into account. Many such efforts are appearing in academia, and we are, hopefully, to expect a great breakthrough in the near future.

LEAVE A REPLY

Please enter your comment!
Please enter your name here