The importance of Data Scientists in the Era of Big Data

We live in the Age of Data, there is no doubt about it. Wether you wake up in the morning, check your e-mail, turn  on your TV, read the newspaper… we are surrounded and influenced by data at all times, specially since the raise of mobile devices, we keep a door open to a universe of data with us all day long. But what most people do not realize, is that they are not only data consumers but also data producers. Every time you browse the internet, your actions are being tracked, every time you buy groceries, your shopping list is being recorded for analysis, every time you listen to a song in i-Tunes, Spotify, etc. the music industry is a step closer to understand their customers. And these are just a few examples, on how companies are being flooded with data every day, data that is crucial to understand their business and their customers and not doing so will result in many missed opportunities.

The excess of information is as problematic as the lack of it. Big Data is the term that has been recently associated with the process of dealing with unmanageable amounts of information.  It has become a big buzz word and top influencers such as McKinsey or IBM are talking about it. Even the popular JP Morgan summer reading list of 2013 contained a book about Big Data.

When dealing with Big Data, traditional Business Intelligence techniques are no longer valid, companies get too much information from too many sources and data does no longer fit in an Excel spreadsheet. So the big players in this game are the Data Scientist. These are the professionals capable of slicing and dicing this huge amount of information, creating complex algorithms that will identify patterns to be used for business decisions and translating all this bits and bytes into a language that can be understood by the stakeholders.

But the role of a Data Scientist is not always well defined, people wonder if they should should be classified as Scientist, Engineers, Statisticians, Software Developers… and the truth is they are a mix of all of these. The process of understanding Big Data goes through lots of different phases, each of them requiring a unique set of skills. There is a visualization that shows this intersection. This visualization created by Drew Conway is a great summary of what the Data Scientist role is about. You can also see a great video that explains his thought process when creating the diagram in here.


Drew Conway Data Science Venn Diagram

Drew Conway Data Science Venn Diagram

First of all, infrastructure needs to be deployed to store and retrieve a huge amounts of data. Traditional relational data bases cannot do the job, they cannot scale. They are good to store structured data but this focus in structure makes them difficult to be distributed across multiple machines. Therefore, NoSQL data bases are the choice for most Data Scientist, CassandraHBaseMongoDB. Specially the popular distributed file storage system HDFS (Hadoop Distributed File System) that is the core part of the Hadoop framework along with MapReduce. This is the open source version of the GFS (Google File System) that allowed Google to index a huge part of the vast internet and make it usable for their customers. Setting up, understanding the schemas and distributing your data across a computer cluster require a good amount of Hacking Skills specially if that data has to be cleaned and refined.

But storing the data is not enough, it also has to be understood, and this requires substantive expertise about the specific market in hand. It requires constant communication and understanding of the different business units of an organization in order to define relevant hypothesis that will be proved or dismissed with the data. Once this hypothesis have been defined, data cannot be manually examined to identify relevant information. Therefore, statistical analysis and machine learning techniques are the most suitable tools for this task. Clustering, regression, statistical inference, Support Vector Machines (SVM), decision trees… are very common terms in the Data Science vocabulary. But the use of those techniques is not as straight forward as in the past, since they have to be parallelized across multiple computers using techniques such as MapReduce. And finally, when data has been processed and analyzed, it has to be presented, so Data Visualization is an area that needs to be mastered by a Data Scientist. There is no value in extracting relevant information if this cannot be properly communicated.

Data Science has been defined as the "sexiest job of the 21st century" by the Hardvard Business Review. Also, some universities have already jumped on the train of creating programs with very specific formation for this type of professionals, such as Columbia University or the New York University. There has been a raise for the past few years of companies demanding people with this set of skills. And there will be more to come as companies discover the hidden opportunities behind Big Data and the technologies to process it get more mature. So being an early adopter of Data Science and understanding its importance is going to be a key differentiation for any young professional or any company looking for success. Are you ready to jump on that train with me?