I recently went to an amazing Data Science meetup which I highly recommend to all New York-based folk interested in all things data – Data Driven NYC, hosted (and generously catered) by Bloomberg in their beautiful 59th street office.
Founder of Insight Data Science speaking at Data Driven NYC. This meetup had more than 300 people in attendance, and at least 150 more on the waiting list.
In this particular meetup, one of the speakers, head of analytics at Knewton, used a great term that I haven’t heard before – data empathy. Good data scientists emphasize with where the data comes from. What an amazing point!
- Data will truly speak to you only if you really understand where it comes from. Who generates it? How? Why? What are their priorities? Motivations? Knowing this will make the data come alive; not knowing it will lead to many suboptimal decisions.
- Data is always messy – understanding why its messy would make cleaning it much more efficient and less frustrating.
And speaking of frustration with data cleanup and data empathy – it seems to me that on this particular point EE folk turned data scientists have noticeable advantages over, say, computer scientists and applied mathematicians! Because there is nothing better for developing a deep understanding of all things that may go wrong with data collection than instrumenting your own studies. Sitting in the lab verifying just how well your sensor is calibrated. Spending a day on a 2-minute experiment because the results keep coming back so low somehow, and you are so very uncertain whether the values really are this low, or your device is simply broken or something. Repeating a week’s worth of work because you used the wrong dynamic range setting to record your information.
I am reminded of this because I am putting finishing touches on our painstakingly conducted human motion energy study paper which recently got accepted to a top-tier conference. Currently, having transitioned to a business role, I work with data that has nulls and strange outliers and odd matches and mismatches – but at least I am no longer literally running around in circles – To validate the data from , we replicated the measurements using our sensing units [ pdf ]- to verify my measurements’ calibrations