(My partner Bryan Senseman is co-author of this blog.)
Over the last month or so, I’ve assembled a few scripts to showcase the web-scraping capabilities of R and python. The tasks are generally pretty simple, involving web data that looks like tables and ultimately resides in either R or python/Pandas dataframes. Not surprisingly, the supporting community-developed libraries generally work well, combining with the power of the languages to deliver painless programming solutions for the tasks at hand.
For the most part……
Perhaps one in every five times I run the scripts, they fail to connect to the scrapesites or are otherwise unsuccessful returning data. I could have invested additional programming effort handling the exceptions, but didn’t, satisfied that I could always make them work by hand. Alas, I got “caught” with the errors as I ran the scripts for my obsessive partner, Bryan. And, of course, he let me have it, teasing that I program like a “stats guy”. Feigning annoyment, I retorted that it’s the DNA of stats guys to get answers quickly – and move on to the next challenge. Unlike techies such as him.
Indeed, Bryan and I couldn’t be more different in our approaches. I’m the quintessential data geek, always posing new evidence-demanding questions, always looking for the latest analytics, visuals and learning methods, always investigating the newest data analysis software. Technology is simply a means to an end.
With my “ideal” customer, I perform exploration and predictive analytics exercises that end in markdown scripts written in R or python on engagements measured in weeks or months. Given that my work has been successfully completed and is demonstrably reproducible – the code works with the accompanying static data sets — my customers prefer I invest leftover time in additional analysis.
Bryan, in contrast, is the prototypical engineer. There’s no new architecture he can’t understand, no new technology he can’t get to work quickly, no business problem he can’t translate to systems-speak, and no systems challenge he can’t design and implement.
For Bryan, project success is all about deliverables that are correct and repeatable, wherein the code correctly executes with unseen data. (Bryan jokes that the data science reproducible standard means the code ran once.) His team methodically develops until the code works, going so far as to include negative logic (when a date field isn’t a date), ultimately cataloging the fixes to discovered problems. In the end, it’s all about correct, modular, automatable and maintainable.
My company, Inquidia, is a professional services firm. We specialize in helping customers implement applications that use data and analytics both as a means of evaluating organizational performance and as revenue-producing products. The flavors of business revolve on Enterprise Analytics, where we assist clients in developing and implementing systems that promote the use of analytics in the organization, and Data Science, where we help customers learn from data, exploring and modeling for business benefit. The Enterprise Analytics work is more aligned with technology and information systems, while Data Science is driven from statistics and algorithms. Computation is central to both. In Inquidia’s world, Bryan is Enterprise Analytics; Steve is Data Science.
EA and DS have converged for Inquidia on a very exciting project for which we’re developing sophisticated forecasting models using cloud-based computation and web deployment. Phase 1 involved lots of statistical exploration/modeling along with a high level proof of concept of the cloud as the computation platform. Phase 2 is design/build/implement the statistical results into production.
Over the last six months, The Enterprise Analytics team has grudgingly learned to appreciate uncertainty, randomness, and absence of signal in the data. Ever the planners, it was quite the education for them to discover that statistical forecasting efforts may yield less than satisfying results, even after weeks of exploration.
How is it that you cannot identify models with confidence after a six weeks of grind, they asked? And how can it be possible that certain models work for some series, but not for others? Fortunately for Inquidia, we were able to cobble together a methodology with a selection of disparate models that seems to meet the forecasting needs. Auspiciously, the EA team began to empathize with the tribulations of data science.
The DS team, on the other hand, has had to buckle down on the programming side, understanding that their code has to approach bullet-proof, that users running their algorithms might not be the forgiving analysts they’re accustomed to working with.
The DS’s are learning that in production apps, it’s not enough that the work be reproducible, but must also meet the more rigorous repeatable standard of development. Both code and outputs must be as simple as practical. A recent example of first pass DS programming; ((FLOOR((“table”.”yyyymm” / 100)) * 12) + FLOOR((100 * (((“table”.”yyyymm” / 100) - 0.001) % 1)))) was simplified to ((“table”.”year” * 12) + “table”.”month”) -1). Both worked; the latter is clearer and more maintainable.
Another difference between DS and EA approaches involves understandability. A few weeks back, there was the need to graphically display predictive results of various analytic models over time. The EA standard was to show data points in times series visuals as MM/YYYY, using the ordinal YYYYMM. The models deployed by the DS’s, however, required the “time periods” to be sequential without holes (so YYYYMM or MMYYYY didn’t work). In response, they created a numberofmonths field which worked perfectly in the model. When charted, however, the X value of 24167 wasn’t readily translatable to 12/2013 – except maybe by the DS’s — even though the plots were otherwise identical. The understandable fix was quickly made.
Perhaps DS is from Venus and EA from Mars, as suggested in an illuminating Harvard Business Review article entitled “Why IT Fumbles Analytics”. Authors Donald A. Marchand and Joe Peppard write as if they were present in many of our project’s early status meetings: “The conventional approach to an IT project, such as the installation of an ERP or a CRM system, focuses on building and deploying the technology on time, to plan, and within budget.
The information requirements and technology specifications are established up front, at the design stage, when processes are being reengineered…. a big data or analytics project can’t be treated like a conventional, large IT project, with its defined outcomes, required tasks, and detailed plans for carrying them out. The former is likely to be a much smaller, shorter initiative. Commissioned to address a problem or opportunity that someone has sensed, such a project frames questions to which the data might provide answers, develops hypotheses, and then iteratively experiments to gain knowledge and understanding.”
Tellingly, according to the authors, project organizations must differ between IT (EA) and DS endeavors, with EA adopting traditional project management driven by preoccupation with planning, design, development, deployment, training and organizational change. The methodology of DS, in contrast, is discovery-driven, obsessing on theories/hypotheses, exploration of relevant data, experimentation, refinement – and repetition of same.
Yet project success for us must meet the objectives of both EA and DS – delivering the desired process change on time, within budget, while simultaneously promoting an evidence-based culture driven by data and analytics. No one said it’d be easy.
In the end, the beginnings of important lessons learned for Enterprise Analytics and Data Science collaboration in the emerging world of broadly-deployed analytical applications.
By Steve Miller and Bryan Senseman, from: http://www.information-management.com/blogs/big-data-analytics/developers-vs-data-scientists-different-approaches-to-a-common-goal-10028416-1.html?utm_medium=email&ET=informationmgmt:e6257621:2047253a:&utm_source=newsletter&utm_campaign=daily-mar%209%202016&st=email