
Data Scientists and Data Engineers like Python and Scala
Python and Scala are popular among members of several well-attended SF Bay Area Meetups
In exchange for getting personalized recommendations many Meetup members declare1 topics that they’re interested in. I recently looked at the topics listed by members of a few local, data Meetups that I’ve frequented. These Meetups vary in size from 600 to 2,000 total (and 400 to 1,100 active2) members.
I was particularly interested in the programming languages members expressed interest in. What I found3 confirmed trends that we’ve noticed in other data sets (online job postings): Python has surpassed R among data scientists and data engineers, Scala is second to Java among JVM languages, and many folks are interested in Javascript. As pydata tools mature, I’ve encountered people who have shifted more of their data workflow from R over to Python.

23andMe flap at FDA indicates fundamental dilemma in health reform
We must go beyond hype for incentives to provide data to researchers
The FDA order stopping 23andM3 from offering its genetic test kit strikes right into the heart of the major issue in health care reform: the tension between individual care and collective benefit. Health is not an individual matter. As I will show, we need each other. And beyond narrow regulatory questions, the 23andMe issue opens up the whole goal of information sharing and the funding of health care reform.

Data Wrangling gets a fresh look
We are in the early days of productivity technology in data science
Data analysts have long lamented the amount of time they spend on data wrangling. Rightfully so, as some estimates suggest they spend a majority of their time on it. The problem is compounded by the fact that these days, data scientists are encouraged to cast their nets wide, and investigate alternative (unstructured) data sources. The general perception is that data wrangling is the province of programmers and data scientists. Spend time around Excel users and you’ll learn that they do quite a bit of data wrangling too!
In my work I tend to write scripts and small programs to do data wrangling. That usually means some combination1 of SQL, Python, and Spark2. I’ve played with Google Refine (now called OpenRefine) in the past, but I found the UI hard to get used to. Part of the problem may have been that I didn’t use the tool often3 enough to become comfortable.
For most users data wrangling still tends to mean a series of steps that usually involves different tools (e.g., you often need to draw charts to spot outliers and anomalies). As I’ve pointed out in previous posts, workflows that involve many different tools require a lot of context-switching, which in turn affects productivity and impedes reproducability.
We are washing our data at the side of the river on stones. We are really in the early, early ages of productivity technology in data science.
Joe Hellerstein (Strata-NYC 2012), co-founder and CEO of Trifacta

Pulling a Dick Cheney, context, and just getting started in data journalism
The New York Times is replacing Nate Silver’s FiveThirtyEight blog (which Silver took to ESPN back in July) with a brand new site intended to “produce clear analytical reporting and writing on opinion polls, economic indicators, politics, policy, education, and sports.” The venture will be headed by D.C. bureau chief David Leonhardt, who also helmed the search committee and selected himself for the job. Naturally, his colleagues are teasing Leonhardt for “pulling a Dick Cheney.”? The new team will also include presidential historian Michael Beschloss, Nate Cohn of The New Republic, and economist Justin Wolfers.
Take it from me –? If you are short on time, do not even attempt to play around on the new Spending Stories website.? Developed by the folks at Open Knowledge Foundation and Journalism++, Spending Stories is intended to help journalists understand and contextualize spending data by making easy comparisons to other data.? For example, using the site, I was able to see that $15,000 US dollars is equal to?3% of private ambulance costs in Yorkshire, England; 0.02% of the cost of the contract awarded to IT company CGI for implementing healthcare.gov; and 90% of government spending per person per year in the UK in 2012. It’s a fun tool!

Behind the Scenes of the First Spark Summit
How it All Started
Spark is a popular open source cluster computing engine for Big Data analytics and a central component of the Berkeley Data Analytics Stack (BDAS). It started as a research project in the UC Berkeley AMPLab and was developed with a focus on attracting production users as well as a diverse community of open source contributors. A community quickly began to grow around Spark, even as a young project in the AMPLab. Before long, this community began gathering at monthly meetups and using mailing lists to discuss development efforts and share their experiences using Spark. More recently the project entered the Apache Incubator.
This year, the core Spark team spun out of the AMPLab to found Databricks, a startup that is using Spark to build next-generation software for analyzing and extracting value from data. At Databricks, we are dedicated to the success of the Spark project and are excited to see the community growing rapidly. This growth demonstrates the need for a larger event that brings the entire community together beyond the meetups. Thus we began planning the first Spark Summit. The Summit will be structured like a super-sized meetup. Meetups typically consist of a single talk, a single sponsor, and dozens of attendees, whereas the Summit will consist of 30 talks, 18 sponsors, hundreds of attendees, and a full day of training exercises.
We understand that an open source project is only as successful as its underlying community. Therefore, we want the Summit to be a community driven event.
Behind the Scenes with the Summit Ops and the Program Committee
The first thing we did was bring in a third-party event producer with a track record of creating high quality open source community events. By separating out the event production we allowed all of the community leaders to share ownership of the technical portion of the event. For example, instead of inviting speakers directly, we hosted an open call for talk submissions. Then we assembled a Program Committee consisting of representatives from 12 of the leading organizations in the Spark community. Finally, the PC members voted on all of the talk submissions to decide the final summit agenda.
The event has been funded by assembling a sponsor network consisting of organizations within the community. With sponsors that are driving the development of the platform, the summit will be an environment that facilitates connections between developers with Spark skills and organizations searching for such developers. We decided that Databricks would participate in the Summit as a peer in this sponsor network.

Day-Long Immersions and Deep Dives at Strata Santa Clara 2014
Tutorials for designers, data scientists, data engineers, and managers
As the Program Development Director for Strata Santa Clara 2014, I am pleased to announce that the tutorial session descriptions are now live. We’re pleased to offer several day-long immersions including the popular Data Driven Business Day and Hardcore Data Science tracks. We curated these topics as we wanted to appeal to a broad range of attendees including business users and managers, designers, data analysts/scientists, and data engineers. In the coming months we’ll have a series of guest posts from many of the instructors and communities behind the tutorials.
Analytics for Business Users
We’re offering a series of data intensive tutorials for non-programmers. Conrad Carlberg will lead a time-series forecasting tutorial aimed at Excel users – a topic that should appeal to those tasked with business forecasting. Grammar of Graphics author, SYSTAT creator, and noted Statistician Leland Wilkinson, will teach an introductory course on analytics using an innovative expert system he helped build.
Data Science essentials
Scalding – a Scala API for Cascading – is one of the most popular open source projects in the Hadoop ecosystem. Vitaly Gordon will lead a hands-on tutorial on how to use Scalding to put together effective data processing workflows. Data analysts have long lamented the amount of time they spend on data wrangling. But what if you had access to tools and best practices that would make data wrangling less tedious? That’s exactly the tutorial that distinguished Professors and Trifacta co-founders, Joe Hellerstein and Jeff Heer, are offering.
The co-founders of Datascope Analytics are offering a glimpse into how they help clients identify the appropriate problem or opportunity to focus on by using design thinking (see the recent Datascope/IDEO post on Design Thinking and Data Science). We’re also happy to reprise the popular (Strata Santa Clara 2013) d3.js tutorial by Scott Murray.

How companies are using Spark
The inaugural Spark Summit will feature a wide variety of real-world applications
When an interesting piece of big data technology gets introduced, early1 adopters tend to focus on technical features and capabilities. Applications get built as companies develop confidence that it’s reliable and that it really scales to large data volumes. That seems to be where Spark is today. With over 90 contributors from 25 companies, it has one of the largest developer communities among big data projects (second only to Hadoop MapReduce).
I recently became an advisor to Databricks (a startup commercializing Spark) and a member of the program committee for the inaugural Spark Summit. As I pored over submissions to Spark’s first community gathering, I learned how companies have come to rely on Spark, Shark, and other components of the Berkeley Data Analytics Stack (BDAS). Spark is at that stage where companies are deploying it, and the upcoming Spark Summit in San Francisco will showcase many real-world applications. These applications cut across many domains including advertising, marketing, finance, and academic/scientific research, but can generally be grouped into the following categories:
Data processing workflows: ETL and Data Wrangling
Many companies rely on a wide variety of data sources for their analytic products. That means cleaning, transforming, and fusing (unstructured) external data with internal data sources. Many companies – particularly startups – use Spark for these types of data processing workflows. There are even companies that have created simple user interfaces that open up batch data processing tasks to non-programmers.

Data journalism’s secrets, no more math-bashing, and a new way to create visualizations.
The ProPublica Nerd Blog this week features an article by Hassel Fallas, a data journalist at ?La Naci?n in Costa Rica. Fallas was a 2013 Fellow at the International Center for Journalists, where she studied up on Data-Driven Journalism’s Secrets. Spoiler alert: The secret is…don’t keep secrets.
Over at the data-driven journalism blog, A Fundamental Way Data Repositories Must Change includes some fascinating examples of how data has been historically manipulated in Romania and Rwanda, including some examples from the present day.
Google Chrome’s new extension, Knoema, provides access to more than 500 data repositories and provides visualization tools for use with those databases. Knoema’s CTO says the platform can be used solely as a data source, but more importantly, it can be used as a tool for journalists to create embeddable visualisations. Pretty cool.

Software, hardware, everywhere
Software and hardware are moving together, and the combined result is a new medium.
Real and virtual are crashing together. On one side is hardware that acts like software: IP-addressable, controllable with JavaScript APIs, able to be stitched into loosely-coupled systems―the mashups of a new era. On the other is software that’s newly?capable of dealing with the complex subtleties of the physical world―ingesting huge amounts of data, learning from it, and making decisions in real time.
The result is an entirely new medium that’s just beginning to emerge. We can see it in Ars Electronica Futurelab’s?Spaxels, which use drones to render a three-dimensional pixel field; inBaxter, which layers emotive software onto an industrial robot so that anyone can operate it safely and efficiently; in?OpenXC, which gives even hobbyist-level programmers access to the software in their cars; in?SmartThings, which ties Web services to light switches.
The new medium is something broader than terms like “Internet of Things,” “Industrial Internet,” or “connected devices” suggest. It’s an entirely new discipline that’s being built by software developers, roboticists, manufacturers, hardware engineers, artists, and designers.

One step closer to a two-hour marathon
In September, Wilson Kipsang ran the Berlin Marathon in 2:03:23, shaving 15 seconds off the world record. ?That means it’s time to check in on the world record progression and update?my article from two years ago.?The following is a revised version of that article, including the new data point.
Abstract: I propose a model that explains why world record progressions in running speed are improving linearly, and should continue as long as the population of potential runners grows exponentially. ?Based on recent marathon world records, I extrapolate that we will break the two-hour barrier in 2043.
Let me start with the punchline: