post

Business Intelligence at Ask

by Erika Bakse, Head of BI   

   

Ask Media Group manages over 15 web properties, and coordinates performance marketing programs across most of these sites. Result? LOTS of data.

The Business Intelligence team at Ask is tasked with making sense of all this data. That means wrangling over 200 million web log events daily (over 1 TB of raw data), cleansing, structuring, and combining data sources to give users a holistic view of the business. Here’s how we do it.

Workflow of Data

The bulk of our data is generated by users interacting with our web sites. Our homegrown logging system tracks everything on the page, as well as how the users interact with the site, as a stream of JSON objects. These objects are enriched in-flight with extra goodies like geographic and device information. These objects are streamed into Amazon S3, where they are copied every minute into our Snowflake data warehouse. From there, we cleanse the data and transform it into a traditional dimensional model for analysis. It takes 45 minutes for an event occurring on one of our websites to be available for our business team to analyze.

But that’s only the beginning of the story. Ask boasts a world class SEM program managing a portfolio of hundreds of millions of keywords. The data involved for those campaigns gets imported daily into the data warehouse as well. We process revenue data from the various types of ads we run on our own properties. We import information from our content management system to track our content lifecycle and performance. We also manage internal metadata for our revenue reporting and A/B testing.

Once all this data comes into our data warehouse, we create different data marts to serve various parts of the business. For example, we merge our SEM data with our web log data and revenue data to get a complete view of a user’s experience on a property—how they enter our site based on our marketing efforts, how they interacted with the site, and how we were able to monetize the session.

Using the Data

Our internal users have a variety of options to interact with the data warehouse. Snowflake is accessible via web browser, ODBC driver, JDBC driver, python connector, spark connector, R connector—you name it, we’ve got it. This allows our users to interact with the front end tool they find best fits their needs.

The BI team specifically supports Looker and Alteryx for our users, as well. Looker is a great browser-based data visualization tool that provides fantastic semantic modeling capabilities so users can focus more on answering questions than writing SQL queries. Alteryx functions as our reporting workhorse—large reports containing thousands of data points go to business teams every hour. It also helps us handle the trickier database pulls, as well as providing a front end for our metadata management workflows.

Ultimately our job is to remove any and all barriers between business and data, and we are constantly evaluating the best ways to do just that.

post

Maximizing the Impact of Data Scientists

by John Bryant, VP of Data Science   

   

We can’t overstate the value of data scientists being able to push code into production. In some organizations, a data scientist has to wait for an engineer to translate their prototype code into production, leading to delays, disagreements, and disappointment. Barriers to production are barriers to impact, and a good data scientist is too valuable to hamstring in this way.

At Ask Media Group, we have enabled our data scientists to push software into the same environment that contains the rest of our production systems and platform. However, empowering data scientists and imbuing them with independence does not mean replacing engineering excellence. It just means that we have found a better way for data science and engineering to collaborate.

As an analogy, consider Hadoop/HDFS, which hides the complexity of parallel processing over huge datasets, and instead lets the data scientist focus on their big data task. Much like with Hadoop, our method for data science and engineering collaboration is an instantiation of the idea of separation of concerns. We define an abstraction—the data science side of the abstraction focuses on the application, commonly a data product. The engineering side of the abstraction manages the details of the production environment. The power of this separation is that both data scientists and engineers do what they love and thus have the opportunity to excel.

For example, one of our data products categorizes user queries into topics such as Vehicles, Travel, Health, etc. More precisely, it’s a multi-label categorization system with 24 top-level categories and 10-20 subsequent second-level categories each, for a total of nearly 300 potential labels. For this system to be useful to the business, it needs to have a response time under 10 milliseconds, scale to 2,000 requests per second, and achieve a precision of 90% with a coverage greater than 70%.

Our initial Python prototype, built with our go-to modeling toolkit (scikit-learn), involved complex features and learning; although it met our precision target, it didn’t hit the response time requirement. So we leveraged a technique called uptraining, using the very precise prototype to automatically annotate a large number of queries with highly accurate labels. In turn, we used that data to build a faster, simpler system with bag-of-words and word embedding (word2vec) features for multi-label logistic regression. With the resulting system, we are able to achieve both our response time and precision targets.

To push a system like this into production, we need to concern ourselves with a lot more than precision and raw speed. In our production environment, apps run as services within containers on OpenShift, using linkerd to route requests. To hide the complexity of the environment, our data engineering team has provided a service creation abstraction on top of OpenShift that deals with deploying, versioning, scaling, routing, monitoring, and logging—all aspects of production systems for which a data scientist may have limited expertise.

Using this service creation abstraction, we can simply wrap the Python query categorization code in Flask and use a few Docker commands to make sure that the model and all necessary libraries are loaded into the image. After code review and software/performance testing for our service, a simple click in Gitlab CI pushes our fast, highly-accurate, scalable multi-label query topic prediction service to production.

What’s awesome about this process is that a data scientist can finish preparing a model in the morning, and deploy it that afternoon. This means that our data scientists are free to focus on leveraging data science to create business impact without production concerns getting in their way.