Machine Learning

Practitioner

MLPractitioner.com

We are on your path to a practice in Machine Learning and Artificial Intelligence

Consider MLP your  resource and information site for news, tools and techniques to maintain your skillset

Tools for Practitioners of Machine Learning and A.I.

What is Dataiku Data Science Studio?

Dataiku DSS is the collaborative data science software platform for teams of data scientists, data analysts, and engineers to explore, prototype, build, and deliver their own data projects more efficiently.

DSS shines as a freemium SaaS product that offers a cohesive workbench from which to work out machine learning problems, interactively. Their true value is how the tools and APIs make it much easier and in many cases possible to work on big data science projects with a team no matter where everyone is.  Everything is in the cloud and it is secure.

Dataiku Data Science Studio

Install Locally on PC

Sampling of DSS Feature-Set

 

 

Advantages For Analytics Leaders

Driving transformation through data requires productive teams of skilled data professionals. But without the right tools, the right culture, and the right knowledge sharing capabilities, even the most skilled professionals will struggle to deliver data driven or AI-based solutions.

That’s why Dataiku offers leaders the possibility to harness a transparent yet structured environment in which they can not only track individual and team performance, but also become a vector for team collaboration, personal growth, and skill-set amplification.

 

Connect to more than 25 data storage systems

  • Analytical MPP databases (Teradata, Greenplum, Vertica)
  • Cloud databases (Amazon Redshift, Google BigQuery, Snowflake, Azure SQL)
  • Operational databases (Oracle, MS SQL Server, PostgreSQL, MySQL)
  • NoSQL stores (MongoDB, Cassandra, Elasticsearch)
  • Hadoop (HDFS)
  • Cloud object storage (Amazon S3, Google Cloud Storage, Azure Blob Storage)
  • Remote data sources (API, HTTP, FTP, SCP, SFTP)
  • And more!

 

Get instant insights from your datasets

  • Create automatic reports on your datasets and point potential data quality issues.
  • Generate univariate and multivariate statistics to produce detailed datasets audit report.
  • Filter and search your data as easily as you would in Excel.
  • Leverage your business and domain knowledge by using custom semantic meanings in your analyses.
  • Scale your analytics by transparently running on Spark, Hadoop or SQL engines to produce insights.

Clean and enrich data interactively

  • Easy access to over 80 builtin visual processors for code-free data wrangling.
  • Automatically suggested contextual transformations.
  • Perform mass actions on your data.

 

Automated machine learning

  • Automatic features engineering, generation and selection to use any kind of data in your models.
  • Optimize your model hyperparameters using various cross validation strategies.
  • Compare dozens of algorithms from Dataiku interface, both for supervised and unsupervised tasks.
  • Get instant visual insights from your model (variables importance, features interactions or parameters), and assess model’s performance through detailed metrics.

Deploy to production in one click

  • Empower analysts and data scientists to deploy models into production in a few clicks.
  • Data cleaning, enriching, preprocessing, as well as models, are bundled together for simplified scoring pipelines.
  • Deployed models are versioned, enabling users to deploy new versions, compare them and rollback at anytime.

Manage your data pipelines

  • Dataiku lets you package a whole workflow, optionally including data and models, as a single deployable and reproducible package.
  • Install dedicated automation instances of Dataiku to run your exported workflows.
  • Provides for fully staged deployment models: from dev to test to preproduction to production, all within a single UI.

 

Interactive Python, R, and SQL notebooks

  • Discover and plot data with interactive (REPL) notebooks.
  • Integrates Jupyter for advanced syntax coloring and completion (Python and R).
  • Create your own updatable custom reports.
  • Use pre-templated Notebooks to speed up your work.
  • Interactively query databases or data lakes through SQL Notebooks (support for Hive).

 

Integrated documentation and knowledge sharing

Dataiku is designed from the ground-up for data teams. Collaboration features make it easy to share knowledge amongst team members and onboard new users much faster.

  • Add detailed descriptions on your Dataiku objects (datasets, code, models…).
  • Tag, comment and favorite any Dataiku objects.
  • Engage with other users of the platform through Discussions.
  • Create Wikis to document your projects

 

Data governance

  • Organize all your data tasks into clearly identified projects.
  • Document all your actions and datasets.
  • Search for data, comments, features, or models in a centralized catalog.

 

 

Project Flow

 

Feature Importance

 

Confusion Matrix

 

 

Post-Training Model Results

 

Stacked Bar Chart

 

Decision Chart

 

Lift Charts

 

Density Charts

 

Bubble Map Chart

 

Preparation Script

  

Requirements

Installation / Setup

 

Examples

 

Video Demos | ‘In-Depth’

Click ‘+’ to expand

Data Science Studio - 101 - Basics

Dataiku DSS Tutorial 101 Basics Duration – 15:44

 

The demo of DSS is three parts:

Part 1: Basics
Part 2: From Lab to Flow
Part 3: Machine Learning – Phase 1 | Machine Learning Models Phase 2 | Model Scoring

Part 1: Basics

This ‘in-depth’ tutorial project is based on a fictional online t-shirt retailer called ‘Haiku T-Shirt. We will use their enterprise data to show how this works for a hypothetical e-commerce business insight and analytics problem. At the end of the three parts, Haiku T-Shirt will be in a better position to predict the spending trajectories of customers based on a series of features about their previous behavior. This is a good business problem for Machine Learning to solve.

In Part 1 we will be doing some dataset hygiene and EDA to get acquainted with the interface. At this point, keep the faith because a lot of it will feel like UX acrobatics but becomes more clear when you are done with Part 2, ‘From Lab to Flow’. You will get to see how DSS employs its wares for data restructuring, data type normalization and how all of this shows up visually on a ‘Flow’ diagram which tells you where you started and where you are going in a project. Very helpful and powerful.  You’ll see.

Tool Synopsis

Dataiku Data Science Studio is a freemium SaaS product that offers a cohesive workbench from which to work out machine learning problems, interactively. Their true value is how the tools and APIs make it much easier and in many cases possible to work on big data science projects with a team no matter where everyone is.  Everything is in the cloud and it is secure.

To get setup, you will first want to sign up with an account. There is a free-tier so no worries that you will run up a tab just learning the product.  The higher tiers allow more back-end processing as a service (remember, deep-learning can get weighty on the CPU) and private team collaboration.

In the setup phase, you will want to figure out what you want to do after playing with some test datasets and models and see how this all works. It will come clear very quickly.

After that, you will be tempted as I was, to bring in something like either the Iris or Titanic dataset and mess around as if you were working with this in a Jupyter notebook but visually. You may not ever leave Jupyter but you may find yourself going to DSS to untangle models and perform data munging tasks which can get messy in notebooks but could become simpler with the studio.  You can also incorporate code snippets that you rely on within the workflow of DSS which is really keen since not every canned process can make it through without some of your own code. Right?

As you watch the demos, I believe you will see how the value of DSS pops and you will go signup for an account ASAP.

Data Science Studio - 102 - From Lab to Flow

Dataiku DSS Tutorial 102 From Lab to Flow Duration – 17:11

The demo of DSS is three parts:

Part 1: Basics
Part 2: From Lab to Flow
Part 3: Machine Learning

 

Part 2: From Lab to Flow

In this segment, you will be doing what any data analyst, engineer, scientist or product owner needs to do so they can deliver insights to sales and marketing. DSS will help us get the data in the right range, convert ip addresses to geo-locations, arrange relevant user-agent data to simplify browser and o/s stats for the analysis. We will also implement an inner join of customers and orders datasets. This has the effect of emitting a new dataset where we have discarded unmatched customers who are without order data row by row.

The lab as a new concept is introduced. We can manipulate copies of datasets to accelerate our flow. We will build some recipes that do things like filter outliers by different features such as ‘Age of first order’.  Sometimes, unrealistic age values find their way into observations that are just not valid.

Watch how we take EDA to a high-level with visual analysis of customer spend vs. age vs whether they participated in a company marketing campaign.

Once you finish Part 2, you can take these datasets into the Machine-Learning realm and work on both labeled and unlabeled datasets just as you would in practical business situations.

 

 

Data Science Studio - 103 - Machine Learning

Dataiku DSS Tutorial 103 Machine Learning Duration – 13:41

The demo of DSS is three parts:

Part 1: Basics
Part 2: From Lab to Flow
Part 3: Machine Learning Model Scoring

 

Part 3: Machine Learning

In Phase 1:  Machine Learning  (both Phases are in one 13 minute video)

After having gone through a serious amount of data preparation and aggregation inPart 2, we will now create some machine learning models. The improvement or ‘tuning’ stages follow the initial creation. In this process, we will see how many knobs and switches are available to us to decide what to do to improve models and that is the ‘key’ that unlocks a lot of the power of DSS. Whereas in Python, you will be doing a train-test-split, choosing and setting up a Random Tree model, feeding train data to it and seeing how it does. You may want to glean the AOC/ROC afterwards, take a look at the confusion matrix and some curves (that you would have to plot).

Tuning might require a good look at a lot of the model results, maybe some aspects of the feature importance.  In DSS this is simple to do. See how we did it in this first phase.

Remember our goal: To predict if a new customer will become a high-value customer based on data around their first purchase.

In Phase 2: Model Scoring

This is where the rubber meets the road so to speak. We will be taking the best model from the previous exercise and letting it ‘score’ new customers based on what it learned from customers that came before them. Here is predictive analytics at its best because your model had a pretty high accuracy (almost 80%) and in business this is a pretty good percentile to bet on.

You will see how we ‘deploy’ a new prediction model given our ‘customers_labeled’ dataset and the ‘Random Forest’ model name.

We will finally score the new data and create a new output dataset based upon it. With our t of proba_true derived from our threshold set to 0,625, we see the new two columns, proba_True and proba_False and the ‘prediction’ column which is the binary flag, True or False for this customer.

At the end of this Phase and Tutorial 3, we have been through an end-to-end, practical data analysis project that any company with customers purchasing online (most can) would love to use.

Dataiku is making that more of a reality with Data Science Studio.

Give it a try.

Synopsis

Dataiku Data Science Studio is a freemium SaaS product that offers a cohesive workbench from which to work out machine learning problems, interactively. Their true value is how the tools and APIs make it much easier and in many cases possible to work on big data science projects with a team no matter where everyone is.  Everything is in the cloud and it is secure.

To get setup, you will first want to sign up with an account. There is a free-tier so no worries that you will run up a tab just learning the product.  The higher tiers allow more back-end processing as a service (remember, deep-learning can get weighty on the CPU) and private team collaboration.

In the setup phase, you will want to figure out what you want to do after playing with some test datasets and models and see how this all works. It will come clear very quickly.

After that, you will be tempted as I was, to bring in something like either the Iris or Titanic dataset and mess around as if you were designing this in a Jupyter notebook but visually. You may not ever leave Jupyter but you may find yourself going to DSS to untangle models and perform data munging tasks which can get messy in notebooks but could become simpler with the studio.  You can also incorporate code snippets that you rely on within the workflow of DSS which is really keen since not every canned process can make it through without some of your own code. Right?

As you watch the demos, you will see how the value of DSS pops and you will want to go signup for an account ASAP.

0 Comments

Submit a Comment

Your email address will not be published.