👨‍🔧
The Ops Compendium
  • The Ops Compendium
  • Definitions
    • Ops Definition Comparisons
  • ML & DL Compendium
  • MLOps
    • MLOps Intro
    • MLOps Teams
    • MLOps Literature
    • MLOps Course
    • MLOps Patterns
    • ML Experiment Management
    • ML Model Monitoring & Alerts
    • MLOps Tools
    • MLOps Deployment
    • Feature Stores & Feature Pipelines
    • Model Formats
    • AI As Data
    • MLOps Interview Questions
    • ML Architecture
  • DataOps
    • SQL
    • Tools
    • Databases
    • Database Modeling
    • Data Analytics
    • Data Engineering
    • Data Pipelines
    • Data Strategy
    • Data Vision
    • Data Teams
    • Data Catalogs
    • Data Governance
    • Data Quality
    • Data Observability
    • Data Program Management
    • Data KPIs
    • Data Mesh
    • Data Contract
    • Data Product
    • Data Engineering Questions & Training
    • Data Patterns
    • Data Architecture
    • Data Platforms
    • Data Lineage
  • DevOps
    • DevOps Strategy
    • DevOps Tools
      • Tutorials
      • Continuous Integration
      • Docker
      • Kubernetes
      • Cloud Objects
      • Key Value DB
      • API Gateway
      • Infrastructure As code
      • Logs
      • ELK
      • SLO
    • DevOps Courses
  • DevSecOps
    • Definitions
    • Tools
    • Concepts
  • Architecture
    • Problems
    • Development Concepts
    • System Design
Powered by GitBook
On this page

Was this helpful?

Edit on GitHub
  1. MLOps

ML Experiment Management

PreviousMLOps PatternsNextML Model Monitoring & Alerts

Last updated 2 years ago

Was this helpful?

  1. Cnvrg.io -

    1. Manage - Easily navigate machine learning with dashboards, reproducible data science, dataset organization, experiment tracking and visualization, a model repository and more

    2. Build - Run and track experiments in hyperspeed with the freedom to use any compute environment, framework, programming language or tool - no configuration required

    3. Automate - Build more models and automate your machine learning from research to production using reusable components and drag-n-drop interface

  2. Comet.ml - Comet lets you track code, experiments, and results on ML projects. It’s fast, simple, and free for open source projects.

  3. Floyd - notebooks on the cloud, similar to colab / kaggle, etc. gpu costs 4$/h

  4. Missing link - RIP

  5. Spark

  6. Databricks

    1. - pandas API on Apache Spark

    2. , has some basic sklearn-like tool and other custom operations such as single-vector-based aggregator for using features as an input to a model

    3. (read me, has all libraries)

    4. , explains the 3 pros of DB with examples of using with native and non native algos

      1. Spark sql

      2. Mlflow

      3. Streaming

      4. SystemML DML using keras models.

    5. for grid searching with sklearn

      1. from spark_sklearn import GridSearchCV

    6. our existing experience with modeling libraries like ? We'll explore three approaches that make use of existing libraries, but still benefit from the parallelism provided by Spark.

These approaches are:

  • Grid Search

  • Cross Validation

  • Sampling (random, chronological subsets of data across clusters)

Github (needs to be compared to what spark has internally)

It's worth pausing here to note that the architecture of this approach is different than that used by MLlib in Spark. Using spark-sklearn, we're simply distributing the cross-validation run of each model (with a specific combination of hyperparameters) across each Spark executor. Spark MLlib, on the other hand, will distribute the internals of the actual learning algorithms across the cluster.

The main advantage of spark-sklearn is that it enables leveraging the very rich set of algorithms in scikit-learn. These algorithms do not run natively on a cluster (although they can be parallelized on a single machine) and by adding Spark, we can unlock a lot more horsepower than could ordinarily be used.

Using is a straightforward way to throw more CPU at any machine learning problem you might have. We used the package to reduce the time spent searching and reduce the error for our estimator

and sklearn random trees

All the alternatives
Trains - open source
Rdds vs datasets vs dataframes
What are Rdds?
keras , tf, spark
Repartition vs coalesce
Best practices
Koalas
Intro to DB on spark
Pyspark.ml
Keras as a single node (no spark)
Horovod for distributed keras (and more)
Documentations
Medium tutorial
systemML notebooks (didnt read)
Sklearn notebook example
Utilizing spark nodes
How can we leverage
scikit-learn
spark-sklearn
Ref:
machine learning
spark-sklearn
Airbnb example using spark and sklearn,cross_val& grid search comparison vs joblib
Sklearn example 2, tfidf,
Tracking experiments
example
Saving loading deployment
Aws sagemaker
Medium
How to productionalize your model using db spark 2.0 on youtube