CS/클라우드컴퓨팅

18. Cloud-Based AI

호프 2023. 12. 9. 14:50

Machine Learning Pipeline

Problem Formulation

Buisiness Problem -> ML problem

  • ML model: start to think about the problem in terms of your ML model
  • Questions
    • Can machine learning solve the problem?
    • Would a traditional approach make more sense?
    • Is t his problem a supervised or unsupervised machine learning problem?
    • Do you have labeled data to train a supervised model?
  • Problem + Intended outcome + Make it measurable

 

Collect and evaluate data

Questions

  • Which data do you need
  • Do you have access to that data?
  • How much data do you have and where is it?
  • What solution can you use to bring all this data into one centralized repository?

Data Sources

  • Private Data
    • Data that you have in various existing systems
    • Data is found in many different system
  • Commercial Data
    • Data that a commercial entity collected and made available
  • Open-Source Data
    • Comprise many different open-source datasets
    • usually available for use in research or for teaching purposes
    • you can find open-source datasets hosted by AWS, Kaggle ...

 

Data Consideration

  • Get an understanding of your data
  • Get a domain expert
  • Evalulate the quality of your data
    • good data contains a signal about the phenomenon that you're trying to model
  • Identify features and labels
    • data = features + labels
    • feature: attribute that can be used to help identify patterns and predict future answers
    • label: answer that you want your model to predict
    • data for which you already know the answer is called labeled data
  • Identify labeled data needs
    • supervised learning include labeled data

Data Preparation and Preprocessing

  • extract data from one or more data sources
  • you need a subject matter expert or a functional expert to understand the authenticity of the data

 

Feature Engineering

Feature Engineering

  • dealing with your data to make it usable
  • selecting or creating the features

Feature Selection

  • selecting the features that are most relevant and discarding the rest
  • prevent either redundancy or irrelevance in the existing features or to get a limited number of features to prevent overfitting
  • Selection methods
    • Wrapper methods: measure the usefulness of a subset of features by training a model on it and measuring the success of the model
    • Filter methods: use statistical methods to measure the relevance of features, faster and cheaper than wrapper methods because they don't involve training the models repeatedly
    • Embedded methods: algorithm-specific and might use a combination of both

Feature Extraction

  • building up valuable information from raw data by reformatting, combining, and transforming primary features into new ones
  • encoding data, finding missing data, handling outliers

Preparing Data

  • data might be dirty -> contains error or duplicate data
    • using dirty data will lead to incorrect results or an inability to run the model effectively
  • Considerations for preparing your data
    • Encoding data
      • ML algo work best with numerical data
    • Cleaning data
      • before you encode the string data, must make sure that the strings are all consistent
      • adjust variables to use a consistent scale
      • split items that capture more than one variable
    • Finding missing data
      • most ML algorithms can't deal with missing values automatically
      • find missing data and update with something meaningful and relevant to the problem
    • Handling outliers
      • outliers: points lie at an abnormal distance from other values
      • outliers can add richness to your dataset but they can also make it more difficult to make accurate predictions
      • outliers affect accuracy because they skew values away from the other more normal values that are related to the feature
  • Use feature extraction to convert data into a usable form
  • During feature extraction, you will handle msising data, duplicates, inconsistencies, invalid values, and conversion of text data into numerical data

Select and Train model

Model Training

  • you don't use all the data to train your model
    • Training data: feeds into the algorithm to produce your model; the model is then used to make predictions over the validation dataset
    • Validation data: you might notice things that you will want to tweak, tune, and change
    • Test data

Benefit of splitting all available data into training(80%), validation(10%), and test(10%) subsets

  • The model can be evaluated on data wasn't used for training to assess how well it is generalizing new information

 

Overfitting and Underfitting

  • The goal of machine learning is to build a model that generalizes well

What is happening if a model is overfitting?

  • The model performs well on the training data, but it doesn't perform well on the evaluation data

 

Chossing a ML algorithm

  • Supervised learning (지도학습)
  • Unsupervised learning (비지도학습)
  • Reinforcement learning (강화학습)

 

Tune and Evaluate model

Evaluate and Tune

  • validation data provided a biased evaluation of a model fit while tuning model
    • 모델이 validation data에만 맞출 수 있도록 튜닝될 수 있음
  • test data has known values and lets you assess the unbiased accuracy of the model after tuning
    • if the model performs well on the test data, it will likely perform well on new data with unknown target values
  • you should set aside enough test data and not use that that data for training or validation

Success Meteric

  • Evaluation relies on an appropriate metric
  • Model metric should be linked to that business metric as closely as possible

Tuning

  • goal of training is a model that is balanced and generalizes well
  • modify the model's data, features, or hyperparameters until you find the model that yields the best result

Deploy model


Machinne Learning Stack

Data Layer

  • where data that will feed directly into your ML model is stored

Model Layer

  • contains the model and algorithms that build predictions, based on data that the model collects

Deployment & Monitoring Layer

  • model is working in a live environment and producing ML tasks

Machine Learning Tools

Tools for Machine Learning

Tools for Machine Learning

  • Jupyter Notebook: open-source web application
  • JupyterLab: web-based interactive development environment for Jupyter Notebooks
  • Pandas: open-source python library used for data handling and analysis
  • Matplotlib: generate plots of your data, Python library
  • NumPy: scientific computing packages in Python
  • Scikit-learn: open-source ML library supports supervised/unsupervised learning

Machine Learning Framework

  • provide tools and code libraries that you can use
  • AWS supports the following frameworks and they can be used from Amazon SageMaker
    • PyTorch,Caffe2, Torch, TensorFlow, Keras, ApacheMXNet, etc..

 

Amazon Instances Designed for ML

AWS provides compute instances that are tuned for ML, both in the cloud and at the edge

  • EC2 C5 instance: cost-effective high performance at a low price for running advanced compute-intensive workloads
    • help to speed up typical ML operations
  • EC2 P3 instance: speed up ML applications
    • fastest in the cloud for ML training
    • ideal for ML workloads that need massive parallel processing power
  • AWS IoT Greengrass makes it easy to bring intelligence to edge devices
  • AWS Elastic Inference is used to attach low-cost GPU-powered acceleration to Amazon EC2 and SageMaker instances

 

AWS offers many managed services that don't require any ML experience

  • Computer Vision
    • Amazon Rekognition: provides object and facial recognition for both image and video
    • Amazon Textract: extract txt from recgonition images
  • Chat bots
    • Amazon Lex: helps you build interactive conversational apps that use voice or text
  • Speech
    • Amazon Polly: converts txt to audible speech
    • Amazon Transcribe: converts spoken audio to text
  • Forecasitng
    • Amazon Forecast: use ML to combine time series data with additional variables to build forecasts
  • Language
    • Amazon Comprehend: use natural language processing to find insights and relationships in text
    • Amazon Translate: translate text into different langauges
  • Recommendations
    • Amazon Personalize: help you create individual personalized recommendations for customers

 

Amazon SageMaker

Amazon SageMaker

  • AWS ML service with many capabilities
  • help data scientists and developers to Prepare > Build > Train & Tune > Deploy & Manage
  • it also provides the ability for developers to iterate the process until they get their model just right
  • eveyr computer interfacing step of ML pipeline can be conducted in this one service
  • first fully integrated development environment that is designed spcifically for ML
  • brings everything that you need for ML under one unified, visual UI

 

 

SageMaker Features

  • provides tools for: labeling data, building models, training models, hosting trained models
  • can deploy ML nstances that run Jupyter notebooks and JupyterLabs
  • In SageMaker, the VMs that you create are your fully managed Amazon SageMaker ML instances
  • Data visualization
  • Model selection: built-in algorithm, write a script, AWS Marketplace, your own algorithm
  • Deployment: you can deploy a model by using a couple of different methods
    • you can use the first way to get one prediction at a time
    • you can use the second way to get predictions on an entire dataset
  • Marketplace integration: provide a large selection of ready-to-se model packages and algorithms from ML developers