18. Cloud-Based AI

CS/클라우드컴퓨팅

18. Cloud-Based AI

호프 2023. 12. 9. 14:50

Machine Learning Pipeline

Problem Formulation

Buisiness Problem -> ML problem

ML model: start to think about the problem in terms of your ML model
Questions
- Can machine learning solve the problem?
- Would a traditional approach make more sense?
- Is t his problem a supervised or unsupervised machine learning problem?
- Do you have labeled data to train a supervised model?
Problem + Intended outcome + Make it measurable

Collect and evaluate data

Questions

Which data do you need
Do you have access to that data?
How much data do you have and where is it?
What solution can you use to bring all this data into one centralized repository?

Data Sources

Private Data
- Data that you have in various existing systems
- Data is found in many different system
Commercial Data
- Data that a commercial entity collected and made available
Open-Source Data
- Comprise many different open-source datasets
- usually available for use in research or for teaching purposes
- you can find open-source datasets hosted by AWS, Kaggle ...

Data Consideration

Get an understanding of your data
Get a domain expert
Evalulate the quality of your data
- good data contains a signal about the phenomenon that you're trying to model
Identify features and labels
- data = features + labels
- feature: attribute that can be used to help identify patterns and predict future answers
- label: answer that you want your model to predict
- data for which you already know the answer is called labeled data
Identify labeled data needs
- supervised learning include labeled data

Data Preparation and Preprocessing

extract data from one or more data sources
you need a subject matter expert or a functional expert to understand the authenticity of the data

Feature Engineering

Feature Engineering

dealing with your data to make it usable
selecting or creating the features

Feature Selection

selecting the features that are most relevant and discarding the rest
prevent either redundancy or irrelevance in the existing features or to get a limited number of features to prevent overfitting
Selection methods
- Wrapper methods: measure the usefulness of a subset of features by training a model on it and measuring the success of the model
- Filter methods: use statistical methods to measure the relevance of features, faster and cheaper than wrapper methods because they don't involve training the models repeatedly
- Embedded methods: algorithm-specific and might use a combination of both

Feature Extraction

building up valuable information from raw data by reformatting, combining, and transforming primary features into new ones
encoding data, finding missing data, handling outliers

Preparing Data

data might be dirty -> contains error or duplicate data
- using dirty data will lead to incorrect results or an inability to run the model effectively
Considerations for preparing your data
- Encoding data
  - ML algo work best with numerical data
- Cleaning data
  - before you encode the string data, must make sure that the strings are all consistent
  - adjust variables to use a consistent scale
  - split items that capture more than one variable
- Finding missing data
  - most ML algorithms can't deal with missing values automatically
  - find missing data and update with something meaningful and relevant to the problem
- Handling outliers
  - outliers: points lie at an abnormal distance from other values
  - outliers can add richness to your dataset but they can also make it more difficult to make accurate predictions
  - outliers affect accuracy because they skew values away from the other more normal values that are related to the feature

Use feature extraction to convert data into a usable form

During feature extraction, you will handle msising data, duplicates, inconsistencies, invalid values, and conversion of text data into numerical data

Select and Train model

Model Training

you don't use all the data to train your model
- Training data: feeds into the algorithm to produce your model; the model is then used to make predictions over the validation dataset
- Validation data: you might notice things that you will want to tweak, tune, and change
- Test data

Benefit of splitting all available data into training(80%), validation(10%), and test(10%) subsets

The model can be evaluated on data wasn't used for training to assess how well it is generalizing new information

Overfitting and Underfitting

The goal of machine learning is to build a model that generalizes well

What is happening if a model is overfitting?

The model performs well on the training data, but it doesn't perform well on the evaluation data

Chossing a ML algorithm

Supervised learning (지도학습)
Unsupervised learning (비지도학습)
Reinforcement learning (강화학습)

Tune and Evaluate model

Evaluate and Tune

validation data provided a biased evaluation of a model fit while tuning model
- 모델이 validation data에만 맞출 수 있도록 튜닝될 수 있음
test data has known values and lets you assess the unbiased accuracy of the model after tuning
- if the model performs well on the test data, it will likely perform well on new data with unknown target values
you should set aside enough test data and not use that that data for training or validation

Success Meteric

Evaluation relies on an appropriate metric
Model metric should be linked to that business metric as closely as possible

Tuning

goal of training is a model that is balanced and generalizes well
modify the model's data, features, or hyperparameters until you find the model that yields the best result

Deploy model

Machinne Learning Stack

Data Layer

where data that will feed directly into your ML model is stored

Model Layer

contains the model and algorithms that build predictions, based on data that the model collects

Deployment & Monitoring Layer

model is working in a live environment and producing ML tasks

Machine Learning Tools

Tools for Machine Learning

Tools for Machine Learning

Jupyter Notebook: open-source web application
JupyterLab: web-based interactive development environment for Jupyter Notebooks
Pandas: open-source python library used for data handling and analysis
Matplotlib: generate plots of your data, Python library
NumPy: scientific computing packages in Python
Scikit-learn: open-source ML library supports supervised/unsupervised learning

Machine Learning Framework

provide tools and code libraries that you can use
AWS supports the following frameworks and they can be used from Amazon SageMaker
- PyTorch,Caffe2, Torch, TensorFlow, Keras, ApacheMXNet, etc..

Amazon Instances Designed for ML

AWS provides compute instances that are tuned for ML, both in the cloud and at the edge

EC2 C5 instance: cost-effective high performance at a low price for running advanced compute-intensive workloads
- help to speed up typical ML operations
EC2 P3 instance: speed up ML applications
- fastest in the cloud for ML training
- ideal for ML workloads that need massive parallel processing power
AWS IoT Greengrass makes it easy to bring intelligence to edge devices
AWS Elastic Inference is used to attach low-cost GPU-powered acceleration to Amazon EC2 and SageMaker instances

AWS offers many managed services that don't require any ML experience

Computer Vision
- Amazon Rekognition: provides object and facial recognition for both image and video
- Amazon Textract: extract txt from recgonition images
Chat bots
- Amazon Lex: helps you build interactive conversational apps that use voice or text
Speech
- Amazon Polly: converts txt to audible speech
- Amazon Transcribe: converts spoken audio to text
Forecasitng
- Amazon Forecast: use ML to combine time series data with additional variables to build forecasts
Language
- Amazon Comprehend: use natural language processing to find insights and relationships in text
- Amazon Translate: translate text into different langauges
Recommendations
- Amazon Personalize: help you create individual personalized recommendations for customers

Amazon SageMaker

Amazon SageMaker

AWS ML service with many capabilities
help data scientists and developers to Prepare > Build > Train & Tune > Deploy & Manage
it also provides the ability for developers to iterate the process until they get their model just right
eveyr computer interfacing step of ML pipeline can be conducted in this one service
first fully integrated development environment that is designed spcifically for ML
brings everything that you need for ML under one unified, visual UI

SageMaker Features

provides tools for: labeling data, building models, training models, hosting trained models
can deploy ML nstances that run Jupyter notebooks and JupyterLabs
In SageMaker, the VMs that you create are your fully managed Amazon SageMaker ML instances
Data visualization
Model selection: built-in algorithm, write a script, AWS Marketplace, your own algorithm
Deployment: you can deploy a model by using a couple of different methods
- you can use the first way to get one prediction at a time
- you can use the second way to get predictions on an entire dataset
Marketplace integration: provide a large selection of ready-to-se model packages and algorithms from ML developers

저작자표시 (새창열림)