Machine Learning Pipeline
Problem Formulation
Buisiness Problem -> ML problem
- ML model: start to think about the problem in terms of your ML model
- Questions
- Can machine learning solve the problem?
- Would a traditional approach make more sense?
- Is t his problem a supervised or unsupervised machine learning problem?
- Do you have labeled data to train a supervised model?
- Problem + Intended outcome + Make it measurable
Collect and evaluate data
Questions
- Which data do you need
- Do you have access to that data?
- How much data do you have and where is it?
- What solution can you use to bring all this data into one centralized repository?
Data Sources
- Private Data
- Data that you have in various existing systems
- Data is found in many different system
- Commercial Data
- Data that a commercial entity collected and made available
- Open-Source Data
- Comprise many different open-source datasets
- usually available for use in research or for teaching purposes
- you can find open-source datasets hosted by AWS, Kaggle ...
Data Consideration
- Get an understanding of your data
- Get a domain expert
- Evalulate the quality of your data
- good data contains a signal about the phenomenon that you're trying to model
- Identify features and labels
- data = features + labels
- feature: attribute that can be used to help identify patterns and predict future answers
- label: answer that you want your model to predict
- data for which you already know the answer is called labeled data
- Identify labeled data needs
- supervised learning include labeled data
Data Preparation and Preprocessing
- extract data from one or more data sources
- you need a subject matter expert or a functional expert to understand the authenticity of the data
Feature Engineering
Feature Engineering
- dealing with your data to make it usable
- selecting or creating the features
Feature Selection
- selecting the features that are most relevant and discarding the rest
- prevent either redundancy or irrelevance in the existing features or to get a limited number of features to prevent overfitting
- Selection methods
- Wrapper methods: measure the usefulness of a subset of features by training a model on it and measuring the success of the model
- Filter methods: use statistical methods to measure the relevance of features, faster and cheaper than wrapper methods because they don't involve training the models repeatedly
- Embedded methods: algorithm-specific and might use a combination of both
Feature Extraction
- building up valuable information from raw data by reformatting, combining, and transforming primary features into new ones
- encoding data, finding missing data, handling outliers
Preparing Data
- data might be dirty -> contains error or duplicate data
- using dirty data will lead to incorrect results or an inability to run the model effectively
- Considerations for preparing your data
- Encoding data
- ML algo work best with numerical data
- Cleaning data
- before you encode the string data, must make sure that the strings are all consistent
- adjust variables to use a consistent scale
- split items that capture more than one variable
- Finding missing data
- most ML algorithms can't deal with missing values automatically
- find missing data and update with something meaningful and relevant to the problem
- Handling outliers
- outliers: points lie at an abnormal distance from other values
- outliers can add richness to your dataset but they can also make it more difficult to make accurate predictions
- outliers affect accuracy because they skew values away from the other more normal values that are related to the feature
- Encoding data
- Use feature extraction to convert data into a usable form
- During feature extraction, you will handle msising data, duplicates, inconsistencies, invalid values, and conversion of text data into numerical data
Select and Train model
Model Training
- you don't use all the data to train your model
- Training data: feeds into the algorithm to produce your model; the model is then used to make predictions over the validation dataset
- Validation data: you might notice things that you will want to tweak, tune, and change
- Test data
Benefit of splitting all available data into training(80%), validation(10%), and test(10%) subsets
- The model can be evaluated on data wasn't used for training to assess how well it is generalizing new information
Overfitting and Underfitting
- The goal of machine learning is to build a model that generalizes well
What is happening if a model is overfitting?
- The model performs well on the training data, but it doesn't perform well on the evaluation data
Chossing a ML algorithm
- Supervised learning (지도학습)
- Unsupervised learning (비지도학습)
- Reinforcement learning (강화학습)
Tune and Evaluate model
Evaluate and Tune
- validation data provided a biased evaluation of a model fit while tuning model
- 모델이 validation data에만 맞출 수 있도록 튜닝될 수 있음
- test data has known values and lets you assess the unbiased accuracy of the model after tuning
- if the model performs well on the test data, it will likely perform well on new data with unknown target values
- you should set aside enough test data and not use that that data for training or validation
Success Meteric
- Evaluation relies on an appropriate metric
- Model metric should be linked to that business metric as closely as possible
Tuning
- goal of training is a model that is balanced and generalizes well
- modify the model's data, features, or hyperparameters until you find the model that yields the best result
Deploy model
Machinne Learning Stack
Data Layer
- where data that will feed directly into your ML model is stored
Model Layer
- contains the model and algorithms that build predictions, based on data that the model collects
Deployment & Monitoring Layer
- model is working in a live environment and producing ML tasks
Machine Learning Tools
Tools for Machine Learning
Tools for Machine Learning
- Jupyter Notebook: open-source web application
- JupyterLab: web-based interactive development environment for Jupyter Notebooks
- Pandas: open-source python library used for data handling and analysis
- Matplotlib: generate plots of your data, Python library
- NumPy: scientific computing packages in Python
- Scikit-learn: open-source ML library supports supervised/unsupervised learning
Machine Learning Framework
- provide tools and code libraries that you can use
- AWS supports the following frameworks and they can be used from Amazon SageMaker
- PyTorch,Caffe2, Torch, TensorFlow, Keras, ApacheMXNet, etc..
Amazon Instances Designed for ML
AWS provides compute instances that are tuned for ML, both in the cloud and at the edge
- EC2 C5 instance: cost-effective high performance at a low price for running advanced compute-intensive workloads
- help to speed up typical ML operations
- EC2 P3 instance: speed up ML applications
- fastest in the cloud for ML training
- ideal for ML workloads that need massive parallel processing power
- AWS IoT Greengrass makes it easy to bring intelligence to edge devices
- AWS Elastic Inference is used to attach low-cost GPU-powered acceleration to Amazon EC2 and SageMaker instances
AWS offers many managed services that don't require any ML experience
- Computer Vision
- Amazon Rekognition: provides object and facial recognition for both image and video
- Amazon Textract: extract txt from recgonition images
- Chat bots
- Amazon Lex: helps you build interactive conversational apps that use voice or text
- Speech
- Amazon Polly: converts txt to audible speech
- Amazon Transcribe: converts spoken audio to text
- Forecasitng
- Amazon Forecast: use ML to combine time series data with additional variables to build forecasts
- Language
- Amazon Comprehend: use natural language processing to find insights and relationships in text
- Amazon Translate: translate text into different langauges
- Recommendations
- Amazon Personalize: help you create individual personalized recommendations for customers
Amazon SageMaker
Amazon SageMaker
- AWS ML service with many capabilities
- help data scientists and developers to Prepare > Build > Train & Tune > Deploy & Manage
- it also provides the ability for developers to iterate the process until they get their model just right
- eveyr computer interfacing step of ML pipeline can be conducted in this one service
- first fully integrated development environment that is designed spcifically for ML
- brings everything that you need for ML under one unified, visual UI
SageMaker Features
- provides tools for: labeling data, building models, training models, hosting trained models
- can deploy ML nstances that run Jupyter notebooks and JupyterLabs
- In SageMaker, the VMs that you create are your fully managed Amazon SageMaker ML instances
- Data visualization
- Model selection: built-in algorithm, write a script, AWS Marketplace, your own algorithm
- Deployment: you can deploy a model by using a couple of different methods
- you can use the first way to get one prediction at a time
- you can use the second way to get predictions on an entire dataset
- Marketplace integration: provide a large selection of ready-to-se model packages and algorithms from ML developers
'CS > 클라우드컴퓨팅' 카테고리의 다른 글
20. Edge Computing (0) | 2023.12.11 |
---|---|
19. MLOps (0) | 2023.12.10 |
17. MapReduce (0) | 2023.12.08 |
16. DevOps (0) | 2023.12.08 |
15. Reliability and Availability (1) | 2023.12.07 |