Deploy machine learning models in production environments

This article describes best practices for deploying machine learning models in production environments by using Azure Machine Learning. Deploying machine learning models in production is important for organizations that use AI to enhance their operations. It can be a complex process, but this article helps you understand the steps.

Architectural considerations

  • Choose the right deployment method. Each deployment method has advantages and disadvantages. It's important to choose the one that best suits your organization's needs. There are two main deployment methods:

    • Real-time (online) inference processes input data as it's received, often with a low-latency requirement. Low latency is important for applications that require immediate responses, such as fraud detection, speech recognition, or recommendation systems. Real-time inference is more complex and expensive to implement than batch inference because it requires a faster and more reliable infrastructure. The underlying compute for real-time inference usually runs continuously to service requests faster.

    • Batch (offline) inference processes a large batch of input data at once rather than processing each input data point individually in real time. Batch inference is well suited for large data volume scenarios that need efficient processing but response time isn't critical. For example, you might use batch inference to process a large dataset of images, and the machine learning model makes predictions on all the images at once. Batch inference is less expensive and more efficient than real-time inference. The underlying compute for batch inference usually runs only during the batch job.

    机器学习使用端点在实时和批处理场景中部署模型。端点提供统一的界面来调用和管理跨计算类型的模型部署。托管在线端点服务、扩展、保护和监控您的机器学习模型以进行推理。

    有关更多信息,请参见本文的下一节部署方法。

  • 确保一致性。 它& # 39;跨环境一致地部署您的模型是很重要的,比如开发、试运行和生产。使用容器化或虚拟化技术,如机器学习环境,来提供一致性并封装您的环境。

  • 监控性能。 在您的模型部署到生产中之后,您应该跟踪指标,例如准确性、延迟和吞吐量,并设置警报,以便在性能下降到可接受的水平以下时通知您。使用应用洞察和受管理端点的内置监控功能来查看指标和创建警报。

  • 实施安全措施。 保护您的数据和系统。您可以设置身份验证和访问控制、加密传输中的数据和静态数据、使用网络安全以及监控可疑活动。

  • 创建更新计划。 随着新数据和新算法的出现,机器学习模型需要更新。它& # 39;在将更新的模型部署到生产环境之前,创建一个流程来测试和验证它是非常重要的。蓝/绿部署是在生产中更新机器学习模型的常见策略。使用蓝/绿部署,您可以将一个模型更新到一个新的环境,测试它,然后在它之后切换到新的模型& # 39;s已验证。蓝/绿部署确保更新型号的潜在问题不会& # 39;不要影响你的顾客。有关更多信息,请参见本机蓝/绿部署。

部署方法

考虑以下问题来评估您的模型,比较两种部署方法,并选择适合您的模型的方法:

  • For improved performance, Machine Learning supports features that enable scalable processing. The number of compute nodes and maximum concurrency parameters are defined during the batch endpoint deployment in Machine Learning. You can override the parameters for each job, which provides customers runtime flexibility and out-of-the-box parallelism. These features work with tabular and file-based inference.
  • Batch inference challenges.
  • Batch inference is a simpler way to use and deploy your model in production, but it does present its own set of challenges.
  • Depending on the frequency that the inference runs, the prediction that's generated might be irrelevant by the time it's accessed.
  • Deploying to many regions and designing the solution for high availability aren't critical concerns in a batch inference scenario because the model isn't deployed regionally. But the data store might need to be deployed with a high-availability strategy in many locations. The deployment should follow the application high-availability design and strategy.
  • Data that's generated during a batch inference might partially fail. For example, if a scheduled pipeline triggers a batch inference job and the pipeline fails, the data that's generated by the batch inference job might be incomplete. Partial restarts are a common problem with batch inference. One solution is to use a staging area for the data, and only move the data to the final destination after the batch inference job is successfully complete. Another solution is to maintain a record, or transaction, of each file that’s processed, and compare that record to the input file list to avoid duplication. This method incorporates logic in the scoring script. This solution is more complex, but you can customize the failure logic if the batch inference job fails.
  • Security requirements.

Use authentication and authorization to control access to the batch endpoint for enhanced security.

A diagram of the real-time inference and batch inference decision tree.

A batch endpoint with ingress protection only accepts scoring requests from hosts inside a virtual network. It doesn't accept scoring requests from the public internet. A batch endpoint that's created in a private link-enabled workspace has ingress protection. For more information, see Network isolation in batch endpoints.

Use Microsoft Entra tokens for authentication.

Use SSL encryption on the endpoint, which is enabled by default for Machine Learning endpoint invocation.

  • Batch endpoints ensure that only authorized users can invoke batch deployments, but individuals can use other credentials to read the underlying data. For a reference of the data stores and the credentials to access them, see the data access table. Batch integration.

  • Machine Learning batch endpoints use an open API. Batch inference can integrate with other Azure services, such as Azure Data Factory, Azure Databricks, and Azure Synapse Analytics to form part of a larger data pipeline. For example, you can use: Data Factory to orchestrate the batch inference process.

    Azure Databricks to prepare the data for batch inference.

    Machine Learning to run the batch inference process.

  • Azure Synapse Analytics to store the subsequent predictions. Batch endpoints support Microsoft Entra ID for authorization. The request to the API requires proper authentication. Azure services, such as Data Factory, support using a service principal or a managed identity to authenticate against batch endpoints. For more information, see Run batch endpoints from Data Factory.

  • 为了选择批量输入和输出处理的**方法,它& # 39;理解数据如何通过数据管道的各个阶段非常重要。你可以通过使用SDK直接通过批处理端点评分脚本访问Azure数据服务,但使用机器学习注册的数据存储更简单、更安全、更可审计。对于第三方数据源,使用数据处理引擎,如Data Factory、Azure Databricks或Azure Synapse Analytics,为批量推断准备数据并应用推断后处理。 MLflow。

    • 在模型开发过程中使用开源框架MLflow。机器学习支持使用MLflow创建和记录的模型的无代码部署。当您将MLflow模型部署到批处理端点时,您不需要& # 39;不需要指明评分脚本或环境。

    • 实时推理

    • 实时推理是一种使您能够随时触发模型推理并提供即时响应的方法。使用此方法分析流数据或交互式应用程序数据。

  • 考虑以下实时推理的**实践: 计算选项。

    • 实现实时推理的**方式是部署模型& # 39;在线端点中的连接到受管在线端点或Kubernetes在线端点。托管在线端点通过使用Azure中的CPU或GPU机器立即部署您的机器学习模型。这种方法是可扩展的和完全管理的。Kubernetes在线端点在完全配置和管理的Kubernetes集群上部署模型和服务在线端点。有关更多信息,请参见托管在线端点与Kubernetes在线端点。
    • 多区域部署和高可用性。
    • Use Microsoft Entra tokens for control plane authentication. For data plane operations, key-based and token-based approaches are supported. The token-based approach is preferred because tokens expire. Use Azure role-based access controls (RBAC) to restrict access and retrieve the key or token for an online endpoint.
  • Use SSL encryption on the endpoint, which is enabled by default for Machine Learning endpoint invocation. Real-time integration.

    • Integrate real-time inference with other Azure services by using SDKs for different languages and invoking the endpoint by using a REST API. You can invoke the online endpoint as part of an application's code.
    • MLflow.
    • Use the open-source framework, MLflow, during model development. Machine Learning supports no-code deployment of models that you create and log with MLflow. When you deploy your MLflow model to an online endpoint, you don't need to indicate a scoring script or an environment.
    • Safe rollout.

    Deploy phased-out updates to machine learning models to ensure that the model performs as expected. Use the Machine Learning safe rollout strategy to deploy a model to an endpoint, perform testing against the model, and gradually increase the traffic to the new model. Take advantage of mirrored traffic to mirror a percentage of live traffic to the new model for extra validation. Traffic mirroring, also called shadowing, doesn't change the results that are returned to clients. All requests still flow to the original model. For more information, see Safe rollout for online endpoints.

    Other considerations

  • Keep these considerations in mind when you deploy machine learning models in production environments. ONNX

To optimize the inference of your machine learning models, use Open Neural Network Exchange (ONNX). It can be a challenge to fully utilize hardware capabilities when you optimize models, particularly when you use different platforms (for example, cloud/edge or CPU/GPU). You can train a new model or convert an existing model from another format to ONNX.

Many-models scenario

A singular model might not capture the complex nature of real-world problems. For example, supermarkets have demographics, brands, SKUs, and other features that vary between regions, which makes it a challenge to create a single sales-prediction model. Similarly, regional variations can pose a challenge for a smart-meter predictive maintenance model. Use many models to capture regional data or store-level relationships to provide higher accuracy than a single model. The many-models approach assumes that enough data is available for this level of granularity.

  • A many-models scenario has three stages: data source, data science, and many models. Data source.

  • In the data source stage, it's important to segment data into only a few elements. For example, don't factor the product ID or barcode into the main partition because it produces too many segments and might inhibit meaningful models. The brand, SKU, or locality are more suitable elements. It's important to simplify the data by removing anomalies that might skew the data distribution. Data science.

  • In the data science stage, several experiments run parallel to each data segment. Many-models experimentation is an iterative process that evaluates models to determine the best one.

    • Many models.
    • The best models for each segment or category are registered in the model registry. Assign meaningful names to the models to make them more discoverable for inference. Use tagging where necessary to group the model into specific categories.
  • Batch inference for many models 对于许多模型来说,在批处理推断过程中,预测是按循环计划进行的,它们可以处理同时运行的大量数据。与单一模型场景不同,多模型推理同时发生。

    • 许多批量推断模型针对单个受管理端点使用多个部署。特定模型的批处理推断在REST或SDK调用期间调用部署名称。有关更多信息,请参见将多个模型部署到一个部署中。
    • 许多模型的实时推理
    • 您可以将多个模型部署到一个受管理的在线端点中,您可以通过REST API或SDK调用该端点。当您创建部署时,将多个模型注册为单个& # 34;注册型号& # 34;在Azure上。将多个模型包含在同一个目录中,并将该目录作为单个模型的路径进行传递。这些模型被加载到一个字典中。他键入了他们的名字。当接收到REST请求时,从JSON有效负载中检索所需的模型,相关模型对有效负载进行评分。
  • 使用这种技术在多模型部署中加载的模型必须共享相同的Python版本,并且没有冲突的依赖关系。他们的库必须同时导入,即使他们没有& # 39;没有严格的依赖关系。 后续步骤

  • MLflow. Use the open-source framework, MLflow, during model development. Machine Learning supports no-code deployment of models that you create and log with MLflow. When you deploy your MLflow model to an online endpoint, you don't need to indicate a scoring script or an environment.

  • Safe rollout. Deploy phased-out updates to machine learning models to ensure that the model performs as expected. Use the Machine Learning safe rollout strategy to deploy a model to an endpoint, perform testing against the model, and gradually increase the traffic to the new model. Take advantage of mirrored traffic to mirror a percentage of live traffic to the new model for extra validation. Traffic mirroring, also called shadowing, doesn't change the results that are returned to clients. All requests still flow to the original model. For more information, see Safe rollout for online endpoints.

Other considerations

Keep these considerations in mind when you deploy machine learning models in production environments.

ONNX

To optimize the inference of your machine learning models, use Open Neural Network Exchange (ONNX). It can be a challenge to fully utilize hardware capabilities when you optimize models, particularly when you use different platforms (for example, cloud/edge or CPU/GPU). You can train a new model or convert an existing model from another format to ONNX.

Many-models scenario

A singular model might not capture the complex nature of real-world problems. For example, supermarkets have demographics, brands, SKUs, and other features that vary between regions, which makes it a challenge to create a single sales-prediction model. Similarly, regional variations can pose a challenge for a smart-meter predictive maintenance model. Use many models to capture regional data or store-level relationships to provide higher accuracy than a single model. The many-models approach assumes that enough data is available for this level of granularity.

A many-models scenario has three stages: data source, data science, and many models.

A diagram that shows the stages of the many-models scenario.

  • Data source. In the data source stage, it's important to segment data into only a few elements. For example, don't factor the product ID or barcode into the main partition because it produces too many segments and might inhibit meaningful models. The brand, SKU, or locality are more suitable elements. It's important to simplify the data by removing anomalies that might skew the data distribution.

  • Data science. In the data science stage, several experiments run parallel to each data segment. Many-models experimentation is an iterative process that evaluates models to determine the best one.

  • Many models. The best models for each segment or category are registered in the model registry. Assign meaningful names to the models to make them more discoverable for inference. Use tagging where necessary to group the model into specific categories.

Batch inference for many models

For many models, during batch inference, predictions are on a recurring schedule, and they can handle large-volume data that runs at the same time. Unlike a single-model scenario, many-models inference occurs at the same time.

Many models for batch inference use multiple deployments against a single-managed endpoint. Batch inference for specific models invokes the deployment name during the REST or SDK call. For more information, see Deploy multiple models to one deployment.

Real-time inference for many models

You can deploy multiple models into a single-managed online endpoint, which you can invoke via a REST API or SDK. When you create the deployments, register the multiple models as a single "registered model" on Azure. Include the multiple models in the same directory and pass that directory as the path of the single model. The models are loaded into a dictionary that's keyed on their names. When a REST request is received, the desired model is retrieved from the JSON payload, and the relevant model scores the payload.

Models that are loaded in a multi-model deployment by using this technique must share the same Python version and have no conflicting dependencies. Their libraries must be simultaneously imported even if they don't strictly have the same dependencies.

Next steps