Introducing self-managed knowledge sources for Amazon OpenSearch Ingestion

Introducing self-managed knowledge sources for Amazon OpenSearch Ingestion
Introducing self-managed knowledge sources for Amazon OpenSearch Ingestion


Enterprise prospects more and more undertake Amazon OpenSearch Ingestion (OSI) to deliver knowledge into Amazon OpenSearch Service for numerous use circumstances. These embrace petabyte-scale log analytics, real-time streaming, safety analytics, and looking semi-structured key-value or doc knowledge. OSI makes it easy, with simple integrations, to ingest knowledge from many AWS providers, together with Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), Amazon Managed Streaming for Apache Kafka (Amazon MSK), and Amazon DocumentDB (with MongoDB compatibility).

In the present day we’re asserting help for ingesting knowledge from self-managed OpenSearch/Elasticsearch and Apache Kafka clusters. These sources can both be on Amazon Elastic Compute Cloud (Amazon EC2) or on-premises environments.

On this publish, we define the steps to get began with these sources.

Resolution overview

OSI helps the AWS Cloud Development Kit (AWS CDK), AWS CloudFormation, the AWS Command Line Interface (AWS CLI), Terraform, AWS APIs, and the AWS Management Console to deploy pipelines. On this publish, we use the console to display methods to create a self-managed Kafka pipeline.

Stipulations

To verify OSI can join and browse knowledge efficiently, the next situations needs to be met:

  • Community connectivity to knowledge sources – OSI is mostly deployed in a public community, such because the web, or in a digital non-public cloud (VPC). OSI deployed in a buyer VPC is ready to entry knowledge sources in the identical or totally different VPC and on the web with an hooked up internet gateway. In case your knowledge sources are in one other VPC, widespread strategies for community connectivity embrace direct VPC peering, utilizing a transit gateway, or utilizing buyer managed VPC endpoints powered by AWS PrivateLink. In case your knowledge sources are in your company knowledge heart or different on-premises setting, widespread strategies for community connectivity embrace AWS Direct Connect and utilizing a community hub like a transit gateway. The next diagram exhibits a pattern configuration of OSI operating in a VPC and utilizing Amazon OpenSearch Service as a sink. OSI runs in a service VPC and creates an Elastic Network interface (ENI) within the buyer VPC. For self-managed knowledge supply these ENIs are used for studying knowledge from on-premises setting. OSI creates an VPC endpoint within the service VPC to ship knowledge to the sink.
  • Title decision for knowledge sources – OSI makes use of an Amazon Route 53 resolver. This resolver routinely solutions queries to names native to a VPC, public domains on the web, and information hosted in private hosted zones. When you’re are utilizing a personal hosted zone, be sure to have a DHCP option set enabled, hooked up to the VPC utilizing AmazonProvidedDNS as area identify server. For extra data, see Work with DHCP option sets. Moreover, you should utilize resolver inbound and outbound endpoints in case you want a fancy decision schemes with situations which can be past a easy non-public hosted zone.
  • Certificates verification for knowledge supply names – OSI helps solely SASL_SSL for transport for Apache Kafka supply. Inside SASL, Amazon OpenSearch Service helps most authentication mechanisms like PLAIN, SCRAM, IAM, GSAPI and others. When utilizing SASL_SSL, be sure to have entry to certificates wanted for OSI to authenticate. For self-managed OpenSearch knowledge sources, ensure that verifiable certificates are put in on the clusters. Amazon OpenSearch Service doesn’t help insecure communication between OSI and OpenSearch. Certificates verification can’t be turned off. Specifically, the “insecure” configuration choice just isn’t supported.
  • Entry to AWS Secrets and techniques Supervisor – OSI makes use of AWS Secrets Manager to retrieve credentials and certificates wanted to speak with self-managed knowledge sources. For extra data, see Create and manage secrets with AWS Secrets Manager.
  • IAM position for pipelines – You want an AWS Identity and Access Management (IAM) pipeline position to jot down to knowledge sinks. For extra data, see Identity and Access Management for Amazon OpenSearch Ingestion.

Create a pipeline with self-managed Kafka as a supply

After you full the stipulations, you’re able to create a pipeline in your knowledge supply. Full the next steps:

  1. On the OpenSearch Service console, select Pipelines below Ingestion within the navigation pane.
  2. Select Create pipeline.
  3. Select Streaming below Use case within the navigation pane.
  4. Choose Self managed Apache Kafka below Ingestion pipeline blueprints and select Choose blueprint.

This can populate a pattern configuration for this pipeline.

  1. Present a reputation for this pipeline and select the suitable pipeline capability.
  2. Underneath Pipeline configuration, present your pipeline configuration in YAML format. The next code snippet exhibits pattern configuration in YAML for SASL_SSL authentication:
    model: '2'
    kafka-pipeline:
      supply:
        kafka:
          acknowledgments: true
          bootstrap_servers:
            - 'node-0.instance.com:9092'
          encryption:
            sort: "ssl"
            certificates: '${{aws_secrets:kafka-cert}}'
            
          authentication:
            sasl:
              plain:
                username: '${{aws_secrets:secrets and techniques:username}}'
                password: '${{aws_secrets:secrets and techniques:password}}'
          subjects:
            - identify: on-prem-topic
              group_id: osi-group-1
      processor:
        - grok:
            match:
              message:
                - '%{COMMONAPACHELOG}'
        - date:
            vacation spot: '@timestamp'
            from_time_received: true
      sink:
        - opensearch:
            hosts: ["https://search-domain-12345567890.us-east-1.es.amazonaws.com"]
            aws:
              area: us-east-1
              sts_role_arn: 'arn:aws:iam::123456789012:position/pipeline-role'
            index: "on-prem-kakfa-index"
    extension:
      aws:
        secrets and techniques:
          kafka-cert:
            secret_id: kafka-cert
            area: us-east-1
            sts_role_arn: 'arn:aws:iam::123456789012:position/pipeline-role'
          secrets and techniques:
            secret_id: secrets and techniques
            area: us-east-1
            sts_role_arn: 'arn:aws:iam::123456789012:position/pipeline-role'

  1. Select Validate pipeline and ensure there are not any errors.
  2. Underneath Community configuration, select Public entry or VPC entry. (For this publish, we select VPC entry).
  3. When you selected VPC entry, specify your VPC, subnets, and an applicable safety group so OSI can attain the outgoing ports for the information supply.
  4. Underneath VPC attachment choices, choose Connect to VPC and select an applicable CIDR vary.

OSI sources are created in a service VPC managed by AWS that’s separate from the VPC you selected within the final step. This choice means that you can configure what CIDR ranges OSI ought to use inside this service VPC. The selection exists so you may make positive there isn’t any handle collision between CIDR ranges in your VPC that’s hooked up to your on-premises community and this service VPC. Many pipelines in your account can share similar CIDR ranges for this service VPC.

  1. Specify any optionally available tags and log publishing choices, then select Subsequent.
  2. Assessment the configuration and select Create pipeline.

You’ll be able to monitor the pipeline creation and any log messages within the Amazon CloudWatch Logs log group you specified. Your pipeline ought to now be efficiently created. For extra details about methods to provision capability for the efficiency of this pipeline, see the part Advisable Compute Models (OCUs) for the MSK pipeline in Introducing Amazon MSK as a source for Amazon OpenSearch Ingestion.

Create a pipeline with self-managed OpenSearch as a supply

The steps for making a pipeline for self-managed OpenSearch are much like the steps for creating one for Kafka. In the course of the blueprint choice, select Knowledge Migration below Use case and choose Self managed OpenSearch/Elasticsearch. OpenSearch Ingestion can supply knowledge from all variations of OpenSearch and Elasticsearch from model 7.0  to  model 7.10.

The next blueprint exhibits a pattern configuration YAML for this knowledge supply:

model: "2"
opensearch-migration-pipeline:
  supply:
    opensearch:
      acknowledgments: true
      hosts: [ "https://node-0.example.com:9200" ]
      username: "${{aws_secrets:secret:username}}"
      password: "${{aws_secrets:secret:password}}"
      indices:
        embrace:
        - index_name_regex: "opensearch_dashboards_sample_data*"
        exclude:
          - index_name_regex: '..*'
  sink:
    - opensearch:
        hosts: [ "https://search-domain-12345567890.us-east-1.es.amazonaws.com" ]
        aws:
          sts_role_arn: "arn:aws:iam::123456789012:position/pipeline-role"
          area: "us-east-1"
        index: "on-prem-os"
extension:
  aws:
    secrets and techniques:
      secret:
        secret_id: "self-managed-os-credentials"
        area: "us-east-1"
        sts_role_arn: "arn:aws:iam::123456789012:position/pipeline-role"
        refresh_interval: PT1H

Concerns for self-managed OpenSearch knowledge supply

Certificates put in on the OpenSearch cluster must be verifiable for OSI to connect with this knowledge supply earlier than studying knowledge. Insecure connections are at the moment not supported.

After you’re linked, ensure that the cluster has adequate learn bandwidth to permit for OSI to learn knowledge. Use the Min and Max OCU setting to restrict OSI learn bandwidth consumption. Your learn bandwidth will differ relying upon knowledge quantity, variety of indexes, and provisioned OCU capability. Begin small and improve the variety of OCUs to steadiness between accessible bandwidth and acceptable migration time.

This supply is usually meant for one-time migration of information and never as steady ingestion to maintain knowledge in sync between knowledge sources and sinks.

OpenSearch Service domains help remote reindexing, however that consumes sources in your domains. Utilizing OSI will transfer this compute out of the area, and OSI can obtain considerably larger bandwidth than distant reindexing, thereby leading to quicker migration instances.

OSI doesn’t help deferred replay or site visitors recording at the moment; confer with Migration Assistant for Amazon OpenSearch Service in case your migration wants these capabilities.

Conclusion

On this publish, we launched self-managed sources for OpenSearch Ingestion that allow you to ingest knowledge from company knowledge facilities or different on-premises environments. OSI additionally helps numerous different knowledge sources and integrations. Discuss with Working with Amazon OpenSearch Ingestion pipeline integrations to study these different knowledge sources.


In regards to the Authors

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search purposes and options. Muthu is within the subjects of networking and safety, and is predicated out of Austin, Texas.

Arjun Nambiar is a Product Supervisor with Amazon OpenSearch Service. He focuses on ingestion applied sciences that allow ingesting knowledge from all kinds of sources into Amazon OpenSearch Service at scale. Arjun is interested by large-scale distributed programs and cloud-centered applied sciences, and is predicated out of Seattle, Washington.

Leave a Reply

Your email address will not be published. Required fields are marked *