Wednesday, December 28, 2016

Dynamodb restore using datapipeline with newer emr version

11 comments
We faced an interesting issue here.

We had a task to restore dynamodb from backup stored in s3. We took the backup using AWS pipeline. But when we started restoring, which we had been doing for a long time, we found the cluster was not able to provisioned and it was failing on bootstraping.
Digging more into it we found the AWS Dynamodb import template of datapipeline use default subnet under default VPC. Now, we don't have Internet Gateway for this VPC. Since, EMR needs IG to work successfully, this pipeline was failing with "internal error"

We started talking with AWS support. After some discussion, we changed our private subnet. Again the EMR provisioning was failing as the template to restore dynamodb that provided by AWS uses AMI Version 3.9.0 which does not support private subnet.

So we decided to change the AMI Version 3.9 to release label "emr-4.5.0" which we have been using for all our EMR so far. Again we failed with error:
Unable to create resource for @EmrClusterForLoad_2016-12-28T20:33:21 due to: The supplied bootstrap action(s): 'bootstrap-action.7237c1e1-31de-4c02-ae68-c546dd581732' are not supported by release 'emr-4.5.0'. (Service: AmazonElasticMapReduce; Status Code: 400; Error Code: ValidationException; Request ID: e8be350e-cd3c-11e6-8e60-cb10b4c3228c)

That is, the template script provided by AWS does not support emr release label 4.5.0. To overcome the problem, we had to modify EmrCluster bootstrap action in pipeline definition which was:
s3://#{myDDBRegion}.elasticmapreduce/bootstrap-actions/configure-hadoop, --mapred-key-value,mapreduce.map.speculative=false

This was only supported by AMI 3.9.0. For release label emr-4.5.0, it should be added as configuration properties as follows:
--
        {
            "configuration": {
                "ref": "EmrConfigurationId_XXWNE"
            },
            "releaseLabel": "emr-4.5.0",
            "type": "EmrCluster",
            ...
       },
       {
            "property": {
                "ref": "PropertyId_3ghq7"
            },
            "type": "EmrConfiguration",
            "id": "EmrConfigurationId_XXWNE",
            "classification": "mapred-site",
            "name": "DefaultEmrConfiguration1"
        },
        {
            "key": "mapreduce.map.speculative",
            "type": "Property",
            "id": "PropertyId_3ghq7",
            "value": "false",
            "name": "DefaultProperty1"
        },
--

Now, we exported the pipeline definition and added the above configuration. The final pipeline definition was looking like this:

{
  "objects": [
    {
      "property": [
        {
          "ref": "PropertyId_3ghq7"
        }
      ],
      "name": "DefaultEmrConfiguration1",
      "id": "EmrConfigurationId_XXWNE",
      "type": "EmrConfiguration",
      "classification": "mapred-site"
    },
    {
      "output": {
        "ref": "DDBDestinationTable"
      },
      "input": {
        "ref": "S3InputDataNode"
      },
      "maximumRetries": "1",
      "name": "TableLoadActivity",
      "step": "s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbImport,#{input.directoryPath},#{output.tableName},#{output.writeThroughputPercent}",
      "runsOn": {
        "ref": "EmrClusterForLoad"
      },
      "id": "TableLoadActivity",
      "type": "EmrActivity",
      "resizeClusterBeforeRunning": "true"
    },
    {
      "subnetId": "subnet-xxxxxxx",
      "name": "EmrClusterForLoad",
      "coreInstanceCount": "1",
      "coreInstanceType": "m3.xlarge",
      "releaseLabel": "emr-4.5.0",
      "id": "EmrClusterForLoad",
      "masterInstanceType": "m3.xlarge",
      "region": "#{myDDBRegion}",
      "type": "EmrCluster",
      "terminateAfter": "23 Hours",
      "configuration": {
                "ref": "EmrConfigurationId_XXWNE"
            }
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "PSS-BDP-QA-DataPipelineDefaultResourceRole",
      "pipelineLogUri": "s3://pss-bdp-qa-logfiles/datapipeline-logs/PSS-BDP-SQA-Dynamodb-Import-1/",
      "role": "PSS-BDP-DataPipelineDefaultRole",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    },
    {
      "writeThroughputPercent": "#{myDDBWriteThroughputRatio}",
      "name": "DDBDestinationTable",
      "id": "DDBDestinationTable",
      "type": "DynamoDBDataNode",
      "tableName": "#{myDDBTableName}"
    },
    {
      "directoryPath": "#{myInputS3Loc}",
      "name": "S3InputDataNode",
      "id": "S3InputDataNode",
      "type": "S3DataNode"
    },
    {
        "key": "mapreduce.map.speculative",
        "type": "Property",
        "id": "PropertyId_3ghq7",
        "value": "false",
        "name": "DefaultProperty1"
    }
  ],
  "parameters": [
    {
      "description": "Input S3 folder",
      "id": "myInputS3Loc",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "description": "Target DynamoDB table name",
      "id": "myDDBTableName",
      "type": "String"
    },
    {
      "default": "0.25",
      "watermark": "Enter value between 0.1-1.0",
      "description": "DynamoDB write throughput ratio",
      "id": "myDDBWriteThroughputRatio",
      "type": "Double"
    },
    {
      "default": "us-east-1",
      "watermark": "us-east-1",
      "description": "Region of the DynamoDB table",
      "id": "myDDBRegion",
      "type": "String"
    }
  ],
  "values": {
    "myDDBRegion": "us-east-1",
    "myDDBTableName": "TABLE_TEST",
    "myDDBWriteThroughputRatio": "1",
    "myInputS3Loc": "s3://my-dynamobackup/TABLE_TEST_201609/2016-12-22-22-55-57"
  }
}
Continue reading →

Labels