Big Data Platform Migration – Challenges

Similarly to a Data Center migration (P2V, V2V, L&S, etc) there are many challenges, risk and issues that can affect the delivery of a big data platform migration.

Below are some of the main challenges I recently faced working on a Hadoop Data Lake Migration.

  1. Lack of Communication
    Preparing for a migration and running the event itself requires excellent communication strategy and channels.
    For instance during a migration weekend in case of delays or technical difficulties it is critical to keep updated business testers so that they will know when they are expected to perform their tasks and if there are delays. This is even more true when you ask business resources to perform out of hours or weekend work, and if they have to do it at the company office, and therefore arrange travels on a Saturday for instance.
    Remediation: Testers should be contacted 60min in advance to them starting their work or to update them on delays.
  2. Connectivity 
    If you will migrate your cluster behind a firewall environment, a huge effort will be spent on network security and connectivity. Key activities will be gathering connectivity requirements from application and interface teams (inbound and outbound to the lake) and implementing the required firewall rules to allow traffic to flow. It might sound easy but in reality is not, as often such teams do not really know well their network requirements, plus there can be extra requirement coming in last minute and other unexpected issues.
    Remediation: ensure you plan a connectivity test event, where you test in advance that all the firewall rules raised, CNAMES and LB connections work as expected, as you cannot afford to execute the migration cut over event without being first sure that each application can connect to the new cluster.
  3. HDFS: Distcp Data Copy
    Copying large volume of data might take days and have to be planned and executed flawlessly.

    HDFS directories might have a huge number of sub-directories and files. Attempting to copy them without having a reliable tool that keep track of progress and report any copy failure will end up leaving you with an incomplete data copy.
    Remediation: validation of all directories and files should take place before the migration and cut over event. If you are using Distcp Falcon, then a notification script alerting of any failed data copy should be implemented.

  4. Kafka: Mirror Maker Performances
    Kafka’s mirroring feature makes it possible to maintain a replica of an existing Kafka cluster, however you need to be careful on how you configure and set the number of MM processes in relationship to the number of topics and partitions. Wrong configuration or large volume of data being copied and replicated might cause performance issues on the nodes.

    KAFKA MIRRORING

  5. Data Validation
    Verification of a prioritized list of tables should be done during the migration window.
    Remediation: this list should be agree with the Big Data team in advance.
  6. Data Rebuild
    In your migration runbook do not forget to include a time buffer for data rebuild and indexing (e.g. Solr collection).