Arun Kumar - Data Engineer |
[email protected] |
Location: Dallas, Texas, USA |
Relocation: Yes |
Visa: H1B |
Verifiable experience in leveraging Data Lake, Cloud platforms & BI layers
Excellent track record of managing high level engagements within AI, ML, and Data Science domains across geographies Solid exposure to leading legacy application revamps based on thorough assessments of new requirements Verifiable experience in ETL data processing through Hadoop Native tools / NIFI / Informatica BDM. Solid exposure to Real time data processing with Kafka & MQTT. Handled many data migrations from multiple source systems (IBM mainframe, MYSQL, Oracle) to Hadoop through sqoop. Solid experience in Data orchestration tools like Apache Oozie. Automated many process / data pipelines through scripting, Jenkins Supported Modelop image deployments / Creating image for data science teams based on the requirements. Received Hackathon Winner (Fidelity Investments,2024) | Best Engineering Team Award (Emirates NBD, 2019) | Best Performer Award (Athena Technologies, 2015 & 2016) CAREER TIMELINE Data Engineer Solution IT INC Feb 22 Present Senior Hadoop Developer D4 Insight Tech Jul 18 Feb 2022 Module lead MindTree Feb 18 Jun 18 Senior software engineer (Cloud) Wisilica India Pvt. Ltd. May 17 Nov 17 FreeLancer Freelancer Nov 16 May 17 Software Engineer Athena Technology Solutions Aug 14 Nov 16 Software Engineer Avolot Technologies Jul 12 Aug 14 Projects Handled: SMAP LRC FRAUD ANALYSIS | DPS API (Solution IT INC) Finacle Core Consolidation | Legacy Applications Archival Phase 2 (D4 Insight Tech) Telecom Data Analytics (Free Lancing) Ghar Value (Athena Technology Solutions IOT Device Tracking (Wisilica India) TECHNICAL PURVIEW Languages: Python, SQL Big Data/Hadoop technologies: Snowflake, Apache Hadoop (MR1, MR2), Apache Spark, Kafka, Apache NiFi, Flume, Knox, Ranger, Atlas, Hive, Impala, Hbase, Impala-Kudu, Sqoop, Oozie workflow, HDFS, Yarn, Informatica BDM Cloud Services: GCP, AWS Cloud Tools: Amazon EC2, Amazon EMR, ELB, Amazon Lamda, Amazon Redshift & S3. Azure Services: Azure Data Factory, Azure SQL, Azure Datalake GCP Services: Cloud SQL, Cloud Functions, GCP Cloud Storage, Big Query Message Broker Tools: Kafka, MQTT Operating Systems: Mac, Linux & Windows | Databases: NoSQL (HBase), SQL (Hive, Impala, MySQL), Presto DB CI/ CD tools: Jenkins, Docker, Kubernetes ACADEMIC CREDENTIALS Bachelor (Computer Engineering) 65 % | Indira Ganesan Engineering College, Anna University (Trichy, IN) Diploma (Computer Technology) 81 % | P.T. Lee. Cengal Varayan Polytechnic College, Anna University (Chennai, IN) Certifications: AWS Certified Big Data Specialty | Snowflake pro certification | Kubernetes Certification (CKA) PROFESSIONAL EXPERIENCE Senior Data Engineer Clients: Fidelity Investments (Feb 2023 Present) Key Deliverables: Developed a real-time generic alert structure process for easy analysis with Kafka, Python & Snowflake. Automated the Rehydration process for docker images, Modelops images through Jenkins and snow API. Design and Develop ETL Processes in AWS Glue to migrate data from external sources like S3, Parquet/Text Files into AWS Redshift. ARUN KUMAR K | Curriculum Vitae Experience building Data pipeline for Realtime streaming data and Data Analytics using Azure cloud components like Azure Data Factory, HDInsight (spark cluster), Azure ML Studio, Azure stream Analytics, Azure Blob Storage, Microsoft SQL DB, Neo4j (Graph DB). Managed MongoDB clusters for high availability and disaster recovery. Designed and implemented scalable data pipelines using AWS Glue and AWS Lambda to automate ETL processes. Designed and implemented robust data pipelines using DBT, improving data processing efficiency and reliability. Experienced in developing end to end automation using Selenium WebDriver/RC/IDE/Grid, Unittest/Pytest, Jenkins, GHERKIN/Cucumber, Robot, ALLURE reporting, RESTful API and PostMan. Proficient in designing and implementing data integration pipelines using StreamSets Data Collector to efficiently manage and process data flows. Proficient in writing and debugging complex Linux shell scripts for task automation. Collaborated with data scientists to integrate machine learning workflows into Dagster pipelines. Designed and implemented data workflows using Prefect, ensuring robust and reliable data pipelines. Conducted data cleaning, transformation, and analysis using Pandas, PySpark, and Databricks notebooks. Experience in building ETL(Azure Data Bricks) data pipelines leveraging PySpark, Spark SQL Integrated Druid with Hive for High availability and provide data for sla reporting on real time data. Extensive experience in developing Kafka producers and Kafka consumers for streaming millions of events per minute on streaming data using PySpark, Python & Spark Streaming. Proficient in Java programming for data processing and ETL pipeline development. Integrated Scala-based applications with Apache Spark to perform large-scale data processing and transformations. Proficient in utilizing data structures like arrays, linked lists, stacks, queues, trees, and graphs to optimize data storage and retrieval. Designed and implemented real-time data processing pipelines using Apache Flink, achieving sub-second latency for data ingestion and analysis. Hands on experience on Unified Data Analytics with Databricks, Databricks Workspace User Interface, Managing Databricks Notebooks, Delta Lake with Python, Delta Lake with Spark SQL Coordinated with multiple stakeholders to ensure that test data needs are addressed proactively, appropriately testing critical business systems with production like data which does not comprise of any PHI/PII. Curate data sourced out of Lake in to Databricks in different phases environments and perform Delta strong data engineering experience in Spark and Azure Databricks, running notebooks using ADF. Wrote and optimized complex SQL queries to retrieve and manipulate data from relational databases. Implemented Oracle PL/SQL packages, triggers, and views to support business requirements. Leveraged Prefect's dynamic task mapping to handle complex data dependencies and workflows. Implemented real-time data streaming and processing solutions using Amazon Kinesis and AWS Lambda. Developed and maintained documentation for data architecture, including data flow diagrams and metadata repositories. Expertise in Core Java, data structures, algorithms, Object Oriented Design (OOD) and Java concepts such as OOP Concepts, Collections Framework, Exception Handling, I/O System and Multi-Threading. Design, develop, deploy and maintain large scale Tableau dashboards for Product Insight, Devices and Networking, and Cox Premise Equipment. Developed the ETL module / Front end module for Investigation workbench project. Supported production deployments, Docker image promotions. Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory. Wrote and optimized advanced SQL queries, including joins, subqueries, and window functions for PostgreSQL. Customized Dagster configurations to optimize resource usage and processing speed for large datasets. Designed, developed, and maintained ETL processes using Informatica PowerCenter to integrate data from various source systems into data warehouses. Configured and managed Talend projects, repositories, and job execution plans. Collaborated with cross-functional teams to integrate Databricks workflows with other AWS services for seamless data processing. Expertise in using Looker Studio to create insightful and interactive data visualizations and dashboards. Optimized MongoDB queries for performance improvements. Extensive experience in developing Kafka producers and Kafka consumers for streaming millions of events per minute on streaming data using PySpark, Python & Spark Streaming. Experienced in developing end to end automation using Selenium WebDriver/RC/IDE/Grid, Unittest/Pytest, Jenkins, GHERKIN/Cucumber, Robot, ALLURE reporting, RESTful API and PostMan. Conducted performance benchmarking and optimization of data structures in large-scale distributed systems. Developed ETL pipelines using Python to extract, transform, and load data from various sources. Developed complex SQL queries and stored procedures in Snowflake to support data analytics and reporting. Conducted thorough testing and debugging of DBT models to ensure high-quality data outputs and accurate analytics. ARUN KUMAR K | Curriculum Vitae Developed and implemented Historical and Incremental Loads using Databricks & Delta Lake run using ADF pipelines Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake . Working as data engineer and having strong background skills of big data technologies like Hive, Scala, Spark integrated with JAVA 8. Monitored and tuned MongoDB database performance using tools like MMS and Ops Manager. Developed and optimized data processing jobs using Apache Spark, reducing ETL job runtimes significantly. Developed data pipelines using Talend Cloud ETL and AWS services like Lambda, S3. Used PySpark jobs to run on Kubernetes Cluster for faster data processing Clients: Johnson & Johnson (Feb 2022 Feb2023) Key Deliverables: Created Robot framework test cases for DP API in J&J. Working as data engineer and having strong background skills of big data technologies like Hive, Scala, Spark integrated with JAVA 8. Developed data pipelines using Talend Cloud ETL and AWS services like Lambda, S3. Experience in Kafka tool for Data testing. Good understanding of Spark Architecture with Databricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Manage Clusters In Databricks, Managing the Machine Learning Lifecycle. Monitored and optimized data warehouse performance, resolving any issues related to data loading and processing. Experienced in developing and executing manual and automated tests in different platforms using Python, Pytest/Unittest/Robot and the Selenium library. Automated failure handling and retry logic in Prefect to ensure high data pipeline uptime. Utilized Databricks' Delta Lake for managing and optimizing large-scale data lakes with ACID transactions and schema enforcement. Developed and maintained shell scripts for system administration, data processing, and application deployment. Configured and maintained MongoDB replica sets and sharding for distributed databases. Optimized data storage and retrieval in Iceberg by leveraging its partitioning and indexing features. Experience with tree data structures (e.g., binary trees, AVL trees) to maintain sorted data and support quick search, insertion, and deletion operations. Designed and implemented enterprise data architecture strategies, ensuring alignment with business goals and IT standards. Conducted data profiling and data quality analysis using PowerCenter, ensuring data accuracy and consistency across multiple platforms. Experienced in developing and executing manual and automated tests in different platforms using Python, Pytest/Unittest/Robot and the Selenium library. Experience in connecting Looker Studio to various data sources such as SQL databases, cloud storage, and APIs. Integrated Apache Flink with Apache Kafka for seamless real-time data streaming and processing. Skilled in configuring StreamSets pipelines to ingest, transform, and route data from diverse sources including databases, cloud storage, and real-time streams. Configured and managed Apache Kafka clusters for real-time data ingestion and streaming analytics. Implemented real-time monitoring and alerting for data workflows in Dagster, improving operational visibility. Developed and optimized complex ETL pipelines in Scala, leveraging its functional programming features for cleaner and more efficient code. Utilized MongoDB Aggregation Framework for complex data analysis. Developed and optimized complex SQL queries, stored procedures, and functions for Oracle databases. Developed multithreaded programs using Core Java to measure system performance. Collaborated with data analysts and business stakeholders to understand requirements and translate them into effective DBT models. Experience with various shell scripting languages including Bash, KornShell, and C Shell. Familiar with deploying Python applications in cloud environments such as AWS, Azure, or GCP. Experienced in reverse engineering existing databases to create comprehensive data models in Erwin. Monitor and maintain CI/CD pipelines in Azure environment for data load from Lake to DBX and from DBX to SQL DW Working Knowledge of Delta Lake Developed and maintained data pipelines leveraging efficient data structures to handle high-throughput data ingestion and processing. Experienced in creating complex data pipelines with StreamSets, utilizing built-in processors, transformations, and connectors to handle various data formats and schemas. Integrated Prefect with cloud storage and database systems for seamless data management. Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns. Spearheaded the integration of new data sources, expanding the data warehouse to support additional business functions. ARUN KUMAR K | Curriculum Vitae Worked on NoSQL databases like MongoDB, Document DB and Graph Databases like neo4j . Hands - on experience in Azure Analytics Services - Azure Data Lake Store (ADLS), Azure Data Lake Analytics (ADLA), Azure SQL DW, Azure Data Factory (ADF), Azure Data Bricks (ADB) etc. Senior Hadoop Developer D4 Insight Tech | Jul 18 Feb 2022 Highlight: Best Engineering Team Award Client: Emirates National Bank of Dubai | Projects Handled: Finacle Core Consolidation | Legacy Applications Archival Phase 2 Key Deliverables: It s a real time data warehouse project for Emirates National Bank of Dubai. Developed ETL s to consume data from Finacle systems and perform the business logics and load into Hadoop eco system. Upgradation of Talend Cloud remote server, Talend Cloud studio. Migrate SAS workflows to Tableau for PACE project. Used Spark and Scala for developing machine learning algorithms which analyses click stream data. Designed solutions to process high volume data stream ingestion, processing and low latency data provisioning using Hadoop Ecosystems Hive, Pig, Scoop and Kafka, Python, Spark, Scala, NoSql, Nifi. Implemented Iceberg s time travel capabilities to provide historical data views for auditing and analysis. Built and maintained end-to-end data pipelines on Databricks using Python, leveraging its intuitive syntax for rapid development. Stayed updated with the latest Looker Studio features and best practices to continuously improve reporting solutions. Utilized Dagster s solid-based architecture for modular and reusable data transformations. Implemented CI/CD pipelines for DBT projects using tools like GitHub Actions and CircleCI, facilitating seamless integration and deployment. Implemented data ingestion processes from various sources into a data lake, using Scala for efficient data parsing and validation. Implemented stateful stream processing with Flink, ensuring accurate and reliable data aggregation and analysis. Capable of developing custom processors and leveraging StreamSets' scripting capabilities to extend pipeline functionality and meet specific data processing requirements. Implemented data integration processes to load and process JSON data from external sources into JSONB columns. Worked with Oracle Enterprise Manager to monitor database performance and manage resources. Administered Oracle databases, including backup and recovery, performance tuning, and space management. Implemented data storage solutions with Apache HBase, providing low-latency access to large datasets. Developed custom Prefect tasks to extend functionality and address specific data engineering needs. Designed custom data structures to solve specific business problems, ensuring optimal time and space complexity. Designed and implemented scalable data architecture solutions to support enterprise-level data management and analytics. Proficient in Core Java concepts including OOP (Object-Oriented Programming), data structures, and algorithms. Conducted data modeling and schema design in Snowflake, ensuring efficient data organization and performance. Automated data validation and reconciliation processes within PowerCenter to streamline data pipeline operations. Configured Selenium WebDriver, Unittest, Pytest,Robot, pip tool and created selenium automation scripts in python. Migrate data into RV Data Pipeline using DataBricks, Spark SQL and Scala. Expert in Test Case preparation, Functional testing - Manual, Defect Management, Regression and Sanity testing, Test Plan Building, Test report generation, Test case review and maintenance. Designed schema evolution strategies in Iceberg to accommodate changing data requirements without downtime. Built and managed data pipelines using Dagster, enhancing data quality and processing efficiency. Maintained detailed documentation of data warehouse architecture, processes, and best practices. Developed and implemented Historical and Incremental Loads using Databricks & Delta Lake run using ADF pipelines Experienced in writing Spark Applications in Scala and Python (PySpark). Utilized JSONB's capabilities to support flexible and dynamic data models in database design. Worked with RESTful APIs in Python to integrate external data sources into data pipelines. Experience in building the Orchestration on Azure Data Factory for scheduling purposes Revamped the old architecture s and migrated the legacy systems data s to Hadoop. Written Test Automation scripts using Robot Framework. Strong knowledge of data warehousing concepts and methodologies (e.g., Kimball, Inmon). competent in integrating data from various sources and systems into cohesive Erwin data models. Helped to reduce the infrastructure and maintence cost by migrated to Hadoop. Almost 20+ systems decommissioned. Used various AWS services including S3,EC2, AWS Glue, Athena, RedShift,EMR,SNS,SQS,DMS,Kenesis. Co-ordinated with multiple vendors for platform building and the some of the main vendors are Informatica, SAP, Hortonworks. Module lead (Data Engineering and Cloud team) MindTree | Feb 18 Jun 18 Highlights: Developed & Implemented a user-friendly UI based file format conversion; saved license cost Implemented NIFI architecture for data migration as a part of Hadoop & Cloud Team Worked on NoSQL databases like MongoDB, Document DB and Graph Databases like neo4j. ARUN KUMAR K | Curriculum Vitae Developed generic Talend Cloud ETL frameworks. Involved in project design and development using Java, Scala, Go, Hadoop, Spring, Apache Spark, NIFI, Airflow Technologies. Experience building distributed high-performance systems using Spark and Scala. Created AWS Glue crawlers for crawling the source data in S3 and RDS. Utilized Apache Iceberg to manage large-scale tabular data in a data lake, enabling efficient data querying and management. Developed Tableau data visualization using Cross tabs, Heat maps, Bar Charts, Gantt charts, Waterfall charts, Scatter Plots, Geographic Maps, Pie Charts and Donut Charts. Documented DBT workflows and data models, providing clear and comprehensive guidelines for team members and stakeholders. Worked with Apache Airflow to schedule and monitor complex ETL workflows, ensuring data pipeline reliability.. Implemented data security best practices in Snowflake, including role-based access control and data encryption. Experience with cloud-based data warehousing solutions (e.g., AWS Redshift, Google BigQuery, Snowflake). Develop overall Test Strategy, lead testing of all impacted applications/infrastructure and postproduction support activities. Experience building Data pipeline for Realtime streaming data and Data Analytics using Azure cloud components like Azure Data Factory, HDInsight (spark cluster), Azure ML Studio, Azure stream Analytics, Azure Blob Storage, Microsoft SQL DB, Neo4j (Graph DB). Wrote comprehensive documentation and provided training sessions for team members on best practices and advanced usage of Apache Flink. Created custom Python scripts for data migration, transformation, and loading tasks. Collaborating with cross-functional teams to gather requirements and translate them into detailed Erwin data models. Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark) Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics . Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks. Automated data workflows and scheduled jobs in Snowflake using Snowflake Tasks and Streams. Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark). Created multiple Apache NiFi custom templates with Spark to process data from different data sources and stored the Spark s output into S3 buckets in parquet formats while implementing Amazon Lamda and Step functions to schedule the data flow Implemented Shell script to analyze the audit logs of hive database in real time while handling a team of 4 members Expertise in using JAVA, BIG DATA, HTML, DHTML, CSS, D3JS, JSON, Font awesome, JavaScript and Bootstrap, Neo4j and Keylines. Used PySpark jobs to run on Kubernetes Cluster for faster data processing. Senior software engineer Cloud WiSilica | May 17 Nov 17 Highlights: Reduced the license cost by innovating & implementing an effective cloud-based architecture through open source tools Implemented big data analytics solutions to effectively track and report IoT component status (tags, listeners, bridges etc.) Created the end to end design of a centralized reporting system including ingestion, cleansing, enriching, processing, storing and visualization based on AWS EC2 cloud servers Created multiple Glue ETL jobs in Glue Studio and then processed the data by using different transformations and then loaded into S3,Redshift and RDS. Monitored and maintained Snowflake environments, performing regular performance tuning and troubleshooting. Providing guidance to the development team working on PySpark as ETL platform. Implemented Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and unstructured data to meet business functional requirements Applied efficient business logics using Spark Java & Scala to gather and analyze the sensor data from the various sensors Created a streaming pipeline using Kafka and spark streaming and showcased the Spark output through Apache Zeppelin Charts Providing guidance to the development team working on PySpark as ETL platform. Architected the VerneMQ scalable message broker cluster in AWS with 5 nodes while creating each node with SSL/TLS authentication Implemented Load balancer using Ha-Proxy & Amazon elastic load balance while performing load testing for message broker s cluster using MQTT-Malaria, MQTT-Bench & Jmeter as well as showcasing the real time broker metrics through Prometheus & Grafana FreeLancer Freelance| Nov 16 May 17 Project Handled: Telecom Data Analytics Handled requirement analysis, Coding, design, implementation, testing, problem analysis and resolution, and technical documentation Manage technical design solution conversation and choose the right frameworks for the business solution Oversee developers through the product design, planning, development implementation, and product/system test, acting as an interface between multiple teams Software Engineer Athena Technology Solutions | Aug 14 Nov 16 Project Handled: Ghar Value | Clients: ICICI Bank & HDFC Bank Highlight: 2 times best performer Award 2015 & 2016 Software Engineer Avolot Technologies | Sep 12 Aug 14 Handled requirement analysis, Coding, design, implementation, testing, problem analysis and resolution, and technical documentation Manage technical design solution conversation and choose the right frameworks for the business solution ARUN KUMAR K | Curriculum Vitae PERSONAL DOSSIER Date of Birth: 04-07-1991 | Nationality: Indian PROJECTS ANNEXURE Solution IT INC Client: Fidelity Investments Location: Westlake, Texas Project Handled: SMAP-LRC Fraud Analytics. Project Deliverables: The project aims to standardize and process alerts from various systems, such as AML (Anti Money Laundering), Surveillance, Trade systems, EFE, etc. These alerts will be converted into a generic structure, making it easier to analyze and group alerts based on business logic. After alert grouping, the data will be published to Kafka topics and used for alert scoring to determine whether fraud has occurred. Developed the generic alert processor in python to handle alert conversion. Designed and implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate) Defined alert type conversion as one function in python. It will take care if the incoming alerts in xml format, it converts to Json format and pushes to alert processor function. Implemented database solutions in Azure SQL Experience in writing distributed Scala code for efficient big data processing. Developed Generic multiple functions to handle various scenarios like input type is Kafka / Read from snowflake / MongoDB, File. And output of the conversion to Kafka / write to snowflake / Write as file. Developed reconciliation, log handling, Alert grouping based on business logics as part of this alert processor. Build the python codes to deployable Docker container. Design setup maintain the Azure SQL, Azure Analysis Services, Azure Data Factory. Defined a YAML config file to define all the parameters and configurations. Interacting with different teams to get alert samples, understanding the alerts, data dictionary to map the alerts to new standard format. Working with the architect team and document the SMAP flows, code process, Any bottlenecks / changes in confluence. Environment: Python, Jenkins, Docker, Snowflake, AWS EC2, BitBucket, JIRA, Confluence, Pycharm, MongoDB, Kafka, Data Dog Client: Johnson & Johnson Location: Remote Project Handled: DPS API Management Project Deliverables: DPS API management project is gateway tool which extracts the data from various robot s and loading the data in AWS for data processing and data analysis. As part of this project, developed lot of python functions for API s data handling. Created Robot framework test cases which validates API s data. Created data dog dashboard for API s host monitoring. Created multiple custom test cases for API response validation. Implemented apache airflow workflows for automation. Handling data migration with Databricks. Environment: Python, Robot-Framework, Jenkins D4 Insight Tech Client: Emiratesnbd Location: Dubai, UAE Projects Handled: Finacle Core Consolidation Project Deliverables: As a part of Data Engineering Team, handled data migrations, data modelling, Raw data vault & transformation data vault structure, workflow scheduling, automation and support Work related to downloading Big Query data into pandas or Spark data frames for advanced ETL capabilities. Truncated & loaded 1TB of data daily in the Hadoop system while deploying different oozie scripts according to source system (Calypso, ALM, AML, OGL, Advent, Extracts & Reports) Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators. Managed all ETL operation through Informatica BDM & handled the BDM workflows as well as email alerts for success / failure notifications using oozie Used cloud shell SDK in GCP to configure the services data proc, storage, BigQuery. Maintained two layers in the Hive database (RDV/TDV) to manage the source & ETL's output data Shared Control-M & SFTP extracts to the downstream systems while creating Enzone in the HDFS to handle customer related details Created necessary masking rules using Atlas on top of the data to hide the customer details for other users Integrated SAS, SAP HANA, MSBI connectors with Hadoop platform to perform various analysis on top of the data Handled the push down of all the migrated data to SAP HANA for FSDM Modelling & Report dashboard Prepared necessary documentation, conducted KT session to handover the production jobs to support team Project Handled: Legacy Applications Archival Phase 2 Location: Dubai, UAE Project Deliverables: ARUN KUMAR K | Curriculum Vitae Migrated multiple legacy source system applications into Hadoop environment while developing Sqoop scripts to import data from Oracle, SQL Server, Sybase & AS 400 systems Loaded data into dev, staging & production environment while creating data masking rules for hiding customer confidential details in Hadoop environment using apache atlas Created user policies for accessing production data using apache ranger while configuring Active directory-based authentication in the Hadoop cluster and taking snapshots for Data Backup FreeLance Project Handled: Telecom Data Analytics Project Deliverables: Developed Hadoop applications to solve problems related to Telecom Network. Collected logs from various client servers like Huawei, Nokia, and Cisco networks using apache NiFi tool and Apache Kafka Performed cluster migration from Cloudera 5.8 to Cloudera 5.9 while adding/removing new nodes to an existing Hadoop cluster Created Impala-JDBC Pipeline in Java to retrieve the data in real-time while using Kudu to update and insert in Impala tables Performed xml parsing to convert xml data to structured format using Spark Athena Technology Solutions Project Handled: Ghar Value | Clients: ICICI Bank & HDFC Bank Location: Chennai, India Project Deliverables: Developed home price prediction platform from scratch while working closely with data scientists to develop the data models Conducted extensive research to try new open source tools for this project while supporting project Demo s & Go- Live Developed Hadoop applications to solve problems related to Real Estate Domain / Banking Domain (Loan Process) while leading a team of 5 members, providing technical support and training fresher s in Hadoop domain Scraped data from the real estate websites using C# watin framework while writing Pig (version 0.12) scripts to transform raw data from several data sources into forming baseline data Managed Partitions, bucketing concepts in Hive and designed External tables in Hive to optimize performance Developed Hive (version 1.1.0) scripts for various database and table creations and UDFs in Java while using them with pig scripts Handled ORC, CSV, TSV file formats and defined Shell scripts to automate the process Created a linear regression model using R and Python for price prediction while developing workflow in Oozie to automate the tasks Handled database JDBC connectivity and configured web servers (Apache tomcat) while developing RESTful web service application to fetch values from UI to Hive in JSON Response format as well as create CRON Job for Data Backup Converted R code as web service using Open CPU R library with JSON response format and defined Shell scripts to dump the data from MySQL to HDFS while using Cloudera Manager for monitoring and managing the Hadoop cluster. Avolot Technologies Location: Chennai, India Project Deliverables: Developed & Implemented a user-friendly UI based file format conversion saved license cost Implemented NIFI architecture for data migration as a part of Data engineering team. Created multiple Apache NiFi custom templates with Spark to process data from different data sources and stored the Spark s output into S3 buckets in parquet formats while implementing Amazon Lamda and Step functions to schedule the data flow Implemented Shell script to analyze the audit logs of hive database in real time while handling a team of 4 members Normalized the data according to the business needs like data cleansing, modifying the data types and various transformations using Spark and Python. Keywords: cprogramm csharp continuous integration continuous deployment artificial intelligence machine learning user interface business intelligence sthree database rlang information technology golang procedural language Colorado |