Home

Harish Nagamalla - Sr. Data Engineer
[email protected]
Location: Plano, Texas, USA
Relocation: Yes
Visa: GC
Harish Novide IT Staff Au
Sr.Data Engineer,
Email ID: [email protected]
Phone.no: 972-646-6279
SUMMARY
Big Data professional with 9 years of combined experience in the fields of Data Applications, Big Data implementations and Java/J2EE technologies.
4 years of experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.
High Exposure on Big Data technologies and the Hadoop ecosystem, In - depth understanding of Map Reduce and Hadoop Infrastructure.
Expertise in writing end to end Data processing Jobs to analyse data using MapReduce, Spark and Hive.
Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and knowledge on Spark MLLib.
Designed and implemented scalable and reliable solutions on AWS, utilizing services like AWS Virtual Machines, AWS Storage, and AWS App Service to meet business requirements.
Extensive Knowledge on developing Spark Streaming jobs by developing RDD s (Resilient Distributed Datasets) using Scala, PySpark and Spark-Shell.
Experienced in data manipulation using python for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.
Conducted data migration and integration on AWS, transferring data from on-premises systems to the cloud or between AWS regions.
Developed and maintained infrastructure as code using Azure Resource Manager (ARM) templates or other automation tools, ensuring consistency and repeatability.
Experienced in using Pig scripts to do transformations, event joins, filters and pre-aggregations before storing the data into HDFS.
Conducted cost optimization on AWS, monitoring usage patterns and adjusting resources for optimal cost efficiency.
Strong knowledge of Hive analytical functions, extending Hive functionality by writing custom UDFs.
Expertise in writing Map-Reduce Jobs in Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.
Implemented security measures on Azure, such as Azure Active Directory, role-based access control (RBAC), and network security groups, to ensure data protection.
Good understanding of data Modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instance.
Hands on experience working Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.
Stayed updated with the latest AWS services, features, and best practices, actively exploring new capabilities to optimize cloud architecture and enhance applications.
Conducted cloud resource provisioning and management on Azure, including setting up virtual networks, storage accounts, and security configurations.
Hands on experience in SQL and NOSQL database such as Snowflake, HBase, Cassandra and MongoDB.
Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
Conducted continuous integration and continuous deployment (CI/CD) pipelines using AWS CodePipeline and AWS CodeDeploy, automating the release process.
Strong experience in working with UNIX/LINUX environments, writing shell scripts.
Excellent knowledge of J2EE architecture, design patterns, object Modeling using various J2EE technologies and frameworks with Comprehensive experience in Web-based applications using J2EE Frameworks like Spring, Hibernate, Struts and JMS.
Worked with various formats of files like delimited text files, clickstream log files, Apache log files, Avro files, JSON files, XML Files.
Collaborated with cross-functional teams to design and implement disaster recovery strategies using AWS services like Amazon CloudWatch and AWS Backup.
Experienced in working in software development lifecycle (SDLC), agile software development and Waterfall Methodologies, Jira, Data Governance.
Strong skills in analytical, Data Structures, presentation, providing feedback, collaborating, Written Communication and problem-solving with the ability to work independently and in a team and follow the best practices and principles defined for the team.
Operating Systems UNIX, LINUX, Windows, MVS z/OS.
Cloud Platforms AWS, AZURE
Programming Languages Java, R, Python, Scala, Node.js, Javascript
Frameworks Apache Spark, Map Reduce, Mahout, Apache Lucene, J2EE
Database Hadoop FS, MySQL, ORACLE, SQL, H Base, Yarn, Spark
Query Languages Hive 0.9, Pig, Sqoop 1.4.4, Spark-SQL
Streaming Flume 1.6, Spark Streaming, Streaming Analytics
Marketing Tools SAS, Tableau, Platfora, ELOQUA(SFDC), UNICA, BLUEKAI & GainSight (SFDC), Talend
Messaging frameworks Kafka using Event Hub
Orchestration Frameworks Airflow
Distributions/ data management MapR, Cloudera, Hortonworks, Cloudera, Apache Sqoop, Cassandra.
Reporting Platforms / data visualization Tableau, Power BI, Platfora
Data warehousing Platform Azure Datawarehouse, EDW
Education Details:
Bachelor s:
SRM University May 2013

Work Experience:
________________________________________

Walmart | CA, United States Dec 2020 Present
Sr. Data Engineer
Design, develop, and maintain scalable and efficient data pipelines using EMR to ingest, transform, and load data from various sources.
Extensively used PySpark improving the performance and optimization of the existing algorithms/queries in Hadoop and Hive using Spark Context, Spark-SQL (Data Frames and Datasets), and Pair RDD's.
Worked on developing, maintaining, and enhancing an ETL process on AWS components including EC2, S3, Lambda, RDS, DynamoDB, Redshift, and Spark.
Hands-on experience handling different file formats like JSON, AVRO, ORC and Parquet.
Developed serverless application using AWS Lambda, Step Functions, and AWS Elastic Beanstalk, building efficient and scalable applications.
Developed ETL pipelines across AWS data ecosystem to analyze affluent customers behavior
Conducted monitoring and performance optimization on AWS Cloud, utilizing services like Amazon CloudWatch to ensure optimal application performance.
Extensively used ETL methodology for supporting data Extraction, Transformation and Loading in a corporate-wide- ETL solution using SAP BW with strong Knowledge on OLAP, OLTP, Extended Star, Star, Snowflake Schema methodologies.
Conducted continuous integration and continuous deployment (CI/CD) pipelines, automating application deployments and releases using DevOps or other tools.
Solid experience with Extracting data from multiple internal and external data sources using Python, SSIS
Worked alongside data scientists and created audience targeting platform system for campaign management. Deployed logistic regression model using Docker, Kubernetes on AWS.
Designed Cassandra and Oracle data model for microservices components. Analyzed partitioning and clustering keys for data models used across components.
Containerized and deployed ML models using Docker and Kubernetes to support various end to end apps.
Deployed and maintained automated CI/CD pipelines for code deployment using Jenkins.
Develop software to automate, and monitor systems and services across the cloud using python.
Maintained reporting data marts on RDBMS and Hive. Oversaw data refresh, automation, Schema changes, data validation and upload errors.
Migrated teams recurring dashboards to tableau there by reducing 80 hours/week man hour
Extensively worked in data integration with Adobe analytics, Mail chimps, Buffer and other marketing tools.
Integrated PostgreSQL as a data warehousing solution within the AWS ecosystem, leveraging its robust features for storage, data retrieval, and analysis.
Implemented complex data transformations using PostgreSQL's powerful SQL capabilities, allowing seamless manipulation and enrichment of data before loading it into analytical systems.
Designed efficient data loading strategies, utilizing PostgreSQL's bulk insert techniques, ensuring fast and optimized loading of large datasets into the database.
Environment: Pyspark, data bricks, AWS Cloud, Airflow, Python, ML models, H2O.


AT& T | TX, United States Jan 2020 NOV 2020
Sr. Data Engineer
Worked on developing, maintaining, and enhancing an ETL process on Amazon Web Services using EC2, EMR, S3, Lambda, Hive, Scala and PySpark to manage large data sets.
Maintained ETL pipelines using Python and Scala.
Hands on experience with data ingestion tools like Sqoop, Kafka, Flume and Oozie.
Hands on experience handling different file formats like JSON, AVRO, ORC and Parquet.
Data engineered unstructured e-commerce product comments data for sentiment analysis using spark. Automated workflows and provided end-end engineering framework for other data scientists to consume. Used Python, pyspark, on AWS for migration.
Developed ETL pipelines across AWS data stack on HDFS, S3, Redshift to analyse affluent customers behaviour.
Experience with Requests, Report Lab, NumPy, SciPy, Pytables, cv2, imageio, Python-Twitter, Matplotlib, HTTPLib2, Urllib2, Beautiful Soup, Data Frame and Pandas python libraries during development lifecycle.
Designed and deployed multiple microservices using Spring Boot, hibernate and Oracle.
Developed Python libraries to use for ETL, data analysis and data science transformations.
Extensively used ETL methodology for supporting data Extraction, Transformation and Loading in a corporate-wide- ETL solution using SAP BW with strong Knowledge on OLAP, OLTP, Extended Star, Star, Snowflake Schema methodologies.
Solid experience with Extracting data from multiple internal and external data sources using Python, SSIS
Worked alongside data scientists and created audience targeting platform system for campaign management. Deployed logistic regression model using Docker, Kubernetes on AWS.
Monitored AWS Athena query performance and identified opportunities for optimization, such as data partitioning, data compression, and query tuning.
Automated data engineering tasks in AWS Athena using Python scripts, AWS Lambda functions, and AWS Step Functions, reducing manual effort and increasing efficiency.
Designed Cassandra and Oracle data model for microservices components. Analyzed partitioning and clustering keys for data models used across components.
Containerized and deployed ML models using Docker and Kubernetes to support various end to end apps.
Developed Controller Classes using Spring MVC, Spring AOP, Spring Boot, Spring Batch, Spring Data modules and handled security using Spring Security.
Implemented Restful Services with Spring Boot and Micro Service Architecture.
Developed RESTful web services to retrieve JSON documents related to customer and
Deployed and maintained automated CI/CD pipelines for code deployment using Jenkins.
Develop software to automate, and monitor systems and services across the cloud using Python.
Served as a hands-on subject matter expert for DevOps and Automation in an AWS infrastructure environment.
Worked on MongoDB database concepts such as locking, transactions, indexes, Sharding, replication.
Migrated RDBMS data mart to Mongo DB to support frequent data schema changes requests.
Migrated teams recurring dashboards to Tableau there by reducing 80 hours/week man hour.
Extensively worked in data integration with Adobe Analytics, Mail chimps, Buffer and other marketing tools.
Environment: AWS Services: S3, EMR, EC2, Step Functions, Glue, Athena, Lambda, CloudWatch, RDS, VPC, Subnets.
Corteva| IL, United States Jan 2018 Dec 2019
Data Engineer
Responsible for the execution of big data analytics, predictive analytics and machine learning initiatives.
Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.
Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
Developed Python scripts, UDF's using both data frames/SQL and RDD in Spark for data aggregation, queries and writing back into S3 bucket.
Experience in data cleansing and data mining.
Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data.
Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3, Teradata and snowflake.
Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
Created scripts to read CSV, json and parquet files from S3 buckets in Python and load into AWS S3, DynamoDB and Snowflake.
Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway.
Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function using Scala.
Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket.
Profile structured, unstructured, and semi-structured data across various sources to identify patterns in data and Implement data quality metrics using necessary queries or python scripts based on source.
Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.
Created DAG to use the Email Operator, Bash Operator and spark Livy operator to execute and in EC2 instance.
Deploy the code to EMR via CI/CD using Jenkins.
Extensively used Code cloud for code check-in and checkouts for version control.
Environment: AgileScrum, MapReduce, Snowflake, Pig, Spark, Scala, Hive, Kafka, PySpark, Airflow, JSON, Parquet, CSV, Codecloud, AWS.

Service Now| Bangalore, IN Feb 2014 Nov 2017
Data Engineer

Responsibilities:
Functioned as Data Engineer responsible for data modelling, data migration, design, preparing ETL pipelines for both cloud and on Exadata.
Good knowledge on Apache Spark components including SPARK CORE, SPARK SQL, SPARK STREAMING and SPARK MLLIB.
Extensively involved in Installation and configuration of Cloudera Hadoop Distribution.
Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
Developed spark applications for performing large scale transformations and denormalization of relational datasets.
Have real-time experience of Kafka-Storm on HDP 2.2 platform for real time analysis.
Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
Created reports for the BI team using Sqoop to export data into HDFS and Hive.
Performed analysis on the unused user navigation data by loading into HDFS and writing MapReduce jobs. The analysis provided inputs to the new APM front end developers and lucent team.
Loading the data from multiple Data sources like (SQL, DB2, and Oracle) into HDFS using Sqoop and load into Hive tables.
Created HIVE Queries to process large sets of structured, semi-structured and unstructured data and store in Managed and External tables.
Developed Complex HiveQL's using SerDe JSON
Created HBase tables to load large sets of structured data.
Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports.
Performed Real-time event processing of data from multiple servers in the organization using Apache Storm by integrating with Apache Kafka.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
Worked on PySpark APIs for data transformations.
Data ingestion to Hadoop (Sqoop imports). To perform validations and consolidations for the imported data.
Extending Hive and Pig core functionality by writing custom UDF's for Data Analysis.
Expertise in hardening, Linux Server and Compiling, Building and installing Apache Server from sources with minimum modules.
Worked on different Java technologies like Hibernate, spring, JSP, Servlets and developed code for both server side and client side for our web application.
Worked on AWS Glue, AWS EMR, AWS S3 as part of EDS transformation.
Hands-on experience in Big data analytics using Hadoop platform using HDFS, Sqoop, Flume, MapReduce, Spark, Scala.
Developed Spark applications using Spark-SQL in EMR for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
Responsible for estimating the cluster size, monitoring, and troubleshooting of the EMR cluster.
Created roles and access level privileges and taken care of Snowflake Admin Activity end to end.
Converted 230 views queries from SQL server snowflake compatibility.
Utilized GitHub for version control and leveraged it for continuous integration services.


Environment: AWS, JMeter, Kafka, Ansible, Jenkins, Docker, Maven, Linux, Red Hat, GIT, Cloud Watch, Python, Shell Scripting, Golang, Web Sphere, Splunk, Tomcat, Soap UI, Kubernetes, Pyspark, Terraform, PowerShell.


Interglobal Enterprises| India May 2013 Jan 2014
Data Engineer

Responsibilities:
Point person for data extraction, transformation and loading for numerous modeling efforts.
Increased accuracy of ecommerce order attribution by 35% by implementing true last click model.
Collaborated with Marketing heads to implement search bidding algorithm.
Leveraged Azure Functions for serverless computing, designing event-driven applications that respond dynamically to triggers.
Collaborated with cross-functional teams to design and implement disaster recovery strategies using Azure Site Recovery and Azure Backup.
Developed several marketing and product segments for various product divisions.
Worked closely with product leads to plan quarterly OKRs, define and implement new KPIs.
Lead analytics works to increase Monthly active users, decrease churn, average spent.
Stayed updated with the latest Azure services, features, and best practices, actively exploring new capabilities to optimize cloud architecture and enhance applications.
Engineered solutions to flag off seller biased product review using Bert and NLP packages.
Made forecast recommendations to ensure inventory of >108k items in warehouse using R.
Managed and optimized Azure databases using services like Azure SQL Database, Cosmos DB, or Azure Database for PostgreSQL to ensure data integrity and performance.
Conducted IoT solutions using Azure IoT Hub and Azure IoT Central to collect, analyze, and visualize data from connected devices.
Increased MAC by 3% by targeting at risk customers using stacked ensemble churn model in R.
Modelled growth drivers, conducted C/B analysis and executed campaign to generate 10M in R.
Modelled affluent customer s behavior using internal and external census data using Ensemble, KNN and SVM techniques. Suggested a sweepstakes campaign to generate 20M IGMV.
Improved Q3 forecast accuracy to 95% from initial 70% accuracy using time series analysis in R.
Led initiatives to build ML repository for the team, focused on companies product lines.
Participated in workshops for A/B testing, Reporting, KPI definition and segmenting behavior
Provided analytical support for a business unit. Activities included Ad-hoc, root-cause, trend, ETL.
Designed and maintained a several marketing and product dashboard using Tableau, Hive.

Environment: Informatica Power Centre, Azure, Oracle, UNIX Shell Scripting, Autosys.
Keywords: cprogramm continuous integration continuous deployment machine learning user interface javascript business intelligence sthree database zos active directory rlang information technology business works California Idaho Illinois Texas

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];1601
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: