Home

Harish - Data Engineer
[email protected]
Location: Dallas, Texas, USA
Relocation: Yes
Visa: H1b
Harish Novide IT Staff Au
Sr.Data Engineer,
Email ID: [email protected]
Phone.no: +14698440487
SUMMARY
Big Data professional with 8+ years of combined experience in the fields of Data Applications, Big Data implementations and Java/J2EE technologies.
4 years of experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.
High Exposure on Big Data technologies and Hadoop ecosystem, In - depth understanding of Map Reduce and Hadoop Infrastructure.
Expertise in writing end to end Data processing Jobs to analyze data using MapReduce, Spark and Hive.
Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and knowledge on Spark MLLib.
Designed and implemented scalable and reliable solutions on Microsoft Azure, utilizing services like Azure Virtual Machines, Azure Storage, and Azure App Service to meet business requirements.
Extensive Knowledge on developing Spark Streaming jobs by developing RDD s (Resilient Distributed Datasets) using Scala, PySpark and Spark-Shell.
Experienced in data manipulation using python for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.
Conducted data migration and integration on Azure, transferring data from on-premises systems to the cloud or between Azure regions.
Developed and maintained infrastructure as code using Azure Resource Manager (ARM) templates or other automation tools, ensuring consistency and repeatability.
Experienced in using Pig scripts to do transformations, event joins, filters and pre-aggregations before storing the data into HDFS.
Conducted cost optimization on AWS, monitoring usage patterns and making adjustments to resources for optimal cost efficiency.
Strong knowledge of Hive analytical functions, extending Hive functionality by writing custom UDFs.
Expertise in writing Map-Reduce Jobs in Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.
Implemented security measures on Azure, such as Azure Active Directory, role-based access control (RBAC), and network security groups, to ensure data protection.
Good understanding of data Modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instance.
Hands on experience working Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.
Stayed updated with the latest AWS services, features, and best practices, actively exploring new capabilities to optimize cloud architecture and enhance applications.
Conducted cloud resource provisioning and management on Azure, including setting up virtual networks, storage accounts, and security configurations.
Hands on experience in SQL and NOSQL database such as Snowflake, HBase, Cassandra and MongoDB.
Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
Conducted continuous integration and continuous deployment (CI/CD) pipelines using AWS CodePipeline and AWS CodeDeploy, automating the release process.
Strong experience in working with UNIX/LINUX environments, writing shell scripts.
Excellent knowledge of J2EE architecture, design patterns, object Modeling using various J2EE technologies and frameworks with Comprehensive experience in Web-based applications using J2EE Frameworks like Spring, Hibernate, Struts and JMS.
Worked with various formats of files like delimited text files, clickstream log files, Apache log files, Avro files, JSON files, XML Files.
Collaborated with cross-functional teams to design and implement disaster recovery strategies using AWS services like Amazon CloudWatch and AWS Backup.
Experienced in working in SDLC, Agile and Waterfall Methodologies.
Strong skills in analytical, presentation, communication, problem solving with the ability to work independently as well as in a team and had the ability to follow the best practices and principles defined for the team.
Operating Systems UNIX, LINUX, Windows, MVS z/OS.
Cloud Platforms AWS, AZURE
Programming Languages Java, R, Python, Scala
Frameworks Apache Spark, Map Reduce, Mahout, Apache Lucene, J2EE
Database Hadoop FS, MySQL, ORACLE, SQL, H Base, Yarn, Spark
Query Languages Hive 0.9, Pig, Sqoop 1.4.4, Spark-SQL
Streaming Flume 1.6, Spark Streaming, Streaming Analytics
Marketing Tools SAS, Tableau, Platfora, ELOQUA(SFDC), UNICA, BLUEKAI & GainSight (SFDC), Talend
Messaging frameworks Kafka using Event Hub
Orchestration Frameworks Airflow
Distributions MapR, Cloudera, Hortonworks, Cloudera
Reporting Platforms Tableau, Power BI, Platfora
Data warehousing Platform Azure Datawarehouse, EDW
Education Details:
Bachelor s:
SRM University Electronics and Communication Engineering May-2013

Client: Walmart, CA Dec 2020 Till Date
Role: Sr. Data Engineer
Responsibilities:
Extensively used Scala, Spark improving the performance and optimization of the existing algorithms/queries in Hadoop and Hive using Spark Context, Spark-SQL (Data Frames and Datasets) and Pair RDD's.
Worked on development, maintenance and enhancement of an ETL process on GCP components including Dataproc, Workflows, Big Query, Spanner, Hive and Spark.
Hands on experience with data ingestion tools like Sqoop, Kafka, Flume and Oozie.
Hands on experience handling different file formats like JSON, AVRO, ORC and Parquet.
Data engineered unstructured e-commerce product comments data for sentiment analysis using spark. Automated workflows and provided end-end engineering framework for other data scientists to consume.
Conducted serverless application development using Azure Functions, Logic Apps, and Azure App Service, building efficient and scalable applications.
Developed ETL pipelines across GCP data stack to analyze affluent customers behavior
Redesigned a critical ETL pipeline handling 15 terabytes of data on hive.
Migrated 120 on-prem scripts to GCP.
Conducted monitoring and performance optimization on Azure, utilizing services like Azure Monitor and Application Insights to ensure optimal application performance.
Extensively used ETL methodology for supporting data Extraction, Transformation and Loading in a corporate-wide- ETL solution using SAP BW with strong Knowledge on OLAP, OLTP, Extended Star, Star, Snowflake Schema methodologies.
Conducted continuous integration and continuous deployment (CI/CD) pipelines using Azure DevOps or other tools, automating application deployments and releases.
Solid experience with Extracting data from multiple internal and external data sources using Python, SSIS
Worked alongside data scientists and created audience targeting platform system for campaign management. Deployed logistic regression model using Docker, Kubernetes on GCP.
Designed Cassandra and Oracle data model for microservices components. Analyzed partitioning and clustering keys for data models used across components.
Containerized and deployed ML models using Docker and Kubernetes to support various end to end apps.
Deployed and maintained automated CI/CD pipelines for code deployment using Jenkins.
Built and deployed containers for implementing Microservices Architecture from Monolithic Architecture
Develop software to automate, and monitor systems and services across the cloud using python.
Maintained reporting data marts on RDBMS and Hive. Oversaw data refresh, automation, Schema changes, data validation and upload errors.
Migrated teams recurring dashboards to tableau there by reducing 80 hours/week man hour
Extensively worked in data integration with Adobe analytics, Mail chimps, Buffer and other marketing tools.
Environment: Pyspark, Scala , data bricks, Azure, Airflow, Python, ML models, H2O.


Client: AT &T, FL Jan 2018 Nov 2020
Role: Sr. Data Engineer
Responsibilities:
Worked on development, maintenance and enhancement of an ETL process on Amazon Web Services using EC2, EMR, S3, Lambda, Hive, Python and Spark to manage large data sets.
Maintained ETL pipelines using python and scala.
Hands on experience with data ingestion tools like Sqoop, Kafka, Flume and Oozie.
Hands on experience handling different file formats like JSON, AVRO, ORC and Parquet.
Data engineered unstructured e-commerce product comments data for sentiment analysis using spark. Automated workflows and provided end-end engineering framework for other data scientists to consume. Used Python, pyspark, on AWS for migration.
Developed ETL pipelines across AWS data stack on HDFS, S3, Redshift to analyze affluent customers behavior.
Experience with Requests, Report Lab, NumPy, SciPy, Pytables, cv2, imageio, Python-Twitter, Matplotlib, HTTPLib2, Urllib2, Beautiful Soup, Data Frame and Pandas python libraries during development lifecycle.
Designed and deployed multiple microservices using Spring Boot, hibernate and Oracle.
Developed python libraries to use for ETL, data analysis and data science transformations.
Redesigned a critical ETL pipeline on python, scala handling 15 terabytes of data on hive.
Extensively used ETL methodology for supporting data Extraction, Transformation and Loading in a corporate-wide- ETL solution using SAP BW with strong Knowledge on OLAP, OLTP, Extended Star, Star, Snowflake Schema methodologies.
Solid experience with Extracting data from multiple internal and external data sources using Python, SSIS
Worked alongside data scientists and created audience targeting platform system for campaign management. Deployed logistic regression model using Docker, Kubernetes on AWS.
Designed Cassandra and Oracle data model for microservices components. Analyzed partitioning and clustering keys for data models used across components.
Containerized and deployed ML models using Docker and Kubernetes to support various end to end apps.
Designed and developed application using various Spring framework modules like Spring IOC,
Developed Controller Classes using Spring MVC, Spring AOP, Spring Boot, Spring Batch, Spring Data modules and handled security using Spring Security.
Implemented the authentication and authorization of the application using Spring Security and Oauth2.
Implemented Restful Services with Spring Boot and Micro Service Architecture.
Developed RESTful web services to retrieve JSON documents related to customer and
Deployed and maintained automated CI/CD pipelines for code deployment using Jenkins.
Built and deployed containers for implementing Microservices Architecture from Monolithic Architecture
Develop software to automate, and monitor systems and services across the cloud using python.
Served as a hands-on subject matter expert for DevOps and Automation in an Azure/AWS infrastructure environment.
Worked on MongoDB database concepts such as locking, transactions, indexes, Sharding, replication.
Migrated RDBMS data mart to Mongo DB to support frequent data schema changes request.
Maintained reporting data marts on RDBMS and Hive. Oversaw data refresh, automation, Schema changes, data validation and upload errors.
Migrated teams recurring dashboards to tableau there by reducing 80 hours/week man hour.
Extensively worked in data integration with Adobe analytics, Mail chimps, Buffer and other marketing tools.
Environment: AWS Services: S3, EMR, EC2, Step Functions, Glue, Athena, Lambda, CloudWatch,RDS,VPC,Subnets , Azure Services: SQL Server (SSMS),SQL procedures, Data factory,Pipelines,Python,Spark,API Services,Vagrant, Gitlab
Client: Corteva, IL Dec 2016 Dec 2017
Role: Data Engineer
Responsibilities:
Responsible for the execution of big data analytics, predictive analytics and machine learning initiatives.
Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.
Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
Developed Scala scripts, UDF's using both data frames/SQL and RDD in Spark for data aggregation, queries and writing back into S3 bucket.
Experience in data cleansing and data mining.
Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data.
Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3, Teradata and snowflake.
Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
Created scripts to read CSV, json and parquet files from S3 buckets in Python and load into AWS S3, DynamoDB and Snowflake.
Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway
Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function using Scala.
Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket.
Profile structured, unstructured, and semi-structured data across various sources to identify patterns in data and Implement data quality metrics using necessary queries or python scripts based on source.
Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.
Created DAG to use the Email Operator, Bash Operator and spark Livy operator to execute and in EC2 instance.
Deploy the code to EMR via CI/CD using Jenkins.
Extensively used Code cloud for code check-in and checkouts for version control.
Environment: AgileScrum, MapReduce, Snowflake, Pig, Spark, Scala, Hive, Kafka, Python, Airflow, JSON, Parquet, CSV, Codecloud, AWS.

Client: Service Now, Bangalore, IN Feb 2014 Nov 2015
Role: Data Engineer

Responsibilities:
Functioned as Data Engineer responsible for data modelling, data migration, design, preparing ETL pipelines for both cloud and on Exadata.
Good knowledge on Apache Spark components including SPARK CORE, SPARK SQL, SPARK STREAMING and SPARK MLLIB.
Extensively involved in Installation and configuration of Cloudera Hadoop Distribution.
Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala
Developed spark applications for performing large scale transformations and denormalization of relational datasets.
Have real-time experience of Kafka-Storm on HDP 2.2 platform for real time analysis.
Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
Created reports for the BI team using Sqoop to export data into HDFS and Hive.
Performed analysis on the unused user navigation data by loading into HDFS and writing MapReduce jobs. The analysis provided inputs to the new APM front end developers and lucent team.
Loading the data from multiple Data sources like (SQL, DB2, and Oracle) into HDFS using Sqoop and load into Hive tables.
Created HIVE Queries to process large sets of structured, semi-structured and unstructured data and store in Managed and External tables.
Developed Complex HiveQL's using SerDe JSON
Created HBase tables to load large sets of structured data.
Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports
Performed Real time event processing of data from multiple servers in the organization using Apache Storm by integrating with Apache Kafka.
Managed and reviewed Hadoop log files.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
Worked on PySpark APIs for data transformations.
Data ingestion to Hadoop (Sqoop imports). To perform validations and consolidations for the imported data.
Extending Hive and Pig core functionality by writing custom UDF's for Data Analysis.
Upgraded current Linux version to RHEL version 5.6
Expertise in hardening, Linux Server and Compiling, Building and installing Apache Server from sources with minimum modules
Worked on JSON, Parquet, Hadoop File formats.
Worked on different Java technologies like Hibernate, spring, JSP, Servlets and developed code for both server side and client side for our web application.
Used Git hub for continuous integration services.
Worked on AWS Glue, AWS EMR, AWS S3 as part of EDS transformation.
Hands on experience in Big data analytics using Hadoop platform using HDFS, Sqoop, Flume, MapReduce, Spark, Scala.
Developed Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
Responsible for estimating the cluster size, monitoring, and troubleshooting of the data bricks cluster.
Provided End to end business intelligence solutions to sales and finance teams using AWS EMR
Created roles and access level privileges and taken care of Snowflake Admin Activity end to end.
Converted 230 views queries from SQL server snowflake compatibility.

Environment: AWS, JMeter, Kafka, Ansible, Jenkins, Docker, Maven, Linux, Red Hat, GIT, Cloud Watch, Python, Shell Scripting, Golang, Web Sphere, Splunk, Tomcat, Soap UI, Kubernetes, Pyspark, Terraform, PowerShell.


Client: Interglobal Enterprises, India May 2013 Jan 2014
Role: Data Engineer

Responsibilities:
Point person for data extraction, transformation and loading for numerous modeling efforts.
Increased accuracy of ecommerce order attribution by 35% by implementing true last click model.
Collaborated with Marketing heads to implement search bidding algorithm.
Leveraged Azure Functions for serverless computing, designing event-driven applications that respond dynamically to triggers.
Collaborated with cross-functional teams to design and implement disaster recovery strategies using Azure Site Recovery and Azure Backup.
Developed several marketing and product segments for various product divisions.
Worked closely with product leads to plan quarterly OKRs, define and implement new KPIs.
Lead analytics works to increase Monthly active users, decrease churn, average spent.
Stayed updated with the latest Azure services, features, and best practices, actively exploring new capabilities to optimize cloud architecture and enhance applications.
Engineered solutions to flag off seller biased product review using Bert and NLP packages.
Made forecast recommendations to ensure inventory of >108k items in warehouse using R.
Managed and optimized Azure databases using services like Azure SQL Database, Cosmos DB, or Azure Database for PostgreSQL to ensure data integrity and performance.
Conducted IoT solutions using Azure IoT Hub and Azure IoT Central to collect, analyze, and visualize data from connected devices.
Increased MAC by 3% by targeting at risk customers using stacked ensemble churn model in R.
Modelled growth drivers, conducted C/B analysis and executed campaign to generate 10M in R.
Modelled affluent customer s behavior using internal and external census data using Ensemble, KNN and SVM techniques. Suggested a sweepstakes campaign to generate 20M IGMV.
Improved Q3 forecast accuracy to 95% from initial 70% accuracy using time series analysis in R.
Led initiatives to build ML repository for the team, focused on companies product lines.
Participated in workshops for A/B testing, Reporting, KPI definition and segmenting behavior
Provided analytical support for a business unit. Activities included Ad-hoc, root-cause, trend, ETL.
Designed and maintained a several marketing and product dashboard using Tableau, Hive.

Environment: Informatica Power Centre, Azure, Oracle, UNIX Shell Scripting, Autosys.
Keywords: cprogramm continuous integration continuous deployment machine learning user interface business intelligence sthree database zos active directory rlang information technology business works California Florida Idaho Illinois

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];459
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: