Home

Sandeep - Data Engineer
[email protected]
Location: Jersey City, New Jersey, USA
Relocation: New York & New Jersey.
Visa: H1B
Sandeep Kumar
Sr Data Engineer
Professional Summary:
Profound 9-year experience in developing, configuring, and maintaining Hadoop ecosystem components for web-based applications.
In depth knowledge of Distributed Systems Architecture, Parallel Processing Frameworks, and the full Software Development life cycle.
Expertise in Big Data Ecosystem tools, including Apache Spark, Spark SQL, HDFS, DBFS, Hive, YARN, Sqoop, Oozie, Airflow, Snowflake, Druid, Azure Databricks, AWS and GCP services.
Profound in data analysis using Hive, Spark, Spark SQL, Python, and Scala.
Optimized Hive performance with partitioning and bucketing techniques.
Experienced in automating data import/export between RDBMS and HDFS/Hive using Sqoop.
Managed batch job scheduling workflows with Oozie, Event Engine, and Airflow.
Engineered data workflows pipelines with Magellan ETL, Glue, Cloud Console and Airflow.
Familiar with serialization formats like Sequence File, Avro, Parquet, and ORC.
Competent in handling various data sources, including flat files, XML files, and databases.
Monitored for historical data analysis and trending using Splunk.
Experienced with Hadoop distributions, such as MapR, Cloudera (CDH 5 and CDH 6), Azure Databricks and Google Cloud Platform (GCP).
Hands-on experience on creating the pipelines using Azure Data Factory(ADF).
Handled and transformed data using Python/PySpark in Databricks and stored them in Azure Data Lake Storage (ADLS).
Proficient in leveraging Delta Lake's time travel capabilities to access and query historical versions of data stored in Delta tables.
Effectively configured AWS cloud environments and also established connectivity between Cloudera distribution and AWS for Python/PySpark jobs.
Proficient in using AWS Glue for automated data extraction, transformation, and loading (ETL) processes, streamlining data pipeline development.
Developed AWS Lambda functions for serverless event-driven data processing, improving data pipeline efficiency.
Proficient in using Amazon Athena, a serverless interactive query service, to analyze large datasets stored in Amazon S3.
Effectively used AWS Cloudwatch to monitor AWS resources and applications, providing real-time visibility into system health and performance.
Migrated tables and applied transformations on Palantir Foundry and managed the projects created within.
Created alerts in applications using Palantir Foundry to alert the users based on severity.
Managed relational databases on RDS, ensuring data integrity, availability, and scalability.
Employed Python Pandas DataFrames to clean, preprocess, and transform large datasets, making them suitable for analysis.
Handled and compared data on PowerBI by analyzing the visual dashboards and exploring KPIs.
Written cost effective SQL queries by different performance tuning methods to reduce the time taken to display the outputs.
Knowledgeable in Kafka for high-speed data processing.
Implemented CI/CD and test-driven development using Jenkins.
Effectively utilized ServiceNow for incident management, problem resolution, knowledge articles, and change requests.
Proficient in UNIX scripting and Version Control systems, including SVN, Jenkins, and GitHub.
Presented Hadoop use-cases at Show and Tell sessions to the business.
Proficient in Agile and Waterfall Software Development methodologies.

Education:
Jawaharlal Nehru Technological University, Kakinada, AP, India Sep 2008 Apr 2012 University of Central Missouri, Warrensburg, MO, USA Aug 2016 Dec 2017
Technical Skills:

Programming Languages Python, Scala, Java, C, C++, SQL.
Big Data technologies Spark Core, Spark SQL, Hadoop, MapReduce, Hive, AWS Athena, YARN, Sqoop, Oozie, Airflow, Event Engine, AWS Glue, Flume, Zookeeper, Kafka.
Scripting Languages Shell Script, Python.
Java Technologies JDBC, JSP, JSON, Web Services, REST API
Databases Oracle, MySQL, PostGRE SQL
NoSQL Databases HBase, Cornerstone
Application Servers WebLogic, WebSphere, Tomcat
IDEs Eclipse, IntelliJ, SQL Developer
Operating Systems Windows, Unix, Linux
Version Control Git, SVN, Rational ClearCase
Development methodology Agile, Scrum, Waterfall
Hadoop Distributions Cloudera, MapR, AWS, Azure Databricks, Google Cloud Platform(GCP), Palantir Foundry

Professional Experience:

Lowe s, (Remote) - Jersey City, NJ Aug 2022 Present
Sr Data Engineer

Roles and Responsibilities:
Translated business requirements into code modules using various Big Data technologies in on-prem systems and on cloud platform.
Collaborated with Data Scientists and Software Engineering teams, providing data based on requirements.
Designed job architectures to enhance existing processes and speed up execution.
Migrated tables from Alteryx, Teradata to Hadoop and GCS in the Data Ingestion Platform (DIP).
Developed and configured integrated business processes using Spark via Python/PySpark.
Enhanced existing code by optimizing Spark with PySpark and Hive configurations.
Validated the Spark output data using Scala with data generated in PowerBI dashboard.
Migrated data from existing platform to Palantir Foundry environment with Data Connection and Pipeline Builder.
Scheduled the pipelines using builds after designing the path for Application on top of Ontology in Palantir Foundry.
Analyzed the data in Fusion and checked-in the code in Code Repository in Palantir Foundry.
Exported reports on cost of every project within Palantir Foundry using Resource Management.
Designed and optimized the tables using BigQuery stored in Google Cloud Storage (GCS).
Created pipelines in Google Cloud Platform (GCP) using Cloud Composer.
Automated jobs with Oozie and Apache Airflow using XML, YAML and Python.
Cleaned the data using Python Pandas to handle the tables before ingesting to Hadoop.
Streamlined scripts into Airflow DAGs to perform various operations using Python.
Co-ordinated with front-end UI teams to ensure data output utilization using Apache Druid.
Implemented and maintained complex software solutions for successful deployment.
Fixed code vulnerabilities using Snyk and helped to secure the code.
Collaborated with Business Analysts to meet specifications and architectural standards.
Coordinated and executed testing methodologies for error identification and software quality.
Followed the CI/CD cycle using version control tools like Bitbucket.
Monitored and implemented adhoc fixes for existing Production jobs.
Managed tickets using JIRA, including CRQs/INCs for production activities.
Conducted knowledge transfer sessions to new team members and managed workload for junior developers.

Environment/Tools: Linux, Hadoop, Hive, Alteryx, Teradata, Trino, Oozie, Airflow, Spark, Scala, Python, GIT, Snyk, Druid, Palantir Foundry, Cloud Console, Google Cloud Platform (GCP).


Dun and Bradstreet Corporation, Short Hills, NJ Sep 2019 Jul 2022 Big Data Engineer

Roles and Responsibilities:
Design, Develop, Test, Deploy and Support Big Data Applications on EC2 Hadoop Cluster in Amazon Web Services (AWS) cloud and Azure Databricks environments.
Managed the collection and organization of all project requirements in Multi-Cloud environments.
Analyzed and ensured all data sources are productionalized and to automate as much as possible and ingest into the platform.
Took charge of spinning up Cloudera EC2 cluster and scale up/down the data nodes.
Supervised the AWS instances employed in the project and visually represented cost estimates using graphs for each instance type.
Designed, built, and implemented end-to-end data pipelines using Azure technologies, including Azure Data Factory (ADF), Databricks, and Azure Data Lake Storage (ADLS).
Employed Delta Lake's time travel functionality for data recovery and rollbacks, ensuring data consistency and reliability.
Integrate Apache Spark with REST API to process and analyze data in Databricks using PySpark/Python.
Assign performance related tuning at default hive/spark settings.
Deploy and trigger the Spark jobs in Azure Databricks environment with different instance types in DBFS and also in AWS S3.
Workflow orchestration to automate successive steps and incorporate appropriate quality checks within the process in Oozie and Apache Airflow.
Automated all jobs by pulling the data from File Transfer Protocol server using workflows into desired destination as per requirement via FTP/SFTP or STP or mainframe deliverables.
Authored data pipelines that pull data from RESTful web services, perform transformations, and store the data in Amazon S3 using AWS Glue.
Created custom AWS Glue jobs and crawlers to extract and transform data from various sources, ensuring data consistency.
Integrated AWS Step Functions with AWS Lambda functions for seamless data processing automation.
Leveraged partitioning, data compression, and query performance tuning techniques to improve query execution times using AWS Athena.
Utilized AWS CloudWatch Metrics and AWS Auto Scaling to dynamically adjust resources based on custom-defined thresholds, optimizing resource utilization and cost savings.
Generate audit reports and metadata in the desired format by ensuring all assets code in source control by following best-in-class release management and source code practices.
Writing complex SQL queries in Snowflake, contributing to the efficiency of the DWaaS and handling them to business.
Designed efficient database schemas within Snowflake, considering factors like data organization and query performance.
Leveraged Time Travel feature in Snowflake within the DWaaS environment to enable historical data analysis, facilitating in-depth investigations into data states at specific time points.
Perform code review, bug fixing and production activity when required.
Perform controlled releases to implement any code/asset change for standardized delivery to various end customers.
Develop and document design/implementation of requirements based on business needs.
Use JIRA board and Confluence for tracking the tickets and documentation updates.
Use Bitbucket as version control tool.

Environment/Tools: Linux, Hadoop, Hive, Oozie, SAS, Airflow, Spark, Python, Snowflake, GIT, CDH5, CDH6, AWS, Azure Databricks.


American Express, Phoenix, AZ Jan 2018 Aug 2019 Software developer / Big Data

Roles and Responsibilities:
Orchestrated a 600+ node Hadoop YARN cluster for data storage and analysis in Credit Fraud Risk Department.
Improved data security and encryption, safeguarding sensitive information like PIIs like credit card information from Fraudulent activities.
Developed and monitored Big Data applications on the Hadoop Cluster, leveraging technologies like Hadoop, Spark, Magellan, Hive, Sqoop, Oozie, Splunk, and HBase.
Written and executed the long running jobs using Spark job and monitored the flows in PySpark/Python.
Conducted Spark code reviews and resolved memory issues, enhancing performance using PySpark/Python.
Crafted high-impact Hive queries with the Hive Terminal and Magellan ETL tool.
Managed, accessed, and processed data in various formats including ORC and Parquet.
Engineered User-Defined Functions (UDFs) for complex transformations using Hive.
Automated data extraction and loading into Hive tables with Oozie.
Developed Shell scripts for use-case initiation and pre-validation.
Collaborated with Hadoop Admins to optimize performance for a more efficient cluster.
Managed and reviewed Hadoop log files to identify and resolve bugs.
Presented weekly status reports to the business on use-case progress and tracked issues.
Mentored the offshore team within the Production Support group, ensuring code quality.
Developed and documented design impacts based on system monitoring.
Utilized ServiceNow ITSM for Incident, Problem, and Change Request management.
Prioritized issues for efficient ticket management using JIRA, reducing client response time.
Leveraged GIT for version control to manage and collaborate on code and data projects.

Environment/Tools: Hadoop, Hive, Magellan, Spark, Python, Splunk, Oozie, Sqoop, Bash, GIT, HBase, ServiceNow, JIRA, MapR.

Vdevture Technologies, Hyderabad, Telangana, India Aug 2013 Jul 2016
Software Engineer

Roles and Responsibilities:
Developed Map Reduce pipeline jobs to process the data and create necessary Files in Hadoop.
Implementation of MapReduce-based large-scale parallel relation-learning system.
Imported data using Sqoop to load data from MySQL and Oracle to HDFS.
Developed Class diagrams, Sequence Diagrams using UML.
Designed various interactive front-end web pages using HTML, CSS, jQuery & Bootstrap.
Developed HTML and JSP pages for user interaction and data presentation.
Designed and developed moderately complex units/modules/products that meet requirements.
Maintained and upgraded (new features, refactoring, bug fixing) of existing programs using MFC and Win32 API wherever needed.
Actively participated in Unit Testing, User Acceptance Testing and Bug Fixing.
Collaborated with Quality Assurance team in creation of test plans and reviews.
Engaged in design and code reviews with other developers.

Environment/Tools: Hadoop, HDFS, Java, Eclipse, MySQL, CSS, jQuery, Bootstrap, UML, MFC, Win32, Visual Studio.
Keywords: cprogramm cplusplus continuous integration continuous deployment user interface sthree Arizona Colorado Missouri New Jersey

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];1578
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: