Satvik Vandanam - Sr. Data Engineer |
[email protected] |
Location: Atlanta, Georgia, USA |
Relocation: Yes |
Visa: GC |
Satvik Vandanam
Sr. Data Engineer +1(470)-785-7456 [email protected] Atlanta, GA Yes GC http://www.linkedin.com/in/satvik-vandanam-210129240 PROFESSIONAL SUMMARY: AWS certified Data Engineer with 7+ years of experience in IT with exceptional expertise in Big Data/Hadoop ecosystem and Data Analytics techniques. Hands on experience working with Big Data/Hadoop ecosystem including Apache Spark, Map Reduce, Spark Streaming, PySpark, Hive, HDFS, Kafka, Sqoop, Oozie. Proficient in Python Scripting and worked in stats function with NumPy, visualization using Matplotlib and Pandas for organizing data. Experience in different Hadoop distributions like Cloudera and Horton Works Data Platform (HDP). In depth understanding of Hadoop Architecture including YARN and various components such as HDFS Resource Manager, Node Manager, Name Node, Data Node. Hands on experience in Importing and exporting data from RDBMS into HDFS and vice-versa using Sqoop. Experience in working with Hive data warehouse tool-creating tables, distributing data by doing static partitioning and dynamic partitioning, bucketing, and using Hive optimization techniques. Experience working with Cassandra and NoSQL database including MongoDB and HBase. Experience in tuning and debugging Spark application and using Spark optimization techniques. Experience in building PySpark and Spark-Scala applications for interactive analysis, batch processing and stream processing. Hands on experience in creating real time data streaming solutions using Apache Spark Core, Spark SQL, and Data Frames. Extensive knowledge in implementing, configuring, and maintaining Amazon Web Services (AWS) like EC2, S3, Redshift, Glue and Athena. Experience in working with Azure cloud platform (HDInsight, DataLake, Databricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer). Experienced in data manipulation using python and python libraries such as Pandas, NumPy, SciPy and Scikit-Learn for data analysis, numerical computations, and machine learning. Experience in writing queries using SQL, experience in data integration and performance training. Developed various shell scripts and python scripts to automate Spark jobs and Hive scripts. Actively involved in all phases of data science project life cycle including Data collection, Data Pre-Processing, Exploratory Data Analysis, Feature Engineering, Feature selection and building Machine learning Model pipeline. Hands on Experience in using Visualization tools like Tableau, Power BI. Experience in working with GIT, Bitbucket Version Control System. Extensive experience working in a Test-Driven Development and Agile-Scrum Development. Involved in daily SCRUM meetings to discuss the development/progress and was active in making scrum meetings more productive. Excellent Communication skills, Interpersonal skills, problem solving skills and a team player. Ability to quickly adapt new environment and technologies. TECHNICAL SKILLS: Hadoop/Big Data Technologies Hadoop, Map Reduce, Sqoop, Hive, Oozie, Spark, Zookeeper and Cloudera Manager, Kafka, Flume ETL Tools Informatica NO SQL Database HBase, Cassandra, Dynamo DB, Mongo DB. Monitoring and Reporting Tableau, Custom shell scripts Hadoop Distribution Horton Works, Cloudera Certifications AWS Cloud Practitioner, AWS Solution Architect Programming & Scripting Python, Scala, JAVA, SQL, Shell Scripting, C, C++ Databases Oracle, MY SQL, Teradata Machine Learning & Analytics Tools Supervised Learning (Linear Regression, Logistic Regression, Decision Tree, Random Forest, SVM, Classification), Unsupervised Learning (Clustering, KNN, Factor Analysis, PCA), Natural Language Processing, Google Analytics Fiddler, Tableau Version Control Git, GitHub, SVN, CVS Operating Systems Linux, Unix, Mac OS-X, CentOS, Windows 10, Windows 8, Windows 7 Cloud Computing AWS, Azure AWS Services Amazon EC2, Amazon S3, Amazon Simple DB, Amazon MQ, Amazon ECS, Amazon Lambdas, Amazon Sagemaker, Amazon RDS, Amazon Elastic Load Balancing, Elastic Search, Amazon SQS, AWS Identity and access management, AWS Cloud Watch, Amazon EBS and Amazon CloudFormation PROFESSIONAL EXPERIENCE: Client: Nomura Bank, Tampa, FL March 2022 Present Role: Sr Data Engineer Responsibilities: Developed Spark Applications by using Python and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources. Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN. Tuned Spark applications to set batch interval time and also the correct level of parallelism, along with memory tuning. Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in real time and persist it to Cassandra. Scheduling Spark/Scala jobs using Oozie workflow in Hadoop Cluster and generated detailed design documentation for the source-to-target transformations. Developed Kafka consumer s API in python for consuming data from Kafka topics. Used Kafka to consume XML messages and Spark Streaming to process the XML file to capture UI updates. Valuable experience on practical implementation of cloud-specific technologies including IAM, Amazon Cloud Services like Elastic Compute Cloud (EC2), ElastiCache, Simple Storage Services (S3), Cloud Formation, Virtual Private Cloud (VPC), Route 53, Lambda, Glue, EMR. Migrated an existing on-premises application to AWS and used AWS services like EC2 and S3 for small data sets processing and storage. Loaded data into S3 buckets using AWS Lambda Functions, AWS Glue and PySpark and filtered data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables. Maintained and operated Hadoop cluster on AWS EMR. Used AWS EMR Spark cluster and Cloud Dataflow on GCP to compare the efficiency of a POC on a developed pipeline. Configured Snow pipe to pull the data from S3 buckets into Snowflakes table and stored incoming data in the Snowflakes staging area. Created live real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system. Worked on Amazon Redshift for shifting all Data warehouses into one Data warehouse. Designed columnar families in Cassandra and Ingested data from RDBMS, performed data transformations, and then exported the transformed data to Cassandra as per the business requirement. Designed, developed, deployed, and maintained MongoDB. Worked extensively on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN, Spark and Map Reduce programming. Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems (RDBMS) and vice-versa. Written several Map reduce Jobs using Pyspark, Numpy and used Jenkins for Continuous integration. Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios. Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better performance with HiveQL queries. Worked on cloud deployments using Maven, Docker, and Jenkins. Experience in using Avro, Parquet, RCFile and JSON file formats, developed UDF in Hive. Worked on Custom Loaders and Storage Classes in PIG to work on several data formats like JSON, XML, CSV and generated Bags for processing using PIG etc. Generated various kinds of reports using Power BI and Tableau based on Client specification. Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR, S3, EC2, MapR, HDFS, Hive, PIG, Apache Kafka, Sqoop, Python, Scala, Pyspark, Shell scripting, Linux, MySQL, NoSQL, SOLR, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau, Power BI, SOAP, Cassandra, and Agile Methodologies. Client: Caterpillar, Peoria, IL January 2021 March 2022 Role: Big Data Engineer Responsibilities: Analyzing large amounts of datasets to determine optimal way to aggregate and report on these datasets. Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources. Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN. Used Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and NoSql databases such as HBase and Cassandra. Used Kafka for live streaming data and performed analytics on it. Worked on Sqoop to transfer the data from relational database and Hadoop. Loaded data from Web servers and Teradata using Sqoop, Flume and Spark Streaming API. Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming. Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation. Written AWS Lambda code in Python for nested Json files, converting, comparing, sorting etc. Created AWS Data pipelines using various resources in AWS including AWS API Gateway to receive response from AWS Lambda and retrieve data from Snowflake using Lambda function and convert the response into Json format using database as Snowflake, DynamoDB, AWS Lambda function and AWS S3. Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR. Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's. Written multiple MapReduce programs for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV & other compressed file formats. Used Pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, & KNN for data analysis. Developed Python code for different tasks, dependencies, and time sensor for each job for workflow management and automation using Airflow tool. Worked on cloud deployments using Maven, Docker and Jenkins. Create Glue jobs to process the data from S3 stating area to S3 persistence area. Scheduling Spark/Scala jobs using Oozie workflow in Hadoop Cluster and generated detailed design documentation for the source-to-target transformations. Proficient in utilizing data for interactive Power BI dashboards and reporting purposes based on business requirements. Environment: AWS EMR, S3, EC2, Lambda, MapR, Apache Spark, Spark-Streaming, Spark SQL, HDFS, Hive, PIG, Apache Kafka, Sqoop, Flume, Python, Scala, Shell scripting, Linux, MySQL, HBase, NoSQL, DynamoDB, Cassandra, Machine Learning, Snowflake, Maven, Docker, AWS Glue, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau, Power BI. Client: Cummins, Columbus, IN Jan 2019 - Dec 2020 Role: Data Engineer Responsibilities: Extensively worked with Azure cloud platform (HDInsight, Data Lake, Databricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer). Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Ingested data to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks. Created Pipelines in Azure Data Factory (ADF) using Linked Services, Datasets, Pipeline to extract, transform and load data from different sources like Azure SQL, Blob storage, Azure SQL DW, write-back tool and backwards. Created Application Interface Document for the downstream to create new interface to transfer and receive the files through Azure Data Share. Designed and configured Azure Cloud relational servers and databases, analyzing current and future business requirements. Worked on Power Shell scripts to automate the Azure cloud system creation of Resource groups, Web Applications, Azure Storage Blobs & Tables, firewall rules. Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB). Configured Input & Output bindings of Azure Function with Azure Cosmos DB collection to read and write data from the container whenever the function executes. Designed and deployed data pipelines using Data Lake, Databricks, and Apache Airflow. Developed Elastic pool databases and scheduled Elastic jobs to execute T-SQL procedures. Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns. Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks. Created and provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters. Created multiple Databricks Spark tasks Utilizing PySpark to perform several table-to-table operations Created data pipeline for different events in Azure Blob storage into Hive external tables. Used various Hive optimization techniques like partitioning, bucketing and Mapjoin. Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API). Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS. Developed automatic job flows and ran through Oozie daily and when needed which runs MapReduce jobs internally. Extracted Tables and exported data from Teradata through Sqoop and placed in Cassandra. Environment: Azure HDInsight, Databricks, DataLake, CosmosDB, MySQL, Azure SQL, Snowflake, MongoDB, Cassandra, Teradata, Ambari, Flume, Tableau, PowerBI, Azure AD, Git, Blob Storage, Data Factory, Data Storage Explorer, Scala, Hadoop 2.x (HDFS, MapReduce, Yarn), Spark v2.0.2, PySpark, Airflow, Hive, Sqoop, HBase, Oozie. Client: Webster Bank, Stamford, CT. Feb 2017 Dec 2018 Role: Data Engineering Analyst Responsibilities: Enterprise Insurance data warehouse is a conversion project of migrating existing data marts at an integrated place to get the advantage of corporate wide data warehouse. It involves rewriting/developing existing data marts and adding new subject areas to existing data marts, it helps business users a platform queries across various subject areas using single OLAP tool (Cognos). Created map design document to transfer data from source system to data warehouse, built ETL pipeline which made analyst s job easy and reduced the patient s expense on treatment up to 40%. Development of Informatica Mappings, Sessions, Worklets, Workflows. Wrote Shell scripts to monitor load on database and Perl scripts to format data extracted from data warehouse based on user requirements. Designed, developed, and delivered the jobs and transformations over the data to enrich the data and progressively elevate for consuming in the layer of the delta lake. Managed multiple small projects with a team of 5 members, planned milestones, scheduled project milestones, and tracked project deliverables. Performed network traffic and analysis expertise using data mining, Hadoop ecosystem (MapReduce, HDFS Hive) and visualization tools by considering raw packet data, network flow, and Intrusion Detection Systems (IDS). Analyzed the company s expenses on software tools and came up with a strategy to reduce those expenses by 30%. Created a chat-bot to receive client complaints and provided them an anticipated wait time to resolve the issue. Environment: Python, R, AWS EMR, Apache Spark, Hadoop ecosystem (MapReduce, HDFS, Hive) Scala, LogRythm, Openvas, Informatica, Ubuntu. Client: Quess Corp Limited, Bangalore, India Sep 2015 Feb 2017 Role: ETL/SQL Developer Responsibilities: Analysed, designed, and developed databases using ER diagrams, normalization, and relational database concept. Engaged in various system's design, development, and testing. Developed SQL Server stored procedures, tuned SQL queries (using indexes and execution plan). Developed user defined functions and created views. Created triggers to maintain the referential integrity. Implemented exceptional handling. Worked on client requirement and wrote complex SQL queries to generate crystal reports. Created and automated the regular jobs. Tuned and optimized SQL queries using execution plan and profiler. Developed the controller component with Servlets and action classes. Business components are developed (model components) using Enterprise Java Beans (EJB). Established schedule and resource requirements by planning, analyzing and documenting development effort to include timelines, risks, test requirements and performance targets. Analysed system requirements and prepared system design document. Developed dynamic user interface with HTML and JavaScript using JSP and Servlet technology. Used JMS elements for sending and receiving messages. Created and executed test plans using quality center by test director. Mapped requirements with the test cases in the quality center. Supported system test and user acceptance test. Rebuilt indexes and tables as a part of performance tuning exercise. Involved in performing database backup and recovery. Worked on documentation using MS Word. Environment: MS SQL Server, SSRS, SSIS, SSAS, DB2, HTML, XML, JSP, Servlet, JavaScript, EJB, JMS, MS Excel, MS Word. ________________________________________ Education Details: Masters in Data science and analytics Georgia state University Atlanta 2017 Bachelors in Computer science Amrita University Bangalore 2013 ________________________________________ Keywords: cprogramm cplusplus user interface message queue business intelligence sthree database active directory rlang information technology green card microsoft Connecticut Florida Georgia Illinois |