Home

Avijith - Data Engineer
[email protected]
Location: Seattle, Washington, USA
Relocation: yes
Visa: H1B
Data Engineer
PROFESSIONAL SUMMARY
Dynamic and motivated 10+ years IT professional as a Data Engineer with expertise in designing data intensive applications using Cloud Data engineering, Data Warehouse, Hadoop Ecosystem, Big Data Analytical, Data Visualization, Reporting, Data Quality solutions and AI/ML technologies..
Hands on experience across Hadoop Ecosystem that includes extensive experience in Big Data technologies like HDFS, MapReduce, YARN, Apache Cassandra, NoSQL, Spark, Python, Scala, Sqoop, HBase, Hive, Oozie, Impala, Pig, Zookeeper, and Flume.
Built real time data pipelines by developing Kafka producers and Spark streaming applications for consuming. Utilized Flume to analyze log files and write into HDFS.
Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Dataframe API, Spark Streaming, Pair RDD's and worked explicitly on PySpark .
Developed framework for converting existing PowerCenter mappings and to PySpark (Python and Spark) Jobs.
Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop / DeVops jobs.
Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
Hands-on experience with Amazon EC2, S3, RDS(Aurora), IAM, CloudWatch, SNS, Athena, Glue, Kinesis, Lambda, EMR, Redshift, DynamoDB and other services of the AWS family and in Microsoft Azure.
Proven expertise in deploying major software solutions for various high-end clients meeting the business requirements such as big data Processing, Ingestion, Analytics and Cloud Migration from On-prem to AWS Cloud.
Experience in Work on AWS Databases like Elastic Cache (Memcached & Redis) and NoSQL databases - HBase, Cassandra & MongoDB, database performance tuning & data modeling.
Established connection from Azure to On-premises data center using Azure Express Route for Single and Multi-Subscription.
Created Azure SQL database, performed monitoring and restoring of Azure SQL database. Performed migration of Microsoft SQL server to Azure SQL database.
Experienced in Data Modeling & Data Analysis experience using Dimensional Data Modeling and Relational Data Modeling, Star Schema/Snowflake Modeling, FACT & Dimensions tables, Physical & Logical Data Modeling.
Expertise in OLTP/OLAP System Study, Analysis and E-R modeling, developing Database Schemas like Star schema and Snowflake schema used in relational, dimensional, and multidimensional modeling.
Experience of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance. Experience with different file formats like Avro, parquet, ORC, Json, XML and compressions like snappy & bzip.
TECHNICAL SKILLS
Cloud Technologies Azure (Data Factory(V2), Data Lake, Databricks, Blob Storage, Data Box), Amazon EC2, IAM, Amazon S3, Amazon RDS, Amazon Elastic Load Balancing, AWS Lambda, Amazon EMR, Amazon Glue , Amazon Kinesis, Google Cloud Platforms (GCP)
Tools Azure Logic App, Crontab, Terraform, DBT .
Big Data Hadoop, MapReduce, HDFS, Hive, Impala, Spark, Sqoop, HBase, Flume, Kafka, Oozie, Zookeeper, NiFi.
Code Repository Tools Git, GitHub, Bit Bucket.
Database MySQL, SQL Server Management Studio 18, MS Access, MySQL WorkBench, Oracle Database 11g Release 1, Amazon Redshift, Azure SQL, Azure Cosmos DB, SnowFlake FACETS.
End User Analytics Power BI, Tableau, Looker, QlikView.
NoSQL Databases HBase, Cassandra, MongoDB, Dynamo DB.
Languages Python, SQL, PostgreSQL, PySpark, PL/SQL, UNIX Shell Script, Perl, JAVA, C, C++
ETL Azure Data Factory, SnowFlake, AWS Glue, Fivetran.
Operating System Windows 10/7/XP/2000/NT/98/95, UNIX, LINUX, DOS.

PROFESSIONAL EXPERIENCE

Client: Bloomberg Remote Oct 2022 Till the date
(AWS Data Engineer)
Designed and setup Enterprise Data Lake to provide support for various cases including Analytics, processing, storing and Reporting of voluminous, rapidly changing data.
Used Data Integration to manage data with speed and scalability using the Apache Spark Engine and AWS Databricks.
Used SQL approach to create notebooks and DHF_UI in DHF 2.1.
Converted the Code from Scala to PySpark in DHF (Data Harmonization Framework) AND Migrated the Code and DHF_UI from DHF 1.0 to DHF 2.1.
Extracted structured data from multiple relational data sources as Data Frames in Spark SQL on Databricks.
Utilized Apache Airflow to orchestrate and manage workflow for scheduling Hadoop/DevOps jobs, ensuring efficient execution and monitoring.
Leveraged Kubernetes to orchestrate and manage containerized applications, ensuring scalability, reliability, and ease of deployment within the AWS environment.
Collaborated with DevOps teams to integrate Kubernetes into CI/CD pipelines, enabling continuous deployment and integration of cloud-native applications.
Responsible for loading data from the internal server and the Snowflake data warehouse into S3 buckets.
Performed the migration of large data sets to Databricks (Spark), create and administer cluster, load data, configure data pipelines, loading data from Oracle to Databricks.
Leveraged Oracle Analytics Cloud and Oracle Integration Cloud to develop robust data pipelines for seamless data ingestion, transformation, and analysis.
Created Databrick notebooks to streamline and curate the data from various business use cases.
Triggering, Monitoring the Harmonization and curation Jobs in Production Environment. Also scheduled few Jobs by using DHF jobs and ESP Jobs.
Raised the Change Request and SNOW Request in ServiceNow to deploy or send changes into Production.
Also guiding the development of a team working on PySpark (Python and Spark) jobs.
Using snowflake cloud data warehouse and AWS S3 bucket to integrate data from multiple sources, including loading nested JSON formatted data into snowflake table.
Created AWS Lambda, EC2 instances provisioning on AWS environment and implemented security groups, administered Amazon VPC's.
Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
Implemented Lambda to configure Dynamo DB Autoscaling feature and implemented Data Access Layer to access AWS DynamoDB data.
Developed Spark Applications for various business logics using Python.
Extracted, Transformed and Loaded (ETL) data from disparate sources to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and Azure Data Lake Analytics.
Worked on different files like CSV, JSON, Flat, Parquet to load the data from source to raw tables
Implemented Triggers to schedule pipelines.
Designed and developed Power BI graphical and visualization solutions with business requirement documents and plans for creating interactive dashboards
Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).
Have Knowledge regarding Streamsets which are pipelines used for the Injecting data into Raw layer from Oracle Source.
Used Terraform scripts to Automate Instances for Manual Instances that were launched before.
Developed environments of different applications on AWS by provisioning on EC2 instances using Docker, Bash and Terraform.
Environment: Snowflake, Scala, PySpark, Python, SQL, AWS S3, Streamsets, Kafka 1.1.0, Sqoop, Spark 2.0, ETL, Power BI, Import and Export Data wizard, Terraform, GCP.
Delta Airlines, St. Louis, MO Oct 2021 Sep 2022
AWS Data Engineer
Developed Apache presto and Apache drill setups in AWS EMR (Elastic Map Reduce) cluster, to combine multiple databases like MySQL and Hive. This enables to compare results like joins and inserts on various data sources controlling through single platform.
The AWS Lambda functions were written in Scala with cross-functional dependencies that generated custom libraries for delivering the Lambda function in the cloud.
Writing to the Glue metadata catalog allows us to query the improved data from Athena, resulting in a serverless querying environment.
Created Pyspark frame to bring data from DB2 to Amazon S3.
Worked on Kafka Backup Index, Log4j appender minimized logs and pointed ambari server logs to NAS Storage.
Used Curator API on Elasticsearch to data back up and restoring. Implemented Apache Airflow for orchestrating and scheduling ETL pipelines, improving automation and reliability in data processing tasks
Created AWS RDS (Relational database services) to work as Hive metastore and could combine 20 EMR cluster s meta data into a single RDS, which avoids the data loss even by terminating the EMR.
Built a Full-Service Catalog System which has a full workflow using Elasticsearch, Kinesis, CloudWatch.
Leveraged cloud-provider services when migrating on-prem MySQL clusters to AWS RDS MySQL, provisioned multiple AWS AD forests with AD-integrated DNS, as well as utilized AWS Elasticache for Redis.
Used AWS Code Commit Repository to store their programming logics and script and have them again to their new clusters.
Spin up the EMRs clusters from 30 to 50 nodes which are memory optimized such as R2, R4, X1 and X1e instances with autoscaling feature.
Hive Being the primary query engine of EMR, we have created external table schemas for the data that is being processed.
Mounted Local directory file path to Amazon S3 using S3fs fuse, to have KMS encryption enabled on the data reflecting in S3 buckets.
Designed and implemented ETL pipelines on S3 parquet files on data lake using AWS Glue.
Migrated the data from Amazon Redshift data warehouse to Snowflake.
Used AWS glue catalog with crawler to get the data from S3 and perform SQL query operations and JSON schema to define table and column mapping from S3 data to Redshift.
Applied Auto scaling techniques to scale in and scale out the instances with given Memory out of time. This helped in reducing the number of instances count when the cluster is not actively in use. This is applied by even considering Hive s replication factor as 2 leaving minimum 5 instances running.
Environment: Amazon Web Services, Elastic Map Reduce cluster, EC2s, CloudFormation, Amazon S3, Amazon Redshift, Hive, Scala, PySpark, Snowflake, Shell Scripting, Tableau, Kafka
Client: Coforge Hyderabad India Sep 2018 July 2021
Azure/SnowFlake Python Data Engineer
Analyze, develop, and build modern data solutions with the Azure PaaS service to enable data visualization. Understand the application's current Production state and the impact of new installation on existing business processes.
Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB).
Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics.
Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
Pipelines were created in Azure Data Factory utilizing Linked Services/Datasets/Pipeline/ to extract, transform, and load data from many sources such as Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backwards.
Used Azure ML to build, test and deploy predictive analytics solutions based on data. Leveraged Azure ML to build, test, and deploy predictive analytics solutions, demonstrating proficiency in AI/ML technologies.
Developed Spark applications with Azure Data Factory and Spark-SQL for data extraction, transformation, and aggregation from different file formats in order to analyze and transform the data in order to uncover insights into customer usage patterns.
Applied technical knowledge to architect solutions that meet business, and IT needs, created roadmaps, and ensure long term technical viability of new deployments, infusing key analytics and AI & ML
technologies where appropriate (e.g., Azure Machine Learning, Machine Learning Server, BOT framework, Azure Cognitive Services, Azure Databricks, etc.)
Managed relational database service in which the Azure SQL handles reliability, scaling, and maintenance.
Integrated data storage solutions with Spark, particularly with Azure Data Lake storage and Blob storage.
Configured stream analytics, Event hubs and worked to manage IoT solutions with Azure.
Successfully completed a proofofconcept for Azure implementation, with the larger goal of migrating on-premises servers and data to the cloud.
Responsible for estimating cluster size, monitoring, and troubleshooting the Spark Databricks cluster.
Experienced in adjusting the performance of Spark applications for the proper batch interval time, parallelism level, and memory tuning.
Extensively involved in the Analysis, design and Modeling. Worked on Snowflake Schema, Data Modeling and Elements, , and Source to Target Mappings, Interface Matrix and Design elements.
To meet specific business requirements wrote UDF s in Scala and PySpark.
Analyzed large data sets using Hive queries for Structured data, unstructured and semi-structured data.
Worked with structured data in Hive to improve performance by various advanced techniques like Bucketing, Partitioning, and Optimizing self joins.
Written and used complex data type in storing and retrieving data using HQL in Hive.
Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
Used Snowflake cloud data warehouse for integrating data from multiple source system which include nested JSON formatted data into Snowflake table.
Hands-on experience on developing SQL Scripts for automation purpose.
Environment: Azure Data Factory(V2), Snowflake, Azure Databricks, Azure SQL, Azure Data Lake, Azure Blob Storage, Hive, Azure ML, Scala, PySpark.

Client: Infogain, Hyderabad India July 2014 Aug 2018
Data Engineer
Performed multiple MapReduce jobs in Hive for data cleaning and pre-processing. Loaded the data from Teradata tables into Hive Tables.
Experience in importing and exporting data by Sqoop between HDFS and RDBMS and migrating according to client's requirement.
Used Flume to collect, aggregate, and store the web log data from different sources like web servers and pushed to HDFS.
Developed Big Data solutions focused on pattern matching and predictive modeling.
Involved in Agile methodologies, Scrum meetings and Sprint planning.
Worked on installing cluster commissioning decommissioning of datanode namenode recovery capacity planning and slots configuration.
Resource management of HADOOP Cluster including adding/removing cluster nodes for maintenance and capacity needs.
Involved in loading data from UNIX file system to HDFS.
Partitioned the fact tables and materialized views to enhance the performance. Implemented Hive Partitioning and Bucketing on the collected data in HDFS.
Involved in integrating hive queries into spark environment using Spark SQL.
Used Hive to analyze the partitioned and bucketed data to compute various metrics for reporting.
Improved performance of the tables through load testing using Cassandra stress tool.
Involved with the admin team to setup, configure, troubleshoot and scaling the hardware on a Cassandra cluster.
Created data models for customers data using Cassandra Query Language (CQL).
Developed and ran Map-Reduce Jobs on YARN and Hadoop clusters to produce daily and monthly reports as per user's need.
Experienced in connecting Avro Sink ports directly to Spark Streaming for analyzation of weblogs.
Address the performance tuning of Hadoop ETL processes against very large data set work directly with statistically on implementing solutions involving predictive analytics.
Performed Linux operations on the HDFS server for data lookups, job changes if any commits were disabled, and rescheduling data storage jobs.
Created data processing pipelines for data transformation and analysis by developing spark jobs in Scala.
Testing and validating database tables in relational databases with SQL queries, as well as performing Data Validation and Data Integration. Worked on visualizing the aggregated datasets in Tableau.
Migrating code to version controllers using Git Commands for future use and to ensure a smooth development workflow.
Environment: Hadoop, Spark, MapReduce, Hive, HDFS, YARN, MobaExtrm, Linux, Cassandra, NoSQL database, Python Spark SQL, Tableau, Flume, Spark Streaming.
Keywords: cprogramm cplusplus continuous integration continuous deployment artificial intelligence machine learning business intelligence sthree database active directory rlang information technology microsoft procedural language Missouri

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];1952
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: