Sai Jaswanth Kunku - Data engineer |
[email protected] |
Location: Hilliard, Ohio, USA |
Relocation: |
Visa: |
Over 8 years of professional experience spanning Data Engineering, Data Science, and Data Analysis with proficiency in Statistical Analysis, Data Mining, Data Engineering, and Machine Learning.
Demonstrated expertise across diverse industries, including Telecommunications, Financial Services, Healthcare, and Retail attesting to adaptability and domain knowledge. Proficient in using Cloudera, EMR and GoogleProc big data distributions and working with a wide range of components such as Spark, MapReduce, HDFS, Sqoop, Pig, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN. Proven track record in managing the entire data engineering project lifecycle, encompassing Data Collection, Data Transformation, Data Preparation, Data Validation, Data Mining, and Data Visualization, adept at handling both structured and unstructured data sources. Experienced in creating and optimizing spark jobs with Spark Context, Spark-SQL, Dataframe API, Spark Streaming, MLlib, and Pair RDD, using Pyspark and Scala API. Utilized Snowflake components like Role-based security, SnowSQL, Snow pipe, Connectors, Data Sharing, Cloning, Time travel, creating tasks and Data Loading and Unloading. Skilled in seamlessly building streaming data pipelines from diverse sources including logs, flat files, APIs, and databases using robust technologies such as Kafka, Google Pub/Sub, and AWS Kinesis. Proven expertise in writing Hive User-Defined Functions (UDFs) and creating Hive external and internal tables with a wide array of file formats including Avro, Parquet, ORC, JSON, and XML. Possess a deep understanding of data modelling principles, encompassing the intricacies of Star Schema, Snowflake Schema, and Slowly Changing Dimension (SCD) methodologies within data warehousing. Proficient in using various SQL and NoSQL databases such as MySQL, PostgreSQL, MongoDB, and Cassandra, with a strong ability to design, query, and optimize databases for efficient data storage and retrieval. Hands-on experience in performing data mappings, incremental loads and building connections using ETL tools including Talend and Informatica for batch processing and reporting. Profound expertise in Microsoft Business Intelligence (MSBI) tools, including SSIS (SQL Server Integration Services), SSRS (SQL Server Reporting Services), and SSAS (SQL Server Analysis Services), with a strong track record of designing, developing packages and OLAP cubes for effective data-driven decision-making. Extensive hands-on experience with Apache Airflow, including designing, configuring, and orchestrating complex data pipelines, demonstrating proficiency in workflow automation and scheduling. Expert in designing ETL data flows, creating mappings/workflows to extract data from SQL Server, and performing Data Migration and Transformation from on-premises servers, Access, logs and CSV files using SSIS. Highly Skilled in data extraction, transformation, and loading (ETL) processes using BI tools such as Tableau, PowerBI and IBM Cognos, ensuring accurate and up-to-date data for analysis. Extensive hands-on experience in implementing and managing the ELK stack (Elasticsearch, Logstash, and Kibana) for efficient log and data analysis. Hands-on experience in performing data collection, data analytics and data profiling in Python and R using statistical libraries such as Pandas, NumPy, SciPy, scikit-learn, Matplotlib, Seaborn, Beautiful Soup, NLTK, PyTorch,ggplot2, Caret, dplyr, tidyverse. Strong knowledge of machine learning concepts such as supervised/unsupervised learning, natural language processing( NLP) and implementing learning ML models on big data using frameworks using Pytorch and TensorFlow. Hands-on experience in using AWS services such as EC2, S3, RDS, VPC, IAM, Cloud Watch, SNS, SQS, Lambda, Elasticsearch, CloudWatch, Kinesis and Glue for batch and stream processing. Proficient in writing complex SQL queries with extensive experience in utilizing advanced SQL functions, subqueries, joins, and window functions to extract and manipulate data from relational databases. Extensive experience in handling multiple tasks to meet deadlines creating deliverables in fast-paced environments and interacting with business and end users. TECHNICAL SKILLS Programming Languages: R, C/C++, C#, SQL, PL/SQL, Java, SAS, Scala Scripting Languages: Python, JavaScript, UNIX Shell scripting, bash scripting, PowerShell Databases: MySQL, DB2, Oracle, MS SQL, PostgreSQL, Bigtable, MongoDB, HBase, Cassandra Data warehousing: Snowflake, AWS Redshift, BigQuery, Azure Synapse Analytics, Teradata Data Visualization: Tableau, Power BI, Excel (Pivot Tables, Lookups), D3.js Data Integration: Informatica Power Center, SSIS, Talend, IBM Infosphere DataStage, Ab Initio, Pentaho Big Data: Hadoop, Spark, Apache Hive, Apache HBase, Apache NiFi, Apache Kafka, Apache Airflow Operating Systems: Linux, Ubuntu, MacOS, CentOS, Windows Version Control & CI/CD: Git, Bitbucket, GitHub, Docker, Jenkins, Terraform Cloud Computing & Web Development: AWS, Azure, GCP, HTML, CSS, JavaScript SDLC/Testing Methodologies: Agile, Waterfall, Scrum, A/B Testing, Unit Testing Other Tools: VS Code, Visual Studio, Zeppelin Notebook, Anaconda, MS Office, Jupyter Notebook PROFESSIONAL EXPERIENCE Amazon Seattle, Washington Sr. Data Engineer July 2022 Till Date Responsibilities: Orchestrated end-to-end data pipelines by systematically collecting, loading, and transforming sales, product, and customer data from enterprise databases into a Big Data platform to facilitate downstream Analytics and Machine Learning. Utilized Kinesis Data Streams to capture real-time data changes in DynamoDB tables, and efficiently distributed raw JSON data through Kinesis Firehose to data lake. Created operational data store (ODS) to perform real-time ad-hoc analysis on the OLTP data store using Amazon DMS. Optimized S3 configurations with dynamic partitioning and compression mechanisms and created Lambda step functions to seamlessly ingest data from Kinesis Firehose into S3 buckets. Leveraged Amazon EMR to perform distributed data transformations and aggregations using Hive and Spark, uncovering valuable insights from the stored data. Developed Hive User-Defined Functions (UDFs) for intricate data transformations and created external table partitions and bucketing techniques, enhancing query response times. Created Spark SQL scripts in Apache Spark for robust data cleansing, transformation, and filtering operations on Hive tables and exported data in Parquet format with snappy compression to S3 for efficient read/write processing. Streamlined data flow from S3 into RDS for analytics by performing incremental loads using Glue Crawlers and Glue ETL and integrated them with AWS Qlik sight. Crafted HQL queries to seamlessly retrieve, process, and migrate raw data from EMR to S3, streamlining the data ingestion and storage processes within Redshift. Created data marts within Redshift, to enable Machine learning to create models and improve product recommendations. Environment: Python, Amazon EMR, Amazon DMS, DynamoDB, ElasticSearch, Hadoop, Hive, Linux, Amazon Kinesis, AWS Qlik Sight, AWS Glue, Amazon Redshift, Spark SQL, Git Citizen s Bank Johnston, Rhode Island Data Engineer Nov 2021 May 2022 Responsibilities: Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Performed the migration of large data sets to Databricks (Spark), created and administered clusters, loaded data, configured data pipelines, and loaded data from ADLS Gen2 to delta lakes using ADF. Developed comprehensive data mapping documentation to map complex Oracle data structures to Redshift schemas, enabling the accurate transfer of data from Azure Data Factory to Azure Synapse Worked on Azure Data Factory to integrate data of both on-premises (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) with Azure Synapse Developed ELT processes using Azure Data Factory to load data from various sources to HDFS and performed structural modifications using Map Reduce and Hive. Created notebooks and performed RDD transformations to streamline and curate the data for various business using Spark Streaming in Data bricks. Developed and optimized Pyspark jobs on Databricks clusters for large-scale data processing and analysis tasks to improve the run time performance. Written complex SQL queries involving database joins, self-joins, aggregate functions, rank functions, and case statements to aggregate data as per business requirements. Collaborated with cross-functional teams to set up the necessary infrastructure and environments for data migration, ensuring compatibility and smooth data flow. Worked extensively on data transformations, Integration runtimes, Azure Key Vaults, Triggers, and migrating data factory pipelines to higher environments using ARM Templates. Environment: Python, MySQL, Databricks, Azure SQL DW, Azure Synapse, Cosmos DB, ADF, Power BI, Azure Data Lake, ARM, Azure HDInsight, Apache Spark. Tata Consultancy Services Hyderabad, India Data Scientist/ Data Engineer Apr 2019 Jul 2021 Responsibilities: Loaded data from microservices to Google Cloud Storage using Kafka and performed structural modifications using Spark on top of Hive in Google Dataproc. Implemented change data capture pipelines (CDC) with Debezium Kafka Connector to process data from MSSQL in Google Cloud Dataflow (GCP) Configured Kafka with Spark Streaming API to fetch near real-time data from multiple sources, such as weblogs, for timely analysis and actionable insights. Ingested billions of claim records into the spark cluster and applied transformations and aggregations, resulting in a 30% reduction in data processing time. Implemented data lineage tracking and metadata management within NiFi, enhancing data governance and providing end-to-end visibility into data movement and transformations. Managed and maintained HBase clusters, including schema design, table optimization, and data modelling and implemented row key strategies for optimal data retrieval performance Designed and implemented Snowflake database schemas, tables, and views to accommodate structured and semi-structured data. Used Matplotlib and Seaborn in Python to visualize the data and performed feature engineering to detect outliers and perform normalization. Developed a Random forest model with hyperparameters using Spark MLlib to identify different kinds of fraud in medical bill claims. Created data visualizations using D3.js and Tableau, enabling stakeholders to gain actionable insights and make informed decisions based on visually represented data. Environment: Python, GCP, Snowflake, Linux, Spark, Databricks, Tableau, D3.js, SQL Server, Excel Brainy n Bright Inc. Hyderabad, India Data Scientist/ Data Engineer Nov 2017 - Mar 2019 Responsibilities: Created new features based on information from million transaction records and training models using Machine-Learning techniques such as Gradient Boosting Tree and Deep Learning. Developed SSIS packages and maintained SQL server agents to perform initial loads and full loads into the cloud storage. Analyzed and determined a cutoff point for accepting/ declining transactions to minimize fraud losses and increase customer experience by using various machine learning algorithms such as Logistic Regression, Classification, Random Forests and Clustering in SAS, R and Python. Used Pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, and NLTK in Python for implementing various machine learning algorithms. Used SAS, SQL, Oracle, Teradata, and MS Office analysis tools to complete analysis requirements. Created SAS data sets by extracting data from Oracle database and flat files. Used Spark-SQL to Load JSON data and create Schema RDD, loaded it into Hive Tables, and handled structured data using Spark SQL. Executed scheduled tasks for weekly and monthly data updates while adeptly managing and manipulating the data for efficient database management. Written SQL queries to retrieve and validate data, prepared for data mapping documents. Created dashboards using SSAS and built matrix and tabular reports using reporting services. Loaded the data from multiple data sources (SQL, DB2, and Oracle) into HDFS using Sqoop and loaded it into Hive tables. Involved in the development of Web Services using SOAP for sending and getting data from the external interface in the XML format. Used Alteryx for Data Preparation in such a way that it is useful for developing reports and visualizations. Worked on RDBMS like MySQL and NoSQL databases like MongoDB to capture data from enterprise databases for ML modelling. Used the Agile Scrum methodology to build the different phases of the software development life cycle. Environment: JIRA, SAS, Jupyter Notebook, Python, Oracle, Hadoop, MongoDB, Teradata, Spark, MySQL, MongoDB LogicMatter Inc. Hyderabad, India Data Analyst May 2015 - Nov 2017 Responsibilities: Offered valuable analytical support for the Claims, Ancillary, and Medical Management departments, contributing to data-driven decision-making and process improvements. Conducted Data Mapping and Logical Data Modeling, showcasing expertise in creating class diagrams and Entity-Relationship (ER) diagrams to visualize data relationships effectively. Employed SAS tools, including PROC FREQ, PROC MEAN, PROC UNIVARIATE, PROC RANK, and macros, to clean and enhance data quality by eliminating duplicates and inaccuracies, ensuring reliable datasets for analysis. Optimized database access by converting various SQL statements into efficient stored procedures, reducing overhead and enhancing query performance. Collaborated closely with Quality Control Teams to develop comprehensive Test Plans and Test Cases, ensuring rigorous testing of systems and validating data accuracy. Created custom reports in Power BI with KPIs using complex DAX calculations in Power query to perform ad-hoc reporting. Handled and submitted change requests independently, right from developing the Understanding Document (UD), getting the Client to sign off, and implementing the Change Request in SQL. Created complex SQL queries with materialized views, temporary tables, CTE (common table expression), nested queries, and various forms of control structures including case statements. Prepared detailed technical documentation and delivered end-user training to ensure seamless adoption of systems and tools. Environment: MS Excel, MS Word, PowerPoint, Oracle, DB2, MS Excel, UNIX, SAS Keywords: cprogramm cplusplus csharp continuous integration continuous deployment machine learning javascript business intelligence sthree database active directory rlang information technology microsoft procedural language |