Rathnala - RE: Data Engineer/Modeler |
[email protected] |
Location: Tampa, Florida, USA |
Relocation: |
Visa: H1B |
Vinay
Data Engineer Summary: More than 9 Years of professional expertise specializing in the development of Data Systems and Business Systems, with a primary focus on Data Engineering and Data Analysis. Experience in Ansible to cloud provisioning, Configuration Management, application deployment in DevOps and IT environment. Proficient in all stages of the Software Development Life Cycle (SDLC), actively participating in daily scrum meetings with cross-functional teams. Demonstrated excellence and hands-on experience in AWS, particularly with AWS S3 and EC2. Experience in project deployment using Heroku/Jenkins and using web services like Amazon Web Services (AWS) EC2, AWS S3, Autoscaling, CloudWatch and SNS. Worked on writing code Golang to pull data from Kinesis and load data to Prometheus which will be used by Grafana for reporting and visualize Expertise in constructing Enterprise Data Warehouses and Data Warehouse appliances from scratch, employing both Kimball and Inman approaches. Skilled in data import/export using stream processing platforms such as Flume and Kafka. Experience in gathering business requirements, generated workflows vehicle inspections and specification write-ups documenting the business processes for driver, vehicle, business base, Medallia's licensing. Experience in using various version control systems like Git, GitHub and Amazon EC2 and deployment using Heroku Utilized SSIS and SSRS for report creation and management within organizational settings. Extensive experience in crafting Storm topologies for processing events from Kafka producers and emitting them into Cassandra DB. Hosted applications on GCP using Compute Engine, App Engine, Cloud SQL, Kubernetes Engine, and Cloud Storage. Developed and deployed various Lambda functions in AWS, incorporating both AWS Lambda Libraries and custom Scala Libraries. Implemented CI/CD pipelines with Jenkins for deploying Microservices in AWS ECS, Python jobs in AWS Lambda, and containerized deployments of Java & Python. Skilled in leveraging Medallia's platform to analyze customer feedback and sentiment, driving actionable insights for business improvement Proficient with Terraform for Infrastructure as Code, execution plans, change automation, and extensive use of auto-scaling. Hands-on experience in Kubernetes for multi-cloud cluster management in AWS. Extensive experience in leveraging Google Cloud Platform (GCP) for Big Data applications, adept with tools such as Big Query, Pub/Sub, Dataproc, and Dataflow. Proficient in data modeling, utilizing tools like Erwin, Power Designer, and ER Studio to ensure effective representation of data structures. Robust background in big data tools, including Hadoop, HDFS, and Hive, with the ability to navigate and harness their capabilities. Demonstrated expertise in end-to-end data management, encompassing Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export. Possess technical and analytical skills for both OLTP design and OLAP dimension modeling. Designed solutions to process high volume data stream ingestion, processing and low latency data provisioning using Hadoop Ecosystems Hive, Pig, Scoop and Kafka, Python, Spark, Scala, No-SQL, Nifi, Druid Proficient in utilizing Prometheus for monitoring and alerting in complex, high-volume environments. In-depth knowledge of Snowflake Database, covering schema and table structures, with practical application using SnowSQL, Snow Pipe, and Python/Java for Big Data techniques. Experience in using various version control systems like CVS, Git, GitHub and deployment using Heroku. Proven track record in migrating data warehouses and databases into Hadoop/NoSQL platforms. Experience in project deployment using Heroku/Jenkins and using web services like Amazon Web Services (AWS) EC2, AWS S3, Auto scaling, Cloud Watch and SNS. Proficient in the design and development of Oracle PL/SQL and Shell Scripts, specializing in Data Conversions and Data Cleansing. Hands-on experience with query optimization and ETL loading on Teradata. Strong expertise in data modeling, data warehousing, and database management, ensuring optimal performance and reliability of Medallia data. Used Azure DevOps CI/CD pipelines to build and release ETL and database objects. Comprehensive understanding of Apache Spark job execution components, coupled with practical experience in working with NoSQL databases such as HBase and MongoDB. Proficient in creating insightful dashboards using visualization tools like Tableau and Power BI to derive valuable business insights. Utilized Spark Data Frames API over the Cloudera platform for analytics on Hive data. Configured Elastic Search for log collections and Prometheus & Cloud watch for metric collections Strong skills in Python Scripting, encompassing stats functions with NumPy and visualization using Matplotlib and Pandas. Practical experience with SQOOP for importing/exporting data from RDBMS to HDFS and Hive. Knowledgeable about job workflow scheduling and locking tools/services like Oozie, Zookeeper, Airflow, and Apache NiFi. Solid understanding of AWS concepts, including EMR and EC2 web services. Evaluated reporting requirements for diverse business units, showcasing expertise in data preparation, modeling, and visualization using PowerBI. Proficient in developing analysis services using DAX queries. Utilized Kubernetes and Docker for establishing a streamlined CI/CD system runtime environment. Proven ability to collaborate with cross-functional teams to understand business requirements and deliver data solutions aligned with organizational goals. Successfully migrated Hive & MapReduce jobs to EMR and Qubole while automating workflows with Airflow. Proficient in UNIX file system operations through the command line interface, showcasing expertise in server-to-server key encryptions and accessibility in Unix environments. Addressed complex proof-of-concepts based on business requirements, including the formulation and execution of test cases for unit test accomplishments. Technical Skills: Hadoop/Big Data Technologies Hadoop, Map Reduce, Sqoop, Hive, Oozie, Spark, Zookeeper and Cloudera Manager, Kafka, Flume ETL Tools Informatica NO SQL Database HBase, Cassandra, Dynamo DB, Mongo DB. Monitoring and Reporting Tableau, Custom shell scripts Hadoop Distribution Horton Works, Cloudera Build Tools Maven Programming & Scripting Python, Scala, JAVA, SQL, Shell Scripting, C, C++ Databases Oracle, MY SQL, Prometheus, Teradata Machine Learning & Analytics Tools Supervised Learning (Linear Regression, Logistic Regression, Decision Tree, Random Forest, SVM, Classification), Unsupervised Learning (Clustering, KNN, Factor Analysis, PCA), Natural Language Processing, Google Analytics Fiddler, Tableau Version Control Git, GitHub, SVN, CVS Operating Systems Linux, Unix, Mac OS-X, CentOS, Windows 10, Windows 8, Windows 7 Cloud Platform Google Cloud Platform (GCP), Heroku, Amazon Web Services (AWS), Microsoft Azure. Professional Experience: CVS Health||Remote Sr. Data Engineer Dec22 To Current Responsibilities: Developed Python-based Spark applications and executed an Apache Spark data processing project to manage data from various RDBMS and streaming sources. Played a key role in constructing scalable distributed data solutions using Hadoop. Design, Architect and Support Hadoop cluster: Hadoop, MapReduce, Hive, Sqoop, Ranger, Presto and high-performance SQL query engine, Druid for indexing etc. Created data pipelines using Airflow in Google Cloud Platform (GCP) for ETL tasks, leveraging different Airflow operators. Utilized Apache Airflow in the GCP Composer environment to construct data pipelines, employing various operators like Bash, Hadoop, Python callable, and branching operators. Deployed the project into Heroku using GIT version control system. Constructed NiFi dataflows to ingest data from Kafka, perform transformations, store in HDFS, and exposed ports to execute Spark streaming jobs. Maintained the Hadoop cluster on GCP using Google Cloud Storage, BigQuery, and Dataproc. Collaborated with Spark to enhance performance and optimize existing algorithms in Hadoop. Configured GCP services (Data Proc, Storage, BigQuery) using Cloud Shell SDK in GCP. Knowledge in working with continuous deployment using Heroku and Jenkins and experienced on Cloud innovations including Infrastructure as a Service Utilized GCP for Cloud Functions, event-based triggering, Cloud Monitoring, and Alerting. Optimized performance of Medallia data infrastructure by fine-tuning ETL processes and database configurations, resulting in improved data processing efficiency. Employed G-cloud function with Python to load data into BigQuery for newly arrived CSV files in GCS buckets. Worked extensively on Spark RDD, Data Frame API, Data Set API, Data Source API, Spark SQL, and Spark Streaming. Used Spark Streaming APIs for real-time transformations and actions to build common functionality. Developed a Python Kafka consumer API for data consumption from Kafka topics. Processed Extensible Markup Language (XML) messages using Kafka and Spark Streaming for capturing User Interface (UI) updates. Implemented a pre-processing job using Spark DataFrames to flatten JSON documents into flat files. Loaded D-Stream data into Spark RDD, performed in-memory data computations to generate output responses. Designed GCP Cloud Composer Directed Acyclic Graphs (DAGs) to load data from on-premises CSV files into GCP BigQuery tables, with scheduled DAGs for incremental loading. Write the Golang code to to pull the data from kinesis and load the data to Prometheus which will be used by Grafana for reporting and visualize Configured Snow-pipe to extract data from Google Cloud buckets into Snowflake tables. Demonstrated a solid understanding of Cassandra architecture, replication strategy, gossip, snitches, etc. Utilized Hive QL for analyzing partitioned and bucketed data, executing Hive queries on Parquet tables. Integrated Druid with Hive for High availability and provide data for SLA reporting on real time data. Working in infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources using Spark, SQL, HDFS, Hive, MapReduce, Druid, Python, Unix, Hue and Shell Scripting Worked as an active team member for both product development and operations team to provide the best DevOps practices and supported their applications with feasible approaches. Employed Apache Kafka to aggregate web log data from multiple servers and make it available in downstream systems for data analysis and engineering purposes. Contributed to the implementation of Kafka security measures and enhanced its performance. Developed Oozie coordinators to schedule Hive scripts for creating data pipelines. Mentored junior team members on best practices for data engineering and Prometheus utilization. Conducted cluster testing of HDFS, Hive, Pig, and MapReduce for new user access to the cluster. Environment: Spark, Spark-Streaming, Spark SQL, GCP, Data Procs, Heroku Map R, HDFS, Hive, Pig, Apache Kafka, Sqoop, Python, Pyspark, Shell scripting, Linux, MySQL Oracle Enterprise DB, SOLR, Jenkins, Eclipse, druid, Oracle, Git, Oozie, Tableau, MySQL, Soap, Cassandra & Agile Methodologies. WALT DISNEY Data Modeler/Engineer JUNE 22 To NOV 22 Responsibilities: Established a Continuous Delivery pipeline using Docker and GitHub. Developed and deployed solutions with Spark and Scala code on a Hadoop cluster running on Google Cloud Platform (GCP). Proficient in Google Cloud components, Google Container Builders, GCP client libraries, and Cloud SDKs. Utilized Google Cloud Functions with Python to load data into BigQuery for incoming CSV files in GCS buckets. Processed and loaded both bound and unbound data from Google Pub/Sub topics to BigQuery using Cloud Dataflow with Python. Applied Spark and Scala APIs hands-on to compare the performance of Spark with Hive and SQL, and employed Spark SQL to manipulate Data Frames in Scala. Stored data efficiently in GCP BigQuery Target data warehouse, catering to different business teams based on their specific use cases. Devised simple and complex SQL scripts to verify Dataflow in various applications. Conducted Data Analysis, Migration, Cleansing, Transformation, Integration, Import, and Export using Python. Launched a multi-node Kubernetes cluster in Google Kubernetes Engine (GKE) and migrated a Dockized application from AWS to GCP. Deployed applications to GCP using Spinnaker (RPM-based). Collaborated with data scientists and analysts to design and implement data models for analyzing customer sentiment and feedback trends using Medallia's platform. Developed pipelines for Proof of Concept (POC) to compare performance and efficiency between running pipelines on AWS EMR Spark clusters and Cloud Dataflow on GCP. Architected multiple Directed Acyclic Graphs (DAGs) for automating ETL pipelines. Automated feature engineering mechanisms using Python scripts and deployed them on Google Cloud Platform (GCP) and BigQuery. Implemented monitoring solutions in Ansible, Terraform, Docker, and Jenkins. Automated Datadog Dashboards through Terraform Scripts. Development done using Rails 3.0, Ruby Mine, Heroku, Sublime Text 2, Spork, Rspec, Ruby gems, RVM, Git. Hands-on experience in architecting ETL transformation layers and writing Spark jobs for processing. Gathered and processed raw data at scale, employing various methods such as scripting, web scraping, API calls, SQL queries, and application development. Proficient in fact-dimensional modeling (Star schema, Snowflake schema), transactional modeling, and Slowly Changing Dimension (SCD). Involved in building ETL to Kubernetes with Apache Airflow and Spark in GCP. Processed and loaded both bound and unbound data from Google Pub/Sub topics to BigQuery using Cloud Dataflow with Python. Extensive hands-on experience in GCP, BigQuery, GCS bucket, G-cloud function, Cloud Dataflow, Pub/Sub, Cloud Shell, GSUTIL, BQ command-line utilities, Data Proc, and Stack Driver. Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines. Proficient in machine learning techniques (Decision Trees, Linear/Logistic Regressors) and statistical modeling. Worked on Confluence and Jira and skilled in data visualization using Matplotlib and Seaborn libraries. Hands-on experience with big data tools like Hadoop, Spark, and Hive. Implemented machine learning back-end pipelines with Pandas and NumPy. Environment: GCP, Bigquery, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Docker, Kubernetes, AWS, Apache, Heroku, Airflow, Python, Pandas, Matplotlib, seaborn library, text mining, NumPy, Scikit-learn, Heat maps, Bar charts, Line charts, ETL workflows, linear regression, multivariate regression, Python, Scala, Spark. PROGRESSIVE DATA ENGINEER Apr 2021 TO Apr 2022 Responsibilities: Extensive hands-on experience with the AWS cloud platform, including EC2, S3, EMR, Redshift, Lambda, and Glue. Proficient in Spark RDD, Data Frame API, Data Set API, Data Source API, Spark SQL, Spark Streaming, SQL, and MongoDB. Developed and deployed data pipelines in cloud environments, particularly on AWS. Maintained DevOps pipelines for integrating Salesforce code and manage continuous and manual deployments, for ANT and SFDX deployment, to the lower and higher environment. Strong understanding of AWS components, with a focus on EC2 and S3. Implemented Spark applications using Python and R, executing Apache Spark data processing projects to manage data from various RDBMS and streaming sources. Utilized Apache Spark DataFrames, Spark-SQL, and Spark MLlib extensively, designing and implementing POCs with Scala, Spark SQL, and MLlib libraries. Specialized in data integration, employing traditional ETL tools and methodologies to ingest, transform, and integrate structured data into a scalable data warehouse platform. Designed and deployed multi-tier applications on AWS, leveraging services such as EC2, Route53, S3, RDS, DynamoDB, SNS, SQS, and IAM, with a focus on high availability, fault tolerance, and auto-scaling using AWS CloudFormation. Expertise in Python and Scala, developing user-defined functions (UDFs) for Hive and Pig using Python. Extracted data from SQL Server, Teradata, Amazon S3 buckets, and internal SFTP, dumping them into the data warehouse AWS S3 bucket. Developed PySpark POCs and deployed them on the Yarn Cluster, comparing the performance of Spark with Hive and SQL/Teradata. Created Spark jobs to process data, including instance and cluster creation, and loaded the data into AWS S3 buckets, creating DataMarts. Utilized AWS EMR for processing and transforming data, assisting the Data Science team based on business requirements. Designed and developed ETL processes in AWS Glue to migrate campaign data from external sources (S3, ORC/Parquet/Text Files) into AWS Redshift. Worked on both batch processing and real-time data processing on Spark Streaming using the Lambda architecture. Developed Spark applications for cleaning and validating ingested data into the AWS cloud. Developed simple to complex MapReduce jobs using Java for processing and validating data. Contributed to the continuous improvement of Medallia's data architecture and infrastructure, staying abreast of new technologies and methodologies in data engineering. Processed and loaded both bound and unbound data from Google Pub/Sub topics to BigQuery using Cloud Dataflow with Python. Hands-on experience in GCP, including BigQuery, GCS bucket, G-cloud function, Cloud Dataflow, Pub/Sub, Cloud Shell, GSUTIL, BQ command-line utilities, Data Proc, and Stackdriver. Developed Python code for workflow management and automation using the Airflow tool. Developed scripts to load data to Hive from HDFS and ingested data into the Data Warehouse using various data loading techniques. Utilized Spark Streaming APIs for real-time transformations and actions. Used Git, GitHub, and Amazon EC2 and deployment using Heroku and Used extracted data for analysis and carried out various mathematical operations for calculation purpose using python library - NumPy, SciPy. Developed preprocessing jobs using Spark DataFrames to flatten JSON documents into flat files. Loaded D-Stream data into Spark RDD, performed in-memory data computations to generate output responses. Used Kubernetes for the runtime environment of the CI/CD system for building, testing, and deployment. Collaborated with the DevOps team to implement NiFi Pipeline on EC2 nodes integrated with Spark, Kafka, Postgres, running on other instances using SSL handshakes in QA and Production Environments. Currently, implementing PowerApps DevOps using Microsoft Azure. Built Informatica mappings, sessions, and workflows, managing code changes through version control in Informatica. Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR,Scala, MapReduce, HDFS, Hive, Pig, Apache Kafka, Sqoop, Python, Pyspark, Shell scripting, Linux, MySQL Oracle Enterprise DB, Big query, SOLR, Jenkins, Eclipse, Dataflow, Oracle, Git, Oozie, Tableau, MySQL, Soap, Cassandra and Agile Methodologies. INFOSYS Data Engineer FEB 19 - DEC 20 Responsibilities: Operated within an Agile environment, utilizing Rally tool for managing user stories and tasks. Implemented ad-hoc analysis solutions using Data Lake Analytics/Store and HDInsight. Implemented Apache Sentry to control access to Hive tables on a group level. Possessed expertise in MapReduce programming using Java, PIG Latin Scripting, and Distributed Application and HDFS. Utilized Tidal enterprise scheduler and Oozie Operational Services for coordinating the cluster and scheduling workflows. Integrated various Azure services such as Azure Data Factory, Azure Data Lake, Azure Data Warehouse, Azure Active Directory, Azure SQL Database, and Web Apps with AWS services to leverage the benefits of both cloud platforms and process unstructured data. Designed and implemented Kafka clusters by configuring Topics in all environments. Developed multiple Tableau dashboards catering to various business needs. Collaborated with data scientists and analysts to design and implement data models for analyzing customer sentiment and feedback trends using Medallia's platform. Implemented Partitioning, Dynamic Partitions, and Buckets in HIVE for efficient data access. Architected and implemented medium to large-scale BI solutions on Azure using Azure Data Platform services, including Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, and NoSQL DB. Utilized AVRO format for entire data ingestion for faster operation and less space utilization. Designed SSIS Packages for ETL operations, extracting, transferring, and loading existing data into SQL Server from different environments for SSAS cubes (OLAP). Ingested data into one or more Azure Services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processed the data in Azure Databricks. Developed visualizations and dashboards using PowerBI. Implemented Composite server for data virtualization needs and created multiple views for restricted data access using a REST API. Exported analyzed data to relational databases using Sqoop for visualization and report generation for the BI team using Tableau. Developed Apache Spark applications for data processing from various streaming sources. Exposure to Spark, Spark Streaming, Spark MLlib, Snowflake, Scala, and creation of Data Frames handled in Spark with Scala. Developed data pipelines using Spark, Hive, Pig, Python, Impala, and HBase to ingest customer data. Converted Hive/SQL queries into Spark transformations using Spark RDDs, Python, and Scala. Queried and analyzed data from Cassandra for quick searching, sorting, and grouping through CQL. Joined various tables in Cassandra using Spark and Scala, running analytics on top of them. Brought data from various sources into Hadoop and Cassandra using Kafka. Migrated on-premise data (Oracle/SQL Server/DB2/MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2). Created action filters, parameters, and calculated sets for preparing dashboards and worksheets using PowerBI. Developed Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats to analyze and transform data for uncovering insights into customer usage patterns. Environment: MapR, Map Reduce, HDFS, Hive, pig, Impala, Kafka, Cassandra, Spark, Scala, Azure (SQL, Databricks, Datalake, Data Storage, HDInsight), Java, SQL, Tableau, PIG, Zookeeper, Sqoop, Kafka, Teradata, Power BI. Q LABS, INDIA JR Data Engineer Dec-2015 Jan-2019 Responsibilities: Accountable for the development, support, and maintenance of ETL (Extract, Transform, and Load) processes using Informatica PowerCenter. Designed and implemented numerous ETL scripts utilizing Informatica and UNIX shell scripts. Analyzed source data from Oracle, Flat Files, and MS Excel, collaborating with the data warehouse team to develop a Dimensional Model. Established FTP, ODBC, and Relational connections for sources and targets. Implemented Slowly Changing Dimension Type 2 methodology to access the complete history of accounts and transaction information. Proficient in crafting complex SQL queries, unions, multiple table joins, and experience with Views. Demonstrated expertise in database programming with PL/SQL, encompassing Stored Procedures, Triggers, and Packages. Scheduled sessions and batches on the Informatica Server using Informatica Server Manager. Executed and validated test cases for data transformations within Informatica. Created JIL scripts and scheduled workflows using CA Autosys. Utilized SQL scripts/queries for thorough data verification at the backend. Executed SQL queries, stored procedures, and performed data validation as part of backend testing. Utilized SQL to test various reports and ETL job loads in development, testing, and production. Developed UNIX shell scripts to orchestrate the process flow for Informatica workflows, handling high-volume data. Prepared test cases based on the Functional Requirements Document. Environment: Informatica Power Center 9.x, Oracle 11g, SQL plus, PL/SQL, Oracle, SQL Developer, UNIX. Keywords: cprogramm cplusplus continuous integration continuous deployment quality analyst user interface business intelligence sthree database active directory rlang information technology microsoft procedural language California |