Pranathi - Sr. Data Engineer |
[email protected] |
Location: Fairborn, Ohio, USA |
Relocation: Yes (Flexible to Relocate Myself no dependencies) |
Visa: H1 |
Pranathi Gannamaneni
Sr. Data Engineer Phone: (+1)940-448-0365 [email protected] LinkedIn: - www.linkedin.com/in/pranathig PROFESSIONAL SUMMARY 11 years of IT experience in designing and developing Data Warehousing projects across Finance, Sales, and Marketing industries. Proficient in building data pipelines using Azure Data Factory (ADF V1, V2), transforming data, and monitoring data sets for accuracy and completeness. Developed PySpark jobs in AWS Glue for merging data from multiple tables and utilizing Glue Crawlers to populate the Glue data catalog with metadata definitions. Designed and optimized scalable cloud infrastructure on AWS and Azure to support training, deployment, and monitoring of General AI models. Used Python Pandas and NumPy for data manipulation and analysis, working with large datasets. Applied machine learning libraries like scikit-learn and TensorFlow for predictive modeling. Developed and optimized ETL pipelines using PySpark to process large-scale datasets efficiently and perform data transformations. Developed and optimized ETL pipelines using PySpark to process large-scale datasets efficiently and perform data transformations. Build microservices in Scala for real-time data processing. Ensured graph databases could scale to accommodate growing data and increasing query demands. Automated cloud infrastructure provisioning with Terraform, ensuring consistent setups for data environments. Managed data migration and integration tasks between RDBMS and other data sources, ensuring seamless data flow and consistency. PL/SQL allows for dynamic SQL execution, enabling you to build and execute SQL statements at runtime, which is useful for flexible query generation and complex operations. Creating and managing Spark DataFrames in PySpark for structured data processing. Writing Python scripts for data extraction, transformation, and loading tasks. Used Databricks to build and deploy scalable data pipelines, processing large volumes of data efficiently with Apache Spark. Implemented Java-based data validation and error handling mechanisms to ensure data integrity. Used Airbyte to automate the extraction and integration of data from various sources into data warehouses or lakes. Managed resources in AWS and Azure with Terraform, unifying cloud infrastructure management. Implementing continuous integration and continuous deployment practices to automate software delivery pipelines. Designed and deployed scalable database solutions using Amazon Aurora, ensuring high availability and performance. Integrating Python with big data tools such as PySpark for scalable processing. Developed and maintained accurate data models in graph databases to represent entities and their connections. Designed and implemented scalable streaming data architectures using Kafka within AWS, ensuring high-throughput data ingestion and processing. Built automated ETL pipelines in Python to handle data extraction, transformation, and loading. TECHNICAL SKILLS Data Ecosystem: Hadoop, HDFS, Map Reduce, Sqoop, Pig, Hive, HBase, Zookeeper, Impala, Apache Spark, Flume, Informatica, Yarn, Kafka, NIFI, Cosmos Distribution: Cloudera, Horton Works. Databases: MS SQL Server, MySQL, Snowflake, Oracle, PostgreSQL, MongoDB, MS Excel, Cassandra Application Server: Apache Tomcat, WebLogic. Languages: Scala, Python, PySpark, Java, HTML, CSS, SQL, HiveQL UNIX Shell Script. Version Control: GIT, GitHub, Erwin, TOAD, AQT, TFS, Hive, Jenkins IDE and Build Tools: Eclipse, Visual Studio, R-Studio, JIRA, IntelliJ IDEA. Build Automation Tools: Ant, Maven. Operating Systems: Linux, Cent OS, and Windows. Visualization/Reporting: Power BI, Tableau, SSRS, Crystal Reports. Orchestration Tools: Apache Airflow, AWS Step Functions. Cloud Skill: AWS (S3, EMR, EC2, KAFKA, Cassandra, Impala, NiFi, RedShift, Athena, Glue, Kinesis), Azure Data Lake PROFESSIONAL EXPERIENCE FIS, OH Jan 2023 Present Sr. Data Engineer Responsibilities: Designed and set up Enterprise Data Lake on AWS to support storing, processing, analytics, and reporting of large and dynamic datasets using services like S3, EC2, ECS, AWS Glue, Athena, RedShift, EMR, SNS, SQS, DMS, and Kinesis. Writing and deploying PySpark jobs for batch and stream processing. Chose suitable ETL frameworks like Apache Spark or AWS Glue based on project needs and data volume. Designed and set up data warehouses for centralized data storage and efficient analytics. Monitored and improved pipeline performance to ensure timely and accurate data processing. Optimized complex SQL queries and stored procedures to improve data retrieval efficiency. Employed Python for integrating and managing data workflows across multiple cloud platforms and services. Deployed and managed RDBMS appliances for high-performance, scalable relational database solutions, optimizing hardware and software configurations. Used PySpark s DataFrame and RDD APIs for data manipulation and analysis. Developed ETL pipelines to automate data extraction, transformation, and loading, ensuring data accuracy. Automating data preparation and transformation tasks with AWS Glue ETL jobs . Designed and managed complex database schemas and optimized SQL queries to enhance performance and ensure data integrity in PostgreSQL. Applied data normalization techniques to design efficient and scalable database schemas, reducing data redundancy and improving data integrity. Implemented AWS Lambda-based serverless solutions to automate data processing workflows, optimizing performance and scalability for real-time data operations. Designed and optimized SQL queries using DML operations to support data transformation and reporting requirements in large-scale data warehousing projects. Developed and deployed SCD Type 2 strategies using AWS Glue and Amazon Redshift to manage historical data and accurately reflect changes in dimension attributes. Designed and implemented database management systems (DBMS) to handle large volumes of data efficiently, ensuring reliable data storage and retrieval. Configured and managed Amazon DynamoDB tables for high-performance, scalable NoSQL database solutions. Applied Software Development Life Cycle (SDLC) methodologies to manage project phases, ensuring structured development from requirements gathering to deployment and maintenance. Created and maintained indexes on frequently queried columns to speed up data retrieval and improve query performance.Implemented custom functions in Scala for data analysis and ETL. Managed the Software Development Life Cycle (SDLC) to ensure systematic and efficient development from planning to deployment.Integrated Git with CI/CD pipelines for automated testing and deployment. Performed data aggregation to summarize and compute metrics from large datasets, enabling insightful reporting and analysis. Utilized advanced DBMS features, including indexing, partitioning, and query optimization, to enhance database performance and scalability. Set up and managed Amazon Kinesis streams for real-time data ingestion and processing from various sources. Implemented real-time data streaming solutions using Apache Kafka and AWS Kinesis to handle high-throughput data ingestion and processing. Deployed and managed virtual machines on AWS EC2 to meet application needs and optimize resources. Utilized AWS Glue ETL jobs to implement SCD Type 1 techniques for updating existing records and handling new incoming data efficiently in the data warehouse. Implemented batch DML operations to handle high-volume data updates and inserts, improving the performance of data processing workflows. Developed and optimized SQL queries using Impala for fast and efficient querying of large datasets. Collaborated with Development and Operations teams to design and implement CI/CD pipelines, automating the deployment of infrastructure and applications. Utilized Snowflake Streams and Tasks to capture and process changes in data in real-time and automate scheduled data transformations. Created PySpark scripts for ETL processes, integrating various data sources. Optimized Spark SQL queries for performance, including using partitioning, caching, and tuning configurations. Technologies: AWS Glue, S3, IAM, EC2, RDS, Redshift, ECS, Lambda, Boto3, DynamoDB, Apache Spark, Kinesis, Athena, Hive, Sqoop, Python. BNY Mellon, Pittsburgh (PA) Feb 2021 Dec 2022 Sr. Azure Data Engineer Responsibilities: Designed and configured Azure Cloud relational servers and databases, optimizing infrastructure based on business requirements. Developed CI/CD processes using Azure DevOps to streamline software delivery and deployment. Utilized Java libraries and frameworks to streamline data manipulation and transformation processes. Implemented data pipelines and workflows in Databricks to automate and streamline analytics processes, ensuring data accuracy and timeliness. Collaborated on designing and optimizing data pipelines with Azure Data Factory, enabling seamless integration across different data sources. Utilized advanced DML techniques, including conditional updates and multi-table joins, to efficiently manage and synchronize data across disparate systems. Utilized Terraform modules to create reusable and standardized infrastructure components for consistency and scalability.Azure Key Vault to ensure secure storage and transmission of sensitive information. Designed and implemented Azure Event Hubs for real-time data ingestion and processing, supporting large-scale event streaming and analytics. Utilized GRANT and REVOKE commands to define and adjust user roles and privileges, maintaining proper data security and adhering to organizational policies. Designed and built data pipelines in Azure Data Factory to automate data movement and transformation across various sources and destinations. Managed Kubernetes resources and namespaces to optimize resource utilization and ensure secure and isolated environments for different application workloads. Designed and developed scalable ETL pipelines in Azure Databricks to handle large datasets and complex transformations. Conducted regular NIST-based audits and risk assessments to identify vulnerabilities and ensure robust security practices. Coordinated with cross-functional teams to define project requirements, design solutions, and implement SDLC best practices for efficient project delivery. Utilized Azure SQL Database Managed Instance for mission-critical data workloads, ensuring high availability and disaster recovery. Administered DBMS platforms to optimize performance, manage database security, and perform regular backups and recovery operations. Gained comprehensive understanding of Azure Database technologies, including both Infrastructure as a Service (IaaS) and Platform as a Service (PaaS), to support seamless data migration and management. Implemented advanced PostgreSQL features like indexing, partitioning, and stored procedures to support high-performance data operations and analytics. Managed data concurrency to handle simultaneous data access and modifications, ensuring data consistency and integrity.Created custom applications with MS Power Apps to streamline business processes and improve user interactions with data. Developed data models incorporating data interchange formats like JSON, enhancing data integration and interoperability. Configured and managed Azure Data Lake Storage (ADLS) Gen2 to handle large-scale data storage and provide efficient access to big data analytics. Integrated data from multiple sources into the data warehouse, ensuring data consistency, accuracy, and quality. Configured and managed Azure Data Lake Storage (ADLS) to store and organize large volumes of data for scalable analytics and big data processing. Implemented Terraform workspaces for isolating and managing different environments, such as development, staging, and production. Configured event processing with Azure Stream Analytics to analyze and transform streaming data in real time. Created and managed Spark SQL views and tables for efficient data retrieval and reporting. Monitored and tuned DML query performance to enhance the efficiency of data manipulation processes and reduce execution time in production environments. Used PySpark s DataFrame API and SQL capabilities to perform complex data transformations and analytics on big data. Monitored and optimized data pipelines and APIs to ensure reliable performance and minimize downtime. Developed and executed DQL scripts for data validation and verification, ensuring data accuracy and consistency across various reporting and analytical platforms. Designed and optimized streaming data pipelines with Azure Event Hubs and Stream Analytics to support real-time analytics and data integration. Utilized Azure Data Factory for data ingestion and ETL processes, ensuring efficient loading from on-premises and cloud sources. Leveraged Trust Center resources to assess and enhance the security and compliance posture of Azure environments. Integrated Enterprise GitHub with CI/CD pipelines to streamline version control, code reviews, and collaboration, enhancing deployment efficiency and team coordination. Managed infrastructure changes with Terraform, making updates controlled and predictable. Integrated Azure Machine Learning models for predictive analytics, enhancing data-driven decision-making. Monitored and maintained CI/CD pipelines to ensure smooth operation and troubleshoot any issues. Implemented CI/CD pipelines integrated with AKS for automated deployment, scaling, and management of containerized applications. Developed custom SQL solutions on Azure SQL Server and Azure Synapse, enhancing data processing capabilities. Configured data flow activities in Azure Data Factory to clean, aggregate, and prepare data for analysis and reporting. Monitored and managed data pipelines in Azure Data Factory, including setting up alerts and troubleshooting issues for reliable data processing. Leveraged Apache Spark within Databricks for high-performance data processing and real-time analytics. Integrated Terraform with CI/CD pipelines to automate infrastructure deployments and updates in coordination with application code changes. Optimized ETL workflows for better performance by tuning settings and handling large datasets efficiently. Integrated Azure Databricks with Azure Data Lake and other Azure services for seamless data flow and storage. Technologies: Azure Data Factory, Azure Data Lake, Azure Synapse Analytics(DW), Azure Devops, Snowflake, PowerBI, SharePoint, Windows 10 Hackett Group India Nov 2019 Feb 2021 Azure Data Engineer Responsibilities: Built pipelines using Azure Logic Apps to extract and manipulate data from SharePoint, optimizing data workflows for efficiency and reliability. Conducted research to identify data sources and requirements for ETL solutions using Azure Databricks, ensuring comprehensive data integration and management. Used Apache Spark SQL to query structured data and integrate with data sources like Hadoop and cloud storage. Supported business intelligence by providing clean, aggregated data for reporting in data warehousing. Implemented data consistency and partitioning strategies in NoSQL databases to balance performance and reliability. Scheduled and monitored pipeline runs in ADF to ensure data was processed and handled on time. Integrated Databricks with Azure services to create end-to-end data solutions, enabling seamless data ingestion, transformation, and visualization. Integrated ADLS with various data processing and analytics tools, such as Azure Data Factory and Azure Synapse, to streamline data workflows and analytics. Deployed and managed containerized applications using Azure Kubernetes Service (AKS) for scalable and efficient orchestration of microservices. Automated Azure Data Lake Storage tiering for cost optimization, reducing cloud storage expenses by implementing lifecycle management policies. Used Azure DevOps for CI/CD, integrating version control and automated testing to speed up development. Applied refactoring strategies to optimize database performance and scalability in Azure, leveraging native features and technologies to meet evolving business requirements. Proficient in Azure Synapse Analytics, leveraging its unified analytics service to streamline big data and data warehousing solutions within Azure cloud environments. Designed and optimized PostgreSQL databases in Azure for scalable, high-performance relational data storage, supporting various business applications. Leveraged Azure Data Factory's monitoring tools and dashboards to track pipeline performance and ensure efficient data processing. Utilized Azure Cloud services for scalable data storage and processing solutions, leveraging BLOB Storage for large data sets. Azure Data Factory seamlessly integrates with a wide range of data sources, whether they are within Azure or external systems.Integrated Event Hubs with Azure Functions for serverless processing and automation of incoming event data. Created dynamic and parameterized DQL queries to support ad-hoc reporting and data exploration, enabling flexible and responsive data analysis for end-users. Configured AKS clusters to handle high-traffic loads and ensure application reliability, utilizing Azure Monitor for performance and health insights. Implemented data processing and machine learning workflows using Databricks' built-in libraries and features. Utilized Apache Spark Streaming for processing real-time data, enabling real-time analytics and dashboard updates. Used PostgreSQL in Azure to manage complex queries and data relationships, ensuring reliable and efficient data processing Technologies: Cosmos, Scope Studio, U-SQL, C#, Azure Data Factory, Azure Data Lake, Azure Databricks GITHub, Snowflake, Iris Studio, Cauce, Kensho, SharePoint, Windows 10 HSBC, India Aug 2017 Oct 2019 Big Data Developer Responsibilities: Created Hive tables, loaded data into them, and optimized Hive queries to improve project cost-effectiveness. Utilizing Java libraries and frameworks (e.g., Apache Kafka, Apache Flume) for real-time data ingestion and stream processing in big data environments. Designed, debugged, scheduled, and monitored ETL batch processing jobs using Airflow to load data into Snowflake for analytical processes. Monitored and scaled Event Hubs to handle large volumes of event data efficiently. Automated and scheduled Sqoop jobs for incremental data transfers, ensuring timely and accurate updates to data repositories. Proficient in Spark DataFrame APIs and SQL syntax, adept at writing complex queries and operations to address business challenges effectively. Designing ETL pipelines that move data between Snowflake and Databricks to facilitate data transformation and analysis. Documented Kafka configurations and processes comprehensively, providing detailed reports on performance and incidents for management review. Implemented Spark serialization and compression techniques such as block-level compression and off-heap storage to optimize data storage and processing efficiency. Conducted proof-of-concepts (POCs) using Apache Spark SQL to explore and validate data processing solutions and strategies. Technologies: Hadoop, HDFS, Hive, Sqoop, Pig, Map Reduce, Snowflake, Hive SQL, MySQL, HBase, Spark SQL, Scala, Linux, Cloudera. IBM, India July 2013 - Aug 2017 Big Data Developer Responsibilities: Applied SQL-based transformations based on business logic outlined in mapping sheets, enhancing data integration and alignment with business requirements. Modified existing procedures and ETL workflows using Microsoft SQL Server Management Studio to accommodate new business needs, ensuring operational efficiency. Wrote DDL and DML scripts to transform data and populate HDFS, ensuring efficient data management and processing. Implementing data ingestion pipelines using Java to extract data from various sources such as databases, APIs, and streaming platforms. Created technical and design documents for ETL processes, providing comprehensive guidelines and documentation for each module. Used Apache Spark with Hadoop for real-time data processing and analytics, improving performance for large datasets. Designed and implemented scalable data pipelines and ETL processes with Spark and Hadoop to efficiently handle and analyze substantial volumes of data. Developing robust and scalable Java applications to process and analyze large volumes of data within big data ecosystems. Implemented Sqoop-based data synchronization solutions to maintain consistency between Hadoop and external databases, enhancing data integrity and reliability. Orchestrated seamless data migrations between Oracle and MongoDB, minimizing downtime and preserving referential integrity across platforms. Engineered a MongoDB sharding strategy that improved read/write performance by 15% and supported scalable data growth, ensuring robust data management capabilities. Integrated Hive tables with other big data technologies such as Hadoop and HBase, facilitating comprehensive data processing and analysis workflows. Technologies: Hadoop, ETL, HDFS, Hive, Sqoop, Pig, Map Reduce, Hive SQL, MySQL, MongoDB, Linux, Python, Eclipse, Cloudera. Certifications: AWS certified Solutions Architect - Professional Designing, deploying, and operating scalable cloud solutions using Amazon Web Services (AWS) such as EC2, S3, VPC, and RDS, as well as an understanding of advanced services like AWS Lambda, Amazon ECS. Education: Bachelor's degree from Lovely Professional University in 2013 Keywords: csharp continuous integration continuous deployment artificial intelligence business intelligence sthree active directory rlang information technology microsoft procedural language Ohio Pennsylvania |