sravani s - Data Engineer(AWS, AZURE and GCP) |
[email protected] |
Location: Overland Park, Kansas, USA |
Relocation: yes |
Visa: GC |
PROFESSIONAL SUMMARY:
Around 10 years of experience as a Data engineer in implementing various Big Data/ Cloud Engineering, Data Warehouse, Data Modelling, Data Mart, Data Visualization, Reporting, Data Quality, Data virtualization. Experienced in Data Engineering, Data Pipeline Design, Development and Implementation as a Data Engineer, Data Developer and Data Modeler. Hands-on experience in designing and implementing data engineering pipelines and analysing data using AWS stack like AWS S3, EMR, RDS, Glue, EC2, Lambda, Athena, Redshift, Cloud Watch and AWS Cloud Trail. Proficient in leveraging APIs to integrate and orchestrate data workflows across diverse cloud platforms, including AWS, Azure, and GCP ensuring efficient data transformation and preparation. Proficient in Cloud services and orchestration, including AWS IAM policies, Docker, Kubernetes, ARM templates, and Infrastructure as Code (IaC) tools like Terraform. Worked in Azure Cloud IaaS stage with components Delta Lakes, Azure blob storage, Notebooks, DBFS, Data Factory and Cosmos DB. Extensively worked as an Azure service using Azure Data bricks, Data Factory, Data Lake Storage, Azure Synapse Analytics, NOSQL DB and Azure HDInsight. Extensive use of cloud shell SDK in GCP to configure the services Data Proc, Data Storage, and Big Query. Proficiency in utilizing Spark - SQL, Pyspark, and Data Lake in Databricks for developing applications that extract, process, and collect data from various file types. This includes analysing and manipulating the data to reveal patterns of consumer usage. Skilled in utilizing Informatica Data Quality (IDQ) for thorough data profiling, cleansing, and enrichment, ensuring high-quality and reliable data throughout the entire data lifecycle. Proficient in Programming languages like Python, SQL, Scala and Shell scripting. Extensive experience in designing and implementing scalable ETL/ELT processes using Python and R, ensuring seamless data integration and transformation. Strong expertise in SQL Server tools like SQL Server Analysis Services (SSAS), SQL Server Reporting Services (SSRS). Expertise in designing and implementing NoSQL database solutions, such as MongoDB and Cassandra, to efficiently handle large volumes of diverse data. Strong understanding of PostgreSQL security best practices, implementing access controls, and ensuring data confidentiality and integrity. Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like Map Reduce, YARN, Hive, HBase, Flume, Sqoop, MLLib, GraphX, Spark SQL and Kafka. Experienced in performing real time analytics on big data using HBase in Kubernetes and Hadoop clusters. Proficient in Apache Spark, with hands-on experience in developing and optimizing Spark applications for large-scale data processing. Proficient in performance tuning of Spark jobs and Apache Kafka streams, employing best practices to optimize resource utilization, reduce latency, and enhance overall system throughput. Experience managing work in an agile environment using Azure Boards and Confluence, ensuring the smooth integration of data engineering tasks into the broader development lifecycle. Practical understanding of the Data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables. Proficient in creating custom calculations, parameters, and complex data visualizations to meet specific business requirements in both Tableau and Power BI environments. Proficient in leveraging Tableau Desktop and Server to develop interactive dashboards and reports that empower data-driven decision-making. Strong understanding of data modeling concepts, relationships, and optimization within Power BI. Proficient in version control and collaboration tools like Git, Bit bucket, GitHub and Jira, ensuring efficient teamwork and project management. TECHNICAL SKILLS: Programming Languages Python, SQL, Scala, Java, JavaScript,PowerShell,Bash, Linux, Unix. Data Warehousing/Modeling Data Warehouse, Data Modeling, Data Mart SQL Server Tools SSAS, SSRS AWS Services AWS S3, EMR, RDS, Glue, EC2, Lambda, Athena, Redshift, Cloud Watch, Cloud Trail, AWS Kinesis Azure Cloud Services Delta Lakes, Blob Storage, Notebooks, DBFS, Data Factory, Cosmos DB, Azure Data bricks, Synapse Analytics, NoSQL DB, HDInsight Clouds AWS, Azure BI & Data Visualization Tableau, Power BI Data & analytics tools Pyspark, Apache, pig, zoo, Pytorch , Keras, Scikit: TensorFlow, OpenCV Big Data Ecosystem Hadoop, Map Reduce, YARN, Hive, HBase, Flume, Sqoop, MLLib, GraphX, Spark SQL, Apache Kafka, Impala, Airflow. NoSQL Databases MongoDB, Cassandra, HBase Data Quality Tools Informatica Data Quality (IDQ) Version Control Tools Git, Bit bucket, GitHub, Jira CI/CD Pipelines Jenkins, Ansible, Docker, Azure DevOps PROFESSIONAL EXPERIENCE: Client: WellCare, Tampa, FL. Jul 2021 Present Role: Sr. Data Engineer Responsibilities: Built and managed Azure-based big data and analytics solutions using technologies such as Azure HDInsight, Data Lake Storage and Azure Stream Analytics. Implemented data quality checks and validation processes within Azure Data Factory to ensure the accuracy, completeness, and reliability of data throughout the ETL (Extract, Transform and Load) workflow an Implemented medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Stream Analytics, SQL DW, Data bricks and NoSQL DB). Integrated and enforced data governance policies within Informatica workflows, promoting data quality, compliance, and accountability throughout the data lifecycle. Implemented proactive performance monitoring and tuning strategies in Informatica Power centre, ensuring optimal resource utilization and responsiveness. Performed exploratory data analysis (EDA) using Python libraries like Matplotlib and Seaborne to create visualizations that help stakeholders understand data patterns and trends. Used Apache Impala to read, write and query the Hadoop data in HDFS and Hbase. Developed and maintained ETL pipelines using Python and Talend to extract, transform, and load data from various sources into Hadoop Distributed File System (HDFS). Gained proficiency in NoSQL databases like MongoDB for storing and querying unstructured data, enhancing data storage and retrieval capabilities. Managed SQL, PostgreSQL and Cassandra databases, optimizing data storage and retrieval while ensuring data consistency and integrity. Designed and Developed Scala workflows for data pull from cloud-based systems and applying transformations on it. Develop strategies for archiving and offloading data from Kafka topics to long-term storage solutions. Utilized Scala in conjunction with version control systems like Git to maintain codebase integrity, enable collaborative development, and manage changes in data engineering projects. Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats. Established automated deployment processes for Spark and Kafka applications, utilizing tools like Ansible or Docker to streamline deployment, reduce errors, and enhance overall system reliability. Managed dependencies effectively within Spark projects, utilizing tools like Apache Maven or SBT to handle libraries and dependencies, ensuring reproducibility and consistency across different environments. Conducted regular backups and disaster recovery testing for Kafka clusters. Developed and implemented robust data quality checks and validation processes in Snowflake, ensuring the accuracy and reliability of the information stored. Maintained detailed data lineage documentation in Snowflake, providing transparency into the flow of data and facilitating compliance and audit requirements. Implemented data archiving and purging strategies in Power BI to manage historical data efficiently, optimizing performance and ensuring compliance with data retention policies. Implemented dynamic parameterization in Power BI, allowing users to customize reports and dashboards based on specific criteria, enhancing flexibility and user engagement. Implemented global filters in Tableau to allow users to dynamically control multiple visualizations simultaneously, enhancing the overall user experience. Incorporated Tableau visualizations into PowerPoint presentations, ensuring a smooth transition between data analysis and executive reporting. Maintained well-organized repositories on GitHub, employing clear directory structures and naming conventions, enhancing code discoverability and project navigation. Environment: Azure (HDInsight, Data Lake, Stream Analytics, Data Factory, Data Bricks, SQL DW, NoSQL DB), Python, Informatica, Matplotlib, Seaborn, Impala, Hadoop, Hbase, Talend, NoSQL, MongoDB, SQL, PostgreSQL, Cassandra, Scala, Git, Spark-SQL, Maven, SBT, Ansible, Docker, Kafka, Snowflake, Power BI, Tableau, GitHub. Client: AMEX, New York, NY. May 2019 Jun 2021 Role: Sr. Data Engineer Responsibilities: Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda and Dynamo DB. Configured data loads from AWS S3 to Redshift using the AWS Data Pipeline. Utilized AWS Glue and Elastic Map Reduce (EMR) for Extract, Transform, Load (ETL) processes, optimizing data transformation and preparation. Designed and deployed multi-tier applications using all the AWS services like S3, Lambda, Cloud Watch, RDS, SNS, SQS, IAM focusing on high availability, fault tolerance, and auto-scaling in AWS Cloud Formation. Developed comprehensive data validation and testing procedures to verify accuracy and completeness of data transformations within Informatica Power Centre workflows using Selenium. Utilized Informatica Power Centre's features to create visual representations of data lineage, allowing for easy identification of dependencies and impact analysis in case of changes or issues. Designed and implemented robust Extract, Transform, Load (ETL) processes for financial data. Streamlined data integration from diverse sources, ensuring accuracy and efficiency in handling large-scale datasets. Implemented comprehensive data quality assurance measures to validate financial data integrity. Conducted thorough data profiling, cleansing, and validation processes, minimizing errors and enhancing the reliability of financial reporting. Created interactive dashboards and reports using Python libraries like Plotly and Dash to provide real-time insights to business stakeholders. Maintained clear and concise documentation for Python scripts and codebase, facilitating collaboration among team members and ensuring code transparency. Conducted data cleansing and validation processes using SQL to maintain data integrity and quality across databases. Tuned SQL queries and database operations for improved query response times and system efficiency. Developed and optimized stored procedures and functions in PostgreSQL for streamlined data processing. Worked with NoSQL databases like MongoDB to store, manage unstructured and semi-structured data. Integrated Scala applications with various data storage solutions such as relational databases, NoSQL databases, and distributed file systems for seamless data access and management. Monitored Kafka brokers and topics for performance and scalability, taking proactive measures when necessary. Experienced in writing live Real-time Processing, core jobs using Spark Streaming with Kafka as a data pipe-line system. Implemented CI/CD pipelines for Scala applications, automating the testing, building, and deployment processes to accelerate development cycles. Developed Spark scripts using Scala Shell commands, ensuring efficient and customized data processing based on specific requirements. Managed and maintained the Cloudera Distribution of Hadoop (CDH) cluster, overseeing Hadoop and Spark jobs for big data processing. Utilized Hadoop ecosystem tools like Apache Ambari and Cloudera Manager for real-time monitoring and performance tuning, ensuring optimal system health and responsiveness. Utilized Snowflake's built-in security features, such as role-based access control and data encryption, to protect sensitive data and ensure compliance. Implemented rigorous security measures in Snowflake to safeguard sensitive data, ensuring compliance with industry standards and regulations, and managing access controls effectively. Designed and developed user-friendly dashboards in Tableau, tailored to the needs of different user groups within the organization. Proficient in creating custom calculations and leveraging scripting languages within Tableau to perform advanced analytics and meet specific business requirements. Implemented automated reporting processes in Power BI, reducing manual efforts and ensuring timely delivery of accurate and up-to-date insights to stakeholders. Developed algorithms within Power BI to detect and highlight data anomalies, supporting data quality assurance and ensuring the reliability of analytical results. Environment: AWS (Lambda, S3, Glue, EMR, Cloud Watch, RDS, SNS, SQS, IAM), Informatica, Java, Python, Plotly, Dash, SQL, PostgreSQL, MongoDB, Scala, Kafka, Spark, CI/CD, Ambari, Hadoop, Snowflake, Tableau, Power BI. Client: Citrix, Bangalore, India. Aug 2017 Jan 2019 Role: Data Engineer Responsibilities: Developed ETL pipelines into Azure data ecosystem with libraries such as Azure Data bricks to develop ETL workflows that handle data extraction and data loading into Azure data stores such as Azure Data Lake Storage, or Synapse Analytics. Implemented data pipelines in Azure Data Factory, enabling seamless data movement and transformation in the Azure cloud environment. Worked with Azure BLOB and Data Lake storage and loading data into Azure SQL Synapse analytics (DW). Designed ETL processes using Informatica to load data from Flat Files, Oracle files to target Oracle Data Warehouse database. Developed communicate finding sights through storytelling techniques, using Python's Jupyter Notebooks and R Markdown to create narratives that resonate with diverse audiences. Proficient in using Python to connect to relational databases like NoSQL and PostgreSQL for data extraction and analysis. Worked on complex SQL Queries, PL/SQL procedures and convert them to ETL tasks. Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios. Created interactive and visually appealing reports and dashboards in Power BI, enabling stakeholders to access critical insights effortlessly. Integrated Spark with data warehousing solutions such as Apache Hive or Apache HBase to facilitate seamless data storage, retrieval, and analysis within a data warehouse architecture. Implemented data transformations and enrichments within Kafka streams for analytics. Integrated Scala applications with monitoring and logging tools to proactively identify issues, analyze performance, and streamline troubleshooting processes. Conducted performance tuning using Scala, optimizing query execution, and improving overall system efficiency for data-intensive workloads. Implemented data quality checks within Spark and Kafka pipelines to ensure the integrity and reliability of processed data, adhering to defined quality standards and business rules. Skilled in writing Map Reduce programs to analyze and transform data, ensuring effective parallel processing across Hadoop clusters. Implemented and optimized data processing workflows on Hadoop and Spark clusters, ensuring efficient data ingestion, storage, and retrieval on the Hadoop (HDP) platform. Participated in the evaluation and implementation of Snowflake features and enhancements to improve data warehousing and analytics capabilities. Involved in reviewing business requirements and anal data sources form Oracle SQL Server for design, development, testing, and production rollover of reporting and analysis projects within Tableau Desktop. Maintained thorough documentation of Tableau configurations, data models, and workflows for knowledge sharing and future reference. Implemented multi-language support in Power BI reports, catering to diverse user bases and enabling effective communication of insights across international teams. Experienced with advanced Power BI topics such as complex calculations, table calculations, geographic mapping and performance optimization. Environment: Azure (Data bricks, Data Factory, BLOB, Data Lake, Synapse Analytics, SQL Data Warehouse), Informatica, Java, Python, Jupyter Notebooks, R, SQL, NoSQL, Hbase, PostgreSQL, Power BI, Spark, Hive, Kafka, Scala, Map Reduce, Hadoop, Snowflake, Tableau, Power BI. Client: Merck, Mumbai, India. April 2014 Jul 2017 Role: SQL Developer Responsibilities: Developed streamlined deployment processes for AWS S3 configurations and AWS Lambda functions across multiple environments, ensuring consistency and reproducibility. Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like AWS S3, Text Files into AWS Redshift. Created comprehensive documentation for AWS S3, RDS and AWS EMR configurations, and best practices. Utilized Python and R programming languages to clean, preprocess, and analyze large datasets, ensuring data quality and accuracy. Leveraged Python libraries like Pandas, NumPy and SciPy for data manipulation and statistical analysis, extracting valuable insights from complex datasets. Created custom T-SQL procedures to read data from flat files to dump to SQL Server database using SQL Server import and export data wizard. Designed and deployed data table structures, reports, and queries in SQL Server. Received and Imported data from various formats into SQL Server. Created and maintained NoSQL data models, defining schemas, indexing strategies, and data access patterns to support efficient data retrieval and storage. Automated routine database tasks, including backups, updates, and monitoring, to ensure the reliability and availability of PostgreSQL databases. Utilized PostgreSQL's advanced features like window functions and common table expressions for complex data analysis. Involved in converting Hive/SQL queries into spark transformation using spark RDDs and Scala. Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats. Developed and maintained real-time data processing pipelines using Hadoop ecosystem tools, ensuring the timely availability of insights from streaming data sources. Utilized Hive and Pig to write and optimize queries for data analysis, enhancing the performance of data retrieval from Hadoop. Created scorecards, dashboards using stack bars, bar graphs, scattered plots, and geographical maps, Gantt charts using Tableau for different requirements. Used Visualization tools such as Power view for excel, Tableau for visualizing and generating reports. Implemented security measures at Power BI Service by implementing authentication and authorization methods. Designed and deployed Power BI apps, streamlining the distribution of reports and dashboards across different departments within the organization. Environment: AWS (S3, Lambda, Glue, redshift, RDS, EMR), ETL, Python, R, Pandas, NumPy, SciPy, SQL, Hive, NoSQL, PostgreSQL, Scala, Spark, Spark-SQL, Hadoop, Pig, Tableau, Power BI. EDUCATION: Jawaharlal Nehru Technological University Hyderabad, TS, India BTech in Computer Science and Engineering June 2011 - March 2015 Major in Computer Science Keywords: continuous integration continuous deployment business intelligence sthree database rlang information technology procedural language Florida New York |