harshitha k - data engineer |
[email protected] |
Location: Alba, Missouri, USA |
Relocation: yes |
Visa: H1B |
Harshitha K
Sr. Data Engineer [email protected] | 346-336-3990 PROFESSIONAL SUMMARY: Around 9 years of professional experience in information technology as Data Engineer with an expert hand in the areas of Database Development, ETL Development, Data modelling, Report Development and Big Data Technologies. Experience in Data Integration and Data Warehousing using various ETL tools Informatica PowerCenter, AWS Glue, SQL Server Integration Services (SSIS), Talend. Experience in Designing Business Intelligence Solutions with Microsoft SQL Server and using MS SQL Server Integration Services (SSIS), MS SQL Server Reporting Services (SSRS) and SQL Server Analysis Services (SSAS). Extensively used Informatica PowerCenter, Informatica Data Quality (IDQ) as ETL tool for extracting, transforming, loading and cleansing data from various source data inputs to various targets, in batch and real time. Experience working with Amazon Web Services (AWS) cloud and its services like Snowflake, EC2, S3, RDS, EMR, VPC, IAM, Elastic Load Balancing, Lambda, RedShift, Elastic Cache, Auto Scaling, Cloud Front, Cloud Watch, Data Pipeline, DMS, Aurora, ETL and other AWS Services. Hands on Experience with AWS Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into Snowflake table. Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP. Extensive experience in integration of Informatica Data Quality (IDQ) with Informatica PowerCenter. Extensive experience in Data Mining solutions to various business problems and generating data visualizations using Tableau, PowerBI, Alteryx. Well knowledge and experience in Cloudera ecosystem such as HDFS, Hive, SQOOP, HBASE, Kafka, Data pipeline, Data analysis and processing with Hive SQL, IMPALA, SPARK, SPARK SQL. Worked with different scheduling tools like Talend Administrator Console(TAC), UC4/Atomic, Tidal, Control M, Autosys, CRON TAB and TWS (Tivoli Workload Scheduler). Experienced in design, development, Unit testing, integration, debugging and implementation and production support, client interaction and understanding business application, business data flow and data relations. Using Flume, Kafka and Spark streaming to ingest real time or near real time data to HDFS. Analysed data and provided insights with Python Pandas. Worked on AWS Data Pipeline to configure data loads from S3 into Redshift. Worked on Data Migration from Teradata to AWS Snowflake Environment using Python and BI tools like Alteryx. Experience in moving data between GCP and Azure using Azure Data Factory. Developed Python scripts to parse the Flat Files, CSV, XML, JSON files and extract the data from various sources and load the data into data warehouse. Developed Automated scripts to do the migration using Unix shell scripting, Python, Oracle/TD SQL, TD Macros and Procedures. Good Knowledge on No SQL database like HBase, Cassandra. Expert-level mastery in designing and developing complex mappings to extract data from diverse sources including flat files, RDBMS tables, legacy system files, XML files, Applications, COBOL Sources & Teradata. Worked on JIRA for defect/issues logging & tracking and documented all my work using CONFLUENCE. Experience with ETL workflow Management tools like Apache Airflow and have significant experience in writing the python scripts to implement the workflow. Experience in identifying Bottlenecks in ETL Processes and Performance tuning of the production applications using Database Tuning, Partitioning, Index Usage, Aggregate Tables, Session partitioning, Load strategies, commit intervals and transformation tuning. Worked on performance tuning of user queries by analyzing the explain plans, recreating the user driver tables by right Primary Index, scheduled collection of statistics, secondary or various join indexes. Experience with scripting languages like PowerShell, Perl, Shell, etc. Expert knowledge and experience in fact dimensional modelling (Star schema, Snow flake schema), transactional modelling and SCD (Slowly changing dimension). Create clusters in Google Cloud and manage the clusters using Kubernetes(k8s). Using Jenkins to deploy code to Google Cloud, create new namespaces, creating docker images and pushing them to container registry of Google Cloud. Excellent interpersonal and communication skills, experienced in working with senior level managers, business people and developers across multiple disciplines. Strong problem solving, analytical and have the ability to work both independently and as a team. Highly enthusiastic, self-motivated and rapidly assimilate with new concepts and technologies. Proficient in leveraging cloud platforms such as AWS, Azure, and GCP (BigQuery, Dataflow) to architect scalable and efficient data pipelines and analytics solutions. Expertise in big data and distributed computing frameworks, including Hadoop, Spark, Kafka, and data integration tools such as Informatica and Apache Nifi. Skilled in database management and with a strong focus on data modeling techniques and platforms like Erwin and Snowflake. Adept at containerization (Docker, Kubernetes) and CI/CD automation (Jenkins, Ansible), and proficient in data visualization tools like Tableau, Power BI, and Looker for actionable business insights. Experienced in machine learning frameworks (TensorFlow, Scikit-Learn) and ensuring data security and compliance through IAM roles, encryption techniques, and JWT authentication. Proven track record in Agile project management (Scrum, Kanban), collaborating effectively with cross-functional teams using tools such as JIRA and Confluence. Known for problem-solving acumen, analytical thinking, and strong communication skills to deliver high-quality solutions with attention to detail and adherence to timelines. TECHNICAL SKILLS: Cloud Platforms and Services: AWS: S3, EC2, Lambda, Glue, Data Pipeline, Redshift, EMR, Kinesis, IAM, KMS, CloudWatch, CloudFront, CloudTrail, CloudFormation, Azure: Data Lake Storage, Event Hubs, Azure Data Factory (ADF), Databricks, Azure DevOps, Azure AD, Google Cloud Platform (GCP): BigQuery, Dataprep, Dataflow, Pub/Sub Big Data and Distributed Computing: Hadoop, HDFS, MapReduce, Hive, HBase, Pig, Sqoop, CDH, Spark, SparkSQL, PySpark, Spark MLlib, Flink, Kafka, Apache Nifi Database Management Systems: SQL Server, MySQL, Oracle, MongoDB, CosmosDB Data Integration and ETL Tools: Informatica, Apache Nifi, Apache Kafka Connect, AWS Glue, ADF Programming Languages: Python, SQL, Scala Data Warehousing and Data Modeling: Erwin, Dimensional Modeling techniques, Snowflake Containerization and Orchestration: Docker, Kubernetes, Helm CI/CD and Automation Tools: Jenkins, Ansible, GitLab CI, Maven, Git Data Visualization: Tableau, Power BI, Looker, Google Data Studio Machine Learning and Data Analysis: Pandas, NumPy, SciPy, TensorFlow, Scikit-Learn, Mahout Security and Compliance: IAM roles and policies, KMS, Encryption techniques, JWT authentication Monitoring and Logging: CloudWatch, ELK Stack, Nagios, Data Dog Agile and Project Management: Agile (Scrum, Kanban), JIRA, Confluence, Rally, Bugzilla Other Tools and Technologies: XML, XSD, XSLT, JSON, RESTful APIs, GraphQL, OAuth WORK EXPERIENCE Walmart, San Bruno, CA | Jan 2023 - Present Sr. Data Engineer Responsibilities: Managed full SDLC for data engineering projects, from requirements gathering to deployment and maintenance, ensuring alignment with business objectives. Designed and implemented scalable AWS architectures using S3, EC2, Lambda, Glue, and Data Pipeline, optimizing data workflows by 30%. Orchestrated message queues and notifications with SQS and SNS, ensuring 99.9% reliability in data delivery and event-driven architectures. Implemented data warehousing solutions using Redshift, achieving a 50% improvement in data storage and query performance for analytical workloads. Utilized AWS EMR and Spark for distributed data processing, leveraging SparkSQL and PySpark for advanced analytics and ETL tasks, increasing processing efficiency by 40%. Designed and implemented real-time data streaming solutions using Kinesis, ensuring data ingestion within 5 seconds for timely processing. Managed IAM roles and policies, implementing KMS for data encryption ensuring compliance with security best practices with zero incidents. Implemented monitoring and logging with CloudWatch, CloudFront, and CloudTrail, improving operational visibility by 50% and enabling compliance auditing. Automated infrastructure deployment and management using CloudFormation, improving deployment speed and consistency. Utilized Erwin for data modeling and schema design, ensuring data integrity and efficient database structures. Implemented MapReduce jobs and utilized Hive, HBase, Pig, and Sqoop for data processing and integration in Hadoop environments. Applied machine learning techniques using Mahout, Pandas, NumPy, Scala, and Spark MLlib to derive insights from large datasets. Managed XML data interchange and transformation using XSD and XSLT, ensuring system compatibility and data integrity. Containerized data applications with Docker and orchestrated with Kubernetes, optimizing resource utilization and deployment flexibility. Implemented CI/CD pipelines using Jenkins and Ansible, achieving 95% testing, deployment, and configuration management automation. Created interactive data visualizations and reports using Tableau, enabling data-driven decision-making with 80% faster insights. Practiced Agile and Scrum methodologies, leading sprint planning and retrospectives, ensuring iterative development and on-time delivery. Managed project tasks and tracked issues using JIRA, ensuring 100% transparency and alignment with project milestones. Implemented robust data security measures, including encryption, access controls, and governance policies, ensuring data privacy and compliance. Conducted performance tuning, indexing, and partitioning strategies, optimizing query performance and resource utilization. Designed and implemented Data Lakes and Data Marts, enabling efficient data storage, retrieval, and analytics. Developed and maintained scalable Data Pipelines, handling data ingestion, profiling, cleansing, and enrichment processes. Led on-call rotations and monitored production data pipelines, ensuring 99.99% system availability and timely issue resolution. Implemented disaster recovery strategies and conducted security audits and vulnerability assessments, ensuring 100% data resilience and compliance. Engaged in technical documentation and mentored junior team members, facilitating knowledge sharing and achieving 90% improvement in team skills. Environment: AWS, S3, EC2, Lambda, Glue, Data Pipeline, SQS, SNS, Redshift, AWS EMR, Spark, SparkSQL, PySpark, Kinesis, IAM, KMS, CloudWatch, CloudFront, CloudTrail, CloudFormation, Erwin, MapReduce, Hive, HBase, Pig, Sqoop, Mahout, Pandas, NumPy, Scala, Spark MLlib, XML, XSD, XSLT, Docker, Kubernetes, Jenkins, Ansible, Tableau, Agile, Scrum, JIRA. Cigna, Birmingham, AL | Oct 2021 - Dec 2022 Data Engineer Responsibilities: Implemented PowerShell scripts for automation and orchestration of data workflows, ensuring 80% efficiency gain in data processing and system operation. Managed SQL Server and CosmosDB databases, optimizing data storage and retrieval, achieving a 30% improvement in performance for high-demand applications. Utilized Azure Data Lake Storage and Azure Event Hubs for scalable data storage and real-time data streaming, facilitating responsive data pipelines for timely analytics. Designed and orchestrated data pipelines using Azure Data Factory (ADF), ensuring 95% reliability in data integration and transformation across cloud environments. Implemented data processing and analytics using Databricks and Apache Spark, leveraging Flink for real-time stream processing, increasing processing speed by 50%. Deployed and managed infrastructure on Azure using Azure DevOps, Azure AD, and Terraform, achieving 60% faster resource provisioning and management efficiency. Integrated REST APIs for data exchange between systems, ensuring seamless data flow and platform integration with zero integration failures. Implemented dimensional modeling techniques to design efficient data warehouses on Hortonworks and Snowflake platforms. Managed and optimized Hadoop clusters, utilizing MapReduce for distributed data processing and analysis. Containerized data applications using Docker and orchestrated with Kubernetes, ensuring scalable and resilient deployment of data solutions. Utilized Pandas and NumPy for data manipulation and analysis, supporting data-driven insights and decision-making with 95% accuracy. Applied TensorFlow for machine learning model development and integration into data pipelines, enhancing predictive analytics capabilities with 90% model accuracy. Managed code versioning and collaboration using GitHub, ensuring code integrity and efficient team collaboration. Developed and maintained data workflows and pipelines using Eclipse IDE, ensuring robust data processing and application development. Coordinated project management and issue tracking using JIRA, facilitating agile development and timely project delivery. Implemented monitoring and performance optimization using Data Dog, ensuring system reliability and efficiency. Developed interactive dashboards and reports with Power BI, delivering actionable insights to business stakeholders and cutting decision-making time by 50%. Automated data ingestion, processing, and integration tasks, improving operational efficiency by 70% and reducing manual effort by 80%. Designed and implemented real-time data streaming solutions, enabling timely data analytics and decision-making within seconds. Environment: PowerShell, SQL Server, CosmosDB, Azure, ADF, Databricks, Spark, Azure DevOps, Azure AD, REST APIs, Hortonworks, Snowflake, Flink, Hadoop, MapReduce, Docker, Kubernetes, Pandas, NumPy, TensorFlow, GitHub, Terraform, Eclipse IDE, JIRA, Data Dog, Power BI. FusionCharts, India | Sep 2018 - Jul 2021 Data Engineer Responsibilities: Developed and maintained data ingestion and transformation pipelines using Python and Apache Nifi, ensuring seamless data flow and integration. Leveraged Sqoop for efficient data transfer between relational databases and HDFS, ensuring data accessibility and integrity. Implemented real-time data processing and analytics using Apache Spark and Scala, enhancing data processing capabilities. Utilized JSON for data interchange and integration across various platforms, ensuring consistent data formats and structures. Managed big data environments using Cloudera Distribution for Hadoop (CDH), optimizing performance and scalability. Conducted data analysis and querying with BigQuery and MySQL, providing actionable insights and supporting data-driven decisions. Integrated MongoDB for NoSQL data storage and retrieval, enabling flexible data modeling and access. Employed Hive for efficient querying and data warehousing, facilitating large-scale data analytics. Deployed data solutions on Google Cloud Platform (GCP) using Dataprep, Dataflow, and Pub/Sub, ensuring robust and scalable cloud infrastructure. Utilized NumPy, Pandas, and Scikit-Learn for data processing and machine learning, enhancing predictive analytics capabilities. Managed version control and CI/CD pipelines with GitLab CI and Maven, ensuring seamless integration and deployment. Implemented Kafka for real-time data streaming and messaging, supporting low-latency data processing. Created interactive visualizations and reports using Google Analytics and Google Data Studio, providing insights for business stakeholders. Employed JWT authentication for secure data access and user authentication, ensuring data privacy and security. The ELK stack (Elasticsearch, Logstash, Kibana) was used for log management and data visualization, enhancing system monitoring, and troubleshooting. Applied Agile and Scrum methodologies for project management, ensuring timely delivery and continuous improvement. Developed and maintained CI/CD pipelines, streamlining the development and deployment processes. Utilized Visual Studio IDE for efficient code development and debugging, ensuring high-quality software solutions. Coordinated project management and issue tracking using JIRA, ensuring efficient workflow and collaboration. Environment: Python, Sqoop, HDFS, Apache Spark, Scala, JSON, BigQuery, MySQL, MongoDB, Hive, Google Cloud Platform (GCP), NumPy, Pandas, Scikit-Learn, Apache NiFi, GitLab CI, Maven, Kafka, JWT, ELK stack, Agile, Scrum, Visual Studio, JIRA. Webgen Technologies, India | Jan 2016 - Aug 2018 Data Engineer Responsibilities: Developed, tested, and maintained Python scripts for data extraction, transformation, and loading (ETL) processes to ensure efficient data flow and integration. Implemented AWS services, including S3, EC2, RDS, and Lambda, to manage and deploy scalable data solutions, enhancing system performance and reliability. Utilized Oracle databases and SQL/PL-SQL for data modeling, performance tuning, and ensuring data integrity across various applications. Designed and maintained Hadoop clusters, leveraging HDFS for distributed data storage and Spark for real-time data processing and analytics. Employed Sqoop for efficient data transfer between Hadoop and relational databases, ensuring seamless data integration and accessibility. Developed and automated data pipelines using Informatica, ensuring timely and accurate data movement across the data architecture. Created and optimized data visualizations and dashboards using Looker, providing stakeholders with actionable insights and improved decision-making. Managed data storage and processing using Hive, ensuring efficient querying and data retrieval from large datasets. Utilized Bugzilla for tracking and resolving data-related issues, ensuring high data quality and system reliability. Conducted version control and collaborative development using Git, ensuring code integrity and facilitating cross-functional collaboration. Implemented system performance monitoring and alerting with Nagios, proactively addressing potential issues and ensuring system uptime. Coordinated project management and issue tracking using JIRA, streamlining workflows and ensuring timely delivery. Developed data governance policies and procedures, ensuring regulatory compliance and data security across the organization. Improved SQL query performance through optimization techniques, enhancing database efficiency and reducing execution times. Automated scheduled jobs and data workflows, ensuring consistent and timely data processing and reducing manual intervention. Collaborated with diverse teams to design and deploy data integration solutions, supporting business intelligence and analytics goals. Environment: Python, AWS, Oracle, Hadoop, HDFS, Spark, Sqoop, Looker, Informatica, Hive, Bugzilla, Git, Nagios, JIRA. EDUCATION: Bachelors in Electronics and Communication Engineering- Andhra University, India. Keywords: continuous integration continuous deployment business intelligence sthree active directory microsoft procedural language Alabama California |