Sai Ahalya Challa - Data Engineer |
[email protected] |
Location: Dallas, North Carolina, USA |
Relocation: Any |
Visa: H1B |
Sr Data Engineer
Name: Sai Ahalya Challa [email protected] (832) 429-8756 PROFESSIONAL SUMMARY: 8+ years of professional IT experience in gathering requirements, Analysis, Architecture, Design, Documentation, and Implementation of applications using Big Data Technologies. Experienced in Programming languages like Scala, Java, Python, SQL. Experienced in Spark technologies like Spark Core, Spark DataFrame, Spark SQL and Spark Streaming. Worked with Spark to improve efficiency of existing applications analyzing the execution graphs, identifying bottlenecks, and performing various configuration tuning and code level improvements. Expertise in using various Hadoop infrastructures such as Yarn, Zookeeper, Hbase, Sqoop, Oozie, Flume and MR for data storage and analysis. Experience in converting Hive/SQL queries into Spark transformations using Spark RDD and Pyspark concepts. Experience in Implementing Apache Airflow for authoring, scheduling and monitoring Data Pipelines. Experience in Importing and exporting data from different databases like MySQL, Oracle, Netezza, Teradata, DB2 into HDFS using Sqoop, Talend. Experience in Hadoop ecosystem including Spark, Kafka, HBase, Apache Iceberg, Impala, Mahout, Tableau, Talend big data technologies. Experienced in python data manipulation for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations. Experience with big data on AWS cloud services - EC2, S3, EMR, Glue, Athena, RDS, VPC, SQS, ELK, Kinesis, DynamoDB and Cloud Watch. Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL, Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory. Experience in implementing various Big Data Engineering, Cloud Data engineering, Data Warehouse, Data Mart, Data Visualization, Reporting, Data Quality, and Data virtualization Solution. Experience in developing and scheduling ETL workflows in Hadoop using Oozie. Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice versa. Experience in relational databases like SQL, MYSQL, Oracle, DB2 and NoSQL databases such as MongoDB, HBase, DynamoDB, Cosmos and Cassandra. Experienced in developing and designing automation frameworks using Python and Shell scripting. Hands on experience in machine learning, big data, data visualization, Python development, Java, Linux, Windows, SQL, GIT/GitHub. Experienced in Data Analysis, Design, Development, Implementation and Testing using Data Conversions, Extraction, Transformation and Loading (ETL) and SQL Server, ORACLE, and other relational and non-relational databases. Experience in Building and Deploying and Integrating with Ant, Maven. Experience on Big Data technology using technologies such as Cloudera and Hortonworks distributions. Experience in the creation of Test Cases for JUnit Testing. Experience in developing simple to complex Map reduce and Streaming jobs using Scala and Java for data cleansing, filtering and data aggregation. Also possess detailed knowledge of MapReduce framework. Automated resulting scripts and workflow using Apache Airflow and Sheel scripting to ensure daily execution in production. Experience in Instantiating, creating, and maintaining CI/CD (continuous integration & deployment) pipelines and apply automation to environments and applications using Jenkins, Docker and Kubernetes. Expertise in snowflake to create and maintain Tables and views. Experience with developing applications using Java, J2EE Technologies Servlets, JSP, Java Web Services, JDBC, XML, Cascading, spring, Hibernate. Experience on data warehouse product Amazon Redshift, which is a part of the AWS and configuring the servers for Auto scaling and Elastic load balancing. Experienced in creating pipelines that move, transform, and analyze data from a wide variety of sources using multiple methods like the Azure Power shell utility. Experience in Various SDLC methodologies like Agile, Scrum, Waterfall. Experience in Big Query, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver. Built an ETL framework for Data Migration from on premise data sources such as Hadoop, Oracle to AWS using Apache Airflow, Apache Sqoop and Apache Spark. Experience using source code management tools such as GIT, SVN, and Perforce. Develop and maintain data ingestion and transformation processes using tools like Apache Beam and Apache Spark. Created and managed data storage solutions using GCP services such as BigQuery, Cloud Storage, and Cloud SQL. Implemented data security and access controls using GCP's Identity and Access Management (IAM) and Cloud Security Command Center. Monitored data pipelines and storage solutions using GCP's Stackdriver and Cloud Monitoring Collaborate with data scientists and analysts to understand their data requirements and provide solutions to meet their needs. Automated data processing tasks using scripting languages such as Python and Bash Participated in code reviews and contributed to the development of best practices for data engineering on GCP. Stay up to date with the latest GCP services and features and evaluate their potential use in the organization's data infrastructure. TECHNICAL SKILLS: Hadoop Distributions Cloudera, AWS EMR and Azure Data Factory. Languages Scala, Python, SQL, Hive QL. IDE Tools Eclipse, IntelliJ, PyCharm. Cloud platform AWS, Azure, CGP Cloud Storage AWS Services S3, Redshift, Lambda, Kinesis, DynamoDB, Glue, Athena, Lambda Functions Databases Oracle, SQL Server, MySQL, MS Access, NoSQL Database (Hbase, Cassandra, Mongo DB) Big Data Technologies Hadoop, HDFS, Hive, Pig, Oozie, Sqoop, Spark, Impala, Zookeeper, Flume, Airflow, Snowflake, DataBricks, Kafka, Cloudera Containerization Docker, Kubernetes CI/CD Tools Jenkins, Bamboo, GitLab CI, uDeploy, Travis CI, Octopus Operating Systems UNIX, LINUX, Ubuntu, CentOS. Other Software Control M, Eclipse, PyCharm, Jupyter, Apache, Jira, Putty, Advanced Excel PROFESSIONAL EXPERIENCE: Client: MetLife, Cary, NC March 2021 to present Sr. Cloud Big Data Engineer Responsibilities: Worked on building centralized Data Lake on AWS Cloud utilizing primary services like S3, EMR, Redshift and Athena. Responsible for ingesting large volumes of user behavioral data and customer profile data to Analytics Data store. Worked on migrating datasets and ETL workloads from on-perm to AWS Cloud services. Built series of Spark Applications and Hive scripts to produce various analytical datasets. Worked on troubleshooting spark application to make them more error tolerant. Worked on fine-tuning spark applications to improve the overall processing time for the pipelines. Worked closely with business teams and data science teams and ensured all the requirements are translated accurately into our AWS data pipelines. Have experience in using Python with Pyspark in building data pipelines and writing python scripts to automate pipelines. Built a scalable web application using AWS Step Functions to orchestrate microservices and manage state across multiple Lambda functions. Worked on full spectrum of data engineering pipelines: data ingestion, data transformations and data analysis. Build real time streaming pipeline utilizing Kinesis, Spark Streaming and Amazon Redshift. Worked on different formats like Text, Avro, Parquet, Delta Lake, JSON and Flat files using Spark. Worked on user reporting tools like Tableau to connect with Athena for generating daily reports of data. Skilled in performance tuning in Snowflake to optimize query performance and improve data processing times. Expertise in leveraging AWS services such as Amazon Elastic Kubernetes Services (EKS) for container orchestration, simplifying cluster management and deployment workflows. Developed various spark applications using Pyspark to perform various enrichments of user behavioral data (click stream data) merged with user profile data. Worked on different file formats like Text, Avro, Parquet, JSON, and Flat files using Spark. Analytical skills in using Snowflake SQL to analyze large datasets and provide insights to stakeholders. Understanding Airflow operators and other related python libraries for data ingestion, data orchestration. Worked on utilizing AWS cloud services like S3, EMR, Redshift, Athena, and Glue Meta store. Create and implement triggers using Spark SQL. Experience with CI/CD integration using Kubernetes in AWS, automating the deployment of applications through continuous integration pipelines. Good hands-on experience with different NoSQL databases like Cassandra, HBase and MongoDB. Familiarity with Snowflake integrations with various data engineering and bi tools, such as Tableau, Python, and Airflow, to build end-to-end data solutions. Proficient in Shell Scripting, including Bash Scripting, to automate various tasks and processes in data engineering workflows. Proficient in using Cron, to schedule and automate the execution of shell scripts and data processing tasks at specified intervals. Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators both old and newer operators. Built an ETL framework for Data Migration from on premise data sources such as Hadoop, Oracle to AWS using Apache Airflow, Apache Sqoop and Apache Spark. Developed multi cloud strategies in better using GCP and AZURE. Involved in writing python scripts for loading customer data and online shopping event data from various external sources like SFTP servers, DB2 and Amcat to S3 buckets. Performed ETL process with python SQL server pipelines/framework to perform data analytics and visualization in Python, NumPy, SciPy, pandas, and MATLAB stack. Migrated the existing ETL data pipeline from AWS redshift to EMR clusters to decrease the cost. Expertise in writing the AWS lambda functions to watch the CloudWatch events and trigger various jobs. Worked on SNS integration to enable real-time data processing with AWS services like AWS lambda and glue. Environment: AWS EMR, Spark, Hive, S3, Athena, Kinesis, Scala, Redshift, Airflow, GCP Cloud Storage, Cloud SQL, Snowflake, Kafka, Spark Streaming, Pyspark, EMR, AWS Glue, Kubernetes, NoSQL, IAM, Tableau, Data Pipelines. Client: Cloudflare, Austin, TX Nov 2018 to Feb 2021 Sr. Data Engineer Responsibilities: Worked extensively with Azure Databricks clusters for real time analytics streaming and batch jobs. Worked in Agile environment and used rally tool to maintain the user stories and tasks. Worked on Python and SQL Scripts to Load data from Data Lake to Datawarehouse. Created Databricks notebooks using SQL, Python and automated notebooks using jobs. Implemented Spark using Python and utilizing Spark Core, Spark Streaming and Spark SQL for faster processing of data instead of MapReduce in Java. Worked on exploring and analyzed customer specific features by using Matplotlib, Seaborn in Python and dashboards in Tableau. Worked on Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine learning use cases under Spark ML and MLlib. Involved in working with big data tools like Hadoop, Spark, Hive. Worked on architecting the ETL transformation layers and writing spark jobs to do the processing. Worked on building the models using Python and PySpark to predict probability of attendance for various campaigns and events. Build and deployed machine learning models using GCP's AI Platform and TensorFlow. Used Pentaho Data Integration to create all ETL transformations and jobs. Worked in creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data. Involved in writing pyspark User Defined Functions (UDF s) for various use cases and applied business logic wherever necessary in the ETL process. Involved in Querying data using SparkSQL on top of Spark Engine, implementing Spark RDD s in Scala. Created Nifi flows to trigger spark jobs and used put email processors to get notifications if there are any failures. Created concurrent access for hive tables with shared/exclusive locks enabled by implementing Zookeeper in cluster. Using Tableau extract to perform offline investigation. Responsible for data services and data movement infrastructures, good experience with ETL concepts, building ETL solutions and Data modeling. Implemented ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight. Worked on Azure Blob Storage, Azure Data Lake, Azure Data Factory, Azure SQL, Azure SQL Datawarehouse, Azure Analytics, Polybase, Azure HDInsight, Azure Databricks. Developed pipelines to move the data from Azure blob storage/fileshare to Azure SQL data warehouse and blob. Worked extensively on the migration of different data products from Oracle to Azure. Worked on developing PySpark script to encrypt the raw data by using hashing algorithms concepts on client specified columns. Worked on NoSQL databases such as Hbase and Cassandra. Environment: Azure, Agile, Spark, Hadoop, Scala, Hive, Zookeeper, ETL, GCP, Kafka, SQL, NoSQL, Cassandra, Oracle, Linux, Tableau. Client: COX Communications, Virginia Jan 2017 to Oct 2018 Big Data Engineer Responsibilities: Participated in the full software development lifecycle with requirements, solution design, development, QA implementation, and product support using Scrum and Agile methodologies. Worked with Hadoop eco system covering HDFS, HBase, YARN and MapReduce. Developed Spark/Scala, Python for regular expression (RegEx) project in Hadoop/Hive environment for big data resources. Responsible for design and development of advanced Python programs to prepare transform and harmonize data sets in preparation for modeling. Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming. Developed various shell scripts and python scripts to address various production issues. Worked on Apache spark writing python applications to convert txt, xls files and parse. Developed Spark code using Scala and Spark-SQL for faster testing and data processing. Involved in the design, development, and testing of the ETL processes using Informatica. Work on package configuration to setup automated ETL load processing for one time and incremental Data Loads. Migration of ETL processes from RDBMS to Hive to test the easy data manipulation. Responsible for logical dimensional data model and using ETL skills to load the dimensional physical layer from various sources including DB2, SQL Server, Oracle, Flat file etc. Designed and developed custom data flow using Apache Nifi to fully automate the ETL process by taking various worst-case scenarios into account. Deep understanding of monitoring and troubleshooting mission critical Linux machines. Environment: Hive, Pig, YARN, Hadoop, GIT, HBase, EC2, Cloudwatch, Apache NIFI, Oracle, SQL, Glue, MongoDB, Spark. Narvee Technologies Pvt Ltd, Hyderabad June 2014 to Nov 2016 Big Data Analyst Responsibilities: Installed and configured Apache Hadoop to test the maintenance of log files in Hadoop cluster. Installed and configured Hive, Pig, Sqoop, and Oozie on the Hadoop cluster. Experienced in SQL programming and creation of relational database models. Installed Oozie Workflow engine to run multiple Hive and Pig Jobs. Developed Spark scripts using Scala as per the requirement using Spark 1.5 framework. Developed multiple MapReduce jobs in Java for data cleansing and pre-processing. Developed Simple to complex Map/Reduce Jobs using Hive. Involved in loading data from the UNIX file system to HDFS. Query optimization, execution plan and Performance tuning of queries for better performance in SQL Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs. Creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and Spark. Good experience in handling data manipulation using Python Scripts. Developed business intelligence solutions using SQL server data tools 2015 & 2017 versions and load data to SQL & Azure Cloud databases. Analyzed large amounts of data sets to determine optimal ways to aggregate and report on it. Responsible for building scalable distributed data solutions using Hadoop. Written SQL Queries, Store Procedures, Triggers and functions for MySQL Databases Migration of ETL processes from Oracle to Hive to test the easy data manipulation. Performed optimization on Pig scripts and Hive queries increase efficiency and add new features to existing code. Worked on creating tabular models on Azure analysis services for meeting business reporting requirements. Used Hive and created Hive tables and was involved in data loading and writing Hive UDFs. Used Sqoop to import data into HDFS and Hive from other data systems. Installed Oozie workflow engine to run multiple Hive. Continuous monitoring and managing the Hadoop cluster using Cloudera Manager. Developed Hive queries to process the data for visualizing and reporting. Environment: Apache Hadoop, Cloudera Manager, CDH2, Python, CentOS, Java, MapReduce, Pig, Hive, Sqoop, Oozie and SQL. Keywords: continuous integration continuous deployment quality analyst artificial intelligence machine learning business intelligence sthree database active directory information technology microsoft North Carolina Texas |