Akhilesh - Data Analyst |
jessica@daticsinc.com |
Location: Detroit, Michigan, USA |
Relocation: Yes |
Visa: |
Data Engineer
Akhilesh Reddy Goli Phone: 717-4674-609 Email: akhileshg1605@gmail.com LinkedIn: linkedin.com/in/akhilesh-reddy-goli-b31924267 ________________________________________ PROFILE Practical Microsoft Azure Certified Data Engineer possessing in-depth knowledge of data manipulation techniques and computer programming paired with expertise in integrating and implementing new software packages and new products into the system. With my 9 years of experience in developing, testing, and troubleshooting ETL/ELT development projects, I have turned out to be more proficient as a data engineer. Overall 9+years of experience as Data Engineer with major focus on Big Data Technologies-Hadoop Ecosystem, HDFS, Map-Reduce, HBase, HIVE, Sqoop, Kafka, Oozie, Spark, Teradata Implement data quality checks and validation rules to ensure the accuracy and consistency of healthcare data in the Next Gen v8 environment. Involved in all the phases of Software Development Life Cycle (SDLC): Requirements gathering, analysis, design, development, testing, production, and post-production support. Hands on experience in installing configuring and using Hadoop ecosystem components like Hadoop MapReduce HDFS HBase Hive Sqoop Pig Zookeeper, Airflow and Flume Hands on experience on Unified Data Analytics with Data Bricks, Databricks workspace user Interface, Managing Data bricks Notebooks, Delta Lake with Spark SQL Knowledge of installation and administration of multi-node virtualized clusters using Cloudera Hadoop and Apache Hadoop. Develop, implement, and optimize ETL (Extract, Transform, Load) pipelines to integrate data from disparate healthcare systems (EHR, EMR, HL7, CCD, FHIR). Ensure seamless integration of Next Gen v8 data with internal and external data systems, including claims, patient records, and lab results. Experience in implementing data analysis with various analytic tools, such as Anaconda 4.0 Jupiter Notebook 4. X and Alteryx Proficient in creating data pipelines, ETL pipelines, and data streaming techniques like Apache Kafka, Apache Spark, Apache Storm, Amazon Kinesis. Hands on experience installing, configuring, and using Hadoop ecosystem components like Hadoop, MapReduce, HDFS, HBase, Oozie, Hive, Kafka, Oozie, Zookeeper, Spark, Storm, Sqoop and Flume. Strong experience using Apache Spark, Spark SQL and other data processing tools and languages. Great knowledge about Hive (architecture, Thrift servers), HQLs, Beeline and other 3rdparty JBDC connectivity services to Hive. Regularly reviewed and updated policies to ensure ongoing compliance with HIPAA and other relevant healthcare regulations. Coordinated with ETL testing team for fixing the defects easy of operability with the UNIX file system through command line interface. Developed Python code to gather the data from HBase and designs the solution to implement using Pyspark. Maintaining and optimized AWS infrastructure (EMR EC2, S3, EBS, Redshift, and Elastic Search). Experience in writing Map Reduce programs for analyzing Big Data with different file formats like structured and unstructured data. Experience in GCP Dataproc, GCS, Cloud Functions, Big Query. Designed and developed data pipelines to ingest, process, and store healthcare data in FHIR format. Ensured that all data handling, storage, and processing activities comply with HIPAA regulations to protect patient privacy and data security. Worked closely with healthcare providers, insurers, and other stakeholders to understand their data needs and ensure seamless integration of FHIR standards. Experience in moving data between GCP and Azure using Azure Data Factory. Maintain comprehensive documentation for data engineering processes, data models, and system configurations related to Next Gen v8. Skilled at Python, SQL, R and Object-Oriented Programming (OOP) concepts such as Inheritance, Polymorphism, Abstraction, Encapsulation. Experienced in Data Integration Validation and Data Quality controls for ETL process and Data Warehousing with reporting in Power BI. Implemented best practices for data ingestion, data transformation, and data quality. Designed, developed, and maintained robust and scalable data pipelines using ETL processes to ingest, transform, and store data from various sources. Predictive analytics models and forecasting data sets in Power BI. Good experience in dashboard design using large data sets. Participate in cross-functional teams to optimize Next Gen v8 data workflows, supporting the transition to value-based care and improving patient care outcomes. Integrated these technologies with Snowflake and Azure Cloud services to build scalable and efficient data processing pipelines. Developed PIG Latin scripts for handling business transformations. Implemented Sqoop for large dataset transfer between Hadoop and RDBMs. Advanced Knowledge of MS office, MS power query, Power BI and Microsoft SQL server is one of the key skill sets. Hands-on experience in SCM tools like Git and SVN for merging and branching. Support data analytics, machine learning, and data-driven applications, by providing cleansed, enhanced, and reliable datasets, using technologies such as Power BI, Azure Machine Learning, and Azure Synapse Analytics. Technical Skills Programming Skills: Scala, Python,Next Gen v8 Frameworks: Spark, Hadoop Data Ingestion Tools: HiveAzure Streaming Tools: Kafka, Spark Streaming, Structured Streaming Job Scheduling Tools: Airflow Databases: Oracle, SQL Server, MongoDB, Cassandra, PL/SQL, T-SQL Version controls: Git Parallel Programming: CUDA, PYCUDA, OpenCL, cuDNN, RAPIDS, Ray, Dask, Celery, doParallel CI Tools: Jenkins Defect Tracking System: JIRA, ServiceNow Query Languages: SQL, PL/SQL ETL/BI: Informatica, Talend, SSIS, Power BI, Tableau, Alteryx. Cloud Technologies: AZURE, AWS: S3, EMR Azure: MS Azure Fundamentals Certified professional, PowerBI WORK EXPERIENCE Sr. Azure Data Engineer Nanthealth, Morrisville, NC December 2021 to Present Job description: Lead comprehensive big data initiatives, proficiently utilizing a suite of big data analytic tools including Spark, Hive, SQOOP, Pig, Flume, Apache Kafka, PySpark, OOZIE, HBase, Python, and Scala. Orchestrated system software development, integrating Hadoop ecosystem components such as HBase, Sqoop, Oozie, and Hive. Translated SQL Server and Oracle procedures into Hadoop using Spark SQL, Scala, and Java to enhance efficiency. Implemented robust ETL processes leveraging AWS Glue, ensuring seamless data loading, and generated impactful reports using Athena and QuickSight. Responsibilities Processed large volumes of data using various big data analytic tools including Spark, Hive, SQOOP, Pig, Flume, Apache Kafka, PySpark, OOZIE, HBase, Python, and Scala. Developed and maintained end-to-end operations of ETL data pipelines and worked with large data sets in Azure Data Factory. Implement data quality checks and validation rules to ensure the accuracy and consistency of healthcare data in the Next Gen v8 environment. Researched and implemented various components like pipeline, activity, mapping data flows, data sets, linked services, triggers, and control flow. Performed extensive debugging, data validation, error handling mechanism, transformation types and data clean up analysis within large datasets. Develop, implement, and optimize ETL (Extract, Transform, Load) pipelines to integrate data from disparate healthcare systems (EHR, EMR, HL7, CCD, FHIR). Ensure seamless integration of Next Gen v8 data with internal and external data systems, including claims, patient records, and lab results. Designed and managed scalable infrastructure to support high volumes of data and complex AI computations. Collaborated with DevOps Engineers to develop automated CI/CD and Test-driven development pipeline using Azure as per client standards. Used infrastructure-as-code tools like Terraform or Azure Resource Manager (ARM) to automate the provisioning and management of cloud resources. Worked on converting the multiple SQL Server and Oracle stored procedures into Hadoop using Spark SQL, Hive, Scala, and Java. Integrate Power BI reports with other Microsoft tools (Excel, SharePoint, Teams) and third-party applications. Leverage DAX (Data Analysis Expressions) for custom calculations and measures. Optimize report performance by implementing best practices for visuals and filters. Developed an interactive Power BI dashboard to track sales performance across regions. Visualized revenue, profit, and customer segments in Power BI. Enabled real-time monitoring of key metrics of BI. Designed efficient data models in Power BI Desktop. Creating and managing data pipelines for data ingestion, processing, and integration, using tools such as Azure Data Factory, Azure Databricks, and Apache Spark. Designing, implementing, and maintaining data solutions on the Microsoft Azure cloud platform, using various Azure data services and frameworks. Developing and optimizing data storage systems, such as Azure SQL Database, Azure Data Lake Storage, Azure Cosmos DB, and Azure Blob Storage, to meet the scalability, performance, and cost requirements of the organization. Created and maintained Source Target Mapping documents, including Data Object Mapping (DOM D), to outline and track data flow from source systems to target destinations. Collaborated with data architects and business analysts to ensure mappings align with business objectives Understood and analyzed source data models to design Level 1 (L1) and Level 1+ (L1+) data models. Developed and maintained comprehensive documentation for L1 and L1+ objects, including definition of done criteria to standardize processes and ensure the completeness and quality of deliverables. Diagnosed and resolved data-related issues efficiently by leveraging debugging tools and techniques in Azure environments. Developed and implemented ETL processes using tools like Azure Data Factory and Azure Synapse. Optimized pipeline implementation and maintenance tasks using Databricks workspace configuration, cluster tuning, and notebook optimization1. Worked on Snowflake Schema, Data Modeling, Data Elements, Issue/Question Resolution Logs, and Source to Target Mappings, Interface Matrix, and Design elements. Developed tables and views in Snowflake for end customers who will consume the final data result. Proficient use of Sqoop to import and export data from relational databases and Teradata into HDFS/Hive Using Azure Databricks and Data Factory, create and manage an ideal data pipeline architecture on the Microsoft Azure cloud. Create and maintain data models within Snowflake, ensuring efficient storage and retrieval of data. Extensive experience working with Snowflake, a cloud-based data warehousing platform, for storing and managing large-scale datasets. Designed and implemented data models in Snowflake, ensuring optimal schema structures for efficient data storage and retrieval. Participate in cross-functional teams to optimize Next Gen v8 data workflows, supporting the transition to value-based care and improving patient care outcomes. Tuned Snowflake queries and optimized data warehouse performance for enhanced query response times. Implemented Snowflake security features, including role-based access control and data encryption, to ensure data security and compliance. Implement data security policies and procedures within Snowflake. Developed tables and views in Snowflake for end customers who will consume the final data result. Integrated Snowflake with Informatica PowerCenter for seamless data movement between on-premises and cloud environments. Architected and managed the Snowflake data warehouse to ensure high availability, scalability, and performance. Developed and implemented Snowflake-specific solutions, including data modeling, schema design, and optimization. Collaborated with data scientists and AI engineers to integrate generative AI models into data pipelines. Optimized data flows to support AI/ML model training, validation, and deployment. Extensively worked on Spark Streaming and Apache Kafka to fetch live stream data. Creating Databricks notebooks using SQL, Python and automated notebooks using jobs. Designed and implemented by configuring Topics in the new Kafka cluster in all environments. Consumed data from various file formats JSON, CSV, Parquet, and various platforms like S3, Kafka, Teradata Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse. Created AWS Glue crawlers for crawling the source data in S3 and RDS. Created multiple Glue ETL jobs in Glue Studio and then processed the data by using different transformations and then loaded it into S3, Redshift and RDS. Created multiple Recipes in Glue Data Brew and then used in various Glue ETL Jobs. Design and Develop ETL Processes in AWS Glue to migrate data from external sources like S3, Parquet/Text Files into AWS Redshift. Utilize Azure IoT, Azure HDInsight + Spark, Kafka, and Azure Stream Analytics to build real-time data solutions that can process and analyze streaming data from various devices and sensors and enable fast and actionable insights. Leverage modern data platform technologies, such as Snowflake, Matillion, and Bryte Flow, to build scalable, flexible, and cost-effective data solutions that can handle diverse and complex data sources and scenarios. Create Terraform scripts to automate deployment of EC2 Instance, S3, EFS, EBS, IAM Roles, Snapshots and Jenkins Server. Implemented Enterprise Data Lake in Google Cloud Storage, DWH in Google Big Query using Informatica and Cloud data fusion ETL tools. Build batch pipelines with homogeneous and heterogeneous sources using cloud Data Fusion. Developed Cloud Data Fusion workflows to parse the raw data, join those with other tables and store the refined data in big query tables. Extensive ETL testing experience using Informatica 9x/8x, Talend, Pentaho. Design, implement, and maintain data solutions on the Microsoft Azure cloud platform, using various Azure data services and frameworks, such as Azure Data Factory, Azure Databricks, Azure SQL Database, Azure Data Lake Storage, Azure Cosmos DB, and more. Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production. Experience in moving data between GCP and Azure using Azure Data Factory. Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs using CloudWatch. Using rest API with Python to ingest Data from and some other site to BIGQUERY. Build a program with Python and Apache beam and execute it in cloud Dataflow to run Data validation between raw source file and Bigquery tables. Building a Scala and spark based configurable framework to connect common Data sources like MYSQL, Oracle, Postgres, SQL Server, Salesforce, Bigquery and load it in Bigquery. Monitoring Bigquery, Dataproc and cloud Data flow jobs via Stackdriver for all the environments. Hands on experience in Python Pyspark programming on Cloudera, Harton Works and MapR Hadoop Clusters, Aws EMR clusters, AWS Lambda functions and CFT'S. Extracted files from Cassandra and MongoDB through Sqoop and placed in HDFS and processed. To analyze the data Vastly used Athena to run multiple queries on processed data from Glue ETL Jobs and then used Quick Sight to generate Reports for Business Intelligence. Environments: Hadoop, HIVE, IMPALA, Amazon S3, Beeline, PySpark, Accelerator, Nifi/StreamSets, Bamboo, Control-M, ICEDQ, UNIX, Cloudera Navigator, Apache Airflow, Java, Athena and Quicksight, Big query, SPARK SQL, SharePoint, Confluence, AWS, Bitbucket, JIRA, Flume, Oozie, Zookeeper, Cassandra, MongoDB, Spark, Kafka, MapReduce, S3, EC2, EMR. Sr. Data Engineer Express Scripts, St. Louis, MO May 2017 to November 2021 Job description: Led end-to-end software solutions, managing requirements, design, development, and unit testing. Engineered Scala and Spark framework for data ingestion, transformation, and aggregation. Expertise in optimizing Spark pipelines and performing unit tests. Proficient in Spark Core and Spark SQL. Orchestrated AWS security groups, focusing on high-availability and auto scaling using Terraform. Responsibilities Requirements gathering, design, developed, unit test and implement technical software solutions. Developed framework using Scala and Spark to ingest data for analytics team. Create Scala/Spark jobs for data transformation and aggregation. Well versed in optimizing Spark jobs/data pipeline. Produce unit tests for Spark transformations and helper methods. Extensive working experience on Spark Core, Spark SQL Ingested the data from Oracle Database through Oracle Golden Gate to Hadoop Data Lake with the help of Kafka. Create Athena data sources on S3 buckets for adhoc querying and business dashboarding using Quicksight and Tableau reporting tools. Copy Fact/Dimension and aggregate output from S3 to Redshift for Historical data analysis using Tableau and Quicksight Schedule load jobs using cloud composer and monitor BigQuery and Data Fusion jobs via Stackdriver or cloud logging. Build Glue Jobs for technical data cleansing such as deduplication, NULL value imputation and other redundant column removal. Also build Glue jobs to build standard data transformations (date/string and Math operations) and Business transformations required by business users. Involved heavily in setting up the CI/CD pipeline using Jenkins, Terraform and AWS Model complex ETL processes that change data visually using data flows, Azure Databricks, and SQL Database. Responsible for setting up a Memsql cluster on AWS EC2 instance. Maintain comprehensive documentation for data engineering processes, data models, and system configurations related to Next Gen v8. Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source systems which include loading nested JSON formatted data into snowflake table. Designed and Implemented Partitioning (static, Dynamic) and Bucketing in HIVE, AWS Develop Hive and Pig Queries on different datasets to prepare reports. Used Sqoop for data transfer between RDBMS and HDFS while migrating historical data. Created automated visualization dashboards using Power BI, enabling stakeholders to easily access and interpret insights, reducing report generation time by 40%. Implemented automated anomaly detection algorithms to identify and flag data inconsistencies, enhancing data integrity and reducing error resolution time by 60%. Writing Hive queries to read/write data from/into HDFS. Developed and implemented Snowflake-specific solutions, including data modeling, schema design, and optimization. Utilized Power Query in Power BI to Pivot and Un-pivot the data model for data cleansing and data massaging. Performed query optimization to achieve faster indexing and made the system more scalable. Experience in building highly scalable systems to ingest large datasets. Worked on Ad hoc queries, Indexing, Replication, Load balancing, and Aggregation in MongoDB. Designed different Airflow workflows with Python based on the input data availability. Designed Git project structure which enables effective maintainability of source code. Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark databricks cluster. Implemented enterprise-level Azure solutions, including Azure Databricks, Azure Machine Learning, Azure Kubernetes Service (AKS), Azure Data Factory, Logic Apps, Azure Storage Account, and Azure SQL Database Utilized Azure Databricks and Apache Spark to handle and analyze large volumes of data efficiently. Managed a high volume of everyday migrations of Informatica workflows. Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks. Implemented robust security measures to protect sensitive data and ensure compliance with data protection regulations (e.g., GDPR, CCPA). Utilized encryption, access controls, and other security practices to safeguard data throughout its lifecycle. Leveraged high-performance computing resources to accelerate AI model training and inference. Orchestrating and automating the application workflow using Airflow. Create and validate the checklist to migrate on-prem application to cloud. Develop initial libraries or tools which can be reused in the process. Environments: Hadoop, PySpark, Cloudera, HDFS, Hive, AWS, Azure Data Factory, Azure Storage, SPARK & SPARK SQL, HBase, Sqoop, Kafka, HP ALM/Quality Center, Agile, SQL, Teradata, XML, UNIX, Shell Scripting, WINSQL HBase, MySQL, MongoDB, Oozie Data Engineer Careington, Frisco, TX October 2016 to April 2017 Job description: Developed Data Lake applications based on client requirements, ensuring data exposure to clients. Engineered scripts for data movement across zones on the Data Fabric platform. Created applications like Claims Sweep WGS using Spark, Hive, and Python for data transformation. Automated validation using scripts and applied Spark (Scala and PySpark) for fast data processing. Transformed Hive/SQL queries into Spark transformations and utilized Apache Airflow for workflow management. Responsibilities Responsible for developing the applications on datalake as per the client requirements and exposing that data to client. Developing the code to move the data from one zone to another zone in Data fabric platform. Create applications like Claims Sweep WGS for transforming the data as per the client requirements using Spark, hive, and Python. Developing automation scripts to do validation like Record count, schema Check etc. and load the data into corresponding partitions. Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data. Used PySpark to write the code for all the use cases in spark and extensive experience with Scala for data analytics on Spark cluster and Performed map-side joins on RDD. Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and have good experience in using Spark-Shell and Spark Streaming. Build near real-time pipelines that operate efficiently to handle huge volumes of incoming business activity. Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark databricks cluster. Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source systems which include loading nested JSON formatted data into snowflake table. Used Apache Airflow to manage the system workflows. Developed programs to validate the data after ingesting the data into Data Lake using UNIX. Developed scripts to generate reconciliation reports using Python. Involved in moving data from different source systems like Oracle, SQL and DB2 etc. to Data Lake. Loaded the data from Teradata to HDFS using Teradata Hadoop connectors. Ingested the data from Oracle Database through Oracle Golden Gate to Hadoop Data Lake with the help of Kafka. Responsible for providing design and architecture to the team to develop applications. Responsible to review the code and make the code as per the Client Standards. Creating data model for the data to be ingested for each table. Identifying the appropriate file formats for the tables to retrieve the data faster. Environments: Kafka, Restful, Amazon Web Services, Scala, Hive, Jira, Stream sets, HDFS, Control-M, Spark, Teradata, Hortonworks, Scrum., Pig, Tez, Oozie, HBase, Scala, Pyspark, Spark SQL, Kafka, Python, LINUX, Cassandra. ETL Developer Ideal Invent, India September 2015 to August 2016 Responsibilities Performed data extraction, aggregation, log analysis on real time data using Spark Streaming Experience working in project with data visualization, R and Python development, Unix, SQL Performed exploratory data analysis using NumPy, matplotlib and pandas. Developed and maintained ETL/ELT processes to ensure accurate and timely data availability for analytics and reporting. Used tools like Azure Data Factory, SSIS, or custom scripts for ETL workflows. Expertise in quantitative analysis, data mining, and the presentation of data to see beyond the numbers and understand trends and insights. Experience in using the Lambda functions like filter (), map () and reduce () with pandas Data Frame and perform various operations. Working experience with Unit testing/ Test driven Development (TDD), Load Testing and worked on Celery Task queue and service broker using RabbitMQ. Implemented Principal Component Analysis and Liner Discriminate Analysis. Used Pandas API for analyzing time series. Creating regression test framework for new code. Documented data workflows, ETL processes, data models, and operational procedures. Implemented CI/CD pipelines for data engineering workflows to ensure rapid deployment and updates. Creating complex SQL queries and scripts to extract and aggregate data to validate the accuracy of the data and Business requirement gathering and translating them into clear and concise specifications and queries. Eliminate incomplete or unusable data. Leveraged Azure Cloud services (such as Azure Data Factory, Azure Synapse, and Azure Databricks) to build and deploy data solutions. Create informative and visually appealing charts, graphs, and dashboards to communicate data insights effectively using visualization tools like Power BI, Tableau. Performed Exploratory Data Analysis and Data Visualizations using R, and Tableau. Environments: R, Tableau, AWS, Azure, SQL, Python, RabbitMQ, Linux, Spark, Excel, T-SQL, Oracle SQL, SSIS, SSMS Data Analyst Zensar Technologies, Bangalore - India May 2014 to August 2015 Responsibilities: Acquired expertise in producing Tableau dashboards for reporting on analyzed data. Knowledge of NoSQL databases, such as HBase. Staged input records files were cleaned and had their data validated before being loaded into the data warehouse. Automated the extraction of numerous flat/excel files from many sources, including FTP and SFTP (Secure FTP). Used Jenkins for continuous integration and GitHub as a repository for committing and retrieving code. Various dataflow and control flow tasks, loop and sequence containers, script tasks, SQL task execution, and package configuration were all worked on. Developed SSIS packages to export data from SQL Server to Excel spreadsheets and import data from Excel spreadsheets. Developed SSIS packages to decrypt, transform, and move files to a data warehouse while providing suitable error handling and alerting. These files may be fetched from remote locations using FTP and SFTP. Environments: SSIS, SSRS, Report Builder, Office, Excel, Flat Files, T-SQL, MS SQL Server, and SQL Server Business Intelligence Development Studio. Education: Bachelor s degree in Electronics and Communication Engineering (ECE) Sree Dattha Institute of Technology and Sciences, Hyderabad, India Year of Graduation: 2015 Keywords: continuous integration continuous deployment artificial intelligence machine learning business intelligence sthree database active directory rlang information technology hewlett packard microsoft procedural language Missouri North Carolina Texas |