Lakshmikanth Reddy - Data Engineer |
[email protected] |
Location: Fort Collins, Colorado, USA |
Relocation: Yes |
Visa: GC EAD |
LAKSHMIKANTH
[email protected] 720-310-8189 PROFESSIONAL SUMMARY Around 10+ years of IT experience in Analysis, design, development, implementation, maintenance, and support with expe-rience in developing strategic methods for deploying big data technologies to efficiently solve Big Data processing require-ments. Experience working with Amazon Web Services (AWS), Microsoft Azure, Cloudera and Hortonworks Good Understanding of Azure Big data technologies like Azure Data Lake Analytics, Azure Data Lake Store, Azure Data Facto-ry, Azure Databricks, and created POC in moving the data from flat files and SQL Server using U-SQL jobs. Expertise with Big data on AWS cloud services i.e., EC2, S3, Auto Scaling, Glue, Lambda, Cloud Watch, Cloud Formation, Athe-na, DynamoDB and RedShift. AWS Data Lakes have used for integrating wide range of AWS services, such as Amazon S3, Amazon EMR, and Amazon Athe-na, as well as third-party tools and services. Applied Data Governance Best Practices Confidential UNUM, insurance company to achieve Data Governance Business, Func-tional and IT goals. Integrate Collibra DGC using Collibra Connect (MuleESB) with third-party tools such as Ataccama, IBM IGC and Tableau to ap-ply DQ rules, import technical lineage and to create reports using the MetaData in Collibra DGC Experience in Amazon Web Services (AWS) and Microsoft Azure, such as AWS EC2, S3, RD3, Azure HDInsight, Machine Learn-ing Studio, Azure Storage, and Azure Data Lake. Good understanding of Spark Architecture with Databricks, Structured Streaming, and Setting Up AWS and Microsoft Azure with Databricks. In-depth knowledge about Data Warehousing (gathering requirements, design, development, implementation, testing, and documentation), Data Modeling (analysis using Star Schema and Snowflake for FACT and Dimensions Tables), Data Processing, Data Acquisition and Data Transformations (Mapping, Cleansing, Monitoring, Debugging, Performance Tuning and Trouble-shooting Hadoop clusters). Leveraged DBT's functionality to create and manage data transformation pipelines, enabling the transformation of raw data from diverse sources into structured and actionable formats for analytics and reporting purposes. Implemented version control for DBT models and configurations using tools like Git, and set up Continuous Integra-tion/Continuous Deployment pipelines to automate testing and deployment of DBT projects. By using DBT I have performed data transformations, data modeling, and building analytics-ready datasets. I employ DBT to create efficient and scalable data pipelines, optimize data schemas, and ensure data quality and consistency. Hands on experience with programming languages such as Python, PySpark, Spark, Scala and querying languages such as SQL, PL/SQL. Experience in Java, J2ee, JDBC, Collections, Servlets, JSP, Struts, Spring, Hibernate, JSON, XML, REST, SOAP Web services, Groovy, MVC, Eclipse, Weblogic, Websphere, and Apache Tomcat severs. Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics. Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data for-mats on HDFS using Scala. Integrate Collibra DGC using Collibra Connect (MuleESB) with third-party tools such as Ataccama, IBM IGC and Tableau to ap-ply DQ rules, import technical lineage and to create reports using the MetaData in Collibra DGC Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala. Experience in using Kafka and Kafka brokers to initiate spark context and processing live streaming. Kafka platform been used for designing and implementing Kafka data pipelines that meet requirements and performance objectives. Strong experience in Python, PySpark, Spark, Scala, SQL, PL/SQL and Restful web services. Strong experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouse and into Azure Data Lake. Experience with ETL workflow Management tools like Apache Airflow and have significant experience in writing the python scripts to implement the workflow. In-depth experience with a variety of Tableau reporting components, including Facts, Attributes, Hierarchies, Transfor-mations, Filters, Prompts, Calculated Fields, Sets, Groups, and Parameters; experience using NiFi and Flume to load log files into Hadoop. Strong Experience in working with Databases like Oracle, and MySQL, Teradata, Netezza and proficiency in writing complex SQL queries. AWS EMR is used to build data processing pipelines that can ingest, transform, and to store large volumes of data. It is also used to build data warehouses on AWS, and could use to run Hadoop clusters to process large volumes of data and store it in a data warehouse. Experience in extracting source data from Sequential, XML, CSV, JSON, Parquet and then transforming and loading it into the target Data warehouse. Good understanding of NoSQL Data bases and hands on work experience in writing applications on No SQL data bases like Cassandra and Mongo DB. Experience with Airflow for creating DAGs and automating the data pipelines at scheduled time. Experienced in troubleshooting errors in Hbase Shell/API, Pig, Hive, and map Reduce. TECHNICAL SKILLS Hadoop Technologies HDFS, MapReduce, YARN, Hive, Pig, HBase, Impala, Zookeeper, Sqoop, OOZIE, Apache Cassandra, Flume, Spark, AWS, EC2 Cloud Technologies AWS, Azure Programming Languages Python, PySpark, Spark, SQL, Java, Groovy, PL/SQL, Scala, Shell Scripts Databases NoSQL, Oracle, DB2, MySQL, SQL Server, MS Access, HBase. Data Modeling Erwin R9.x , E R Studio, Snowflake Application Servers WebLogic, WebSphere, Apache Tomcat, JBoss IDEs Eclipse, NetBeans JDeveloper, IntelliJ IDEA. Version Control TFS, SVN, Git Reporting Tools Jaspersoft, Qlik Sense, Tableau, Junit ETL Tools Apache NiFi, Apache Airflow, Talend, Informatica, SSIS Big data Tools Apache Hadoop, Apache Spark, Apache Kafka, Apache Hive, Apache Cassandra, Apache Flink, and Apache Pig. PROFESSIONAL EXPERIENCE CBRE, Dallas, TX March 2023 Till date Sr Data Engineer Responsibilities: Collaborated with Business Analysts, and SMEs across departments to gather business requirements, and identify workable items for further development. I am responsible for ensuring data quality, and compliance. I use the catalog to manage metadata, enforce data policies, and monitor data lineage to ensure adherence to regulations and organizational standards. AWS Data Catalog has been used to store and process the patient's data in format, including structured, semi-structured, and unstructured data with a few terabytes to petabytes of data. To load Medicare and Medicaid claims data from various sources into data warehouses we mainly apply ETL pipelines effi-ciently. To ensure the accuracy, consistency, and quality of healthcare claims data by implementing data quality checks, validation processes, and adherence to healthcare regulations and compliance standards (HIPAA, HITECH). Optimized data processing workflows and queries to handle large volumes of healthcare claims data efficiently, employing techniques such as indexing, partitioning, and query optimization. Leveraging data models to facilitate analysis, reporting, and visualization of Medicare and Medicaid claims data, considering the specific requirements and complexities of healthcare da-ta structures. Collaborate with data analysts, data scientists, and business stakeholders to provide the necessary infrastructure and support for generating insights, creating reports, and developing dashboards related to Medicare and Medicaid claims. Partnered with ETL developers to ensure that data is well cleaned, and the data warehouse is up to date for reporting pur-pose by Pig. Imported data using Sqoop to load data from MySQL to HDFS on a regular basis. Used MongoDB to manage complex and diverse datasets within a NoSQL environment and has done query optimization and aggregation, utilizing MongoDB's powerful features for data analysis and reporting. Selected and generated data into CSV files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift. AWS EMR is used for machine learning tasks such as training and deploying models. EMR with tools such as Apache Spark MLlib to build and train models on large datasets. Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them in data ware-house. Identified and resolved connectivity issues during integration, ensuring seamless data flow into the catalog. This involves test-ing, error handling, and debugging to maintain data integrity. Involved with ETL processes to extract data from different systems, transform it into a suitable format, and load it into Mon-goDB. Ensuring data flow between MongoDB and other databases or applications in your ecosystem. Configured and optimized Kafka clusters to ensure efficient data flow and distribution across the ecosystem. Monitoring Kafka data pipelines for reliability, fault tolerance, and timely delivery of streaming data. Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's. Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations. We leveraged Glue's visual interface to design and schedule ETL jobs, implementing custom transformations to ensure data quality and consistency. Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas. Integrate Ataccama with Collibra using MuleESB connector and publish DQ rule results on Collibra using REST API calls. Developed and validated machine learning models including Ridge and Lasso regression for predicting total amount of trade. Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, and Mon-goDB) into HDFS. Used MongoDB's aggregation framework and query language to perform complex data analysis tasks which involved in aggre-gating data, deriving insights, and generating reports to support decision-making processes within your organization. Developed Spark scripts by writing custom RDDs in Spark for data transformations and perform actions on RDDs. Developed highly complex Python and Spark code, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries. Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results. Developed predictive analytics using Apache Spark APIs. Created and optimized SSIS packages with a focus on efficiency, reliability, and maintainability, and also for employing best practices for error handling, logging, and package configurations. SSIS process is utilized to build, design, and deploy ETL into a data warehouse. Utilized Agile and Scrum methodology for team and project management. Implementing effective data modeling within Power BI for optimal storage and retrieval of information and implementing data governance policies and ensuring that sensitive information to be secured within Power BI reports and dashboards. Environment: Spark (PySpark, SparkSQL, SparkMLIib), Python (Scikit-learn, NumPy, Pandas), PySpark, SQL, Talend, Tableau, MySQL, GitHub, AWS EMR/EC2/S3/Redshift, Snowflake, AWS Glue, Pig, and Oracle, Power Bi, SSIS. Alameda Health System, Remote Oct 2021 Feb 2023 Data Engineer Responsibilities: Developed an ETL pipeline to source datasets and transmit calculated ratio data from Azure to Datamart (SQL Server) and Credit Edge. Designed and implemented Kafka Topics configuration in a new Kafka cluster for various scenarios. Established and maintained best practices and standards for data pipelining and integration within Snowflake in the Azure environment. Constructed robust data warehousing solutions within Snowflake to amalgamate and analyze diverse financial data sources, such as transaction records, market data, customer information, and compliance-related data. Integrated Snowflake with machine learning algorithms and analytics tools to create predictive models for risk assessment, fraud detection, and investment forecasting within the financial sector in Azure. Integrate Ataccama with Collibra using MuleESB connector and publish DQ rule results on Collibra using REST API calls. Facilitated collaborative data sharing and reporting within Snowflake, allowing different departments like finance, risk man-agement, and compliance to access and analyze data securely while ensuring accuracy. Optimized the speed of both External and Managed HIVE tables within the Azure ecosystem. Developed and managed data pipelines utilizing Azure services, orchestrating seamless data flow and workflows. Utilized Pyramid's rich ecosystem of libraries and extensions to facilitate tasks like handling data pipelines or integrating with analytics frameworks within Azure. Created Communities, Domains and Assets within Collibra. Integrated Azure services with Power BI for impactful data visualization, crafting insightful dashboards and reports for stake-holders to derive actionable insights. Created a system using Kafka to collect data from multiple portals and processed it using Spark within Azure. Developed Python scripts for exploratory analysis on data retrieved from databases like Redshift and Snowflake. Scheduled Airflow DAGs to run multiple Hive and Pig jobs independently based on time and data availability. Conducted Ex-ploratory Data Analysis and data visualization using Python and Tableau. Collaborated with stakeholders to translate data requirements into Azure-based solutions aligned with business objectives and compliance standards. Built the code values, code lists, attributes in Collibra DGC Employed Splunk in machine learning and predictive analytic capabilities for building models and algorithms to automatically analyze and predict data patterns. Leveraged Azure Blob Storage, Azure Data Lake Storage, or Azure SQL Database for efficient management and storage of structured, semi-structured, and unstructured data, ensuring accessibility and security. Identified key attributes or fields facilitating data correlation across different sources for integrity and efficient retrieval dur-ing the integration process. Loaded information into the data warehouse and other systems like SQL Server using ETL tools such as SQL loader and exter-nal tables. Designed and implemented Hadoop-based data processing architectures to handle large-scale distributed data storage and processing requirements efficiently within Azure. Collaborated with stakeholders to understand reporting requirements and translated them into Power BI solutions. Devel-oped ETL processes to extract, transform, and load data into Power BI from different sources. Built scalable and high-performing data warehousing solutions using Azure Synapse Analytics (formerly Azure SQL Data Warehouse) for effective data analysis and reporting. Developed Spark code using Scala and Spark-SQL for accelerated testing and data processing in the Azure environment. Employed Python's pandas and NumPy libraries for data cleaning, feature scaling, engineering, and Predictive Analytics to create models within Azure. Designed, implemented, and optimized Azure-based ETL (Extract, Transform, Load) processes to source, transform, and load data from diverse sources into Azure services. Applied Apache Airflow and CRON scripts in the UNIX operating system for developing Python scripts to automate the ETL process within Azure. Loaded data into different schema tables using SQL loader and control files in Azure. Participated in the design and architecture of Master Data Management (MDM) and Data Lakes, utilizing Cloudera Hadoop to create Data Lake within Azure. Contributed to Data Integration by defining information needs across functional domains and scripting/data migration using SQL Server Export Utility within Azure. Top of Form Environment: ETL Development, Kafka Configuration, Snowflake Data Warehousing, Machine Learning Integration, Azure Services Management, Power BI Integration, Spark Data Processing, Python Scripting, Airflow DAGs, Tableau Data Visualization, Splunk Ana-lytics, Azure Storage Management, Hadoop Architecture, Azure Synapse Analytics, Scala Programming, Pandas, NumPy, Data Inte-gration Tools, Cloudera Hadoop, SQL Server Utilities. Verizon, Atlanta, GA June 2018 Sept 2021 Azure Data Engineer Project Description: Designing, building, and maintaining the data infrastructure for a price optimization system. The goal of the project is to build a model that can predict future sales of products in a retail store, using historical sales data and other relevant information such as weather, promotions, and holidays. The model could be trained on a large dataset of sales data and then used to make informed decisions about inventory management, staffing, and marketing. Responsibilities: Implemented data pipelines in Azure Databricks to process and analyze inventory data, optimize stocking levels, predict de-mand, and manage supply chain operations effectively, reducing stockouts and overstock situations. Developed engines using Azure Databricks to suggest personalized products or services to customers based on their histori-cal purchases, browsing behavior, and trends, enhancing upselling and cross-selling opportunities. Used Azure Databricks to process operational data, such as point-of-sale (POS) data and store foot traffic, to optimize store layouts, staff allocation, and overall operational efficiency. Evaluated tools using gartners magic quadrants and functionality metrics and product scores informatica MDM, Ataccama, IBM, Semarchy, Contentserv..etc Participated in the Data Governance working group sessions to create Data Governance Policies. Configuration of Communities, Domains, Asset Model, Relations per MS Data Governance Approach & Requirements Designed and implemented Azure security solutions for data storage, processing, and analysis, and ensuring compliance. Have designed and implemented Service Oriented Architecture underlined with Ingress and Egress using Azure Data Lake Store & Azure Data Factory by adding blobs to lakes for analytic results and so pull data from Azure Data Lake to the Blobs. Transformed batch data from several tables containing tens of thousands of records from SQL Server, MySQL, PostgreSQL, and CSV file datasets into data frames using PySpark. Designed and manage the databases in SQL Server to store and manage financial data such as transaction records, customer information, account details, and market data. Created reports, dashboards, and visualizations using SQL Server Reporting Services SSRS/Power BI, enabling financial ana-lysts and stakeholders to gain insights into financial performance, risk analysis, and regulatory compliance. Integrated SQL Server Analysis Services with SQL Server for business intelligence, forecasting, and predictive analytics in areas like investment strategies, customer behavior, and market trends. Designed data models using Erwin. Developed physical data models and created DDL scripts to create database schema, star schema and database objects. Worked with Terraform Templates to automate the Azure laas virtual machines using terraform modules and deployed virtual machine scale sets. Worked on migrating MapReduce programs into Spark transformations. Developed and optimized MapReduce programs for analyzing and extracting insights from vast datasets in Hadoop clusters. Designed and developed POWER BI dashboards and visualizations to effectively communicate insights from complex datasets. Optimizing Kafka cluster performance for high throughput and low-latency data processing. Implementing and maintaining security measures in Kafka, including authentication and authorization. Utilized Power BI features such as DAX (Data Analysis Expressions) for advanced calculations and analytics. Created and executed HQL scripts that create external tables in a raw layer database in Hive. Created PySpark code that uses Spark SQL to generate DataFrame from avro formatted raw layer and writes them to data service layer internal tables as orc format. Used PySpark to store streaming data to HDFS as Avro files and to implement Spark for faster processing of data. Configured/Installed setup documents which allow Airflow to communicate to its PostgreSQL database and developed Air-flow DAGs in python by importing the Airflow libraries. Environment: Erwin, SQL, MySQL, Kafka, Apache Airflow, Agile, HDFS, OLAP, Teradata, Hive, SSRS, Sqoop, Power BI, Map Reduce, Terraform, Azure. Change Healthcare, Nashville, TN Jan 2016 May 2018 Data Engineer Responsibilities: Reports based on SQL queries were created using Business Objects and executive dashboard reports provide the most re-cent financial data from the company, broken down by business unit and product. Conducted data analysis and mapping, as well as database normalization, performance tuning, query optimization, data ex-traction, transfer, and loading ETL, and clean up. Used Informatica as ETL tool. I have utilized SSIS to schedule and automate data integration workflows, creating scheduled jobs and packages that run at predefined intervals or in response to specific events. Have also incorporated SSIS package configurations and parameters to make the integration processes flexible and configu-rable, allowing for easy customization based on changing business requirements. Developed and optimized Hive and Pig scripts for efficient querying and transformation of data stored in Hadoop. Integrating Kafka with ETL processes to facilitate real-time data integration across the data infrastructure. Gathering requirements, status reporting, developing various KPIs, and project deliverables are all responsibilities. Assisting with the migration of the warehouse database from Oracle 9i to Oracle 10g. Improved report performance by rewriting SQL statements and utilizing Oracle's new built-in functions. Created BO full client reports, Web intelligence reports in 6.5 and XI R2, and universes with context and loops in 6.5 and XI R2. Used Erwin extensively for data modeling and ERWIN's Dimensional Data Modeling. Built HBase tables to load enormous amounts of structured, semi-structured, and unstructured data from UNIX, NoSQL, and several portfolios. Have used Pytest to simplify the writing and executing tests. By using this have also created tests for data pipelines, ensuring that transformations, data manipulations, and ETL processes produce the expected results. Used Pylint to adhere to coding standards, ensuring consistent code formatting, and improving the overall quality of code used in data processing scripts, data pipelines, or applications. Developed reports, interactive drill charts, balanced scorecards, and dynamic dashboards using Teradata RDBMS analysis with Business Objects. Collaborated with stakeholders to gather and understand data requirements for creating interactive and impactful Tableau reports. Created a NoSQL database in MongoDB using CRUD, Indexing, Replication, and Sharing. Environment: SQL, Informatica, SSIS, Hive, Pig, Hadoop, Kafka, Oracle DB, ERWIN, UNIX, NoSQL, Teradata, Tableau. Knack Systems, India Aug 2013 Dec 2015 Big Data Engineer Responsibilities: Perform T-SQL tuning and optimizing queries for SSIS packages. Developed SQL queries to perform data extraction from existing sources to check format accuracy. Using SSIS, I have implemented a wide range of data transformations, such as data cleansing, aggregation, merging, and en-richment. I have leveraged the rich set of SSIS components and transformations to perform tasks like data type conversions, lookups, conditional splits, and data validation. We have used Akka for smooth data exchange between different components of the data ecosystem, including databases, data lakes, and analytics platforms. Imported Legacy data from SQL Server and Teradata into Amazon S3. As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data. Developed SQL scripts to Upload, Retrieve, Manipulate and handle sensitive data (National Provider Identifier Data I.e. Name, Address, SSN, Phone No) in Teradata. Troubleshooting and resolving issues related to Hadoop cluster performance, stability, and job execution. Collaborated with cross-functional teams to integrate Tableau solutions into comprehensive data architectures. Worked on retrieving the data from FS to S3 using spark commands. Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS. Co-developed the SQL server database system to maximize performance benefits for clients. Worked with Sparks ecosystem using Spark-SQL and Scala queries on different formats like text file, CSV file and generated custom SQL to verify the dependency for the daily, Weekly, and Monthly jobs. Developed spark code and Spark-SQL/streaming for faster testing and processing of data. Environment: SQL, AWS S3, T-SQL, SQL server, Hadoop, Tableau, AWS Glacier, CSV, Spark-SQL. Keywords: access management business intelligence sthree database rlang information technology trade national microsoft procedural language Colorado Delaware Georgia Tennessee Texas |