Home

Venkata Lakshmi Peri - Data Engineer
[email protected]
Location: Austin, Texas, USA
Relocation: yes
Visa: GC
Name: Venkata Lakshmi Peri
Senior Data Engineer
E-mail: [email protected]
Phone: 8703845816


Career Prologue:
With over 10 years of experience as a Data Engineer, specializing in data-intensive application design using the Hadoop Ecosystem, Big Data Analytics, Cloud Data Engineering and Data Warehouse/Data Mart solutions. My expertise extends to Data Visualization, Reporting and Data Quality solutions.
Mastery in Python, Java, Scala and SQL is essential for efficiently manipulating, processing and querying data, catering to diverse data processing requirements of Data Engineers.
Deep understanding of SDLC enables effective management of the development, maintenance and enhancement of data systems, ensuring high data quality and system reliability.
Proficient in Python frameworks like Django, Flask and FastAPI are essential for Data Engineers to develop efficient APIs and web applications, streamlining data processing and enhancing user interfaces.
Expertise in Libraries like Pandas, NumPy and Scikit-learn are indispensable for tasks ranging from routine data manipulation and analysis to complex machine learning implementations.
Efficient, scalable database schemas and models are created, optimizing the storage and retrieval of data in a way that supports the organization's needs.
Proficient in Hadoop architecture, led the development of enterprise-level solutions using Apache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, NiFi, Kafka and YARN. Additionally, experienced in building Power BI reports on Azure Analysis Services for optimal performance.
Proficiency in cloud platforms like AWS, Azure and GCP is vital for managing scalable data storage, computing resources and leveraging various cloud-based data engineering tools.
Designing and implementing ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) processes is essential for integrating data from multiple sources into a cohesive and functional dataset.
Expertise in collaborating with data scientists to design optimized data models for analytics and machine learning, enhancing query performance and analysis.
Proficient in implementing Python and SQL-based data quality frameworks to automate anomaly detection and ensure data integrity through continuous monitoring.
Skilled in leading data governance initiatives, aligning with GDPR and CCPA compliance and developing strategies for secure and accessible data handling across departments.
Familiarity with operationalizing machine learning models for production using Docker and Kubernetes, streamlining updates and monitoring for effectiveness.
Expertise in relational databases such as PostgreSQL, MySQL and data warehousing technologies like Amazon Redshift and Snowflake is crucial for efficient data storage and retrieval operations.
Implementing real-time data pipelines using tools like Apache Kafka and Apache NiFi is a key responsibility to enable continuous data processing and integration.
Using visualization tools like Tableau, Grafana, Apache Superset and PowerBI, complex data sets are transformed into understandable and actionable insights through interactive dashboards and reports.
Ensuring the accuracy, consistency and security of data is a primary duty, achieved by implementing robust data governance policies and conducting thorough quality checks.
Applying machine learning algorithms performs predictive analysis and extracts valuable insights from data, contributing significantly to data science initiatives.
Mastery in CI/CD tools enables Data Engineers to automate data pipeline workflows, ensuring efficient and reliable deployment of data applications and systems.
Mastery in version control tools like Git, GitHub, Bitbucket and SVN is essential for efficient source code management and facilitating collaboration among team members.
Proficiency in operating systems, particularly Linux/Unix and shell scripting is needed to automate various data engineering tasks, enhancing efficiency and consistency.
Proficiency in reporting tools such as SSRS and Crystal Reports enables the generation of periodic, insightful reports and dashboards, essential for business decision-making and strategy formulation.

Technical Skills:

Programming Languages Python, Java, Scala, SQL
Frameworks & Libraries Django, Flask, FastAPI, Pandas, NumPy, Scikit-learn
Big Data Technologies Hadoop Ecosystem (Apache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, NiFi, Kafka, YARN)
Cloud Platforms AWS, Azure, Google Cloud Platform (GCP)
Data Warehousing Technologies Amazon Redshift, Snowflake, Azure Analysis Services
Data Visualization Tools Tableau, PowerBI, Grafana, Apache Superset
Database Management Systems PostgreSQL, MySQL, Microsoft SQL Server, Oracle
ETL/ELT Tools Implied usage within data processing and integration tasks
Data Processing & Integration Tools Apache Kafka, Apache NiFi, Apache Spark, Talend
Data Governance & Quality Implied practices for ensuring data accuracy, consistency and security
Version Control Systems Git, GitHub, Bitbucket, SVN
Operating Systems & Scripting Linux/Unix, Shell Scripting
Reporting Tools SQL Server Reporting Services (SSRS), Crystal Reports

Learning & Credentials:

Bachelors of computer science fom VIT 2013

Qualification Badges:

Microsoft Certified Azure Data Engineer Associate.
AWS Certified Developer Associate.

Professional Track Record:

Client: Goldman Sachs, NY Nov 2022 Present
Senior Data Engineer

Mission & Contributions:
Developed robust and scalable web applications and Microservice using Django's Model-View-Template architecture and Flask's flexible framework.
Performed data cleaning, transformation and complex mathematical and statistical operations on large datasets using Pandas DataFrames and NumPy arrays.
Successfully spearheaded the migration of cluster data to Azure Cloud, a critical step in enabling the incorporation of real-time features and enhancing system responsiveness.
Expertly utilized Application Insights in Azure to craft custom dashboards, employing the Application Insights Query Language to process and visualize metrics, thereby providing actionable insights from complex data sets.
Engineered a specialized message consumer for Kafka, adeptly facilitating the efficient transmission of messages to both Azure Service Bus and Event Hub, thus enhancing system communication and data flow.
Implemented robust Spark ETL jobs within Azure HDInsight, optimizing and streamlining ETL operations, significantly improving data processing efficiency in the Azure Cloud environment.
Designed and constructed auto-scalable functions, effectively managing data transfer between Azure Service Bus or Event Hub and Cosmos DB, resulting in improved scalability and operational efficiency.
Authored intricate JSON scripts for deploying pipelines within Azure Data Factory (ADF), focusing on seamless data processing through Cosmos Activity. This implementation enabled more efficient handling and transformation of data, particularly in real-time applications.
Developed dynamic, high-performance streaming dashboards, which provide instant insights and enhanced visualization of data streams, greatly aiding in real-time decision-making and analysis.
Configured Azure Synapse Analytics to serve as a sophisticated platform for data warehousing, integration and business intelligence reporting, thereby ensuring efficient and comprehensive data management and analysis.
Advanced the capabilities of Azure Data Factory, focusing on the design and implementation of scalable data pipelines. This initiative facilitated efficient data ingestion, storage and exploitation of advanced features such as data partitioning, replication and data warehousing, greatly enhancing data management capabilities.
Managed the ingestion of real-time data from Event Hub into a custom message consumer, involving the implementation of robust and efficient data ingestion pipelines, crucial for timely data processing and analysis.
Developed and deployed sophisticated machine learning models using Azure Machine Learning, focusing on predictive analytics and other data science tasks, thereby enabling more accurate forecasting and strategic decision-making.
Implemented Azure Kubernetes Service (AKS), focusing on container orchestration and management. This allowed for scalable and reliable deployment of data processing and analysis applications, enhancing overall system performance and reliability.
Established comprehensive end-to-end job automation using Airflow and Oozie. Designed and developed automated workflows that incorporated advanced scheduling and dependency management, enabling efficient orchestration of complex data processing tasks.
Utilized advanced data modeling, warehousing and visualization techniques to generate accurate, timely and actionable business reports, derived from various data streams, thus supporting informed decision-making and strategic planning.
Developed extensive Azure Databricks notebooks for in-depth data exploration, preprocessing and transformation tasks, facilitating more effective data analysis and insights generation.
Performed detailed exploratory data analysis (EDA) on healthcare datasets using statistical techniques and Tableau.
Implemented Azure DevOps for efficient management of code repositories, continuous integration/continuous deployment (CI/CD) pipelines and release management within data engineering projects, thereby enhancing overall project efficiency and effectiveness.
Developed and managed SQL Server Reporting Services (SSRS) reports to enable data-driven decision-making.

Environment & Setup: Django, Flask, Pandas, NumPy, Azure Cloud, Application Insights in Azure, Kafka, Azure Service Bus, Azure Event Hub, Spark, Azure HDInsight, Cosmos DB, Azure Data Factory (ADF), Azure Synapse Analytics, Azure Machine Learning, Azure Kubernetes Service (AKS), Airflow, Oozie, Tableau, Azure Databricks, Azure DevOps, SQL Server Reporting Services (SSRS).

Client: American Credit Acceptance Sept 2020 - Oct 2022
Senior Data Engineer

Mission & Contributions:
Customized Django's user model to suit specific application requirements, ensuring a robust and secure user management system.
Leveraged Pandas time-series functionality to handle and analyze time-stamped data, essential for trend analysis and forecasting in various domains.
Efficiently managed end-to-end data processing workflows, encompassing data ingestion, cleansing and transformation, by adeptly employing AWS Lambda, AWS Glue and Step Functions, ensuring smooth and effective data flow across the pipeline.
Architected and deployed a highly scalable, serverless infrastructure using API Gateway, Lambda functions and DynamoDB, optimized for performance and scalability.
Streamlined Lambda code deployment from S3 buckets, significantly improving the reliability and efficiency of service executions.
Set up comprehensive monitoring systems for Lambda functions and Glue Jobs, including alarms, notifications and logs via CloudWatch, for enhanced operational oversight.
Integrated Lambda with SQS and DynamoDB through Step Functions, facilitating efficient message processing and real-time status updates.
Extracted and refined data from diverse sources like S3, Redshift and RDS, establishing Glue Catalog tables/databases using Glue Crawlers.
Engineered and executed Glue ETL jobs in Glue Studio, optimizing data processing and enabling sophisticated transformations before loading into S3, Redshift and RDS.
Conducted thorough architectural and implementation assessments of key AWS services, including Amazon EMR, Redshift and S3.
Leveraged AWS EMR for large-scale data processing and integration with S3, orchestrating S3 event notifications, SNS topics, SQS queues and Lambda functions for streamlined Slack message alerts.
Automated data integration into AWS data lakes (S3, Redshift, RDS) using AWS Kinesis Data Firehose for streaming data sources.
Developed Terraform scripts for the automated deployment of AWS infrastructure, encompassing EC2, S3, EFS, EBS, IAM Roles, Snapshots and Jenkins Server.
Authored PL/SQL packages, database triggers and user procedures, along with comprehensive user documentation for new programs.
Designed and implemented interactive, real-time dashboards using Grafana to monitor system performance, data processing pipelines and key business metrics, enhancing operational visibility and decision-making.
Executed complex data transformations and aggregations using Scala with Apache Spark libraries and frameworks.
Integrated streaming ingests services, merging batch and real-time data processing with Spark Streaming and Kafka.
Specialized in constructing Data Lakes and Pipelines utilizing Big Data technologies such as Apache Hadoop, Cloudera, HDFS, MapReduce, Spark, YARN, Delta-lake and Hive.
Performed seamless data loading from HDFS to Hive using Hive Load Queries and facilitated data exchange between Hive and Netezza via Sqoop.
Implemented Parquet files and ORC format in PySpark and Spark Streaming, enhancing DataFrame operations.
Applied Dimensional Data Modeling techniques, including Star and Snow-Flake Schemas and leveraged frameworks like Lambda Architecture and Oozie.
Employed CI/CD tools like Jenkins and Git Bucket for streamlined code repository management, build automation and deployment of Python codebases.
Leveraged Crystal Reports to craft and manage detailed, dynamic reports that visualize complex datasets.

Environment & Setup: AWS Lambda, AWS Glue, Step Functions, API Gateway, DynamoDB, S3, CloudWatch, SQS, Redshift, RDS, Glue Crawlers, Glue Studio, Amazon EMR, SNS, AWS Kinesis Data Firehose, Terraform, EC2, EFS, EBS, IAM Roles, Snapshots, Jenkins, PL/SQL, Grafana, Scala, Apache Spark, Spark Streaming, Kafka, Apache Hadoop, Cloudera, HDFS, MapReduce, YARN, Delta-lake, Hive, Sqoop, Parquet, ORC, PySpark, Jenkins, Git Bucket, Crystal Reports.

Client: JetBlue Nov 2017 - Aug 2020
Senior Data Engineer

Mission & Contributions:
Masterfully utilized Google Cloud components, Google Container Builders, GCP client libraries and Cloud SDKs to architect and deploy data-intensive applications.
Successfully migrated an Oracle SQL ETL process to run on Google Cloud Platform (GCP) using Cloud Dataproc and BigQuery.
Stored data files in Google Cloud Storage (GCS) buckets daily, leveraging DataProc and BigQuery to develop and maintain GCP-based solutions.
Compared self-hosted Hadoop infrastructure to GCP's DataProc, evaluating performance and explored use cases and performance of BigTable (managed HBase).
Developed BigQuery authorized views to ensure row-level security and facilitate data sharing with other teams.
Leveraged Cloud Pub/Sub to trigger Airflow jobs within the GCP environment, enabling seamless orchestration and execution of data pipelines.
Built data pipelines in Airflow on GCP's Composer, utilizing various Airflow operators such as Bash, Hadoop, Python callable and branching operators.
Played a pivotal role in prototyping a NiFi-based big data pipeline, showcasing end-to-end data ingestion and processing scenarios.
Developed Informatica mappings to load data from multiple sources into the Data Warehouse, utilizing transformations such as Source Qualifier, Expression, Lookup, Aggregate, Update Strategy and Joiner.
Implemented ETL pipelines using Spark and Hive for seamless data ingestion from multiple sources.
Worked extensively with Presto, Hive, Spark SQL and BigQuery, harnessing Python client libraries to develop efficient and interoperable analytics programs.
Led migration of the MapReduce jobs to Spark RDD transformations using Python, driving improved performance and scalability.
Developed and executed SQL queries and scripts to ensure data integrity, encompassing checks for duplicates, null values, truncated values and accurate data aggregations.
Leveraged Apache Superset for creating and sharing comprehensive data visualizations and dashboards, facilitating deeper insights into data analytics and supporting strategic business decisions with interactive reporting.
Leveraged Apache Parquet with Hive to optimize data storage and retrieval within the Hadoop ecosystem.
Utilized Kafka message broker and custom Java code to transfer incoming log files to the Parser, loading the data into HDFS and HBase.
Employed GitHub for version control, while utilizing Jira and Confluence for streamlined documentation.
Created interactive dashboards in Tableau, establishing ODBC connections to various data sources including the Presto SQL engine.

Environment & Setup: Google Cloud components, Google Container Builders, GCP client libraries, Cloud SDKs, Cloud Dataproc, BigQuery, Google Cloud Storage (GCS), Dataproc, BigTable, Hadoop, Cloud Pub/Sub, Airflow, GCP Composer, NiFi, Informatica, Spark, Hive, Presto, Spark SQL, Apache Superset, Python client libraries, MapReduce, Apache Parquet, Kafka, Java, GitHub, Jira, Confluence, Tableau and ODBC.

Client: HCA Healthcare Nov 2015 Jul 2017
Data Engineer

Mission & Contributions:
Leveraged a deep understanding of the SDLC to manage the development, maintenance and enhancement of complex data systems, ensuring high data quality and system reliability through rigorous testing and validation processes.
Utilized Python frameworks including Django, Flask and FastAPI to develop efficient APIs and web applications, streamlining data processing workflows and enhancing user interfaces to improve user experience and system functionality.
Demonstrated expertise in leveraging libraries such as Pandas, NumPy and Scikit-learn for a wide range of data engineering tasks, from routine data manipulation and analysis to the implementation of complex machine learning algorithms, significantly contributing to data-driven decision-making processes.
Designed and implemented efficient, scalable database schemas and models to optimize data storage and retrieval operations, supporting the organization's data management needs and ensuring the integrity and accessibility of data across various systems.
Led the development of enterprise-level solutions within the Hadoop ecosystem using Apache Spark, MapReduce, HDFS enhancing data processing capabilities and system scalability.
Expertly built Power BI reports on Azure Analysis Services, delivering optimal performance and insightful data visualizations to support strategic business analysis and reporting requirements.
Administered Oracle database systems, focusing on performance tuning, schema design and security to support critical business applications, ensuring high availability and data integrity in a high-volume transaction environment.
Utilized Bitbucket for version control and team collaboration in software development projects, ensuring efficient management of branches, pull requests and automated CI/CD pipelines for timely and reliable application deployments.
Enhanced data operations and system deployments by automating tasks and workflows with Linux/Unix operating systems, boosting efficiency and reliability.

Environment & Setup: SDLC, Python, Django, Flask, FastAPI, Pandas, NumPy, Scikit-learn, Hadoop, Apache Spark, MapReduce, HDFS, Power BI, Azure Analysis Services, Oracle, Bitbucket, Linux/Unix.

Client: Sagarsoft, India Dec 2013 - July 2015
Data Analyst

Mission & Contributions:
Designed and developed interactive web dashboards using Django, complemented by HTML, CSS and JavaScript, to present analytical findings and insights in an accessible manner to stakeholders.
Managed and manipulated data from relational databases such as PostgreSQL, including designing schemas and performing advanced SQL queries for data analysis purposes, ensuring data accuracy and integrity.
Utilized SQL queries for data extraction, filtering and aggregation, supporting comprehensive analysis and meeting specific reporting requirements for data-driven decision-making.
Enhanced data retrieval performances by optimizing SQL queries and implementing indexing strategies, resulting in improved database efficiency and user experience in reporting tools.
Designed and utilized RESTful APIs, developed with Django Rest Framework, to streamline data flow and integration across various data analysis tools and platforms, facilitating seamless data exchange.
Applied Python libraries like Pandas, NumPy and SciPy for sophisticated data processing, cleansing and statistical analysis, and translating raw data into actionable insights.
Created visual representations of complex data sets using Matplotlib and Seaborn, enabling stakeholders to grasp insights through intuitive visualizations and dashboards.
Ensured the accuracy and reliability of data analysis by developing unit tests and performing integration testing, identifying discrepancies and enhancing the data validation process.
Managed project codebases and collaborated on data analysis projects using the version control system Git, fostering team collaboration and efficient code management.

Environment & Setup: Python, Django, HTML, CSS, JavaScript, PostgreSQL, SQL, Django Rest Framework, Pandas, NumPy, SciPy, Matplotlib, Seaborn and Git.
Keywords: continuous integration continuous deployment business intelligence sthree database procedural language New York

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];2385
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: