Vakul Reddy - Big Data Engineer |
[email protected] |
Location: Assaria, Kansas, USA |
Relocation: Open |
Visa: |
Vakul Reddy Pannala
(234) -219-2017 SUMMARY OF EXPERIENCE 10+ Years of IT industry experience and working in a Big Data Capacity with the help of Hadoop Eco System across internal and cloud based platforms. Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and Spark processing frameworks. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala. Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming. Good understanding and knowledge of NoSQL databases like MongoDB, Azure, PostgreSQL, HBase and Cassandra. Understanding of structured data sets, data pipelines, ETL tools, data reduction, transformation and aggregation technique, Knowledge of tools such as DBT, DataStage Hands-on experience with CQL (Cassandra Query Language) for data querying and manipulation in a distributed environment. Extensive experience in developing strategies for Extraction, Transformation and Loading data from various sources into Data Warehouse and Data Marts using DataStage. Pleasant experience of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance. Excellent hands on business requirement analysis, designing, developing, testing and maintaining the complete data management & processing systems, process documentation and ETL technical and design documents. Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned. Having good knowledge in writing MapReduce jobs through Pig, Hive, and Sqoop. Responsible for data engineering functions including, but not limited to: data extract, transformation, loading, integration in support of enterprise data infrastructures data warehouse, operational data stores and master data management Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats. Has good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO etc. Optimized query speed by reducing data transport and using efficient data distribution and sorting algorithms in Amazon Redshift. Built end-to-end data processing pipelines and improved data architecture by integrating Amazon Redshift with other AWS services including Amazon S3 and AWS Lambda. Developed and maintained automated build and deployment pipelines using Jenkins, ensuring efficient software delivery and reducing deployment time by 50 Collaborated with cross-functional teams to configure and manage Jenkins agents for distributed builds and scalability. Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data. Proficient in using Celonis process mining software to analyze and optimize business processes. Demonstrated ability to identify process bottlenecks and implement effective solutions. Skilled in analyzing and interpreting large datasets to drive process optimization. Experience in generating reports and dashboards to visualize process performance and KPIs. Extensively used Python Libraries PySpark, Pytest, Pymongo, Oracle, Py Excel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup. Designed, deployed, and managed highly available and scalable Confluent Kafka clusters to support real-time data streaming for a large-scale enterprise application. Implemented Kafka Connect connectors to integrate Kafka with various external systems, enabling seamless data ingestion and delivery. Expertise working with AWS cloud services like EMR, S3, Redshift, EMR cloud watch, for big data development. Experience in developing customized UDF s in Python to extend Hive and Pig Latin functionality. IT SKILLS BigData/Hadoop Technologies Map Reduce, Spark, SparkSQL, Spark Streaming, Kafka, PySpark, ,Pig, Hive, HBase, Flume, Yarn, Oozie, Zookeeper, Hue, Ambari Server Languages Python, R, SQL, Java, Scala, Javascript NO SQL Databases Cassandra, HBase, MongoDB Web Design Tools HTML, CSS, JavaScript, JSP, jQuery, XML Development Tools Microsoft SQL Studio, IntelliJ, Azure Data bricks, Eclipse, NetBeans. Public Cloud AWS, Azure Development Methodologies Agile/Scrum, UML, Design Patterns, Waterfall Build Tools Jenkins, Toad, SQL Loader, PostgreSQL, Talend, Maven, ANT, RTC, RSA, Control-M, Oozie, Hue, SOAP UI Reporting Tools MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, cognos. Databases Microsoft SQL Server, MySQL , Oracle, DB2, Teradata, Netezza Operating Systems All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris PROJECT EXPERIENCE Client: Citizen Bank, Johnston, RI Feb 2023 till now Role: Senior Big Data Engineer Responsibilities: Designed, build and managed ELT data pipeline, leveraging Airflow, python, DBT. Developed and maintained automated build and deployment pipelines using Jenkins, ensuring efficient software delivery and reducing deployment time by 50 Collaborated with cross-functional teams to configure and manage Jenkins agents for distributed builds and scalability. Conducted Jenkins training sessions for team members, improving their proficiency in continuous integration and deployment practices. Configured Hadoop tools like Hive, Pig, Zookeeper, Flume, Impala and Sqoop. Deployed the initial Azure components like Azure Virtual Networks, Azure Application Gateway, Azure Storage and Affinity groups. Responsible to manage data coming from different sources through Kafka. Working in big data technologies like spark, Scala, Hive, Hadoop cluster (Cloudera platform). Making a data pipelining with help Data Fabric job SQOOP, SPARK, Scala and KAFKA. Parallel working in data side oracle and MYSQL server for data designing to source to target. I was in charge of PostgreSQL databases, making ensuring they were set up properly and optimized for maximum performance and dependability. Efficiently stored data by designing and implementing PostgreSQL data models that ensured adequate normalization. Proven ability to enhance database performance, decrease query execution times, and optimize SQL queries. Secured PostgreSQL databases by implementing authentication, authorization, and encryption for users to prevent unauthorized access to sensitive data. Designed, deployed, and managed highly available and scalable Confluent Kafka clusters to support real-time data streaming for a large-scale enterprise application. Proficient in Cassandra Agile (Apache Cassandra), a highly scalable and distributed NoSQL database management system. Architect and optimize cloud-based data solutions, leveraging cloud computing technologies and platforms (e.g., Snowflake) to store and process large volumes of data. Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell. Designed and implemented a product search service using Apache Solr PDVs. Write programs using Spark to move data from Storage input location to output location by running data loading, validation, and transformation to the data. Have extensive knowledge and experience in all Functional Helthcare Insurance Domain. Designed highly efficient data model for optimizing large-scale queries utilizing Hive complex datatypes and Parquet file format. Used Cloudera Manager continuous monitoring and managing of the Hadoop cluster for working application teams to install operating system, Hadoop updates, patches, version upgrades as required. Developed data pipelines using Sqoop, Pig and Hive to ingest customer member data, clinical, biometrics, lab and claims data into HDFS to perform data analytics. Analyzed Teradata procedure and imported all the data from Teradata to My SQL Database for Hive QL queries information for developing Hive Queries which consist of UDF s where we don t have some of the default functions in Hive. Provided design recommendations and thought leadership to sponsors/stakeholders that improved review processes and resolved technical problems. Managed and reviewed Hadoop log files. Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns. Used Scala function, dictionary and data structure (array, list, map) for better code reusability Based on Development, we need to do the Unit Testing. Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries. Primarily involved in Data Migration process using Azure by integrating with GitHub repository and Jenkins. Used Spark Data Frames Operations to perform required Validations in the data and to perform analytics on the Hive data. Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in in Azure Data bricks. Hadoop Resource manager was used to monitor the jobs that were run on the Hadoop cluster Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from Cassandra to Azure SQL Data warehouse and improved the query performance. Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API). Worked extensively on Azure data factory including data transformations, Integration Runtimes, Azure Key Vaults, Triggers and migrating data factory pipelines to higher environments using ARM Templates. Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE. Developing Spark scripts, UDFS using both Spark DSL and Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop. Written multiple MapReduce Jobs using Java API, Pig and Hive for data extraction, transformation and aggregation from multiple file formats including Parquet, Avro, XML, JSON, CSV, ORCFILE and other compressed file formats Codecs like gZip, Snappy, Lzo. Implemented Data Lake to consolidate data from multiple source databases such as Exadata, Teradata using Hadoop stack technologies SQOOP, HIVE/HQL. Developed real-time streaming applications integrated with Kafka and Nifi to handle large volume and velocity data streams in a scalable, reliable and fault tolerant manner for Confidential Campaign management analytics. Environment: SPARK, Kafka, DBT (data build tool), Data Stage, DB2, Snowflake, Map Reduce, Python, Hadoop, Hive, Pig, Spark, PySpark, SparkSQL, Azure SQL DW, Data brick, Azure Synapse, Azure Data lake, ARM, Azure HDInsight, Blob storage, Apache Spark, Oracle 12c, Cassandra, Git, Zookeeper, Oozie. confluent, kafka Client: Stifel, St Louis, MO Sep 2021 Jan 2023 Role: Big Data Engineer Responsibilities Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka. Implemented Kafka Connect connectors to integrate Kafka with various external systems, enabling seamless data ingestion and delivery. Developed workflow in Oozie to automate the tasks of loading the data into Nifi and pre-processing with Pig. Worked on Apache NIFI to decompress and move JSON files from local to HDFS. Like Access, Excel, CSV, Oracle, flat files using connectors, tasks and transformations provided by AWS Data Pipeline. Worked with Json format files by using XML, Hierarchical Data stage stages. Extensively used Parallel Stages like Row Generator, Column Generator, Head, and Peek for development and de-bugging purposes. Mentor and guide analyst on building purposeful analytics tables in dbt for cleaner schemas. Developed a python script to transfer data from on-prem with AWS. Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan. Reduced analytical query response times and improved query speed by implementing query optimization techniques in Amazon Redshift. Created and executed comprehensive recovery and backup strategies for Amazon Redshift, protecting data and reducing the likelihood of loss. Strong understanding of AWS components such as EC2 and S3 Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions. Developed HIVE UDFs to incorporate external business logic into Hive script and Developed join data set scripts using HIVE join operations. Created various hive external tables, staging tables and joined the tables as per the requirement. Implemented static Partitioning, Dynamic partitioning and Bucketing. Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics. Migrated Map reduce jobs to Spark jobs to achieve better performance. Working on designing the MapReduce and Yarn flow and writing MapReduce scripts, performance tuning and debugging. Used the Data Stage Director and its run-time engine to schedule running the solution, testing and debugging its components, and monitoring the resulting executable versions on ad hoc or scheduled basis. Stored data in AWS S3 like HDFS and performed EMR programs on data stored. Used the AWS-CLI to suspend an AWS Lambda function. Used AWS CLI to automate backups of ephemeral data-stores to S3 buckets, EBS. If we don t have data on our HDFS cluster, I will be scooping the data from netezza onto out HDFS cluster. Transferred the data using Informatica tool from AWS S3 to AWS Redshift. Worked on Hive UDF s and due to some security privileges I have to ended up the task in middle itself. Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS Wrote Flume configuration files for importing streaming log data into HBase with Flume. Experience in setting up the whole app stack, setup, and debug log stash to send Apache logs to AWS Elastic search. Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data. Implemented AWS provides a variety of computing and networking services to meet the needs of applications Writing HiveQL as per the requirements and Processing data in Spark engine and store in Hive tables. Importing existing datasets from Oracle to Hadoop system using SQOOP. Brought data from various sources in to Hadoop and Cassandra using Kafka. Experienced in using Tidal enterprise scheduler and Oozie Operational Services for coordinating the cluster and scheduling workflows. Model, lift and shift custom SQL and transpose LookML into dbt for materializing incremental views. Applied spark streaming for real time data transforming. Exported the analysed data to the relational databases using Sqoop for visualization and to generate reports for the BI team Using Tableau. Implemented Composite server for the data virtualization needs and created multiples views for restricted data access using a REST API. Implemented the machine learning algorithms using python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis firehose and S3 data lake. Allotted permissions, policies and roles to users and groups using AWS Identity and Access Management (IAM). Environment: Hadoop (HDFS, MapReduce), DBT, Data Stage, Scala, Spark, DB2, Snowflake, Impala, Hive, MangoDB, Pig, Devops, HBase, Oozie, Hue, Sqoop, Flume, Oracle, AWS Services (Lambda, EMR, Auto scaling), Mysql, Python, Scala, Spark, Hive, Spark-Sql. Client: Dana Farber Cancer Institute, Boston, MA Jun 2019 Aug 2021 Role: Data Engineer Responsibilities: Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow. Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS and VPC. Developed a python script to transfer data from on-premises to AWS S3 Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time. Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning. Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB. Helped maintain and troubleshoot UNIX and Linux environment Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's. Built pipelines to move hashed and un-hashed data from XML files to Data lake. Expertise in analyzing data using Pig scripting, Hive Queries, Sparks (python) and Impala. Experienced in writing live Real-time Processing using Spark Streaming with Kafka. Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports. Developed Pig program for loading and filtering the streaming data into HDFS using Flume. Experienced in handling data from different datasets, join them and pre-process using Pig join operations. Developed HBase data model on top of HDFS data to perform real time analytics using Java API. Environment: Spark, Kafka, Hadoop, HDFS, Spark-SQL, AWS, Python, Map Reduce, Pig, Hive, Oracle 11g, My SQL, MongoDB, Hbase, Oozie, Zookeeper, Tableau. Client: Reinsurance Group of America, Minneapolis, MN Jan 2016 May 2019 Role: Hadoop Developer Responsibilities: Experience in writing stored procedures and complex SQL queries using relational databases like Oracle, SQL Server and MySQL. Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer. Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoop clusters which are set up in AWS EMR. Performed Data Preparation by using Pig Latin to get the right data format needed. Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file. Developed spark code and spark-SQL/streaming for faster testing and processing of data. Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive Involved in designing the row key in HBase to store Text and JSON as key values in HBase table and designed row key in such a way to get/scan it in a sorted order. Created Hive schemas using performance techniques like partitioning and bucketing. Used Hadoop YARN to perform analytics on data in Hive. Developed and maintained batch data flow using HiveQL and Unix scripting Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python. Worked extensively with Sqoop for importing metadata from Oracle. Involved in creating Hive tables and loading and analyzing data using hive queries. Developed Hive queries to process the data and generate the data cubes for visualizing. Environment: Hadoop, MapReduce, HBase, JSON, Spark, Kafka, Hive, Pig, Hadoop YARN, Spark Core, Spark SQL, Scala, Python, Java, Hive, Sqoop, Impala, Oracle, Yarn, Linux, Oozie. Client: Healthnet India Pvt Ltd - Delhi, India July 2013 Oct 2015 Role: Junior Hadoop Developer Responsibilities: Implemented Avro and parquet data formats for apache Hive computations to handle custom business requirements. Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster. Installed Oozie workflow engine to run multiple Hive and Pig Jobs. Developed Simple to complex Map/reduce Jobs using Hive and Pig Developed Map Reduce Programs for data analysis and data cleaning. Extensively used SSIS transformations such as Lookup, Derived column, Data conversion, Aggregate, Conditional split, SQL task, Script task and Send Mail task etc. Implemented Apache PIG scripts to load data from and to store data into Hive. Environment: Hive, Hadoop, Cassandra, Pig, Sqoop, Ooze, Hive, Python, MS Office. Keywords: user interface business intelligence sthree database active directory rlang information technology microsoft Delaware Massachusetts Minnesota Missouri Rhode Island |