Chilumula Sanjeeva Rao - Senior Data Engineer |
[email protected] |
Location: Remote, Remote, USA |
Relocation: Yes |
Visa: GC |
Chilumula Sanjeeva Rao
Senior Data Engineer [email protected] (475)-209-3157 Yes GC https://www.linkedin.com/in/engineer-sanjeev/ PROFESSIONAL SUMMARY Overall, 10+ years of experience in IT Industry including 6 years of experience as Spark/Hadoop developer using Bigdata technologies like Spark and Hadoop Ecosystems and around 2 years in Java Technologies and SQL. Have good understanding of Spark Architecture including Spark core, Spark SQL, Data frames, and Spark Streaming. Hands on experience on Spark using Java, expertise in creating Spark RDD (Resilient Distributed Dataset), and performing transformations, actions. Expertise in using Spark-SQL with various data sources like CSV, JSON files and apply transformation and saving into different file formats. Expertise in loading and reading the data into Hive using Spark-sql. Developed Spark scripts by using Java shell commands as per the requirement. Hands on experience in installing, configuration and using Hadoop Ecosystem components like HDFS, Map Reduce, HIVE, PIG, HBase, Sqoop, Flume. Working experience in importing and exporting data using Sqoop, from HDFS to Relational Database Systems and vice-versa for further processing. Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, MapR, Amazon EMR) to fully implement and leverage new features. Experience in developing Spark Applications using Spark RDD, Spark-SQL, and Data frame APIs Worked with real-time data processing and streaming techniques using Spark streaming and Kafka Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop. Expertise in working with HIVE data warehouse infrastructure-creating tables, data distribution by implementing Partitioning and Bucketing, developing, and tuning the HQL queries. Significant experience writing custom UDFs in Hive and custom Input Formats in MapReduce. Involved in creating Hive tables, loading with data, and writing Hive ad-hoc queries that will run internally in MapReduce and TEZ, replaced existing MR jobs and Hive scripts with Spark SQL & Spark data transformations for efficient data processing, Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data Strong understanding of real time streaming technologies Spark and Kafka. Knowledge of job workflow management and coordinating tools like Oozie Strong experience building end to end data pipelines on Hadoop platform. Experience working with NoSQL database technologies, including MongoDB, Cassandra and HBase. Strong understanding of Logical and Physical database models and entity-relationship modeling. Experience with Software development tools such as JIRA, Play, GIT. Good understanding of the Data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables. Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data. Experienced in using Agile methodologies including extreme programming, SCRUM and Test-Driven Development (TDD) Excellent analytical, communication and interpersonal skills Assisted in troubleshooting and resolving issues related to data integration and MDM. Developed documentation, including design documents, operational manuals, and training materials. Well versed with API integration. Storing Data Files in Google Cloud S3 Buckets daily basis. Using DataProc, Big Query to develop and maintain GCP cloud base solutions. Start working with AWS for storage and handling for terabyte of data for customer BI Reporting tools Experience in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension) Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance. Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud data flow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver Good Exposure in Data Quality, Data Mapping, Data Filtration using Data warehouse ETL tools like Talend, Informatica, DataStage, Ab - initio Good Exposure to create various dashboard in Reporting Tools like SAS, Tableau, Power BI, BO, QlikView used various filters, sets while dealing with huge volume of data. Experience in various Database such as Oracle, Teradata, Informix and DB2. Experience with NoSQL like MongoDB, HBase and PostgreSQL like Greenplum Good Knowledge on Cloudera distributions and in Amazon simple storage service (Amazon S3), AWS Redshift, Lambda and Amazon EC2, Amazon EMR. Excellent understanding of Hadoop Architecture and good Exposure in Hadoop components like Hadoop Map Reduce, HDFS, HBase, Hive, Sqoop, Cassandra, Kafka and Amazon Web services (AWS) API test, document and monitor by Postman which is easily integrate the tests into your build automation. Used Sqoop to Import data from Relational Database (RDBMS) into HDFS and Hive, storing using different formats like Text, Avro, Parquet, Sequence File, ORC File along with compression codes like Snappy and Gzip. Performed transformations on the imported data and exported back to RDBMS. Worked on Amazon Web service (AWS) to integrate EMR with Spark 2 and S3 storage and Snowflake Worked in complete Software Development Life Cycle(SDLC) like Analysis, Design, Development, Testing, Implementation and Support using Agile and Waterfall Methodologies. Demonstrated a full understanding of the Fact/Dimension data warehouse design model, including star and snowflake design methods. Experienced with Security protocols including OS hardening, firewalls, iptables, and working with Infosec. Worked with DBs in K8 in a High availability & Disaster recovery env Experienced with operational process and scripts for smooth operations of Postgres & MongoDB. TECHNICAL SKILLS Operating Systems: Linux (Ubuntu, CentOS), Windows, Mac OS Hadoop Ecosystem: Hadoop, MapReduce, Yarn, HDFS, Pig, Oozie, Zookeeper Big Data Ecosystem: Spark, Spark SQL, Spark Streaming, Hive, Impala, Hue Data Ingestion: Sqoop, Flume, NiFi, Kafka NOSQL Databases: HBase, Cassandra, MongoDB Programming Languages: C, Scala, Core Java, J2EE (SERVLETS, JSP, JDBC, JAVA BEANS, EJB) Frameworks: MVC, Struts, Spring, Hibernate Web Technologies: HTML, CSS, XML, JavaScript, Maven Scripting Languages: Java Script, UNIX, Python, R Language Databases: Oracle 11g, MS-Access, MySQL, SQL-Server 2000/2005/2008/2012, Teradata SQL Server Tools: SQL Server Management Studio, Enterprise Manager, Query Analyzer, Profiler, Export & Import (DTS). IDE: Eclipse, Visual Studio, IDLE, IntelliJ Web Services: Restful, SOAP Tools: Bugzilla, Quick Test Pro (QTP) 9.2, Selenium, Quality Center, Test Link, TWS, SPSS, SAS, Documentum, Tableau, Mahout Methodologies: Agile, UML, Design Patterns Professional Experience Senior Data Engineer Client: Kohl s, VA May 2019 to Present Responsibilities: Developed Hive, Bash scripts for source data validation and transformation. Automated data loading into HDFS and Hive for pre-processing the data using One Automation. Gather data from Data warehouses in Teradata and Snowflake Developed Spark/Scala, Python for regular expression project in the Hadoop/Hive environment. Designed and implemented an ETL framework to load data from multiple sources into Hive and from Hive into Teradata. I have experience on DBT provides a few benefits for data engineering teams. It allows data engineers to write modular, reusable code using SQL, which can be version controlled and tested like any other software code. DBT also provides several built-in features for data modeling, such as automatic type inference, schema management, and data lineage tracking. Generate reports using Tableau. Experience at building Big Data applications using Cassandra and Hadoop. Utilized SQOOP, ETL and Hadoop Filesystem APIs for implementing data ingestion pipelines. Worked on Batch data of different granularity ranging from hourly, daily to weekly and monthly. Hands on experience in Hadoop administration and support activities for installations and configuring Apache Big Data Tools and Hadoop clusters using Cloudera Manager. Handled Hadoop cluster installations in various environments such as Unix, Linux and Windows Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Ambari, PIG, and Hive. Developing and writing SQLs and stored procedures in Teradata. Loading data into snowflake and writing Snow SQLs scripts. TDCH scripts for full and incremental refresh of Hadoop tables. Optimizing Hive queries by parallelizing with partitioning and bucketing. Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and ORC. Worked extensively on Teradata, Hadoop-Hive, Spark, SQLs, PLSQLs, Snow SQLs. Designed and published visually rich and intuitive Tableau dashboards and crystal reports for executive decision making. Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications. Experienced in working with Hadoop from Horton works Data Platform and running services through Cloudera manager. Used Agile Scrum methodology/ Scrum Alliance for development. Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud data flow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver Storing Data Files in Google Cloud S3 Buckets daily basis. Using DataProc, Big Query to develop and maintain GCP cloud base solutionsGood Knowledge on Cloudera distributions and in Amazon simple storage service (Amazon S3), AWS Redshift, Lambda and Amazon EC2, Amazon EMR. Excellent understanding of Hadoop Architecture and good Exposure in Hadoop components like Hadoop Map Reduce, HDFS, HBase, Hive, Sqoop, Cassandra, Kafka and Amazon Web services (AWS) API test, document and monitor by Postman which is easily integrate the tests into your build automation. Used Sqoop to Import data from Relational Database (RDBMS) into HDFS and Hive, storing using different formats like Text, Avro, Parquet, Sequence File, ORC File along with compression codes like Snappy and Gzip. Performed transformations on the imported data and exported back to RDBMS. Worked on Amazon Web service (AWS) to integrate EMR with Spark 2 and S3 storage and Snowflake Environment: Hadoop, HDFS, AWS, Vertica, Scala, Kafka, MapReduce, YARN, Spark, Hive, Scala, MySQL, Kerberos, Maven, Stream sets. Sr. Hadoop/Big Data Engineer Client: Sony, NY August 2017 to April 2019 Responsibilities: Setting up Datalake in google cloud using Google cloud storage, Big Query, and Big Table. Creating shell scripts to process the raw data, loading data to AWS S3, and Redshift databases Planning and design of data warehouse in STAR schema. Designing structure of tables and documenting it. Designed and implemented end to end big data platform on Teradata Appliance Performing ETL from multiple sources such as Kafka, NIFI, Teradata, DB2 Using Hadoop spark. Involvement developing architecture solution of the project to migrate data. Developed Python, Bash scripts to automate and provide Control flow. Moving data from Teradata to a Hadoop cluster Using TDCH/Fast export and Apache NIFI. Work with Pyspark to perform ETL and generate reports. I have experience on DBT provides a few benefits for data engineering teams. It allows data engineers to write modular, reusable code using SQL, which can be version controlled and tested like any other software code. DBT also provides several built-in features for data modeling, such as automatic type inference, schema management, and data lineage tracking. Writing regression SQL to merge the validated Data into Prod environment. Develop Python, PySpark, Bash scripts logs to Transform, and Load data across on premise and cloud platform. Write UDFs in Hadoop Pyspark to perform transformations and loads. Use NIFI to load data into HDFS as ORC files. Writing TDCH scripts and apache NIFI to load data from Mainframes DB2 to Hadoop cluster. Working with Google cloud storage. Research and development of strategies to minimize the cost in google cloud. Worked on Amazon Web service (AWS) to integrate EMR with Spark 2 and S3 storage and Snowflake Using Apache solar for search operations on data. Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala. Working with multiple sources. Migrating tables from Teradata and DB2 to Hadoop cluster. Source Analysis, tracing back the sources of the data and finding its roots though Teradata, DB2 etc. Identifying the jobs that load the source tables and documenting it. Being an active part of Agile Scrum process with Sprints of 2 weeks. Working with Jira, Microsoft planner to track the progress of the project. Storing Data Files in Google Cloud S3 Buckets daily basis. Using DataProc, Big Query to develop and maintain GCP cloud base solutions Good Exposure in Data Quality, Data Mapping, Data Filtration using Data warehouse ETL tools like Talend, Informatica, DataStage, Ab - initio Good Exposure to create various dashboard in Reporting Tools like SAS, Tableau, Power BI, BO, QlikView used various filters, sets while dealing with huge volume of data. Experience in various Database such as Oracle, Teradata, Informix and DB2. Experience with NoSQL like MongoDB, HBase and PostgreSQL like Greenplum Worked in complete Software Development Life Cycle(SDLC) like Analysis, Design, Development, Testing, Implementation and Support using Agile and Waterfall Methodologies. Big Data Engineer Client: Mutual of Omaha, TX Nov 2015 to July 2017 Responsibilities: Responsible for building scalable distributed data solutions using Hadoop. Used Pyspark Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and persists into Cassandra. Experience in Loading the data into Pyspark data frames and Spark RDDs, perform advanced procedures like text analytics and processing using in memory data Computation capabilities of pyspark Spark using Scala to generate the Output response. Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in PySpark, Effective & efficient Joins, Transformations and other during ingestion process itself. Developed Scala scripts using both Data frames/SQL and RDD/MapReduce in PySpark for Data Aggregation, queries and writing data back into OLTP system through SQOOP. Worked with Impala KUDU for creating a spark to IMPALA-KUDU data ingestion tool. Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning. Optimizing of existing algorithms in Hadoop using PySpark Session, Spark-SQL, Data Frames and Pair RDDs Used Data tax Spark-Cassandra connector to load data into Cassandra and used CQL to analyze data from Cassandra tables for quick searching, sorting, and grouping. Designed, developed, and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems. Experience in writing SQOOP Scripts for importing and exporting data from RDBMS to HDFS. Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra for data access and analysis. Created Hive tables for loading and analyzing data, Implemented Partitions, Buckets, and developed Hive queries to process the data and generate the data cubes for visualizing. Implemented schema extraction for Parquet and Avro file Formats in Hive. Developed Hive scripts in Hive QL to de-normalize and aggregate the data. Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud data flow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver Worked on Amazon Web service (AWS) to integrate EMR with Spark 2 and S3 storage and Snowflake Used Spark API over Cloudera Hadoop Yarn to perform analytics on data in Hive. Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project. Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala. Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow. Experience in NoSQL Column-Oriented Databases like Cassandra and its Integration with Hadoop cluster. Worked with BI team to create various kinds of reports using Tableau based on the client's needs. Experience in Querying on Parquet files by loading them into Spark data frames by using Zeppelin notebook. Experience in troubleshooting any problems that arises during any batch data processing jobs Extracted the data from Teradata into HDFS/Dashboards using Spark Streaming. Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR. Environment: Hadoop Yarn, Spark-Core, Spark-Streaming, Spark-SQL, Scala, Python, Kafka, Hive, Sqoop, Amazon AWS, Elastic Search, Impala, Cassandra, Tableau, Talend, Cloudera, MySQL, Linux, Shell scripting Data Engineer Client: HSBC Hyderabad, India June 2013 to October 2015 Responsibilities: Developed PIG scripts for source data validation and transformation. Automated data loading into HDFS and PIG for pre-processing the data using Oozie. Designed and implemented an ETL framework using Java and PIG to load data from multiple sources into Hive and from Hive into Vertica. Expert design/coding skills, unit testing methodologies and techniques. Demonstrated competency in all phases of business intelligence and data warehousing projects, from inception to production deployment. Solid understanding of data warehousing principles, concepts, and best practices (e.g., ODS, Data Marts, Staging). Strong understanding of Release Management process and required applications. Familiar with business intelligence tools such as MicroStrategy, Tableau or similar. Understanding of data modeling principles for data warehousing (Normalization and Star Schema) Solid understanding of MS Office Suite (Visio, MS Project, others...). Developed Spark scripts by using Python in PySpark shell command in development. Experienced in Hadoop Production support tasks by analyzing the Application and cluster logs. Created Hive tables, loaded with data, and wrote Hive queries to process the data. Created Partitions and used Bucketing on Hive tables and used required parameters to improve performance. Developed Pig and Hive UDFs as per business use-cases. Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and XML Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena, Used Apache NiFi to automate data movement between different Hadoop components. Used NiFi to perform conversion of raw XML data into JSON, AVRO. Designed and published visually rich and intuitive Tableau dashboards and crystal reports for executive decision making. Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications. Experienced in working with Hadoop from Cloudera Data Platform and running services through Cloudera manager. Used Agile Scrum methodology/ Scrum Alliance for development. Environment: Hadoop, HDFS, Hive, Scala, TEZ, Teradata, Teradata Studio, TDCH, Snowflake, MapReduce, YARN, Drill, Spark, Pig, Hive, Scala, Java, MySQL, Kerberos. ETL Developer Client: HDFC, Hyderabad, India June 2012 to May 2013 Responsibilities: Solid understanding of data warehousing principles, concepts, and best practices (e.g., ODS, Data Marts, Staging). Strong understanding of Release Management process and required applications. Familiar with business intelligence tools such as MicroStrategy, Tableau or similar. Understanding of data modeling principles for data warehousing (Normalization and Star Schema) Solid understanding of MS Office Suite (Visio, MS Project, others...). Developed PIG scripts for source data validation and transformation. Automated data loading into HDFS and PIG for pre-processing the data using Oozie. Designed and implemented an ETL framework using Java and PIG to load data from multiple sources into Hive and from Hive into Vertica. Expert design/coding skills, unit testing methodologies and techniques. Demonstrated competency in all phases of business intelligence and data warehousing projects, from inception to production deployment. Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and XML Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena, Designed and published visually rich and intuitive Tableau dashboards and crystal reports for executive decision making. Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications. Experienced in working with Hadoop from Cloudera Data Platform and running services through Cloudera manager. Used Agile Scrum methodology/ Scrum Alliance for development. Environment: Hadoop, HDFS, Hive, Scala, TEZ, Teradata, Teradata Studio, TDCH, Snowflake, MapReduce, YARN, Drill, Spark, Pig, Hive, Scala, Java, MySQL, Kerberos. Neural Networks (ANN), Na ve Bayes Classifier, Extra Trees Classifier, Stochastic Gradient Descent, etc. Cloud Technologies: AWS, Azure, Google cloud platform (GCP) IDE IntelliJ: Eclipse, Spyder, Jupyter Ensemble and Stacking: Averaged Ensembles Weighted Averaging, Base Learning, Meta Learning, Majority Voting, Stacked Ensemble, AutoML - Scikit-Learn, MLjar, etc. Databases: Oracle 11g/10g/9i, MySQL, DB2, MS SQL Server, HBASE Programming: Query Languages Java, SQL, Python Programming (Pandas, NumPy, SciPy, Scikit-Learn, Seaborn, Matplotlib, NLTK), NoSQL, PySpark, PySpark SQL, SAS, R Programming (Caret, Glmnet, XGBoost, rpart, ggplot2, sqldf), RStudio, PL/SQL, Linux shell scripts, Scala. Data Engineer: Big Data Tools / Cloud / Visualization / Other Tools Data bricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, Mahout, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer, Azure HDInsight, Salesforce, GCP, Google Shell, Linux, PuTTY, Bash Shell, Unix, etc., Tableau, Power BI, SAS, We Intelligence, Crystal Reports, Dashboard Design. Keywords: cprogramm business intelligence sthree active directory rlang information technology green card microsoft procedural language Delaware New York Texas Virginia |