Sai Ganesh - Hadoop/spark developer |
[email protected] |
Location: Dallas, Texas, USA |
Relocation: Yes |
Visa: H1B |
Sai Ganesh Vasam
Email:[email protected] Ph No: 430-221-2036 PROFESSIONAL SUMMARY: Over all 10+ years of professional IT experience with 7+ Years of Big Data Hadoop Ecosystems experience in ingestion, storage, querying, processing and analysis of big data. Hands-on experience architecting and implementing Hadoop clusters on Amazon (AWS), using EMR, S2, S3, Redshift, Cassandra, mangoDB, CosmosDB, SimpleDB, AmazonRDS, DynamoDB, Postgresql., SQL, MS SQL. Experience in Hadoop Administration activities such as installation, configuration, and management of clusters in Cloudera (CDH4, CDH5), &Hortonworks (HDP) Distributions using Cloudera Manager & Ambari. Hands on experience in installing, configuring, and using Hadoop ecosystem components like HDFS, MapReduce, Hive, Impala, Sqoop, Pig, Oozie, Zookeeper, Spark, Solr, Hue, Flume, Storm, Kafka and Yarn distributions. Setup of HADOOP Cluster on AWS, which includes configuring different components of HADOOP. Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's Designed a data analysis pipeline in Python, using Amazon Web Services such as S3, EC2 and Elastic MapReduce Maintained Hadoop Cluster on AWS EMR. Used AWS services like EC2 and S3 for small data sets processing and storage Experienced in importing & exporting data between HDFS and Relational Database Management systems using Sqoop and troubleshooting for any issues. Exposure to Data Lake Implementation using Apache Spark and developed Data pipelines and applied business logics using Spark and used Scala and Python to convert Hive/SQL queries into RDD transformations in Apache Spark. Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle. Expert in big data ecosystem using Hadoop, Spark, Kafka with column-oriented big data systems on cloud platforms such as Amazon Cloud (AWS), Microsoft Azure and Google Cloud Platform. Used Spark to design and perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. Involved in designing Kafka for multi data center cluster and monitoring it. Responsible for importing real time data to pull the data from sources to Kafka clusters. Developed Spark Streaming Jobs in Scala to consume data from Kafka Topics, made transformations on data and inserted it to HBase. Very good Knowledge and experience in Amazon AWS concepts like EMR and EC2 web services which provides fast and efficient processing of Big Data. Experienced in performance tuning of Yarn, Spark, and Hive and experienced in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement. Develop predictive analytics using Apache Spark Scala APIs. Worked on Migration POC to Move data from Existing On-prem Hive to Snowflake. Experienced in extending Hive and Pig core functionality by writing custom UDFs and Map Reduce Scripts using Java & Python. Good Understanding and experience on Name Node HA architecture and experience in monitoring the health of cluster using Ambari, Nagios, Ganglia and Cronjobs. Experienced in Cluster maintenance and Commissioning /Decommissioning of Data Nodes and good understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, and Task Tracker, NameNode, DataNode and MapReduce concepts. Experienced in implementation of security controls using Kerberos principals, ACLs, Data encryptions using DM-Crypt to protect entire Hadoop clusters. Well-versed in spark components like Spark SQL, MLib, Spark streaming and GraphX. Expertise in installation, administration, patches, upgrade, configuration, performance tuning and troubleshooting of Red hat Linux, SUSE, CentOS, AIX, Solaris. Experienced Schedule Recurring Hadoop Jobs with Apache Oozie and experience in Jumpstart, Kickstart, Infrastructure setup and Installation Methods for Linux. Good knowledge in troubleshooting skills, understanding of system's capacity, bottlenecks, basics of memory, CPU, OS, storage, and network. Experience in administration activities of RDBMS data bases, such as MS SQL Server. Experienced in Hadoop Distributed File System and Ecosystem (MapReduce, Pig, Hive, Sqoop, YARN, MongoDB and HBase) and knowledge of NoSQL databases such as HBase, Cassandra and MongoDB. Major strengths are familiarity with multiple software systems, ability to learn quickly new technologies, adapt to new environments, focused adaptive and quick learner with excellent interpersonal, technical and communication skills. TECHNICAL SKILLS: Big Data Ecosystem HDFS and Map Reduce, Pig, Hive, Impala, YARN, HUE, Oozie, Zookeeper, Apache Spark, Apache STORM, Apache Kafka, Sqoop, Flume, PySpark. Operating Systems Windows, Ubuntu, RedHat Linux, Unix Programming Languages C, C++, Java, Python, SCALA Scripting Languages Shell Scripting, Java Scripting Databases Oracle 11g/10g/9i, MySQL, DB2, MS-SQL Server, SQL, PL/SQL, Teradata NoSQL Databases HBase, Cassandra, and MongoDB Hadoop Distributions Cloudera, Hortonworks Build Tools Ant, Maven, Sbt Development IDEs NetBeans, Eclipse IDE Web Servers Web Logic, Web Sphere, Apache Tomcat 6 Cloud AWS Version Control Tools SVN, Git, GitHub Packages Microsoft Office, Putty, MS Visual Studio Professional Experience: Client: Capital One, Dallas TX Nov 2020 - Currently Role: Hadoop / Spark Developer Responsibilities: Involved in Analysis, Design, System architectural design, Process interfaces design, design documentation. Loading the data from multiple Data sources like (SQL, DB2, and Oracle) into HDFS using Sqoop and load into Hive tables. Used Scala to convert Hive/SQL queries into RDD transformations in Apache Spark. Used Spark SQL to process the huge amount of structured data. Worked on NoSQL databases including HBase and MongoDB. Developed Python code to gather the data from HBase (Cornerstone) and designs the solution to implement using PySpark. Used Spark to design and perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. Involved in designing Kafka for multi data center cluster and monitoring it. Responsible for importing real time data to pull the data from sources to Kafka clusters. Developed Spark Streaming Jobs in Scala to consume data from Kafka Topics, made transformations on data and inserted it to HBase. Develop predictive analytics using Apache Spark Scala APIs. Worked on Migration POC to Move data from Existing On-prem Hive to Snowflake. Setup of HADOOP Cluster on AWS, which includes configuring different components of HADOOP. Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's Designed a data analysis pipeline in Python, using Amazon Web Services such as S3, EC2 and Elastic MapReduce Maintained Hadoop Cluster on AWS EMR. Used AWS services like EC2 and S3 for small data sets processing and storage Creating Oozie workflows and coordinator jobs for recurrent triggering of Hadoop jobs such as Java map-reduce, Pig, Hive, Sqoop as well as system specific jobs (such as Java programs and shell scripts) by time (frequency) and data availability. Used Spark for fast and general processing engine compatible with Hadoop data. Developed Python script for starting a job and ending a job smoothly for a UC4 workflow Analyzed large data sets by running Hive queries, and Pig scripts. Cascade Jobs introduced to make the data Analysis more efficient as per the requirement. Built re-usable Hive UDF libraries which enabled various business analysts to use these UDF s in Hive querying. Developed Python scripts to clean the raw data. Developed Simple to complex MapReduce Jobs using Hive and Pig. Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS. Developed Spark scripts by using Scala as per the requirement. Load the data into SparkRDD and perform in-memory data computation to generate the output response. Applied MapReduce framework jobs in java for data processing by installing and configuring Hadoop and HDFS. Performed data analysis in Hive by creating tables, loading it with data and writing hive queries which will run internally in a MapReduce way. Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, HBase, NoSQL database and Sqoop. Extracted files from MongoDB through Sqoop and placed them in HDFS and processed. Involved in creating Hive tables and loading and analyzing data using hive queries. Used FLUME to export the application server logs into HDFS. Environment: Hadoop, HDFS, Sqoop, Hive, Pig, MapReduce, Spark, Scala, Kafka, AWS, HBase, MongoDB, Cassandra, Python, NoSQL, Flume, Oozie. Client: Target Minneapolis, MN Aug 2019 Nov 2020 Role: Spark Developer Responsibilities: Worked with product owners, Designers, QA and other engineers in Agile development environment to deliver timely solutions to as per customer requirements. Transferring data from different data sources into HDFS systems using Kafka producers, consumers and Kafka brokers. Developed highly optimized Spark applications to perform various data cleansing, validation, transformation and summarization activities according to the requirement Data pipeline consists Spark, Hive and Sqoop and custom build Input Adapters to ingest, transform and analyze operational data. Developed Spark jobs and Hive Jobs to summarize and transform data. Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames and Python. Used Oozie for automating the end-to-end data pipelines and Oozie coordinators for scheduling the workflows. Involved in creating Hive tables, loading data and writing hive queries, views and worked on them using Hive QL. Performed Optimizations of Hive Queries using Map side joins, dynamic partitions and Bucketing. Applied Hive queries to perform data analysis on HBase using the serde tables in meeting the data requirements for the downstream applications. Responsible for executing hive queries using Hive Command Line, Web GUI HUE and Impala to read, write and query the data into HBase. Implemented MapReduce secondary sorting to get better performance for sorting results in MapReduce programs. Load and transform large sets of structured, semi structured that includes Avro, sequence files. Worked on migration of all existed jobs to Spark, to get performance and decrease time of execution. Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3). Using Hive join queries to join multiple tables of a source system and load them to Elastic search tables. Experience with ELK Stack in building quick search and visualization capability for data. Experience with different data formats like Json, Avro, parquet, ORC formats and compressions like snappy & bzip. Coordinated with the testing team for bug fixes and created documentation for recorded data, agent usage and release cycle notes. Environment: Hadoop, Big Data, HDFS, Scala, Python, Oozie, Hive, HBase, NiFi, Impala, Spark, AWS, Linux. Client: Change Health, Nashville, TN Jan 2018 Aug 2019 Hadoop Developer Responsibilities: Developed an EDW solution, which is a cloud-based EDW and Data Lake that supports Data asset management, Data Integration, and continuous data analytic discovery workloads. Developed and implemented real-time data pipelines with Spark Streaming, Kafka, and Cassandra to replace existing lambda architecture without losing the fault-tolerant capabilities of the existing architecture. Created a Spark Streaming application to consume real-time data from Kafka sources and applied real-time data analysis models that we can update on new data in the stream as it arrives. Worked on importing, transforming large sets of structured semi-structured and unstructured data. Used Spark-Structured-Streaming to perform necessary transformations and data model which gets the data from Kafka in real time and Persists into HDFS. Implemented the workflows using the Apache Oozie framework to automate tasks. Used Zookeeper to co-ordinate cluster services. Created various hive external tables, staging tables and joined the tables as per the requirement. Implemented static Partitioning, Dynamic partitioning and Bucketing in Hive using internal and external table. Created Map side Join, Parallel Execution for optimizing the Hive queries. Developed and implemented hive and spark custom UDFs involving date Transformations such as date formatting and age calculations as per business requirements. Written Programs in Spark using Scala and Python for Data quality check. Written transformations and actions on Data Frames, used Spark SQL on data frames to access hive tables into spark for faster processing of data. Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala. Used Spark optimizations techniques like Cache/Refresh tables, broadcasting variables, Coalesce/Repartitioning, increasing memory overhead limits, handling parallelism and modifying the spark default configuration variables for performance tuning. Performed various benchmarking steps to optimize the performance of Spark jobs and thus improve the overall processing. Worked in Agile environment in delivering the agreed user stories within the sprint time. Environment: Hadoop, HDFS, Hive, Sqoop, Oozie, Spark, Scala, Kafka, Python, Cloudera, Linux. Client: NBC, Los Angeles CA Aug 2016 Dec 2017 Role: Hadoop Developer Responsibilities: Responsible for importing data to HDFS using Sqoop from different RDBMS servers and exporting data using Sqoop to the RDBMS servers after aggregations for other ETL operations Involved in converting Hive/SQL queries into Spark transformations and actions using Spark SQL (RDDs and Data frames) in Python Implemented Spark SQL queries with Python for faster testing and processing of data Involved in creating Hive tables, and loading and analyzing data using hive queries Involved in running Hadoop jobs for processing millions of records of text data Developed multiple MapReduce jobs in python for data cleaning and preprocessing Experienced in analyzing data using HiveQL, Informatica and custom Ab Initio MapReduce programs in Python Created Partitioning, Bucketing, Map side Join, Parallel execution for optimizing the hive queries Involved in scheduling Oozie workflow engine to run multiple Hive and pig jobs Developed application using Eclipse and used build and deploy tool as Maven Hands on experience in AWS Cloud in various AWS services such as EC2, EMR, S3, and RDS Spins up different AWS instances including EC2-classic and EC2-VPC using Cloud Endure Environment: Hadoop, Spark, Python, HBase, HDFS, Sqoop, Hive, Oozie, UNIX shell scripting, Hue, Avro, Parquet, Informatica, Amazon S3, Maven, SBT, IntelliJ, Spotifire Client: Daffodil, Delhi IN March 2013 - July 2015 Role: Java Developer Responsibilities: Design of Java Servlets and Objects using J2EE standards. Designed use cases, activities, states, objects and components. Developed the UI pages using HTML, DHTML, Java script, Ajax, jQuery, JSP and tag libraries. Developed front-end screens using JSP and Tag Libraries. Performing validations between various users. Coded HTML, JSP and Servlets. Developed internal application using Angular and Node.js connecting to Oracle on the backend. Coding xml validation and file segmentation classes for splitting large XML file into smaller segments using SAX Parser. Created new connections through application coding for better access to DB2database and involved in writing SQL & PL SQL - Stored procedures, functions, sequences, triggers, cursors, object types etc. Involved in testing and deploying the development server. Wrote oracle stored procedures (PL/SQL) and calling it using JDBC. Involved in the design tables of the database in Oracle. Involved in the design tables of the database in Oracle. Environment: Java1.6 J2EE, Apache Tomcat, CVS, JSP, Servlets, Struts, PL/SQL and Oracl Keywords: cprogramm cplusplus quality analyst user interface javascript sthree information technology trade national microsoft procedural language California Colorado Minnesota Tennessee Texas |