Gowtham - Big Data Developer |
[email protected] |
Location: Alpharetta, Georgia, USA |
Relocation: |
Visa: H1B |
Gowtham
Sr. Big Data Engineer Mobile: 9084284149 E-Mail:[email protected] Professional Summary: Over 10+ years of strong experience in Application Development using Azure Data factory, Pyspark, Java, Python, Scala, and R & in-depth understanding of Distributed Systems Architecture and Parallel Processing Frameworks. Provisioning and managing resources on GCP. Designing and implementing scalable and secure GCP solutions. Strong experience using pyspark, HDFS, MapReduce, Hive, Pig, Spark, Sqoop, Oozie, and HBase. Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, MapR, Amazon EMR) to fully implement and leverage new features. Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop. Good knowledge in Database Creation and maintenance of physical data models with Oracle, Teradata, Netezza, DB2, MongoDB, HBase and SQL Server databases. Deep understanding of MapReduce with Hadoop and Spark. Good knowledge of Big Data ecosystem like Hadoop 2.0 (HDFS, Hive, Pig, Impala), Spark (Spark SQL, Spark MLlib, Spark Streaming). Integrated Kafka with Spark Streaming for real time data processing. Experienced on Hadoop Ecosystem and Big Data components including Apache Spark, Scala, Python, HDFS, Map Reduce, KAFKA. Expert in designing Server jobs using various types of stages like Sequential file, ODBC, Hashed file, Aggregator, Transformer, Sort, Link Partitioner and Link Collector. Proficiency in Big Data Practices and Technologies like HDFS, MapReduce, Hive, Pig, HBase, Sqoop, Oozie, Flume, Spark, Kafka. Hadoop ecosystem tools like HDFS, Spark, Sqoop, Hive, Flume, Kafka, Impala, PySpark, Oozie, and HBase. Solid Knowledge of AWS services like AWS EMR, Redshift, S3, EC2, and concepts, configuring the servers for auto-scaling and elastic load balancing. Experience in integrating the data in AWS with snowflake. Experienced with distributed version control systems such as GitHub, GitLab and Bit Bucket to keep the versions and configurations of the code organized. Working knowledge of Azure cloud components (HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, Cosmos DB). Experienced in building data pipelines using Azure Data Factory, Azure Databricks, and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse, and controlling database access. Extensive experience with Azure services like HDInsight, Stream Analytics, Active Directory, Blob Storage, Cosmos DB, and Storage Explorer. Wrote Python modules to extract/load asset data from the MySQL source database. Designed and implemented a dedicated MySQL database server to drive the web apps and report on daily progress. Technical Skills: Hadoop/Spark Ecosystem Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Oozie, Zookeeper, Spark, Airflow, MongoDB, Cassandra, HBase, and Storm. Hadoop Distribution Cloudera distribution and Horton works Programming Languages Scala, Hibernate, JDBC, JSON, HTML, CSS, SQL, R, Shell Scripting Script Languages JavaScript, jQuery, Python, Palantir Databases Oracle, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQL, HBase, MongoDB Cloud Platforms AWS, Azure, GCP Distributed Messaging System Apache Kafka Data Visualization Tools Tableau, Power BI, SAS, Excel, ETL, Matillion Batch Processing Hive, MapReduce, Pig, Spark Operating System Linux (Ubuntu, Red Hat), Microsoft Windows Reporting Tools/ETL Tools Informatica Power Centre, Tableau, Pentaho, SSIS, SSRS, Power BI Professional Experience: Client: Citizens Phoenix, Remote Senior Big Data Engineer Feb 2021 - Present Responsibilities: Responsible for the execution of big data analytics, predictive analytics, and machine learning initiatives. Provisioning and managing resources on GCP. Designing and implementing scalable and secure GCP solutions. Implementing and maintaining cloud-based solutions. Redesigned cloud-based data warehouse to enhance security and improve performance. Designing, developing, testing, and deploying ETL mappings, sessions, and workflows using Informatica PowerCenter. Understanding business requirements and translating them into ETL solutions. Designing and implementing data models to support ETL processes. Collaborating with ETL developers to ensure data models align with business requirements. Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift. Automate Datadog Dashboards with the stack through Terraform Scripts. Designed and developed the solutions for the applications migrating to azure. Utilized Python Libraries like Boto3, NumPy for AWS. Used Amazon EMR for map reduction jobs and test locally using Jenkins. Data Extraction, aggregations, and consolidation of Adobe data within AWS Glue using PySpark Create external tables with partitions using Hive, AWS Athena, and Redshift. Developed the PySpark code for AWS Glue jobs and for EMR. Good Understanding of other AWS services like S3, EC2 IAM, RDS Experience with Orchestration and Data Pipeline like AWS Step functions/Data Pipeline/Glue. Good exposure in Snowflake Cloud Architecture and Snow SQL and SNOWPIPE for continuous data ingestion. Utilized AWS services such as AWS Code Pipeline, AWS Code Commit, and AWS CloudFormation to automate the deployment of Data bricks notebooks, code, and infrastructure. Developed and maintained CI/CD pipelines using DevOps best practices to streamline the data engineering process, reduce errors, and improve efficiency. Developed automated testing and monitoring strategies to ensure high-quality data pipelines and machine learning models. Environments: Snowflake, Pig, Spark, Scala, Hive, Kafka, Python, Airflow, JSON Parquet, CSV, Code cloud, AWS. Client: - AT&T ( Remote) Sr. Big Data Engineer Azure Aug 2019 - Jan 2021 Responsibilities: Involved in Requirement gathering, business Analysis, Design and Development, testing and implementation of business rules. Designing and implementing scalable and secure GCP solutions. Defining infrastructure as code using tools like Terraform or Deployment Manager. Ensuring compliance with security and best practices. Implementing and maintaining cloud-based solutions. Understand business use cases, integration business, write business & technical requirements documents, logic diagrams, process flow charts, and other application related documents. Used Pandas in Python for Data Cleansing and validating the source data. Designed and developed ETL pipeline in Azure cloud which gets customer data from API and process it to Azure SQL DB. Orchestrated all Data pipelines using Azure Data Factory and built a custom alerts platform for monitoring. Created custom alerts queries in Log Analytics and used Web hook actions to automate custom alerts. Used Azure Data Factory extensively for ingesting data from disparate source systems. Used Azure Data Factory as an orchestration tool for integrating data from upstream to downstream systems. Automated jobs using different triggers (Event, Scheduled and Tumbling) in ADF. Used Cosmos DB for storing catalog data and for event sourcing in order processing pipelines. Take initiative and ownership to provide business solutions on time. Created High level technical design documents and Application design documents as per the requirements and delivered clear, well-communicated and complete design documents. Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks Integrated Azure Active Directory authentication to every Cosmos DB request sent and demoed feature to Stakeholders. Creating Pipelines with GUI in Azure Data Factory V2 Scheduling Pipelines and monitoring the data movement from source to destinations Transforming data in Azure Data Factory with the ADF Transformations Experience in Azure DevOps Wiki, Repos, Deployment Agents, Build and Release Pipelines Used T-SQL in constructing User Functions, Views, Indexes, User Profiles, Relational Database Models, Data Dictionaries, and Data Integrity. Environments: Azure Cloud, Azure Data Factory (ADF v2), Azure functions Apps, Azure Data Lake, BLOB Storage, SQL server, Teradata Utilities, Windows remote desktop, UNIX Shell Scripting, AZURE PowerShell, Data bricks, Python, Erwin Data Modelling Tool, Azure Cosmos DB, Azure Stream Analytics, Azure Event Hub, Azure Machine Learning. Client: - Ford Motors Dearborn, Michigan Hadoop/Big Data Engineer Oct 2017 - Jul 2019 Responsibilities: Developing Spark scripts, UDFS using both Spark DSL and Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop. Written multiple MapReduce Jobs using Java API, Pig and Hive for data extraction, transformation and aggregation from multiple file formats including Parquet, Avro, XML, JSON, CSV, ORCFILE and other compressed file formats Codecs like gZip, Snappy, Lzo. Strong understanding of Partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance. Interacted with business partners, Business Analysts, and product owners to understand requirements and build scalable distributed data solutions using the Hadoop ecosystem. Developed Spark Streaming programs to process near real time data from Kafka, and process data with both stateless and stateful transformations. Worked on implementation of a log producer in Scala that watches for application logs, transforms incremental logs and sends them to a Kafka and Zookeeper based log collection platform. Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard. Worked with HIVE data warehouse infrastructure-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HQL queries. Built and implemented automated procedures to split large files into smaller batches of data to facilitate FTP transfer which reduced 60% of execution time. Developed PIG UDFs for manipulating the data according to Business Requirements and also worked on developing custom PIG Loaders. Developing ETL pipelines in and out of data warehouses using a combination of Python and Snowflake Snow SQL Writing SQL queries against Snowflake. Transformed the data using AWS Glue dynamic frames with PySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature. Worked on installing cluster, commissioning & decommissioning of data node, name node recovery, capacity planning, and slots configuration. Developed data pipeline programs with Spark Scala APIs, data aggregations with Hive, and formatting data (JSON) for visualization, and generating. Environment: Apache Spark, Map Reduce, Snowflake, Apache Pig, Python, Java, SSRS, HBase, AWS, Cassandra, PySpark, Apache Kafka, HIVE, SQOOP, FLUME, Apache Oozie, Zookeeper, ETL Client: - Accenture Hyderabad, India Big Data Engineer Mar 2015 Aug 2017 Responsibilities: Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka. Worked on Apache NIFI to decompress and move JSON files from local to HDFS. Sustaining the Big Query, PySpark and Hive code by fixing the bugs and providing the enhancements required by the Business User. Authoring Python (PySpark) Scripts for custom UDF s for Row/ Column manipulations, mergers, aggregations, stacking, data labeling and for all Cleaning and conforming tasks. Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers. Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management. Strong understanding of AWS components such as EC2 and S3 Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python. Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions. Developed Hive UDFs to incorporate external business logic into Hive script and developed join data set scripts using Hive join operations. Migrated Map reduce jobs to Spark jobs to achieve better performance. Built Data pipelines using Apache Beam framework in GCP for ETL related jobs for different Airflow operators. Worked on Hortonworks-HDP distribution. Wrote Flume configuration files for importing streaming log data into HBase with Flume. Implemented AWS provides a variety of computing and networking services to meet the needs of applications. Writing HiveQL as per the requirements and Processing data in Spark engine and store in Hive tables. Importing existing datasets from Oracle to Hadoop system using Sqoop. Experienced in using Tidal enterprise scheduler and Oozie Operational Services for coordinating the cluster and scheduling workflows. Applied spark streaming for real time data transforming. Implemented the machine learning algorithms using python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis firehose and S3 data lake. Environment: Hadoop (HDFS, MapReduce), Pyspark, Scala, NiFi, GCP, MongoDB, Data bricks, Hortonworks, Cassandra, PostgreSQL, Spark, Impala, Hive, MongoDB, Pig, Devops, HBase, Oozie, Hue, Sqoop, Flume, Oracle, AWS Services (Lambda, EMR,), AWS, MySQL, SQL Server. Client: - Cyient Solutions Hyderabad, India Jr Hadoop Developer July 2013 Feb 2015 Responsibilities: Developed PIG scripts for source data validation and transformation. Automated data loading into HDFS and PIG for pre-processing the data using Oozie. Designed and implemented an ETL framework using Java and PIG to load data from multiple sources into Hive and from Hive into Vertica. Handled Hadoop cluster installations in various environments such as Unix, Linux and Windows. Assisted in upgrading, configuration, and maintenance of various Hadoop infrastructures like Pig, Hive, and HBase. Developed Spark scripts by using Python in PySpark shell command in development. Experienced in Hadoop Production support tasks by analyzing the Application and cluster logs. Created Hive tables, loaded with data, and wrote Hive queries to process the data. Created Partitions and used Bucketing in Hive tables and used required parameters to improve performance. Developed Pig and Hive UDFs as per business use-cases. Used Apache NiFi to automate data movement between different Hadoop components. Used NiFi to perform conversion of raw XML data into JSON, AVRO. Designed and published visually rich and intuitive Tableau dashboards and crystal reports for executive decision making. Experienced in working with SQL, T-SQL, PL/SQL scripts, views, indexes, stored procedures, and other components of database applications. Used Agile Scrum methodology/ Scrum Alliance for development. Environment: Hadoop, HDFS, Hive, Scala, TEZ, Teradata, Teradata Studio, TDCH, Snowflake, MapReduce, YARN, Drill, Spark, Pig, Hive, Scala, Java, MySQL, Kerberos. Education: Bachelor s degree in computer science JNTU, Hyderabad Aug 2010 Jun 2014 Keywords: continuous integration continuous deployment business intelligence sthree database rlang information technology microsoft procedural language |