Home

Krishna - Dara Engneer
[email protected]
Location: Bloomington, Idaho, USA
Relocation: YES
Visa: H1B
Krishna

PROFESSINAL SUMMARY:
Having 10+ years of IT industry experience as Data Engineer with hands-on experience in installing, configuring using Hadoop ecosystem components like Hadoop, Map reduce, HDFS, HBase, Zookeeper, Hive, Sqoop, Pig, Flume, Cassandra, Kafka, Flink, and Spark.
Experienced in using Agile methodologies including extreme programming, SCRUM and Test-Driven Development (TDD)
Excellent knowledge on Hadoop Architecture such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming paradigm.
Snowflake is one such SaaS-based platform that powers the data cloud and offers an intelligent infrastructure, optimized storage, and an elastic performance engine.
Played key role in Migrating Teradata objects into Snowflake environment.
Experience tuning spark jobs for efficiency in terms of storage and processing.
Experience in creating and executing Data pipelines in GCP and AWS and Azure platforms.
Hands on experience in GCP, Big Query, GCS, Cloud Composer, Cloud functions, Cloud dataflow, Pub/Sub, cloud shell, GSUTIL, bq command- line utilities, Data Proc
Experience in developing the big data applications and services using in Amazon Web Services (AWS) platform using EMR, AWS Kinesis, AWS EMR, Aws Glue, AWS LAMBA, S3, EC2, Lambda, CloudWatch and cloud computing using AWS RedShift.
Qlik Replicate enables real-time data replication, allowing organizations to capture and replicate data changes as they occur, ensuring that target systems are continuously updated.
Worked on Azure BLOB and Azure Data Lake storage and loading data into Azure SQL Synapse analytics.
Experienced in managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services.
Experienced in integrating Hadoop with Kafka, experienced in uploading Clickstream data from to HDFS.
Use Spark Streaming's API, which resembles batch processing, but operates on micro-batches of data.
Experienced in loading dataset into Hive for ETL (Extract, Transfer and Load) operation.
I have good Experience like Kanban is a subsect of the Agile methodology and functions within the broader Agile mentality.
Apache Druid can query segments that are only stored in deep storage. Running a query from deep storage is slower than running queries from segments that are loaded on Historical processes.
Apache Druid first partitions data by time. You can optionally implement additional partitioning based upon other fields. Time-based queries only access the partitions that match the time range of the query which leads to significant performance improvements.
Lookup data sources correspond to Apache Druid's key-value lookup objects. In Druid SQL, they reside in the lookup schema,
They collaborate with the DataStage developers to define the data requirements, create data mappings, and ensure the integrity and consistency of the data throughout the ETL processes.
Good Experience writing pyspark script using Jupiter notebook.
Experience in analysing data using HQL, Pig Latin and custom MapReduce programs in Python.
Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and process of data.
Hands on experience in developing ETL jobs in Hadoop eco-system using Oozie& Stream sets.
Proficient in usage of tools like Erwin (Data Modeler, Model Mart, navigator), ER Studio, IBM Meta Data Workbench, Oracle data profiling tool, Informatica, Oracle Forms, Reports, SQL Plus, Toad, Crystal Reports.
Good Experience writing pyspark script using jupyer notebook.
Logical data modelling focuses on capturing the business requirements and creating a conceptual representation of the data.
Experience in importing and exporting data using Sqoop from Relational Database Systems to HDFS and vice - versa.
Kudu is designed to store structured data with a schema. It supports various data types and complex data structures.
Extensive Experience of designing, developing, and deploying various kinds of reports using SSRS using relational and multidimensional data.
Developed Apache Spark jobs using Scala and Python for faster data processing and used Spark Core and Spark SQL libraries for querying.
Extensive Experience on importing and exporting data using stream processing platforms like Apache Flume, Apache Beam.
Build a program with Python and Apache beam and execute it in cloud Dataflow to run Data validation between raw source file and Big query tables.
Experienced with Docker and Kubernetes on multiple cloud providers, from helping developers build and containerize their application (CI/CD) to deploying either on public or private cloud.
Worked with HBase to conduct quick look ups (updates, inserts and deletes) in Hadoop.
Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
Extensive experience using MAVEN as a Build Tool for the building of deployable artifacts from source code.
Experience in Data Integration and Data Warehousing using various ETL tools Informatica PowerCenter, AWS Glue, SQL Server Integration Services (SSIS), Talend.
Setup and build AWS infrastructure various resources, VPC, EC2, S3, AWS EMR, AWS Kinesis, IAM, EBS, Simple Notification Service (SNS), Simple Queue Service (SQS), Security Group, Auto Scaling, RDS in Cloud Formation JSON templates.
Start by configuring ServiceNow's Change Management module to align with your organization's specific change management processes and policies. ServiceNow is highly customizable, allowing you to define your change request forms, workflows, and approval processes.
Expertise in relational database systems (RDBMS) such as My SQL, Oracle, MS SQL, and No SQL database systems like HBase, MongoDB and Cassandra.
Collaborated closely with front-end developers to integrate user interfaces seamlessly with back-end Spring Boot services.
Experience with Software development tools such as JIRA, GIT, SVN.
Worked on JIRA for defect/issues logging & tracking and documented all my work using CONFLUENCE.
Having experience in developing story telling dashboards data analytics, designing reports with visualization solutions using Tableau Desktop and publishing on to the Tableau Server.
Experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, SAS and Python and creating dashboards using tools like Qlik Sense and Tableau.
Created reports using visualizations such as Bar chart, Clustered Column Chart, Waterfall Chart, Gauge, Pie Chart, Tree map etc. in Power BI.
Flexible working Operating Systems like Unix/Linux (Centos, Red hat, Ubuntu) and Windows Environments

TECHNICAL SKILLS:
Languages: SQL, PL/SQL, PYTHON, Java, Scala, C, HTML, Unix, Linux
ETL Tools: AWS Redshift Matillion, Alteryx, Informatica PowerCenter, Ab Initio, Data Stage, Integration Services (SSIS) AWS GLUE.
Big Data: HDFS, Map Reduce, Spark, Yarn, NiFi, HBase, Hive, Pig, Flume, Sqoop, Kafka, Oozie, Hadoop, Zookeeper, Spark SQL, Apache Airflow.
RDBMS: Oracle 9i/10g/11g/12c, Teradata, PostgreSQL, My SQL, MS SQL, Neo4j, Cosmos DB, DynamoDB
NO SQL: MongoDB, HBase, Cassandra
Cloud Platform: AWS (Amazon Web Services), GCP, AZURE, Kubernetes
Concepts and Methods: Business Intelligence, Data Warehousing, Data Modeling, Requirement Analysis
Data Modeling Tools: ERwin, Power Designer, Embarcadero ER Studio, IBM Rational Software Architect, MS Visio, ER Studio, Star Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables
Application Servers: Apache Tomcat, Web Sphere, Sona type, Web logic, JBoss
Other Tools: Azure Databricks, Azure Data Explores, Azure HDInsight, Qlik Sense, Jira, ClearQuest, Qlik View Power BI, Veeva Api, Angular, Tableau, Looker.
Operating Systems: UNIX, Windows, Linux

CERTIFICATIONS - GCP CLOUD CERTIFIED

Professional Experience:
Mayo Clinic Rochester MN March 2021 to Present
Lead GCP Data Engineer
Responsibilities
Experience in building multiple Data pipelines, end-to-end ETL, and ELT processes for Data ingestion and transformation in GCP and coordinating tasks among the team.
Design and implement various layers of Data Lake, Design star schema in Big Query.
Understanding creating DATA PROC cluster management and configuration using GCP.
Created real-time data streaming pipelines using GCP Pub/Sub using Data flow.
GCP provides robust and secure storage solutions for healthcare organizations to store and manage their data. Google Cloud Storage offers scalable and durable object storage, while Google Cloud Bigtable and Google Cloud Fire store provide NoSQL databases for structured and semi-structured data. These services are commonly used to store Electronic health records (EHRs), medical imaging data, genomics data, and other healthcare-related data..
We are using for BigQuery is a fully managed and serverless data warehouse that can be used for analytics and reporting on healthcare data(EHR).
Proficient in utilizing Apache Druid for real-time processing and analysis of healthcare data, ensuring immediate access to critical patient information
Knowledge of maintaining HIPAA compliance and securing real-time healthcare data within Apache Druid, ensuring patient privacy and data integrity.
Proficient in integrating Apache Druid with EHR systems to provide real-time access to patient records and support seamless patient car
Using g-cloud function with Python to load Data into Big query for on arrival csv files in GCS bucket.
Worked with GCP services like Cloud Storage, Compute Engine, App engine, Cloud SQL, Cloud Functions, Cloud Run, Cloud Data Flow, Cloud Composer, Cloud Bigtable and Pub/Sub to process data for the downstream Customers.
Configure GCP Looker to connect to your GCP data sources, such as BigQuery, Cloud Storage, or other relevant services.
We use GCP Looker, which is platform-independent, to connect to data in Google Cloud's BigQuery, Data Proc, and other public clouds.
Used Pub/Sub was a real-time streaming analytics service that used Dataflow to process data from a GCP bucket to BigQuery.
Created a data pipeline using Dataflow to import all the UDF files and Parquet files and load the data into BigQuery.
Used Dataflow automatic templates for processing batch and streaming data.
Working with product teams to create various store-level metrics and supporting data pipelines written in GCP s big data stack.
Worked on partitioning and clustering high-volume tables on fields in BigQuery to make queries more efficient.
We are using Cloud Pub/Sub can be used for building event-driven architectures in healthcare applications. It allows you to ingest, transform, and analyze real-time healthcare data(EHR)
We are using Apache Beam which provides built-in connectors for various data storage and processing systems. This makes it easy to read from and write to different data sources.
We are using Apache Beam on GCP to read and write data from various GCP storage services, including Google Cloud Storage and BigQuery.
Executed the Apache Beam pipeline by using Google Cloud Dataflow as the runner and submitting it for execution.
Created and deployed Kubernetes clusters to Google Cloud, Created Docker images, and pushed to Google Cloud container registry using Jenkins
We are using DataStage allows users to integrate data from multiple sources such as databases, flat files, and applications.
Developed a real-time streaming data pipeline using Spring Boot and Spring Boot Data Stream.
We are using DataStage enables the creation of workflows that define the sequence and dependencies of ETL tasks.
We are using DataStage jobs are comprised of stages and connectors, defining the flow of data and transformations
We are using DataStage jobs are built using various stages representing data sources, transformations, and targets
Proficient in designing and implementing RESTful APIs and microservices using Spring Boot, ensuring optimal performance and security.
DataStage jobs are built using various stages representing data sources, transformations, and targets.
DataStage jobs can be parallelized to take advantage of parallel processing, enabling high performance for large datasets.
Created the Dataproc cluster using Apache Kafka for processing real-time streaming data.
Optimizing and tuning Snowflake's performance for efficient query execution.
Provide support for production DataStage jobs as needed.
Developed Streaming applications using PySpark to read from Kafka and persist the data in NoSQL databases such as HBase and Cassandra.
Created a Dataflow cluster and a temporary table in BigQuery. Then, imported all the files from the GCS bucket into the temporary table.
Build data pipelines in airflow in GCP for ETL related jobs using different Airflow operators.
Used cloud shell SDK in GCP to configure the services Data Proc, Data flow, Pub sub, Big Query
Worked on writing terraform scripts from scratch for building Dev, Staging, Prod, and DR environments.
Implemented Spark SQL to connect to Hive to read the data and distributed processing to make highly scalable.
Develop and deploy the outcome using spark and python code in the Hadoop cluster running on GCP.
Migrated previously written cron jobs to airflow/composer in GCP.
Worked on querying data using Spark SQL on the top of PySpark engine jobs to perform data cleansing, validation, and applied transformations and executed the program using Python API.
Worked on migrating MapReduce programs into Spark transformations using Spark and Scala.
Environment: GCP, GKE, PySpark, Data Stage, GCP Looker, Bigtable, Scala, Kotlin, Tableau, Java, Apache Druid, Terraform, GCP Looker, QILK, Data Cloud SQL, Stage, Apache Airflow, Apache Beam ,Big Query, Cloud Data flow, Data Composer, Terraform, Data Pro, Pub Subs, Kerberos, Snowflake, Jira, Confluence, python, Git, Kafka, CI/CD(Jenkins), Kubernetes.

Ascena Retail Group, Patskala, Ohio December 2018 to February 2021
Senior GCP Data Engineer
Responsibilities:
Involved in analysing business requirements and prepared detailed specifications that follow project guidelines required for project development.
Lead Data Engineer for developing ETL S using Informatica cloud services.
Used Pyspark for data frames, ETL, Data Mapping, Transformation and Loading in complex and high-volume environment.
Data is ingested into Druid through a variety of methods, including real-time streaming, batch ingestion, and integration with other data sources.
We are in the process of developing Apache Druid, a system that stores ingested data in a distributed, columnar format optimized for quick querying and aggregation. Druid takes advantage of indexing and compression techniques to effectively store and retrieve data.
Modified existing dimension data model by adding required dimensions and facts as per business process.
Worked on implementing scalable infrastructure and platform for large amounts of data ingestion, aggregation, integration, and analytics in Hadoop using Spark and Hive.
Use the PowerCenter Designer interface to design data transformations.
Got involved in migrating the on-prem Hadoop system to using GCP (Google Cloud Platform).
Worked on developing streamlined workflows using high-performance API services dealing with large amounts of structured and unstructured data.
Developed Spark jobs in Python to perform data transformation, creating Data Frames and Spark SQL.
Worked on processing unstructured data in JSON format to structured data in parquet format by performing several transformations using Pyspark.
Migrated previously written cron jobs to airflow/composer in GCP.
Built Scalding jobs to migrate the revenue data from BigQuery and HDFS. Used cloud replicator to run the BQMH jobs on a GCP Hadoop cluster and replicate the data on-prem HDFS.
Developed Spark applications using spark libraries to perform ETL transformations, eliminating the need for ETL tools.
Developed the end-to-end data pipeline in Spark using Python to ingest, transform, and analyses data.
Integrated Tableau with ETL platforms to streamline data workflows.
Created Hive tables using HiveQL, then loaded the data into Hive tables and analyzed the data by developing Hive queries.
Worked on GCP for the purpose of data migration from Oracle database to GCP.
We are injected to transfer the data using Qlik Replicate allows for data transformation during the replication process. You can map and transform data from source to target, including data cleansing, filtering, and enrichment.
Experience in working with product teams to create various store-level metrics and supporting data pipelines written in GCP s big data stack.
Experience in GCP Dataproc, Dataflow, Pub-Sub, GCS, Cloud functions, Big Query, Stack driver, Cloud logging, IAM, Data studio for reporting, etc.
Created and executed Unit test cases to validate transformations and process functions are working as expected.
Have written Python DAGs in airflow which orchestrate end-to-end data pipelines for multiple applications.
WE are using Azure Cosmos account is the basic unit of global distribution and high availability.
we are using globally distributing our data and throughput across multiple Azure regions, we can add or remove Azure regions from our Azure Cosmos at any tim
Was involved in setting up of Apache airflow service in GCP.
Build data pipelines in airflow in GCP for ETL-related jobs using different airflow operators.
Worked on scheduling the Control-M workflow engine to run multiple jobs.
Written shell scripts to automate application deployments.
Implemented solutions to switch schemas based on the dates so that the transformation would be automated.
Developed custom functions and UDFs in Python to incorporate methods and functionality of Spark.
Implemented Spark RDD transformations to map business analysis and apply actions on top of transformations.
Worked on Data serialization formats for converting complex objects into sequence bits by using Parquet, Avro, JSON, and CSV formats.
Business applications and Data marts for reporting. Involved in different phases of Development life including Analysis, Design, Coding, Unit Testing, Integration Testing, Review and Release as per the business requirements.
Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
Environment: Kafka, Impala, Spark, Sybase, GCP, BIG Query, Data Proc, Cosmo DB , Data Stage, Scala, ,PowerCenter, Inframatica, Apache Airflow, , Apache Beam, Data stage, Cloud Shell, Tableau, Flink, SQL, Python, Hive, Spark SQL, MongoDB, TensorFlow, Jira.

Early Warning's, Scottsdale, AZ April 2016 to November 2018
Role: AWS Data Engineer
Responsibilities:
Experience in Loading the data into Spark RDD s, perform advanced procedures like text analytics and processing using in memory data Computation capabilities of Spark using Scala to generate the Output response.
Written Terraform scripts to automate AWS services which include ELB, CloudFront distribution, RDS, EC2, database security groups, Route 53, VPC,IAM, EBS Lambda, ECS Fargate, API Gateway,Subnets, Security Groups, AWS GLUE, AWS Lambda,Auto Scaling, AWS Kinesis, AWS Cloud formation , AWS EMR, RDS & S3 Bucket in Cloud Formation JSON templates and and converted existing AWS infrastructure to AWS Lambda deployed via Terraform and AWS CloudFormation.
Worked in AWS environment for development and deployment of custom Hadoop applications.
Developed Python AWS serverless lambda with concurrent and multi-threading to make the process faster and asynchronously executing the callable.
Experienced in data ingestion and query optimization using Apache Druid, with the ability to handle high data throughput, retain extensive datasets, and ensure low-latency query responses
Proficient in real-time stream processing and analytics through the integration of Apache Flink with Apache Druid, enabling real-time data insights and decision-making capabilities
Proficient in Apache Druid for versatile data ingestion, modeling, and optimization, capable of handling both batch and real-time data sources to support diverse use cases.
We are injected the Real time data streaming using Apache Kafka is often used as a data source for Apache Druid, providing a reliable and scalable message streaming platform.
While Apache Druid can handle real-time analytics, traditional data warehouses like Snowflake and Amazon Redshift are more suitable for complex reporting and ad-hoc queries on structured data.
Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.
Experience writing scripts using Python (or Go Lang) and familiarity with the following tools like AWS Cloud Lambda, AWS S3, AWS EC2, AWS Redshift, AWS Postgres
Developed Talend Bigdata jobs to load heavy volume of data into S3 data lake and then into Snowflake.
Extracted the data from Teradata into HDFS/Dashboards using Spark Streaming.
Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and persists into Cassandra.
Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
Build machine learning models to showcase big data capabilities using Pyspark and MLlib.
Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR.
Developed MapReduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW.
The data architect collaborates with the Ab Initio developers to design the overall data model and data structures used in the ETL processes. They define the data requirements, create data mappings, and ensure the integrity and consistency of the data throughout the ETL Processes.
Setup and build AWS infrastructure various resources, VPC EC2, S3, IAM, EBS, Security Group, Auto Scaling, AWS Kinesis, AWS EMR, RDS in Cloud Formation JSON templates.
Used Spark SQL to read the Parquet data and Avro data loaded tables in hive to Spark using Scala.
Environment: AWS, S3, Jenkins, Spark-Core, Spark-Streaming, SQL, Jira, Lambda, ECS Fargate, API Gateway Ab Initio, Data Stage, Scala, AWS Kinesis, Apache Airflow, VPC, Python, Kafka, Hive, EC2, Elastic Search, Impala, Cassandra, Tableau, Talend, ETL, Linux

Grapesoft Solutions Hyderabad, India October 2014 to January 2016
Role: Data Engineer
Responsibilities:
Worked in comprehending and examining the client and business requirements.
Developed tools for monitoring and notification using Python.
Familiar with data architecture including data ingestion pipeline design, Hadoop information architecture, data modelling and data mining, machine learning and advanced data processing.
Experience in optimizing ETL workflows.
Good Experience in working with SerDe s like Avro Format, Parquet format data.
Good experience on Hadoop tools related to Data warehousing like Hive, Pig and involved in extracting the data from these tools on the cluster using Sqoop.
Skilled in executing programming code for intermediate to complex modules following development standards; planning and conducting code reviews for changes and enhancements that ensure standards compliance and systems interoperability.
Hands-on experience in working on Job Tracker, Task Tracker, Name Node, Data Node, Resource Manager, Node Manager, Application Master, YARN, and MapReduce Concepts.
Environment: Python 3, Hadoop, HDFS,JAVA, HTML/CSS, Big table , Scala, Hive, HBase, Zookeeper, Sqoop, MySQL, MAPREDUCE, Tableau, Cassandra, YARN, XML , jira, JSON, Lambda, ECS Fargate, API Gateway Restful Web services, JAVASCRIPT, Flink, pig scripts, Apache Spark, Linux, Git, Amazon s3, Jenkins, , Mongo DB, T-SQL, Eclipse.

Ceequence Technologies Hyderabad, India May 2013 to September 2014
Role: Data Engineer/ ETL Developer
Responsibilities:
Identify business, functional and technical requirements through meetings and interviews and sessions.
Define the ETL, mapping specification and Design the ETL process to source the data from sources and lead it into DWH tables.
Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion.
Developed server-based web traffic using RESTful API's statistical analysis tool using Flask, Pandas.
Analyse various type of raw file like Json, Csv, Xml with Python using Pandas, Numpy etc.
Extensive experience in integration of Informatica Data Quality (IDQ) with Informatica PowerCenter
Experience with ETL workflow Management tools like Apache Airflow and have significant experience in writing the python scripts to implement the workflow.
Environment: AWS GLUE, MS SQL , Teradata, ETL, Python, Hive, Scala, Data Stage, Talend.

Educational Qualifications: Bachelors in computer science from JNTUH Hyderabad in the year 2013
Keywords: cprogramm continuous integration continuous deployment business intelligence sthree database active directory rlang information technology golang microsoft procedural language Arizona Minnesota

To remove this resume please click here or send an email from [email protected] to [email protected] with subject as "delete" (without inverted commas)
[email protected];2662
Enter the captcha code and we will send and email at [email protected]
with a link to edit / delete this resume
Captcha Image: