Venkatesh M - Sr Data engineer |
[email protected] |
Location: Dallas, Texas, USA |
Relocation: |
Visa: GC |
Name: Venkatesh M
Email: [email protected] Phone: 479 437 5798 Sr Big Data Engineer/Big Data Engineer/Sr Data Engineer/Data Engineer BACKGROUND SUMMARY : Having around 11+ years of IT experience in Design, Development, Maintenance and Support of Big Data Applications. Exposure to Spark, Spark Streaming, Spark MLlib, Scala and Creating the Data Frames handled in Spark with Scala. Hands on experience in working on Spark SQL queries, Data frames, and import data from Data sources, perform transformations, perform read/write operations, save the results to output directory into HDFS. Hands on experience in developing SPARK applications using Spark tools like RDD transformations, Spark core, Spark MLlib, Spark Streaming and Spark SQL. Experience in developing and deploying IBM Data Power policies and configurations to enable secure and efficient connectivity between applications and systems Hands on experience in installing, configuring Cloudera Apache Hadoop ecosystem components like Flume, Hbase, Zoo Keeper, Oozie, Hive, Sqoop and Pig. Handled importing of data from various data sources, performed transformations using Hive, Map Reduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop. Stored data in AWS S3 like HDFS and performed EMR programs on data stored. Wrote AZURE POWERSHELL scripts to copy or move data from local file system to HDFS Blob storage. Strong understanding of various technologies used in DataPower such as Web Services Security (WSS), Simple Object Access Protocol (SOAP), Representational State Transfer (REST), and XML-related technologies. Hands on experience in multiple domains such as Retail, and Healthcare etc Hands on experience with CCL for retrieving and manipulating data within the Cerner Millennium suite of applications, which includes electronic health records (EHR), clinical decision support. Experienced on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform. Developed Python code to gather the data from HBase and designs the solution to implement using spark. Experience in job workflow scheduling and monitoring tools like Oozie and good knowledge on Zookeeper. Hands on experience in installing configuring and using Hadoop ecosystem components like Hadoop Map Reduce HDFS HBase Hive Sqoop Pig Zookeeper and Flume. Strong experience and knowledge of real time data analytics using Spark, Kafka and Flume. Hands on experience in Capturing data from existing relational databases (Oracle, MySQL, SQL and Teradata) that provide SQL interfaces using Sqoop. Build Data Pipelines in airflow in GCP for ETL related jobs using different airflow operations. Experience in working on Cloud environments Like GCP (Google cloud platform) and AWS Extract Transform and Load data from source systems to cloud Azure Data Storage services using a combination of Azure Cloud Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Proficient in designing, developing, and maintaining AbInitio graphs and plans for ETL workflows. Demonstrated expertise in designing and executing data validation tests using QuerySurge to compare data between source and target systems, ensuring data accuracy and integrity. Experience in creating Docker Containers leveraging existing Linux containers and AMI s in addition in creating Docker containers from scratch. Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources. Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster. Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS. Implemented Technics with Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame and Spark YARN. Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS. Leveraged QuerySurge's data masking capabilities to ensure data privacy and compliance with data protection regulations. Extensive usage of Azure Portal, Azure PowerShell, Storage Accounts, Certificates and Azure Data Management. Virtualized the servers using Docker for the test environments and Dev environments, also configuration automation using Docker Contatiners. Developed Scala scripts using both Data frames/SQL/Data sets and RDD/Map Reduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop Worked on Implementation of a log producer in Scala that watches for application logs transform incremental log and sends them to a Kafka and Zookeeper based log collection platform Productionize models in cloud environment, which would include, automated process, CI/CD pipelines Worked with teams in setting up AWS EC2 instances by using different AWS services like S3, EBS, Elastic Load Balancer, and Auto scaling groups, VPC subnets and Cloud Watch. Proficient in writing and optimizing Kusto queries using KQL to extract, analyze, and visualize data from large datasets. Experience working with SAN infrastructure to store and manage large volumes of data used in data pipelines and analytical processes. Developed the application in spark that were integrated using Python, hive, GCP (Big Query), Hadoop in the backend. Worked with NoSQL databases like Hbase, Cassandra, dynamo DB (AWS) and MongoDB. Proficient in using Datagaps ETL Validator to conduct end-to-end data testing and validation of ETL pipelines. Created and maintained various Shell and Python scripts for automating various processes and optimized Map Reduce code, pig scripts and performance tuning and analysis. Developed workflow in Oozie also in Airflow to automate the tasks of loading data into HDFS and pre-processing with Pig and Hive. Development of cloud service including Jenkins and Nexus on Docker using Terraform. Developed workflow in Oozie to automate the tasks of loading the data into HDFS and processing. Designed and architected integration solutions using IBM App Connect Enterprise/IIB to connect disparate systems, applications, and data sources, enabling seamless information flow and business process automation. Successfully implemented an ESB pattern using IBM App Connect Enterprise/IIB, enabling the decoupling and mediation of services, simplifying integration complexity, and facilitating service. Developed business components by using Spring Boot, Spring IOC, Spring AOP, Spring Annotations, Spring Cloud) &Persistence layer by using Hibernate/JPA along with Web Services (RESTful). Deployed Spring Boot based Microservices and Docker container using Amazon EC2 container services using AWS. Thorough knowledge of Software Development Life Cycle (SDLC) with deep understanding of various phases like Requirements gathering, Analysis, Design, Development and Testing. TECHNICAL SKILLS: Big Data Ecosystem Hadoop, MapReduce, Pig, Hive, Flink, YARN, Kafka, Flume, Sqoop, Impala, CI/CD, Oozie, Zookeeper, Spark2.0, Ambari, Mahout, MongoDB, Cassandra, Avro, Storm, Parquet and Snappy. Hadoop Distributions Cloudera (CDH3, CDH4, and CDH5), Hortonworks,MapR and Apache Languages Java, Python, Jruby, SQL, HTML, DHTML, Scala, JavaScript. No SQL Databases Cassandra, MongoDB and HBase Java Technologies Servlets, JavaBeans, JSP, JDBC, JNDI, EJB and struts XML Technologies XML, XSD, DTD, JAXP (SAX, DOM), JAXB Development Methodology Agile, waterfall Development / Build Tools Eclipse, Ant, Maven, IntelliJ, JUNIT and log4J Frameworks Struts, spring and Hibernate App/Web servers WebSphere, WebLogic, JBoss and Tomcat DB Languages MySQL, PL/SQL, PostgreSQL and Oracle Cloud Technologies AWS, Azure WORK EXPERIENCE : Client: ICF Semantic bits - Herndon, VA SEP 2023 Present Role: Lead Big Data Engineer Responsibilities : Developed Spark Applications by using Scala, Java and Implemented Apache Spark data processing project to handle data. Developed Spark Programs using Scala and Java API's and performed transformations and actions on RDD's. Developed ADF Pipelines to load data from on prem to AZURE cloud Storage and databases and Creating pipelines, data flows and complex data transformations and manipulations using ADF and Pyspark with Databricks. Used Scala functional programming concepts to develop business logic. Developed programs in JAVA, Scala-Spark for data reformation after extraction from HDFS for analysis. Developed Step Function state machines for orchestrating complex ETL workflows, handling retries and error handling. Experience in working with product teams to create various store level metrics and supporting data pipelines written in GCP s big data stack Implemented CodeBuild projects with custom build environments and caching mechanisms for efficient builds. Developed Spark scripts by using Scala shell commands as per the requirement. Implemented applications with Scala along with Akka and Play framework. Expert in implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala. Skilled in writing Jenkins pipeline scripts (Groovy) and configuring Jenkins jobs for data applications. Developed and maintained Jenkins pipelines for building, testing, and deploying data applications using Scala, Spark. Experience in building data pipelines using AzureDatafactory, AzureDatabricks and loading data to AzureDataLake, AzureSQLDatabase, AzureSQLDatawarehouse and Controlling and granting database access. Using Databricks utilities called widgets to pass parameters on run time from ADF to Databricks and AzureDatabricks to run Spark-Python Notebooks through ADF pipelines. Creating starschema for drilling data and Created pyspark procedures, functions, packages to load data. Implemented automated testing, code quality checks, and data quality validations as part of the Jenkins build process. Data Ingestion to at least one AzureServices - (AzureDataLake, AzureStorage, AzureSQL, AzureDW) and processing the data in AzureDatabricks. Experience in moving data between GCP and Azure using Azure Data Factory. Migrated previuously written cron jobs to airflow/composer in GCP Worked on performing dry runs for pipelines to test out PY23 config updates. Improved data quality and caught regressions early by integrating unit tests and data validation checks in the build process. Understand and work with AzureDevOps, interpreting AzureARM template scripts, and making required changes for project requirements. Developed analytical components using Scala, Spark, Apache Mesos and Spark Stream. Experience in using the Docker container system with the Kubernetes integration. Client: Best Buy, San Jose, CA. Nov 2020 AUG 2023 Role: Big Data Engineer Responsibilities: Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify workable items for further development. Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-date for reporting purpose. Responsible for architecting, designing, implementing, and supporting of cloud-based infrastructure and solutions in Amazon Web Services (AWS) and Microsoft Azure. Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift. Configured Ansible to manage AWS environments and automate the build process for core AMIs used by all application deployments including Auto Scaling and Cloud Formation scripts. Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data. Used Ansible to manage systems configuration to facilitate interoperability between existing infrastructure and new infrastructure in alternate physcial data centres or cloud(AWS). Demonstrated experience in design and implementation of Statistical models, Predictive models, Developed a Queryable state for Flink by Scala to query streaming data and enriched the functionalities of the framework. Developed scripts using PySpark to push the data from GCP to the third-party vendors using their API framework. Experience in working with product teams to create various store level metrics and supporting data pipelines written in GCP s big data stack. Hands on experience in installing configuring and using Hadoop ecosystem components like Hadoop, Map Reduce, HDFS, HBase, Hive, Sqoop, Pig, Zookeeper and Flume Developed advanced PromQL queries to retrieve and aggregate metrics for analysis and troubleshooting. Implemented data profiling and data quality assessment processes to identify and remediate data issues. Designed and maintained data catalogs and dictionaries to provide a centralized view of data assets and their usage Experience on creating implementation plans, infrastructure management, architecture plans and pricing based on the Infrastructure as Service or Platform as a Service in AWS and Azure. Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse. Experience in working with Map reduce programs, Pig scripts and Hive commands to deliver the best results Designed and implemented efficient data models and schemas in Cosmos DB, optimizing data storage and query performance. Created a Serverless data ingestion pipeline on AWS using MSK Kafka and lambda functions. Experience in moving data between Azure using Azure Data Factory. Integrated Datagaps ETL Validator with ETL tools (e.g., Informatica, Talend, DataStage) to validate data transformations and mappings. Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc. Designed and developed Flink pipelines to consume streaming data from kafka and applied bussiness logic to message and transform and serialize raw data. Responsible for building scalable distributed data solutions using Hadoop and involved in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow. Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL,postgreSQL,Data Frame,OpenShift, Talend,pair RDD's Used Hbase/Pheonix to support front end applications that retrieve data using row keys Everyday Capture the data from OLTP systems and various sources of XML, Excel and CSV and load the data into Talend ETL tools. Generated reports and charts by connecting to different data sources like excel, sql server and implemented drill down functionality using Power BI tool. Expertise in using Docker to run and deploy the applications in multiple containers like Docker Swarm and Docker Wave. Involved in extensive data validation by writing several complex SQL queries and Involved in back-end testing and worked with data quality issues. Architect and design serverless application CI/CD by using AWS Serverless (Lambda) application model. Developed stored procedures/views in Snowflake and use in Talend for loading Dimensions and Facts. Developed merge scripts to UPSERT data into Snowflake from an ETL source Environment: Hdfs, Hive, Spark, Kafka, Linux, Python, NumPy, Pandas, Tableau, GitHub, AWS EMR/EC2/S3/Redshift, Lambda, Pig, Map Reduce, Cassandra, Snowflake, Unix, Shell Scripting, Git. Client: UBS, Weehawken, NJ Mar 2018 Oct 2020 Role: Big Data Engineer Responsibilities : Developed Crawlers java ETL framework to extract data from Cerner client s database and Ingest into HDFS & HBase for Long Term Storage. Created Oozie workflows to manage the execution of the crunch jobs and vertical pipelines. Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data Optimized StreamSets pipelines for performance by tuning batch sizes, parallelism settings, and resource allocation, resulting in reduced processing times and improved overall system efficiency. Implemented data governance policies and security measures within StreamSets pipelines to ensure data privacy, compliance with regulatory standards (such as GDPR or CCPA), and adherence to company data governance guidelines. Installed Oozie workflow engine to run multiple Hive and Pig Jobs. Design, Architect and Support Hadoop cluster: Hadoop. MapReduce, Hive, Sqoop, Ranger, Presto and high performance SQL query engine, Druid for indexing etc. For Log analytics and for better query response used Kusto Explorer and created alerts using Kusto query language Used Zookeeper to provide coordination services to the cluster. Experienced in managing and reviewing Hadoop log files. Experienced Python Developer working on multiple Python frameworks with OOP, (web, HTTP(S), Testing, built-ins and various 3rd party modules. Used Databricks for encrypting data using server-side encryption. Implemented data governance policies, access controls, and encryption mechanisms in CDP to ensure data security and compliance. Kubernetes Regional Cluster is created for both Jenkins & Druid integrated with OKTA and Active Directory. Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and pre-processing. Implemented monitoring solutions to track the health and performance of CDP clusters and conducted performance tuning for optimal resource utilization. OVONET Web Services Confidential. WPF, Web Art Silverlight 4.0/3.0, UND. Entity Framework, is SOAP XML DOM, HTML Java Script Visual Studio AJAX ADO.Net Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2. Utilized Cosmos DB Change Feed to enable real-time processing of data changes and updates. Reviewed the logical model with Business users, ETL Team, DBA's and testing team to provide information about the data model and business requirements. Designed and Developed Real Time Data Ingestion frameworks to fetch data from Kafka to Hadoop. Developed Airflow DAGs in python by importing the Airflow libraries. Ensured data quality by implementing data validation and cleansing processes in CDP, and maintained data lineage information. Performed information purging and applied changes utilizing Databricks and Spark information analysis. Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala. Used broadcast variables in spark, effective & efficient Joins, caching, and other capabilities for data processing. Extensive expertise in Data Warehousing on different database (s), as well as data modeling, both logical and physical data modeling tools like Erwin, Power Designer and ER Studio. Involved in continuous Integration of application using Jenkins. Environment: Ubuntu, Hadoop, Spark, Py Spark, Nifi, Jenkins, Talend, Spark SQL, Spark MLIib, Pig, Python, Tableau, GitHub, AWS EMR/EC2/S3, and Open CV Client: Gilead Sciences, Foster City, CA Nov 2016 Feb 2018 Role: Data engineer Responsibilities: Developed Spark scripts by using Scala shell commands as per the requirement. Worked on analysing Hadoop cluster using different big data analytic tools including Flume, Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Spark and Kafka. Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop. Built Big Data analytical framework for processing healthcare data for medical research using Python, Java, Hadoop, Hive and Pig. Integrated R scripts with Map reduce jobs. Start working with AWS for storage and halding for terabyte of data for customer BI Reporting tools Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP Experience in working with product teams to create various store level metrics and supporting data pipelines written in GCP s big data stack. Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop, Map Reduce developed in python, pig, Hive. Used Customer Health Record (CHR) repository a database of patient information collected from various clinical IT systems. A CHR is centralized and allows healthcare providers to quickly access patient information at the point of care. A CHR designed to hold data specifically for analytics is a clinical data warehouse. Worked on CICD pipeline, integrating code changes to Git repository and build using Jenkins. Utilized Kafka to capture and process near real time streaming data. Environment: AWS Services, S3, EMR, Spark, Oozie Teradata, Unix, TDCH, Python, PY Spark, Scala. Client: Valley infosystems Pvt ltd, Bangalore, IND Jun 2014 - Aug 2016 Role: Data Engineer Responsibilities: Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend. Experience in developing scalable & secure data pipelines for large datasets. Gathered requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment. Developed Map reduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu). S3 Data Lake Management. Responsible for maintaining and handling data inbound and outbound requests through big data platform. Used Sqoop to transfer data between relational databases and Hadoop. Environment: Spark, Redshift, Python, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology. Client: SumTotal Systems, Hyderabad, India Jun 2012 May 2014 Role: Hadoop Developer Responsibilities : Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, Zookeeper and Sqoop. Implemented Partitioning, Dynamic Partitions, Buckets in HIVE. Installed and Configured Sqoop to import and export the data into Hive from Relational databases. Used Python& SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions. Configured Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS. Environment : Hadoop YARN, Spark, Spark Streaming, Spark SQL, Scala, Pig, Python, Hive, Sqoop, Map Reduce, No SQL, HBase, Tableau, Oracle, Linux Keywords: continuous integration continuous deployment business intelligence sthree database rlang information technology procedural language California New Jersey Virginia |