Vishal - Sr. Data Engineer |
[email protected] |
Location: Plano, Texas, USA |
Relocation: yes |
Visa: GC |
Ahalya
Bench Sales Recruiter Vdrive IT Solutions Inc. | 800 E Campbell Road, Suite # 157 | Richardson, TX 75081 Office: +1 (469)-988-5899 [email protected] NAME - VISHAL VISA - GC CURRENT LOCATION - TX Profile Experienced professional with over 9 years of expertise in data preprocessing, data collection, market research, data visualization, and proficiency in various data analytical tools. Skilled in the software development cycle, encompassing building processes, testing, machine learning, data mining, and possessing foundational coding knowledge. ------------------------------------------------------------------------------------------------------------------------------------------ Professional Experience Over 9 years of IT expertise in Data Analytics and Engineering, based on Data Science tools. Hands-on experience with components such as Spark, Kafka, Flume, Sqoop, Oozie, Zookeeper, and Solr. Knowledge of moving SQL databases to AWS buckets, AWS Data Lake Analytics, AWS SQL Database, Data Bricks, and AWS SQL Data Warehouse, as well as managing and providing database access and transferring on-premises databases to AWS Data Lake stores using AWS Data Factory. Experience developing Spark applications on AWS Data bricks utilizing Spark - SQL for data extraction, transformation, and aggregation from diverse file formats to analyze and transform the data to find insights regarding customer usage patterns. Proficient in designing and implementing solutions using the ELK stack (Elasticsearch, Logstash, Kibana), with a strong emphasis on utilizing Kibana for creating smart dashboards. Demonstrated expertise in utilizing ELK stack and Kibana to gather, analyze, and visualize data for actionable insights. Hands-on experience integrating various tools through Web Services/APIs, enabling seamless data flow and enhancing overall data analysis capabilities. Proven ability to perform data mining and utilize advanced analytics techniques for effective decision-making and problem-solving. Practical experience with database issues and links to SQL databases through the installation and configuration of various Python packages. Experienced programming in Python for data analysis and machine learning models utilizing Python libraries such as Pandas, Numpy scikit-learn, SciPy Stack packages, Matplotlib, and Seaborn. Experience creating and constructing ETL pipelines for data processing. Used SparkSQL and Spark Streaming Contexts to conduct actions and transformations on RDDs, Data Frames, and Datasets. Experience in working with Google Cloud Platform IAM to provide access to API s and GCP applications within the enterprise. Set up Spark streaming to accept real-time data from messaging systems such as Apache Kafka. Worked on Data Models and Dimensional Modeling for OLAP using 3NF, Star, and Snowflake schemas. Strong grasp of data warehousing concepts, Fact Tables, Dimension Tables, and Star and Snowflake schema modeling. Acquired a sufficient understanding of the CI/CD pipelines, as well as tools such as Jenkins and GIT for releasing code changes more often to increase cooperation and quality. Installing and configuring orchestration applications such as the Docker tool and Kubernetes. Awareness of Machine Learning Algorithms and R for Statistics, as well as Regression Models such as Linear and Logistic, as well as Statistical Methodologies such as Hypothetical Testing, ANOVA, and Time Series. Worked with deep learning techniques such as LSTM and CNN, as well as TensorFlow, Keras, and OpenCV. Alteryx platform experience, including data preparation, data blending, and the construction of data models and data sets. Designed and built more interactive models using Tableau, including reports and dashboards with big data volumes. Knowledge of analyzing data from many sources and developing reports with Interactive Dashboards in Power BI. Extensive experience building Power BI reports scorecards and dashboards. Experience with analysis and visualization technologies such as Kibana and Splunk. Proven expertise in designing online applications utilizing Model View Control (MVC) architecture and web application frameworks such as Django, Flask, Pyramid, and Python. Extensive familiarity with Informatica PowerCenter, SAS, and SSIS. Solid understanding of Amazon AWS principles such as EMR and EC2 web services, which enable quick and effective Big Data processing. A strong team player with the capacity to swiftly learn and adapt to developing new technology. ------------------------------------------------------------------------------------------------------------------------------------------ Education Jawaharlal Nehru Technological University | Hyderabad, India May 2014 Bachelor of Technology in Computer Science and Engineering 87% ------------------------------------------------------------------------------------------------------------------------------------------ Skills Languages: Programming: Python (NumPy, Pandas, Scikit, Matplotlib, SciPy, Seaborn), Java, C, C++, Scala, JavaScript Web Technologies: JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL Hadoop Ecosystem: HDFS, MapReduce, Yarn, Data Node, Node Manager, PIG, SQOOP, HBase, Hive, Flume, Cassandra, Spark, Storm, Scala, Impala, Zoo Keeper Apache Tools: Apache Spark, Apache Kafka, Apache Airflow Cloud Based: Amazon Web Services (AWS), Microsoft Azure, Google Cloud PLatform AWS Cloud: Amazon EC2, Amazon EMR, Amazon LAMBDA, Amazon GLUE, Amazon S3, Amazon ATHENA, Amazon Redshift Data Warehouse/ Data Lake: Microsoft SQL Server Management Studio, MySQL, Amazon RedShift, Snowflake, Informatica PowerCenter, Powermart, Data quality, Big Data. Databases: Snowflake(cloud), Teradata, IBM DB2, Oracle, SQL Server, MySQL, NoSQL, HBase, Cassandra, Mongo db, PostgreSQL ETL & BI Tools: AWS Glue, Azure Data Factory, Pentaho ETL, Informatica, Talend, Power BI, Tableau, advanced MS-Excel Agile Software Management Tools: JIRA & Google Workspace. Version Control Tools: SVM, GitHub, Bitbucket. Operating Systems: Windows, Linux, Unix & macOS HD ------------------------------------------------------------------------------------------------------------------------------------------ Work Experience Citi | Jacksonville, Florida | Remote Jan 2023 - Present Data Engineer Bi-weekly group meetings with the supervisor were held to resolve difficulties, resulting in a 70% improvement in application throughput time. Worked on identifying the field and column level data quality issues of the source systems to drive the data quality, data cleansing and error checking in the ETL packages. Designed and developed ETL packages using SQL Server Integration Services (SSIS) 2008 to load the data from different source systems. Worked on generating reports and charts using SQL Service Reporting Services (SSRS) reporting functionality. Demonstrated experience in designing, building, and maintaining large-scale big data solutions using Azure services and PySpark, Park-SQL. Extracted, transformed, and loaded data from various heterogeneous data sources and destinations like Access, Excel, CSV, avero, parquet, flat files using connectors, tasks and transformations provided by Azure Synapse and ADF. Analyzed, designed, and built Modern data solutions using Azure PaaS service to support visualization of data. Created complex data transformations and data cleansing processes within Matillion to prepare data for analysis. Developed orchestration workflows and job scheduling within Matillion to automate ETL processes, ensuring data is up-to-date and accurate. Ensured data security and compliance by implementing encryption, access controls, and data masking techniques within Matillion. Extracted, Transformed, and Loaded data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Worked on Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processed the data in Azure Databricks. Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards. Extensive experience with ELK stack and Kibana, designing and implementing solutions, and creating smart dashboards. Proficient in real-time log aggregation, analysis, and querying using ELK stack and Kibana. Skilled in integrating various tools through Web Services/APIs to enhance data analytics capabilities. Hands-on expertise in data analytics, including data mining and test data management. Experience in Migrating SQL database to Azure Data Lake Analytics, Azure Data Lake, Azure SQL Database and controlling and giving information base access and Migrating On-prem data sets to Azure Data Lake store utilizing Azure Data factory. Environment: Apache Hadoop, CDH 4.7, HDFS, MapReduce, AWS, Azure, Sqoop, Flume, Pig, Hive, HBase, Oozie, Scala, Spark, Spark Streaming, Kafka, Linux ------------------------------------------------------------------------------------------------------------------------------------------ FIS | Jacksonville, Florida | On-Site Sept 2020 - Dec 2022 Big Data Engineer Worked as a Data Engineer designed and Modified Database tables and used HBase Queries to insert and fetch data from tables. Developed automatic job flows and ran through Oozie daily and when needed which runs MapReduce jobs internally. Used SSIS and T-SQL stored procedures to transfer data from OLTP databases to staging area and finally transfer into data-mart. Worked on Amazon AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data. Experience in designing and building RESTful APIs using Java and Python Proficient in using API development tools such as Swagger and Postman Experience in developing web applications using Java application servers such as WebLogic, WebSphere, and Tomcat Experience in using JavaScript toolkits such as Dojo for developing rich client-side applications Proficient in Apache Spark for data processing and analysis Experience in using Scala programming language for building scalable data applications Designed Data Stage ETL jobs for extracting data from heterogeneous source systems, transform and finally load into Data Warehouse Created data models for AWS Redshift and Hive from dimensional data models. Executed change management processes surrounding new releases of SAS functionality Worked with Sqoop to transfer data between the HDFS to relational database like MySQL and vice versa and experience in using of Talend for this purpose. Collaborated with ETL, and DBA teams to analyze and provide solutions to data issues and other challenges while implementing the OLAP model. Designed and Developed PL/SQL procedures, functions and packages to create Summary tables. Worked with OLTP to find the daily transactions and type of transactions occurred and the amount of resource used Environment : Hadoop, HDFS, HBase, SSIS, SSAS, OLAP, Hortonworks, OLTP, ETL, Java, AWS, T-SQL, MySQL, HDFS, Sqoop, Cassandra, MongoDB, Hive, SQL, PL/SQL, Oracle 11g, Teradata ------------------------------------------------------------------------------------------------------------------------------------------ Zaniboni Lighting | Clearwater, Florida | On-Site Nov 2018 - Aug 2020 Pyspark Data Engineer Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data. Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell. Experience in setting up, configuring, and managing Hadoop clusters Proficient in Hadoop Distributed File System (HDFS) and Map Reduce programming model Experience in Hive HQL for querying and analyzing data Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity. Subscribing the Kafka topic with Kafka consumer client and process the events in real time using spark. Designed and Developed Real Time Stream Processing Application using Spark, Kafka, Scala and Hive to perform Streaming ETL and apply Machine Learning. Worked on Snowflake Schemas and Data Warehousing Enabled Python scripts to explode, parse and de-dupe JSON from Kafka and land in HDFS. Built Kafka monitoring scripts to monitor Kafka loads into Hadoop cluster. Used Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and NoSql databases such as HBase and Cassandra. Environment: Hadoop, Cloudera, Map Reduce, Kafka, Impala, Spark, Snowflake, Pig, Hive, Sqoop, Java, Scala, Cassandra, SQL, Tableau, Zookeeper, Teradata, , Linux Red-Hat and Oracle 12c. ------------------------------------------------------------------------------------------------------------------------------------------ Menlo Technologies |Hyderabad, India Sept 2016 - Oct 2018 Data Analyst/Modeler Performed root cause analysis on smaller self-contained data analysis tasks that are related to assigned data processes. Collaborated with a team of Business Analysts to ascertain capture of all requirements. Worked on Data Profiling, Data cleansing, Data Mapping and Data Quality. Worked with data investigation, discovery and mapping tools to scan every single data record from many sources. Used ER/Studio to create Conceptual, Logical and Physical data models. Designed Physical Data Model (PDM) using ER/Studio and Oracle PL/SQL. Coordinated with different data providers to source the data and build the Extraction, Transformation, and Loading (ETL) Modules based on the requirements to load the data from source to stage and performed Source Data Analysis. Responsibilities include supporting end user reporting requirements by developing Reporting solutions on the Cognos. Performed data cleaning and data manipulation activities using NZSQL utility. Created SSIS package to load data from Flat files, Excel and Access to SQL server using connection manager. Developed and maintained Data Dictionary to create Metadata Reports for technical and business purpose. Handled performance requirements for databases in OLAP models. Used MS POWER BI as reporting and visualization tool Implemented the oracle partitioning feature for large tables and indexes. Produced report using SQL Server Reporting Services (SSRS) and creating various types of reports. Created stored procedures using PL/SQL and tuned the databases and backend process. Worked in importing and cleansing of data from various sources with high volume data. Designed and developed of data warehouse using T-SQL and SQL. Used SQL for Querying the database in UNIX environment. Environment: ER/Studio, Oracle, SQL, PL/SQL, SSIS, T-SQL, UNIX, Tableau. ------------------------------------------------------------------------------------------------------------------------------------------ GreyCampus | Hyderabad, India June 2014 - Aug 2016 Hadoop Developer Inserted Overwriting the HIVE data with HBase data daily to get fresh data every day and used Sqoop to load data from DB2 into HBASE environment... Implement automation, traceability, and transparency for every step of the process to build trust in data and streamline data science efforts using Python, Java, Hadoop streaming, Apache Spark, SparkSQL, Scala, Hive, and Pig. Created Hive, Phoenix, HBase tables and HBase integrated Hive tables as per the design using ORC file format and Snappy compression. Extensively used DB2 Database to support the SQL Performed data validation and transformation using Python and Hadoop streaming. Loading the data from the different Data sources like (Teradata and DB2) into HDFS using SQOOP and load into Hive tables, which are partitioned. Performed data transformations like filtering, sorting, and aggregation using Pig Involved in story-driven Agile development methodology and actively participated in daily scrum meetings. Creating Sqoop to import data from SQl, Oracle, and Teradata to HDFS Designed highly efficient data model for optimizing large-scale queries utilizing Hive complex data types and Parquet file format. Created Hive tables to push the data to MongoDB. Automated workflows using shell scripts and Control-M jobs to pull data from various databases into Hadoop Data Lake. Environment: Hadoop, HDFS, Hive, Pig, DB2, Java, Python, Oracle 9i, SQL, Splunk, Unix, Shell Scripting. Keywords: cprogramm cplusplus continuous integration continuous deployment business intelligence sthree database rlang information technology green card microsoft procedural language Delaware Texas |