Home

Data Engineer with Informatica and ControlM 100% remote US citizen h4ead at Remote, Remote, USA
Email: [email protected]
From:

Muskan,

InfoTech Spectrum

[email protected]

Reply to: [email protected]

Data Engineer 2 roles

100% remote

6+ months

No sponsorship available

Must Have

SQL Development

Informatica

Control M development

Environment

Development work and data engineering work

Understand basics of data engineering

Can train on this

Data warehousing and finance details

Know what etl does

Type 2 cable is

Missing development in sql

Cant train on this

Control M

Schedule programs

Informatica

To do ETL processes

Role

Need someone to hit the ground running.

SQL Server coding and Control M

If you open control m what do you see

Snowflake

Python would be great

Using snowflake in the future but mostly sql

Sources are flat files or sql server

Nice to Have

Snowflake

Kafka

Python

Interview Process

Divy and Vasant interviewing CC Sujata

Interview Times

Block 1 hour

Tues/Thurs afternoon

3-4pm

Can have this block for 4 weeks

Past candidates

Missing fundamentals

Cant answer basic questions

Suggested Screening Questions:

Can you describe a complex SQL query you've written in the past to solve a data engineering problem Walk me through your thought process, including how you optimized the query for performance and handled potential pitfalls like deadlocks or inefficient joins.

A complex SQL query I worked on involved aggregating large datasets from multiple tables to generate monthly financial reports. The query included several JOINs across tables with millions of rows, subqueries, and window functions.

To optimize the query, I:

Analyzed Execution Plan: I reviewed the execution plan to identify slow-performing parts, such as unnecessary full table scans or inefficient JOINs.

Indexing: I added indexes on the columns used in WHERE, JOIN, and GROUP BY clauses. This significantly reduced the query execution time.

Query Restructuring: I rewrote subqueries to use JOINs where appropriate, reducing the number of nested queries.

Handling Deadlocks: To avoid deadlocks, I ensured that transactions were short and that resources were accessed in a consistent order. I also used the SET TRANSACTION ISOLATION LEVEL to minimize lock contention.

The optimized query reduced the execution time from over 10 minutes to under 2 minutes, making the reports much more responsive for end-users.

Create a New Job: I would start by defining a new job, specifying the script or executable that runs the ETL process.

Set Job Parameters: I would configure job parameters, such as the execution time, frequency (e.g., daily, weekly), and any environment variables.

Manage Dependencies: I would set up dependencies to ensure the ETL job only runs after all prerequisite jobs have completed successfully. This might include upstream data ingestion jobs or database availability checks.

Error Handling: I would configure error handling by setting up alerts to notify the team if the job fails. Additionally, I would define recovery actions, such as restarting the job from a specific step or triggering a secondary job to handle failures.

This approach ensures that the ETL job runs smoothly, with minimal manual intervention required in case of errors.

Extraction: I configured Informatica to extract data from each source, handling different formats and connection types.

Transformation: I applied several transformations, such as data cleansing, type conversions, and aggregations. For example, I used the Expression and Filter transformations to standardize date formats and remove invalid records.

Loading: Finally, I loaded the transformed data into the data warehouse, ensuring it adhered to the required schema.

Data Integration: Merging data from heterogeneous sources was challenging due to differences in data types and structures. I resolved this by using Informatica's transformation capabilities to normalize the data.

Performance: Processing large volumes of data posed performance issues. I optimized the process by using partitioning, parallel processing, and tuning the Informatica server parameters, which reduced the ETL time by 30%.

Source Data Extraction: Extract the current data from the source system.

Comparison: Compare the extracted data with the existing records in the dimension table to identify new or changed records.

Insert/Update Logic:

Load: Finally, load the transformed data into the dimension table.

Type 2 SCD is crucial for maintaining a history of changes, allowing analysts to perform time-based analyses. For instance, in financial reporting, it's important to know not just the current status of a customer but also how their status has changed over time.

By implementing Type 2 SCD, we ensure that our data warehouse can provide accurate and comprehensive insights into historical trends and patterns.

When you open Control M, what key components or elements do you typically see, and how would you go about scheduling a new ETL job using this tool How would you handle dependencies and error handling.

When I open Control M, the key components I typically see include the Job Definitions, Calendars, Folders, Dependencies, and Alerts.

To schedule a new ETL job:

Explain a scenario where you had to use Informatica to transform large datasets from multiple sources into a data warehouse. How did you design the ETL process, and what challenges did you face with data integration, transformations, or performance

In one project, I used Informatica to extract data from multiple sources, including flat files, SQL Server, and an API. The goal was to load this data into a central data warehouse

ETL Process Design:

Challenges:

Given that this role involves a significant amount of data engineering work, can you explain what a Type 2 Slowly Changing Dimension (SCD) is and how you would implement it in an ETL process Why is it important in a data warehousing context

A Type 2 Slowly Changing Dimension (SCD) is a method used in data warehousing to track historical changes in dimension data. Unlike a Type 1 SCD, which simply overwrites old data, a Type 2 SCD keeps a history of changes, enabling you to see how a dimension has evolved over time.

Implementation in an ETL Process:

Insert: For new records, insert them into the dimension table with a new surrogate key, along with start and end dates.

Update: For changed records, update the existing record's end date and insert the new version of the record with a new surrogate key and a new start date.

Importance in Data Warehousing:

Thank You

Best Regards

Muskan Sharma

InfoTech Spectrum Inc

2060, Walsh Ave, #120, Santa Clara, CA 95050

Email :

[email protected]

Web:

w
ww.infotechspectrum.com

Keywords: information technology golang California
Data Engineer with Informatica and ControlM 100% remote US citizen h4ead
[email protected]
[email protected]
View all
Thu Aug 15 03:57:00 UTC 2024

To remove this job post send "job_kill 1661810" as subject from [email protected] to [email protected]. Do not write anything extra in the subject line as this is a automatic system which will not work otherwise.


Your reply to [email protected] -
To       

Subject   
Message -

Your email id:

Captcha Image:
Captcha Code:


Pages not loading, taking too much time to load, server timeout or unavailable, or any other issues please contact admin at [email protected]
Time Taken: 0

Location: ,