Data Engineering Questions & Training

Data Engineering Questions / Training

General
1. What scales of data have you worked with in the past?
2. How do you generally work with the departments that make use of your data?
3. Tell me about a time you had performance issues with an ETL. How did you identify this as a performance issue and how did you fix it?
4. Describe a time when you found a new use case for an existing database.
5. Describe the most challenging project you’ve worked on. What was your role?
6. Think back to a project you’re proud of. What was it that gave you that sense of pride and accomplishment?
7. What do you consider to be one of the biggest mistakes you’ve ever made in your previous job?
8. Explain the differences between stream processing and data processing, with one caveat: pretend that I’m not familiar with data at all.
9. What are the considerations when choosing methods of ingesting data to bigquery?
10. Have you worked with data science teams? What were your responsibilities?
11. What are the considerations of choosing spark vs bigquery?
12. What are the differences between ETL and ELT? 1, 2, 3 By guru99/david taylor

By mark smallcombe

CAP Theorem
1. What Is CAP Theorem?
2. Can you 'get around' or 'beat' the CAP Theorem?
3. Name some types of Consistency patterns
4. What Do You Mean By High Availability (HA)?
5. What are A and P in CAP and the difference between them?
6. What does the CAP Theorem actually say?
7. how it effect real world application (latency is availability in real world)
8. Another great resource for CAP questions
Explain the difference and the reason to choose using NoSQL {mongoDB | DynamoDB | .. } over Relational database {Postgress |MySQL} and vice versa. Give an example for a project where you had to make this choice, and walk through your reasoning.

(This question can be modified for the relevant technologies.. )

Streaming vs Batch “Explain the difference and the reason to choose using Streaming over Batch and vice versa. Give an example for a project where you had to make this choice, and walk through your reasoning.”
Job vs Service “Explain the difference and the reason to choose using Job over Service and vice versa. Give an example for a project where you had to make this choice, in the context of ML pipelines and walk through your reasoning.”
Athena
1. What is the engine behind athena
2. How is presto different from Spark? How does it affect your query planning?
3. Performance tuning - Top 10: partitioning, bucketing, compression, optimize file sizes, optimize columnar data store generation, query tuning, optimize order by, optimize group by, use approx functions, column selection. What are the tradeoffs (time vs cost)?
4. What is the cost composed of?
5. How can you calculate cost?
6. How can you optimize your queries (partitions, join order, limit tricks, etc)
7. What options do you have to limit the cost of athena?
8. when would u use athena vs spark
Spark -
1. Several spark articles that can be used as candidate questions by sivaprasad mandapati.
2. Join strategies #1, Join strategies #2 - how? Pros and cons. (broadcast hash, shuffle hash, shuffle sort merge, cartesian).
3. What’s the difference between a data frame and a dataset?
4. Sort merge vs broadcast
  1. broadcast join is 4 times faster if one of the table is small and enough to fit in memory
  2. Is broadcasting always a good solution ? Absolutely no. If you are joining two data sets both are very large broad casting any table would kill your spark cluster and fails your job.
5. Shuffle & AQE
  1. Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan.
  2. Dynamically coalescing shuffle partitions
  3. Dynamically switching join strategies
  4. Dynamically optimizing skew joins
BigQuery
1. What is the difference in the implementation between partitions and clustering in BQ?
2. What ways do you know to reduce query cost in BigQuery?
3. What is the BigQuery cost composed of? How can you reduce storage cost?
4. Did you ever encounter a memory error when running BigQuery? Why does it happen and how is it related to the Dremel implementations
5. How can you control the access to sensitive data in BigQuery?
6. What options do you have to limit the cost of BigQuery?
7. When using BigQuery ML to train TF models - what happens in the background?
Airflow
1. What is airflow?
2. How do you transfer information between tasks in airflow?
3. Please give me a real-world example of using spark and airflow together
Data Validation
1. How can you protect yourself from bad data? Data validation, TDDA, monitoring.
2. Tools:
  1. Type validation: typeguard
  2. Data validation pydantic
  3. Test driven: tdda
  4. Data quality: great expectations
  5. Saas: SuperConductive by GE
File formats
1. Can you explain the parquet file format? https://parquet.apache.org/documentation/latest/
2. How is this leveraged by Spark? https://databricks.com/session/spark-parquet-in-depth
3. What are the shortcomings of parquet and how is it solved by file formats like hudi, delta, iceberg? https://lakefs.io/hudi-iceberg-and-delta-lake-data-lake-table-formats-compared/
Julien simon on AWS glue data brew vs data wrangler
What is a CDC and why do you need it, or how do you use it? - Change data capture (CDC) is the process of recognising when data has been changed in a source system so a downstream process or system can action. A common use case is to reflect (replication) the change in a different target system so that the data in the systems stay in sync.
Outage handling and the differences between stream-based processing vs concurrent isolated worker-based processing using

Q: you have a real time stream - what is better? A stream-based processing system, or a worker-based, that can be triggered on different time ranges, in the context of recovery from outage.