👨‍🔧
The Ops Compendium
  • The Ops Compendium
  • Definitions
    • Ops Definition Comparisons
  • ML & DL Compendium
  • MLOps
    • MLOps Intro
    • MLOps Teams
    • MLOps Literature
    • MLOps Course
    • MLOps Patterns
    • ML Experiment Management
    • ML Model Monitoring & Alerts
    • MLOps Tools
    • MLOps Deployment
    • Feature Stores & Feature Pipelines
    • Model Formats
    • AI As Data
    • MLOps Interview Questions
    • ML Architecture
  • DataOps
    • SQL
    • Tools
    • Databases
    • Database Modeling
    • Data Analytics
    • Data Engineering
    • Data Pipelines
    • Data Strategy
    • Data Vision
    • Data Teams
    • Data Catalogs
    • Data Governance
    • Data Quality
    • Data Observability
    • Data Program Management
    • Data KPIs
    • Data Mesh
    • Data Contract
    • Data Product
    • Data Engineering Questions & Training
    • Data Patterns
    • Data Architecture
    • Data Platforms
    • Data Lineage
  • DevOps
    • DevOps Strategy
    • DevOps Tools
      • Tutorials
      • Continuous Integration
      • Docker
      • Kubernetes
      • Cloud Objects
      • Key Value DB
      • API Gateway
      • Infrastructure As code
      • Logs
      • ELK
      • SLO
    • DevOps Courses
  • DevSecOps
    • Definitions
    • Tools
    • Concepts
  • Architecture
    • Problems
    • Development Concepts
    • System Design
Powered by GitBook
On this page

Was this helpful?

Edit on GitHub
  1. DataOps

Data Engineering Questions & Training

PreviousData ProductNextData Patterns

Last updated 2 years ago

Was this helpful?

Data Engineering Questions / Training

  1. General

    1. What are the considerations when choosing methods of ingesting data to bigquery?

    2. Have you worked with data science teams? What were your responsibilities?

    3. What are the considerations of choosing spark vs bigquery?

    4. What are the differences between ETL and ELT? , , By guru99/david taylor

By mark smallcombe

  1. CAP Theorem

    1. how it effect real world application (latency is availability in real world)

  2. Explain the difference and the reason to choose using NoSQL {mongoDB | DynamoDB | .. } over Relational database {Postgress |MySQL} and vice versa. Give an example for a project where you had to make this choice, and walk through your reasoning.

(This question can be modified for the relevant technologies.. )

  1. Streaming vs Batch “Explain the difference and the reason to choose using Streaming over Batch and vice versa. Give an example for a project where you had to make this choice, and walk through your reasoning.”

  2. Job vs Service “Explain the difference and the reason to choose using Job over Service and vice versa. Give an example for a project where you had to make this choice, in the context of ML pipelines and walk through your reasoning.”

  3. Athena

    1. What is the engine behind athena

    2. How is presto different from Spark? How does it affect your query planning?

    3. What is the cost composed of?

    4. How can you calculate cost?

    5. How can you optimize your queries (partitions, join order, limit tricks, etc)

    6. What options do you have to limit the cost of athena?

    7. when would u use athena vs spark

  4. Spark -

    1. What’s the difference between a data frame and a dataset?

      1. broadcast join is 4 times faster if one of the table is small and enough to fit in memory

      2. Is broadcasting always a good solution ? Absolutely no. If you are joining two data sets both are very large broad casting any table would kill your spark cluster and fails your job.

      1. Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan.

      2. Dynamically coalescing shuffle partitions

      3. Dynamically switching join strategies

      4. Dynamically optimizing skew joins

  5. BigQuery

    1. What is the difference in the implementation between partitions and clustering in BQ?

    2. What ways do you know to reduce query cost in BigQuery?

    3. What is the BigQuery cost composed of? How can you reduce storage cost?

    4. Did you ever encounter a memory error when running BigQuery? Why does it happen and how is it related to the Dremel implementations

    5. How can you control the access to sensitive data in BigQuery?

    6. What options do you have to limit the cost of BigQuery?

    7. When using BigQuery ML to train TF models - what happens in the background?

  6. Airflow

    1. What is airflow?

    2. How do you transfer information between tasks in airflow?

    3. Please give me a real-world example of using spark and airflow together

  7. Data Validation

    1. How can you protect yourself from bad data? Data validation, TDDA, monitoring.

    2. Tools:

  8. File formats

  9. Outage handling and the differences between stream-based processing vs concurrent isolated worker-based processing using

By nielsen Ilai Malka

  • How would you design and implement an API rate limiter?

References:

- Top 10: partitioning, bucketing, compression, optimize file sizes, optimize columnar data store generation, query tuning, optimize order by, optimize group by, use approx functions, column selection. What are the tradeoffs (time vs cost)?

by sivaprasad mandapati.

1, #2 - how? Pros and cons. (broadcast hash, shuffle hash, shuffle sort merge, cartesian).

Type validation:

Data validation

Test driven:

Data quality:

Saas:

Can you explain the parquet file format?

How is this leveraged by Spark?

What are the shortcomings of parquet and how is it solved by file formats like hudi, delta, iceberg?

Julien simon on

- Change data capture (CDC) is the process of recognising when data has been changed in a source system so a downstream process or system can action. A common use case is to reflect (replication) the change in a different target system so that the data in the systems stay in sync.

Q: you have a real time stream - what is better? A stream-based processing system, or a worker-based, that can be triggered on different time ranges, in the context of recovery from outage.

.

, (which is great), (isn't complete)

, ,

(podcast)

What Is CAP Theorem?
Can you 'get around' or 'beat' the CAP Theorem?
Name some types of Consistency patterns
What Do You Mean By High Availability (HA)?
What are A and P in CAP and the difference between them?
What does the CAP Theorem actually say?
Another great resource for CAP questions
Performance tuning
Several spark articles that can be used as candidate questions
Join strategies #
Join strategies
Sort merge vs broadcast
Shuffle & AQE
typeguard
pydantic
tdda
great expectations
SuperConductive by GE
https://parquet.apache.org/documentation/latest/
https://databricks.com/session/spark-parquet-in-depth
https://lakefs.io/hudi-iceberg-and-delta-lake-data-lake-table-formats-compared/
AWS glue data brew vs data wrangler
What is a CDC and why do you need it, or how do you use it?
How to design a
The twitter question
More than 2000 questions for data engineers
More data engineering questions
Even more qs
DS leads
System design interview q’s with solutions
Cap theorem
2
3
ACID
CAP
PACLEC
Why do we need Data engineering?
What scales of data have you worked with in the past?
How do you generally work with the departments that make use of your data?
Tell me about a time you had performance issues with an ETL. How did you identify this as a performance issue and how did you fix it?
Describe a time when you found a new use case for an existing database.
Describe the most challenging project you’ve worked on. What was your role?
Think back to a project you’re proud of. What was it that gave you that sense of pride and accomplishment?
What do you consider to be one of the biggest mistakes you’ve ever made in your previous job?
Explain the differences between stream processing and data processing, with one caveat: pretend that I’m not familiar with data at all.
1
2
3