11 Data Engineer Interview Questions and Answers

how to answer Data Engineer Interview Questions
Summary:

Interviewing as a data engineering candidate? Here are essential data engineer interview questions to know how to answer well before you go!

When looking to hire data engineer, you want to test your candidate about building and maintaining a data pipeline.

By asking the interview questions below, you can gain insights to how much knowledge and experience your candidates have with data engineering.

(And if you’re preparing for a data engineer interview, feel free to use these questions and answers as a guide.)

Looking to hire the best remote developers? Explore HireAI to see how you can:

⚡️ Get instant candidate matches without searching
⚡️ Identify top applicants from our network of 300,000+ devs with no manual screening
⚡️ Hire 4x faster with vetted candidates (qualified and interview-ready)

Try HireAI and hire top developers now →

What is the Difference Between a Data Scientist and a Data Engineer?

The main responsibility of a data scientist is to analyze data and produce suggestions for actions to take to improve a business metric, and then monitor the results of implementing those actions.

In contrast, a data engineer is responsible for implementing the data pipeline to gather and transform data for data scientists to analyze. While a data engineer needs to understand the business value of the data being collected and analyzed, their daily tasks will be more oriented around implementing the gathering, filtering, and transformation of data.

Interview Questions for Data Engineers

1. Using the following SQL table definitions and data, how would you construct a query that shows…

A data engineer needs to be able to construct and execute queries in order to understand the existing data, and to verify data transformations that are part of the data pipeline.

You can ask a few questions covering SQL to ensure the data engineer candidate can handle the query language well.

Here are some examples:

  • With a product table defined with a name, SKU, and price, how would you construct a query that shows the lowest priced item?
  • With an order table defined with a date, a product SKU, price, quantity, tax rate, and shipping rate, how would you construct a query that shows the average order cost?

You can use the SQL below to setup the examples above:

CREATE TABLE products (
  sku INT NOT NULL,
  name VARCHAR(50) NOT NULL,
  price DECIMAL(10, 2) NOT NULL,
  PRIMARY KEY(sku)
);

CREATE TABLE orders (
  product_sku INT NOT NULL,
  price DECIMAL(10, 2) NOT NULL,
  quantity INT NOT NULL,
  tax_rate DECIMAL(3, 2) NOT NULL,
  shipping_rate DECIMAL(3, 2) NOT NULL,
  FOREIGN KEY(product_sku) REFERENCES products(sku)
);

INSERT INTO products VALUES (1, 'shirt', 25.99);
INSERT INTO products VALUES (2, 'sweater', 34.99);
INSERT INTO orders VALUES (1, 25.99, 1, 15.0, 3.0);
INSERT INTO orders VALUES (2, 34.99, 3, 13.0, 2.0);

2. Which Python libraries would you use for efficient data processing?

Python is one of the most popular languages used for data engineering. This question lets you know if the candidate knows the basics of Python. The answer should include NumPy and pandas.

NumPy is used for efficient processing of arrays of numbers, and pandas is great for stats, which are the bread and butter of data science work. Pandas is also good for preparing data for machine learning work. You can ask a candidate why they would use those libraries, and also ask for examples of situations where they would not use them.

From here, you can ask specific Python coding questions, such as:

  • How would you transpose this data?
  • How would you filter out outliers?

Here are two examples based on the Pandas library documentation:

# Here is the data
df = pd.DataFrame(data={'col1': [-10, 1, 2, 8], 'col2': [-5, 0.2, 0.4, 7]})
# How to transpose it?
df_transposed = df.T
# How to filter out the outliers?
df_no_outliers = df.ge(-3).le(3)

3. In this example web app, what data points would you collect?

The example web app could be a calendar similar to Outlook or Google calendar. In that case, collecting data on which calendar views are being used would be worthwhile.

This question asks the data engineer candidate to analyze and understand the domain they’re working in. They need to understand the domain they’re working with because collecting data from a calendar web app can differ vastly from collecting data from IoT (Internet of Things) devices. If you’re dealing with complex IoT data, you might benefit from Internet of Things engineering services to ensure the efficient and accurate collection of data from various devices.

While a data engineer does not need to implement the code that records data points in the web app, they should be able to understand the needs of developers who do need to implement that code. You can guide the candidate to a more specific answer by asking additional questions such as: what data would we need to collect to find out how users are using certain features?

Ask this question if you want to know if a data engineer candidate understands that problem domain.

4. How would you deal with duplicate data points in an SQL query?

This is a good question to ask a candidate because it should get them to ask you questions in return. For instance, they should ask you what kind of data you are working with, and what columns or values would likely be duplicated.

They should also suggest using the SQL keywords, UNIQUE and DISTINCT, for reducing duplicate data points. After that they should also suggest other ways to deal with duplicate data points, such as grouping the data using GROUP BY and filtering it further.

If you want to know if a candidate has a good grasp of SQL, ask this question.

5. What is a memorable data pipeline performance issue that you solved?

This question will give you insight into the candidate’s past experiences with data pipeline implementation and how they were able to improve performance. Performance issues in a data pipeline can not only slow down the gathering of data, but can disrupt and slow down data analysis. This can have a direct impact on business decisions.

Here are some examples of experiences candidates could discuss:

  • how they improved the performance of a specific SQL query
  • how they upgraded a database from one type to another
  • how they reduced the time it took to run a set of queries
  • how they improved the performance of importing or exporting of data (as an example, importing CSV files or exporting JSON or XML or CSV)
  • how they improved retrieval of data from a backup system (for instance, Amazon Glacier or moving data from S3 storage into a faster data storage system)

If you want to know if the candidate has ideas on improving your data pipeline’s performance, also ask this as a question!

You can also ask a candidate how they have solved issues with malformed data and incorrect taxonomies.


Check out our entire set of software development interview questions to help you hire the best developers you possibly can.

If you’re a developer, familiarize yourself with the non-technical interview questions commonly asked in the first round by HR recruiters and the questions to ask your interviewer!

Arc is the radically different remote job search platform for developers where companies apply to you. We’ll feature you to great global startups and tech companies hiring remotely so you can land a great remote job in 14 days. We make it easier than ever for software developers and engineers to find great remote jobs. Sign up today and get started.


6. Given an expected increase in data volume, how would you plan to add more capacity to the data processing architecture?

When asking this question, you can come up with a simple scenario. For example, you could be collecting data from IoT devices and are planning a rollout of thousands more devices (which will send back sensor data to the data pipeline). The data would be processed in two ways and stored in three ways: a data warehouse for analysis, a database, and caching layer for the interaction between a control panel web app and a backup system. How would they add more capacity in that situation?

This question asks the candidate to formulate a plan for the data pipeline to handle more data. They should be able to tell you what would be needed for this, such as needing more database instances in the cloud on Amazon Web Services, Microsoft Azure, or Google Cloud Platform. Or they could suggest better data compression, or removing old sets of data, or redirecting subsets of data to other parts of the architecture.

The candidate should be able to point to the various components and give you ideas about preparing those pieces for an increase in data volume.

For startups this can be particularly important. As startups start rapidly gaining more customers, they need to handle a dramatic increase in data collection volume and need to be able to view business reports and analysis quickly to reorient their direction.

7. How would you validate a data migration from one database to another?

The data engineer candidate should be concerned with the validity of data and ensuring that no data is dropped. They should be able to explain how validation of data would happen. In some cases, a comparison between hashes or timestamps can be used; in other cases, a more thorough comparison of data is needed to be able to validate.

The candidate should be able to give you an idea of which type of validation is appropriate in different scenarios. For example, the validation could occur continuously as the data flows into both databases, or the validation could occur once after a complete data migration happens. There could also be other approaches that the candidate suggests. The validation could also be a simple comparison, or more involved (in the case of more complex data structures).

8. How would you prepare for the migration of a dataset that’s 1GB from a NoSQL database to an SQL-based database?

This is an interesting question to ask a candidate. There are a few startups that have started out with MongoDB or Couch (or some other NoSQL database) and found that it didn’t suit their needs as they grew. The NoSQL database may contain a lot of duplicate data and may not have validation of all data fields, and could be schema-less. This makes migration more difficult. A good candidate will ask for more information about this and inquire about the details around the NoSQL and SQL databases, and will also ask about performance requirements.

The data engineer candidate should be able to tell you what steps are needed for migrating from NoSQL to SQL. For instance, they should recommend ways to understand the existing data schema. The candidate should give ideas on designing the new database schema to accommodate that data.

If you want to know whether a candidate understands how to prepare database backups and how to design schemas, ask this question.

You’re Ready to Interview Data Engineers

In conclusion, the eight questions above are a great starting point for interviewing data engineers.

Each question should deliver insights about a data engineer job candidate’s knowledge and experience with the various skills and technology required to produce efficient and performant data pipelines.

Written by
Arc Team