logo

Introduction

In the past 2 years, the data ecosystem has been evolving rapidly. New tools have been emerging every month in the modern data stack. In a hype cycle, it becomes hard to distinguish the signal from the noise. Which of those tools would eventually become simple features or actual products that we would be using in a few years?

In addition to our growing number of tools, we've seen a few new trends, such as declarative approaches appearing everywhere (from Kubernetes where we have code as infra, orchestration as code, and even integration as code)

Other trends include the rise of the Semantic Layer, Rust becoming the future of performance-intense applications in data (potential replacing Spark eventually), or even data modeling coming back with the exposing of the modern data stack. All this is without mentioning AI and vector-based engines being used for small data, such as DuckDB along with newer ones especially supporting the AI wave behind the curtains with Pinecone, Qdrant, etc.

So much is going on that we had to take a step back on the evolution of the data engineer

To make sense of it all, we need all the data we can get. Fortunately, this State of Data 2023 survey is the largest data engineering survey made to date. It will help us take a step back and understand what the community is using and feeling excited about, what is noise or signal in the modern data stack.

The research will first give details on the demographics of the survey participants. Then, it will go through the data stack, but also the blogs, podcasts, newsletters that we follow most. 

The best insights are usually discovered when using the filters at your disposal, per company size and per experience, so you can drill down on the information that matters most to you.

Now, let's see what we discovered with the survey.

Demographics

This is the largest data engineering survey made to date.

886respondents

Geography

Respondents were evenly split across geography

Where do you currently reside?

Source: Airbyte

Experience

Respondents were evenly split across experience

How many years of experience in your current field do you have?

Source: Airbyte

Company size

Respondents were evenly split across team and company size

If you're currently employed, how large is your company?

Source: Airbyte

Team size

Respondents were evenly split across team and company size

If you're currently employed, how large is your the data team at your company?

Source: Airbyte

The job titles of people touching data are extremely diverse

Among job titles, Data Engineer obviously was #1 at 38%, but surprisingly 20% of respondents were in Management and 11% identified as Software Engineer. Analytics Engineer, Data Analyst, and Data Scientist all came in at 5% each.

Which option best describes your current role?

Source: Airbyte

Hiring environment

Half of the respondents are not hiring, but 17% are aggressively hiring

Is your data team currently hiring?

Source: Airbyte

Data Tooling Insights

This is the largest data engineering survey made to date.

Data ingestion

Insight 1: Airbyte and Fivetran are clear leaders for Data Ingestion layer

Brand recognition and adoption - Data Ingestion

Source: Airbyte

Extra poll - people care most about Correctness, Stability, and Performance for data integration

Source: Airbyte

Extra poll - more than 30% of teams maintain more than 10 connectors

Source: Airbyte

Data transformation

Insight 2: dbt has most positive sentiment for Data Transformation, but pandas is actually most used

Brand recognition and adoption - Data Transformation

Source: Airbyte

There are a few things that I find surprising and exciting about the State of the Data survey. Firstly, I’m surprised for example, that Pandas is still leading the pack for data transformations. This is also exciting because it points to a need for continued education and development around new tooling like Polars, which has a lot to offer. I find it surprising Databricks isn’t used more, but also has a bright side in the idea that there is a lot of room for growth towards those tools. Both from an education of perspective as a content creator and then it points to still exciting times ahead for the Data Engineering community as more people and teams adopt new technologies.

Daniel Beach-Senior Data Engineer, Rippleshot
Daniel Beach

Senior Data Engineer, Rippleshot

Data warehouses

Insight 3: Snowflake and BigQuery clearly at the top for Data Warehouses; Azure Synapse lagging behind badly

Brand recognition and adoption - Data Warehouses

Source: Airbyte

Data orchestration

Insight 4: For Data Orchestration, most people are still using self hosted Airflow, but Dagster is coming up the ranks

Brand recognition and adoption - Data Orchestration

Source: Airbyte

It's unsurprising to find data quality as the number one concern of data engineers. Yet, no one seems to own that across companies which will become an increasingly important issue. Also, I found out interesting to see that most people are widely using self-hosted Airflow, while we heard that Airflow was very difficult to set up. I believe the choice between self-hosted and managed Airflow in the future, will be more on what those managed solutions bring to quickly onboard teams, give a better dev experience, and help solve quality/observability issues.

Marc Lamberti-Head of Customer Education, Astronomer
Marc Lamberti

Head of Customer Education, Astronomer

Business intelligence

Insight 5: For Business Intelligence, the giants Looker and Tableau are still ruling the roost, but there is also significant churn from Tableau to the newer set of solutions

Brand recognition and adoption - Business Intelligence

Source: Airbyte

Data quality

Insight 6: For Data Quality, Great Expectations and Monte Carlo are leading the pack, but more people have not yet tried or explored the tools than have

Brand recognition and adoption - Data Quality

Source: Airbyte

I am particularly happy to see the growth of Data Quality tools that have evolved for good. This signals maturity is coming along. It's not a shocker to me Airbyte still leading the way for the Data Ingestion Layer.

Ravit Jain-founder & host, The Ravit Show
Ravit Jain

founder & host, The Ravit Show

Reverse etl

Insight 7: For Reverse ETL, Hightouch and Census are neck and neck, but the vast majority of the market is still up for grabs

Brand recognition and adoption - Reverse ETL

Source: Airbyte

Data catalogs

Insight 8: For Data Catalogs, DataHub, Atlan and Amundsen are leading for now, but the vast majority of the market is also up for grabs

Brand recognition and adoption - Data Catalog

Source: Airbyte

Amazing Data Engineering survey! I highly recommend checking out the insights into adoption of engineering tools from Data Ingestion, transformation to reverse ETL and Data Catalogs. That section was my highlight. Congratulations to Airbyte for leading the Data Ingestion section.

Andreas Kretz-founder, Learn Data Engineering
Andreas Kretz

founder, Learn Data Engineering

Data Community Survey

This is the largest data engineering survey made to date.

Top newsletters

Newsletters

Source: Airbyte

What is most exciting for me in the data engineering space is the continual growth of people sharing and helping grow the community. Whether it be people like Xinran and Data Engineering Things or Daniel Beach and Data Engineering Central, there are so many people putting out great content in the DE space

Benjamin Rogojan-host & editor, Seattle Data Guy
Benjamin Rogojan

host & editor, Seattle Data Guy

Podcasts

Top Podcasts

Source: Airbyte

Youtube channels

Top YouTube Channels

Source: Airbyte

The data engineering community stands out for its open-mindedness and collaborative spirit. Every day, I'm impressed by how we've created a culture of learning and sharing that transcends organizational boundaries and geographical constraints.

Ananth Packkildurai-Editor, Data Engineering Weekly
Ananth Packkildurai

Editor, Data Engineering Weekly

Communities

Top Communities

Source: Airbyte