Introduction
In the past 2 years, the data ecosystem has been evolving rapidly. New tools have been emerging every month in the modern data stack. In a hype cycle, it becomes hard to distinguish the signal from the noise. Which of those tools would eventually become simple features or actual products that we would be using in a few years?
In addition to our growing number of tools, we've seen a few new trends, such as declarative approaches appearing everywhere (from Kubernetes where we have code as infra, orchestration as code, and even integration as code)
Other trends include the rise of the Semantic Layer, Rust becoming the future of performance-intense applications in data (potential replacing Spark eventually), or even data modeling coming back with the exposing of the modern data stack. All this is without mentioning AI and vector-based engines being used for small data, such as DuckDB along with newer ones especially supporting the AI wave behind the curtains with Pinecone, Qdrant, etc.
So much is going on that we had to take a step back on the evolution of the data engineer
To make sense of it all, we need all the data we can get. Fortunately, this State of Data 2023 survey is the largest data engineering survey made to date. It will help us take a step back and understand what the community is using and feeling excited about, what is noise or signal in the modern data stack.
The research will first give details on the demographics of the survey participants. Then, it will go through the data stack, but also the blogs, podcasts, newsletters that we follow most.
The best insights are usually discovered when using the filters at your disposal, per company size and per experience, so you can drill down on the information that matters most to you.
Now, let's see what we discovered with the survey.
Demographics
This is the largest data engineering survey made to date.
886respondents
Geography
Respondents were evenly split across geography
Where do you currently reside?
Source: Airbyte
Experience
Respondents were evenly split across experience
How many years of experience in your current field do you have?
Source: Airbyte
Company size
Respondents were evenly split across team and company size
If you're currently employed, how large is your company?
Source: Airbyte
Team size
Respondents were evenly split across team and company size
If you're currently employed, how large is your the data team at your company?
Source: Airbyte
The job titles of people touching data are extremely diverse
Among job titles, Data Engineer obviously was #1 at 38%, but surprisingly 20% of respondents were in Management and 11% identified as Software Engineer. Analytics Engineer, Data Analyst, and Data Scientist all came in at 5% each.
Which option best describes your current role?
Source: Airbyte
Hiring environment
Half of the respondents are not hiring, but 17% are aggressively hiring
Is your data team currently hiring?
Source: Airbyte
Compensation trends
This is the largest data engineering survey made to date.
Experience
More experience correlates with more pay.
Cash compensation ($K USD) vs Years of experience
Source: Airbyte
Company size
Larger companies correlate with more pay, and North America pays far more than anywhere else
Cash compensation ($K USD) vs Company Size
Source: Airbyte
Cash compensation ($K USD) vs Location
Source: Airbyte
Data Tooling Insights
This is the largest data engineering survey made to date.
Data ingestion
Insight 1: Airbyte and Fivetran are clear leaders for Data Ingestion layer
Brand recognition and adoption - Data Ingestion
Source: Airbyte
Extra poll - people care most about Correctness, Stability, and Performance for data integration
Source: Airbyte
Extra poll - more than 30% of teams maintain more than 10 connectors
Source: Airbyte
Data transformation
Insight 2: dbt has most positive sentiment for Data Transformation, but pandas is actually most used
Brand recognition and adoption - Data Transformation
Source: Airbyte
There are a few things that I find surprising and exciting about the State of the Data survey. Firstly, I’m surprised for example, that Pandas is still leading the pack for data transformations. This is also exciting because it points to a need for continued education and development around new tooling like Polars, which has a lot to offer. I find it surprising Databricks isn’t used more, but also has a bright side in the idea that there is a lot of room for growth towards those tools. Both from an education of perspective as a content creator and then it points to still exciting times ahead for the Data Engineering community as more people and teams adopt new technologies.
Senior Data Engineer, Rippleshot
Data warehouses
Insight 3: Snowflake and BigQuery clearly at the top for Data Warehouses; Azure Synapse lagging behind badly
Brand recognition and adoption - Data Warehouses
Source: Airbyte
Data orchestration
Insight 4: For Data Orchestration, most people are still using self hosted Airflow, but Dagster is coming up the ranks
Brand recognition and adoption - Data Orchestration
Source: Airbyte
It's unsurprising to find data quality as the number one concern of data engineers. Yet, no one seems to own that across companies which will become an increasingly important issue. Also, I found out interesting to see that most people are widely using self-hosted Airflow, while we heard that Airflow was very difficult to set up. I believe the choice between self-hosted and managed Airflow in the future, will be more on what those managed solutions bring to quickly onboard teams, give a better dev experience, and help solve quality/observability issues.
Head of Customer Education, Astronomer
Business intelligence
Insight 5: For Business Intelligence, the giants Looker and Tableau are still ruling the roost, but there is also significant churn from Tableau to the newer set of solutions
Brand recognition and adoption - Business Intelligence
Source: Airbyte
Data quality
Insight 6: For Data Quality, Great Expectations and Monte Carlo are leading the pack, but more people have not yet tried or explored the tools than have
Brand recognition and adoption - Data Quality
Source: Airbyte
I am particularly happy to see the growth of Data Quality tools that have evolved for good. This signals maturity is coming along. It's not a shocker to me Airbyte still leading the way for the Data Ingestion Layer.
founder & host, The Ravit Show
Reverse etl
Insight 7: For Reverse ETL, Hightouch and Census are neck and neck, but the vast majority of the market is still up for grabs
Brand recognition and adoption - Reverse ETL
Source: Airbyte
Data catalogs
Insight 8: For Data Catalogs, DataHub, Atlan and Amundsen are leading for now, but the vast majority of the market is also up for grabs
Brand recognition and adoption - Data Catalog
Source: Airbyte
Amazing Data Engineering survey! I highly recommend checking out the insights into adoption of engineering tools from Data Ingestion, transformation to reverse ETL and Data Catalogs. That section was my highlight. Congratulations to Airbyte for leading the Data Ingestion section.
founder, Learn Data Engineering
Data Community Survey
This is the largest data engineering survey made to date.
Podcasts
Top Podcasts
Source: Airbyte
Youtube channels
Top YouTube Channels
Source: Airbyte
The data engineering community stands out for its open-mindedness and collaborative spirit. Every day, I'm impressed by how we've created a culture of learning and sharing that transcends organizational boundaries and geographical constraints.
Editor, Data Engineering Weekly
Communities
Top Communities
Source: Airbyte