Counterpoint (7 Years Later) - 20 Questions to Detect Fake Data Scientists

Midjourney 2023-02-02 Anime painting of a horse wearing a sweater sitting in a corner office and working on a laptop overlooking the Chicago River 5

Gfesser via Midjourney: "Anime painting of horse wearing a sweater and working on a laptop, sitting in a corner office overlooking the Chicago River."

Erik on Software anniversary post #500.

Almost exactly seven years ago, I published what was for a long time one of my most popular posts: a counterpoint to a KDNuggets essay entitled "20 Questions to Detect Fake Data Scientists". The original essay to which I had reacted was making the rounds in some of the forums that I had been frequenting at that time, and it wasn't rubbing me the right way. The author had suggested 20 questions that were intended to help separate the real and the fake when it comes to data science, and the commentary that they had provided alongside their list didn't seem to provide any clarity. 

In particular, they had made the following statement: "A 'real' data scientist knows how to apply mathematics, statistics, how to build and validate models using proper experimental designs. Having IT skills without statistics skills makes you a data scientist as much as it makes you a surgeon to know how to build a scalpel." My response to this statement hasn't changed: "What are 'IT skills', exactly?".

While many definitions of "data scientist" still exist, I commented that the best definition at the time was probably the one which originated from Josh Wills, Director of Data Science at Cloudera: "Person who is better at statistics than any software engineer and better at software engineering than any statistician." This definition is clean, emphasizing a combined need for the skills of both the software engineer and the statistician within a single individual, while also speaking to the relative scarcity of individuals who can fill this need. 

Interestingly, KDNuggets published a follow-up to the original essay about a month ago, but instead of tackling the challenge to revisit the relevance of the editors' original questions and answers, the questions were instead submitted to ChatGPT, a chat-based interface which currently makes use of the GPT-3.5 large language model from OpenAI. If you're already familiar with my original post, you'll know that the intention of my (somewhat) tongue-in-cheek counterpoint was to tackle the software engineering side of data science. So of course I also felt compelled to submit to ChatGPT my counterpoint list of software engineering questions to detect fake data scientists. And what's this? The value of my questions was arguably made more pronounced in the process.

For reader convenience, the original 20 questions from the essay cited above are reproduced below:

  1. Explain what regularization is and why it is useful.
  2. Which data scientists do you admire most? which startups?
  3. How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression.
  4. Explain what precision and recall are. How do they relate to the ROC curve?
  5. How can you prove that one improvement you’ve brought to an algorithm is really an improvement over not doing anything?
  6. What is root cause analysis?
  7. Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples.
  8. What is statistical power?
  9. Explain what resampling methods are and why they are useful. Also explain their limitations.
  10. Is it better to have too many false positives, or too many false negatives? Explain.
  11. What is selection bias, why is it important and how can you avoid it?
  12. Give an example of how you would use experimental design to answer a question about user behavior.
  13. What is the difference between “long” and “wide” format data?
  14. What method do you use to determine whether the statistics published in an article (e.g. newspaper) are either wrong or presented to support the author’s point of view, rather than correct, comprehensive factual information on a specific subject?
  15. Explain Edward Tufte’s concept of “chart junk.”
  16. How would you screen for outliers and what should you do if you find one?
  17. How would you use either the extreme value theory, monte carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?
  18. What is a recommendation engine? How does it work?
  19. Explain what a false positive and a false negative are. Why is it important to differentiate these from each other?
  20. Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)?

And here's my counterpoint list of software engineering questions to detect fake data scientists:

  1. As software engineers, we often need to balance short-term and long-term needs, and to distinguish between needs and wants. Tell me about the philosophy you used with a client to address this subject, and how you approached implementation of related decisions.
  2. What types of technology-minded meetups do you generally attend?
  3. How do you test code which you have written?
  4. What is your philosophy with regard to project planning? Tell me about a time when results did not meet your expectations, and the factors which led to these results.
  5. When you choose to improve code written by a colleague (or an open source committer), what general guidelines do you follow? Tell me about a recent scenario in which you were able to improve previously written code not written by yourself. To which "-ities" did this improvement contribute?
  6. Since we are having this discussion, you already realize that some level of hands-on programming is essential as a data scientist. How else do you spend your time on a given project?
  7. Tell me about the development processes you have used on projects, and how the process steps change with regard to the type of project.
  8. What is premature optimization?
  9. With which database models have you worked? With which database products were these database models associated? What effect did interfacing with these database products have on your development work?
  10. Let's go to the whiteboard and walk through the application architecture of one of your current projects. Don't worry about the notation you use.
  11. Some developers describe themselves as "full-stack" developers. What does "full-stack" mean to you in the context of being a data scientist?
  12. Tell me about your experiences with user interface design. What is the difference between UI and UX?
  13. Many software developers do not value data because they take it as a given as being available, think of it as "the easy part", do not understand it, or do not wish to work on it. Tell me how you view data within the context of an organization, from the perspective of both the software engineer and statistician. Who do you expect to do a bulk of the data work prior to analyses?
  14. Provide an example of a time when you made a technology choice for an application architecture you designed, which you determined was the wrong decision during implementation. How did you rectify this decision?
  15. In thinking about the code base for a software application, what do you view as the equivalent of "chartjunk"? How have you prevented it in your code? (Note: the original data scientist question #15 incorrectly referred to "chart junk".)
  16. Are there any areas of software engineering in which you feel you are especially adept? How do you balance these areas with other tasks that are needed, especially those which you do not particularly enjoy?
  17. Let's go back to the whiteboard and talk about the data persistence layer in the application architecture you drew. What were your options? Now walk me through some portion of the data model which which you are intimately familiar, and discuss the change that it might need to undergo over time. To what degree does the chosen database model support change?
  18. Risks are inherent in the software development process. Provide some examples of how you mitigated risk on a recent project.
  19. Which programming languages do you currently use in your daily work? Why? How has your use of languages changed over the course of your career?
  20. Tell me about the development tools you use on a regular basis. Do you generally prefer commercial or open source tooling? Why?

(1) Gfesser: As software engineers, we often need to balance short-term and long-term needs, and to distinguish between needs and wants. Tell me about the philosophy you used with a client to address this subject, and how you approached implementation of related decisions.

ChatGPT: As a language model, I do not have personal experience working with clients. However, I can provide you with a general approach to balancing short-term and long-term needs when working with a client:
  1. Clearly communicate the trade-offs: It's important to be transparent with the client about the trade-offs between short-term and long-term needs, and to help them understand how their decisions will impact both.

  2. Prioritize needs: Identify the most important needs of the client and prioritize them accordingly. This will help ensure that the most critical requirements are met while still considering the long-term goals of the project.

  3. Use a flexible development approach: Agile methodologies such as Scrum and Kanban are well suited for balancing short-term and long-term needs because they allow for flexibility and adaptability. This means that as the project progresses, new requirements can be added or existing ones can be changed to better align with the client's evolving needs.

  4. Identify Minimum Viable Product (MVP): Identifying the MVP, the bare minimum that needs to be delivered to the client, can help balance short-term and long-term needs. This helps in focusing on delivering the most important features and capabilities first and then build on top of that.

  5. Regular check-ins and reviews: Regular check-ins and reviews with the client can help ensure that the project is on track and that any issues or concerns can be addressed in a timely manner.

  6. Keep in mind the bigger picture: While focusing on the short-term needs, It's important to keep in mind the bigger picture, making sure that the project is aligning with the long-term goals of the client.

By following this approach, it should help ensure that the client's short-term and long-term needs are both considered and addressed in a balanced way.


Grade: D

 
ChatGPT's response displays several issues. ChatGPT doesn't provide a philosophy, i.e. an attitude held by a person or organization that serves as a guiding principle for behavior. Instead, ChatGPT provides a laundry list, seemingly leaving it up to the user to derive a philosophy from its response. That said, ChatGPT doesn't even try to distinguish between needs and wants, consistently only referring to needs in its response. ChatGPT also assumes that the context of my question is a project rather than a product or broader ecosystem. And ChatGPT's response doesn't give me the impression that it understands the concept of "long-term". In point #3, for example, ChatGPT states that Scrum is well-suited for balancing short-term and long-term needs, but Scrum isn't intended to address long-term needs. Point #6 perhaps gets closest to the concept of "long-term" by mentioning the "bigger picture", but the bigger picture of a given client oftentimes only involves project scope rather than the lifecycle of a given product. And with respect to lifecycle, ChatGPT construes the purpose of an MVP in point #4. Yes, an MVP provides the bare minimum, but the context of this minimum is so that early users can provide feedback for future product development that isn't likely to address long-term needs. ChatGPT likely demonstrates its training data, which consists of misperception of what an MVP is all about. I've personally needed to remind myself to revisit the source material on this topic, but this information should be readily at hand for ChatGPT.

ChatGPT's straw man argument
It's arguably worth pointing out that, because ChatGPT isn't human, it can't be expected to directly respond to questions with humanness in mind. As such, I've adjusted my initial evaluations of its responses to take this fact into account. That said, I don't particularly care for how ChatGPT addresses this topic. For example, in software engineering question #1 I've referred to a "client" (i.e. a person or organization using the services of a professional person or company). As a chatbot, ChatGPT can obviously interact with clients, so its representing itself as *solely* a language model is a straw man argument. Because ChatGPT frequently presented a similar disclaimer at the outset of many of its responses to my questions, I've chosen not to repeat this refrain throughout this article.

(2) Gfesser: What types of technology-minded meetups do you generally attend?

ChatGPT: As a language model, I do not attend meetups or events. However, there are many different types of technology-minded meetups that individuals can attend, some examples include:

  1. Programming languages meetups: Meetups focused on specific programming languages such as Python, JavaScript, and Java, where developers can learn about new developments, share their knowledge and network with other developers in the community.

  2. Open-source meetups: Meetups focused on open-source projects and technologies where developers can collaborate, share their experiences, and learn about new open-source tools and technologies.

  3. DevOps meetups: Meetups focused on the practices, tools, and technologies related to software development and operations, such as continuous integration and deployment, containerization, and infrastructure as code.

  4. Artificial Intelligence/Machine Learning meetups: Meetups focused on AI and Machine Learning technologies where individuals can learn about the latest developments, share their knowledge and network with other experts in the field.

  5. Cybersecurity meetups: Meetups focused on the security of software and hardware systems where individuals can learn about the latest threats and vulnerabilities, and share knowledge on how to protect against them.

  6. Cloud-computing meetups: Meetups focused on cloud computing and infrastructure where individuals can learn about the latest cloud-based technologies and services, such as AWS, Azure, and GCP, share their experiences and network with other cloud experts.

  7. Internet of Things (IoT) meetups: Meetups focused on IoT technologies, where individuals can learn about the latest developments in IoT, share their knowledge and network with other IoT experts.

These are just a few examples of the many different types of technology-minded meetups that are available. The best way to find meetups in your area is to search online for technology meetups in your city or region.


Grade: A

As mentioned in my original piece, this is a giveaway question.


(3) Gfesser: How do you test code which you have written?

ChatGPT: As a language model, I do not write code, so I do not have a personal experience with testing code. However, I can provide you with a general overview of some common techniques for testing code:

  1. Unit Testing: Unit testing is a technique used to test individual units or components of code, usually functions or methods. Unit tests are typically written using a framework such as JUnit, TestNG, or NUnit, and they are typically automated to make it easy to run them as part of a continuous integration pipeline.

  2. Integration Testing: Integration testing is a technique used to test how different units or components of code interact with each other. This can include testing how different parts of an application interact, as well as testing how an application integrates with external systems, such as databases or web services.

  3. Functional Testing: Functional testing is a technique used to test how well an application meets its functional requirements. This can include testing the application's user interface, as well as testing its underlying functionality, such as testing how the application handles different inputs or how it responds to different events.

  4. Performance Testing: Performance testing is a technique used to test how well an application performs under different loads or conditions. This can include testing how well an application scales, as well as testing how well it performs in terms of response time and throughput.

  5. Acceptance Testing: Acceptance testing is a technique used to test whether an application meets the requirements of the stakeholders, such as customers or users. Acceptance testing is typically done by the stakeholders themselves and it can include both functional and non-functional requirements.

  6. Security Testing: Security testing is a technique used to test how well an application is protected against security threats, such as hacking, data breaches, or malware. Security testing can include testing the application's authentication and authorization mechanisms, as well as testing how well the application handles sensitive data.

These are just a few examples of the many different types of testing that can be used to test code. The choice of testing technique will depend on the nature of the application, the development process, and the stakeholders' requirements.


Grade: C

ChatGPT's response displays several issues. ChatGPT arguably explains the "what", not the "how", with the exception of its explanation for unit testing in point #1. Periodically injecting the word "how" in its response doesn't mean that ChatGPT is actually explaining the how. For example, point #2 mentions "testing how an application integrates with external systems." But this statement just explains the what, in contrast with point #1 which explains that the how is accomplished by writing unit tests using particular frameworks. ChatGPT also doesn't always answer from the perspective of the individual who has written the code. For example, point #5 states that acceptance testing is performed by stakeholders such as customers or users.


(4) Gfesser: What is your philosophy with regard to project planning? Tell me about a time when results did not meet your expectations, and the factors which led to these results.
 
ChatGPT: As a language model, I don't have personal philosophy or experiences with project planning. However, I can provide you with a general overview of some common project planning techniques and philosophies:
  1. Agile Methodology: Agile methodology is a project management approach that emphasizes flexibility, collaboration, and rapid iteration. Agile methodology is often used in software development, and it is characterized by a focus on delivering working software quickly, and then iterating on that software based on feedback from stakeholders.

  2. Waterfall Methodology: Waterfall methodology is a project management approach that is characterized by a linear, sequential flow. Waterfall methodology is often used in projects where the requirements are well-defined, and it is characterized by a focus on delivering a complete, working product at the end of the project.

  3. Scrum: Scrum is an Agile framework for managing and completing complex projects. Scrum is characterized by a focus on delivering working software in short iterations, and it is often used in software development, product development, and other projects where requirements are not well-defined.

  4. Kanban: Kanban is a project management approach that is characterized by a focus on visualizing workflows and limiting work in progress. Kanban is often used in manufacturing and other industries where there are many tasks that need to be completed in parallel.

In any project, it is possible that the results do not meet the expectations due to various reasons, for example:

  • Lack of proper planning and estimation: When project planning is not thorough, it can lead to unrealistic expectations and missed deadlines.
  • Changing requirements: Changing requirements can cause delays and added costs, and can lead to a final product that does not meet the stakeholders' needs.
  • Lack of communication and collaboration: Lack of communication and collaboration among team members can lead to misunderstandings, delays, and misaligned expectations.
  • Technical challenges: Technical challenges can cause delays and added costs, and can lead to a final product that does not meet the stakeholders' needs.
  • Unforeseen events: Unforeseen events, such as natural disasters, pandemics, or other disruptions, can cause delays and added costs, and can lead to a final product that does not meet the stakeholders' needs.

It is important to identify the factors that led to unexpected results, evaluate the root causes and learn from them, in order to prevent similar issues from happening in the future, and to improve the project planning process.


Grade: C

As with its response to question #1, ChatGPT doesn't provide a philosophy, i.e. an attitude held by a person or organization that serves as a guiding principle for behavior. But in contrast with its response to question #1 which simply ignores my request for a philosophy, ChatGPT uses its being a language model as an excuse not to provide one. The main problem with ChatGPT's response, however, is that it provides two disjointed laundry lists as its answer to the question. For example, would use of agile correspond to all its examples as to why results might not meet expectations? Does use of agile or Kanban make use of deadlines? Since one of the key benefits of agile is to provide flexibility, permitting stakeholders to shift direction downstream, why would a final product not meet the needs of stakeholders?


(5) Gfesser: When you choose to improve code written by a colleague (or an open source committer), what general guidelines do you follow? Tell me about a recent scenario in which you were able to improve previously written code not written by yourself. To which "-ities" did this improvement contribute?
 
ChatGPT: As a language model, I don't have personal experience in choosing to improve code written by a colleague or an open source committer. However, I can provide you with a general overview of some common guidelines that are followed when improving code:
  1. Understand the code: Before making any changes, it is important to understand how the code works and why it was written the way it was. This can involve reading through the code, testing it, and talking to the original author.

  2. Follow the project's style guide: If a project has a style guide, it's important to follow it when making changes to the code. This can help to ensure that the code is consistent and easy to read.

  3. Test the code: Before and after making changes, it's important to test the code to ensure that it still works correctly and that the changes don't introduce any new bugs.

  4. Document the changes: It's important to document the changes that were made, and why they were made, so that other developers can understand what was done.

  5. Follow best practices: When making changes, it's important to follow best practices for writing good, maintainable code.

  6. Communicate and seek feedback: Before making any significant changes, it's good practice to communicate with the original author and seek their feedback.

In terms of improving the code, the changes can contribute to various "-ities" such as:

  • Readability: By making the code more readable, it's easier for other developers to understand how the code works and why it was written the way it was.

  • Maintainability: By making the code more maintainable, it's easier for other developers to make changes to the code in the future.

  • Testability: By making the code more testable, it's easier to test the code and ensure that it works correctly.

  • Performance: By making the code more performant, the code will execute faster and use fewer resources.

  • Security: By making the code more secure, it's less likely that the code will be vulnerable to hacking, data breaches, or malware.

It's important to note that code improvement can be a subjective matter and different developers may have different opinions on how the code should be improved.


Grade: B

Unlike its response to question #4, ChatGPT doesn't provide a disjointed answer, but doing so was likely more easily avoided because the topic of discussion is arguably singular. Question #4 asked for a philosophy followed by a real-life example, but ChatGPT answered each of these questions in a standalone manner, not relating its answers to each other. In its response to question #5, ChatGPT is seemingly thrown off by use of the term "choose", likely because it acts in accordance with user requests, but because it has shown evidence that it can generate code, and is likely making choices between code alternatives before doing so, its response is arguably misleading.


Midjourney 2023-02-02 Anime painting of a horse wearing a sweater sitting in a corner office and working on a laptop overlooking the Chicago River 1

Gfesser via Midjourney: "Anime painting of horse wearing a sweater and working on a laptop, sitting in a corner office overlooking the Chicago River."

(6) Gfesser: Since we are having this discussion, you already realize that some level of hands-on programming is essential as a data scientist. How else do you spend your time on a given project?
 
ChatGPT: As a language model, I don't have personal experience working on projects as a data scientist, but I can provide some general information on the types of tasks that data scientists might spend time on during a project.
  1. Data collection and cleaning: Data scientists often spend a significant amount of time collecting and cleaning data. This can involve scraping data from the web, collecting data from APIs, or working with data provided by clients.

  2. Exploratory data analysis: Data scientists often spend a significant amount of time exploring and analyzing data. This can involve visualizing data, identifying patterns, and building models.

  3. Feature engineering: Data scientists often spend time creating new features from existing data. This can involve combining multiple features, creating new features based on domain knowledge, or creating new features based on existing features.

  4. Modeling: Data scientists often spend time building and fine-tuning models. This can involve selecting appropriate algorithms, training models, and evaluating models.

  5. Communicating results: Data scientists often spend time communicating their results to stakeholders. This can involve creating visualizations, writing reports, and presenting results to clients or management.

  6. Deployment: Data scientists often spend time deploying models into production environments. This can involve creating web services, integrating models into mobile apps, or deploying models to the cloud.

  7. Monitoring: After the model is deployed, data scientists often spend time monitoring the performance of the model in production, and updating the model as necessary.

  8. Continual learning: Data scientists often spend time keeping up-to-date with new techniques and technologies, attending conferences and workshops, and reading research papers to stay current in the field.

It's important to note that the specific tasks that a data scientist will spend time on will depend on the specific project, and the stage of the project.


Grade: A

As mentioned in my original piece, the original KDNuggets question was not directly related to data science so I was sure not to make the same mistake.


Software engineering applies to both "data science" *and* "data engineering"
Interestingly, the term "data science" was coined in 1974, proposed as an alternative to the term "computer science". Some dispute exists as to the origin of its modern usage, however, and consensus is still lacking with respect to its definition, although most point to 2008. I mentioned in my introduction to this piece that the intention of my original 2016 counterpoint was to tackle the software engineering side of data science, and 7 years later it is remarkable how quickly the term "data engineering" caught on. That said, the software engineering side of data science isn't restricted to data engineering, because software engineering is needed for both data science *and* data engineering.
Curiously, the original KDNuggets piece now mentions the term "data engineer" once in its introduction, but the associated link (from 2022) is dead. In checking the Wayback Machine, I've confirmed that this blurb did not exist across any of the versions of the piece from 2015 to 2017, helping explain why I hadn't cited it. Unlike the term "data science", the coinage of "data engineering" cannot be pinpointed, with most simply saying that its usage started in the "early 2010s", and I personally did not use it until the mid-2010s, around the time when I started using the term "data pipeline". 

(7) Gfesser: Tell me about the development processes you have used on projects, and how the process steps change with regard to the type of project.
 
ChatGPT: There are many different development processes that can be used on projects. Some of the most common include:
  1. Waterfall: The Waterfall model is a linear, sequential process in which development proceeds through distinct phases, such as requirements gathering, design, implementation, testing, and maintenance. This process is often used for projects with well-defined requirements and a fixed scope.

  2. Agile: The Agile process emphasizes flexibility and rapid iteration. Agile development is done in short sprints, with requirements and solutions emerging through the collaborative effort of self-organizing cross-functional teams. Agile methodologies such as Scrum, Kanban and Lean are often used for software development projects, but also used in other fields

  3. Iterative: Iterative development is a process in which the project is broken down into smaller, incremental stages. Each stage includes the full development process, from requirements gathering to testing and deployment. This process is often used for projects that are too complex to be completed in a single pass.

  4. DevOps: DevOps is a process that emphasizes collaboration between development and operations teams. The goal of DevOps is to automate the software development process, from development to deployment, in order to make the process more efficient and reliable.

  5. Lean: Lean development is a process that emphasizes speed, efficiency, and minimal waste. It is based on the principles of Lean manufacturing and is often used in the manufacturing and software development fields

Each of these processes has its own strengths and weaknesses and the choice of which process to use will depend on the specific project, the team and the organization. For example, a project with a fixed scope and well-defined requirements might be a good fit for the Waterfall process, while a project that requires rapid iteration and flexibility might be a better fit for an Agile process. A project that needs to be deployed quickly and have a continuous delivery pipeline might be a good fit for a DevOps process.

It is important to note that different projects, teams, and organizations may have different needs, and it is important to evaluate the strengths and weaknesses of each process and choose the one that is the best fit for a specific project.


Grade: B

 
Unlike prior responses, ChatGPT doesn't make an excuse for itself that this question cannot be answered due to its being a language model. That said, ChatGPT speaks to potential project approaches, but interestingly summons a different set of methodologies than it did in its answer to question #4. Both responses mention agile and waterfall, but the descriptions are slightly different, and the earlier response additionally mentions Scrum and Kanban versus iterative, DevOps, and Lean. I don't think DevOps belongs here, and DevOps doesn't involve separate development and operations teams. Ideally, the same individuals perform both development and operations, and barring this, individuals with either area of expertise work alongside each other on the same teams. 

(8) Gfesser: What is premature optimization?
 
ChatGPT: Premature optimization is a term used to describe the practice of optimizing a software system before it is necessary. This can happen when a developer or engineer spends a significant amount of time optimizing code or algorithms before the system has been proven to have a performance bottleneck.

Optimizing code before it is necessary can be problematic for a number of reasons:

  1. It can lead to overly complex code that is difficult to maintain and understand.

  2. It can cause delays in development, as the developer spends more time on optimization than on other important tasks such as requirements gathering, design, and testing.

  3. It can lead to over-optimization, where the optimization is too aggressive, and the system's performance is worse than it would be without the optimization.

  4. It can result in wasted time and effort if the optimization is not needed, or if the problem is solved in a different way.

It's important to note that optimization is still important and necessary, but it should be done at the right time, based on data and measurements.

A common approach to avoid premature optimization is to first write the code to be clear and easy to understand, and then, once the performance bottlenecks have been identified through profiling and benchmarking, optimize only the specific parts of the code that are causing problems. This way, the optimization is targeted and effective, rather than a general and unnecessary.


Grade: B

Generally speaking, ChatGPT provided a decent response, albeit arguably a bit elusive with respect to its use of the term "can" in its second sentence. In other words, is its second sentence intended to be just an example? The issue here is that ChatGPT mentions the scope of a given optimization problem as a "software system", but only seems to focus on code for the remainder of its response. Optimization for a given software system often involves components outside the code, such as the infrastructure on which the code executes, and the data which is processed by the code, oftentimes involving needed optimization outside the code, such as through configuration. And it's arguably worth additionally challenging ChatGPT's statement that a performance bottleneck must first be proved. Optimization is justified, for example, If foreseeable workloads are already known knowns. Functional correctness is the first priority, but any expected workloads should obviously be taken into account because these will have architectural impact.


Performance and "premature optimization"
Many of us were exposed to the maxim "premature optimization is the root of all evil" either during our computer science coursework or early in our software development careers. This maxim is just as important (perhaps even more so) for modern data work, which has been converging with software development in recent years. Some engineers, however, take this quote out of context. The broader quote by Donald Knuth states the following: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
In a community post a few months ago about data normalization, I elaborated a bit about the importance of considering data workloads when tackling performance, mainly because of the prevalence of inexperienced professionals: "Performance for analytical operations typically needs to be handled differently than performance for transactional operations. Why is this? Because when analytics is performed, scanning many records is often needed to calculate aggregates etc, unlike targeted queries which are looking for specific records. From my experience, it's still relatively common to hear development teams in the workplace talk about the number of records involved when it comes to performance. But there are many other additional factors. How many fields does each record contain? What queries will be run against these records? What are the data types of the field values being queried? What are the cardinalities of the field values being queried? and What database product type is being used to store this data, or conversely, What query engine / compute layer is being used, if storage is kept separate in cases of data lakes etc? All of these are important considerations."

(9) Gfesser: With which database models have you worked? With which database products were these database models associated? What effect did interfacing with these database products have on your development work?
 
ChatGPT: I have worked with several different database models, including:
  1. Relational Model: The relational model is the most commonly used database model. It organizes data into tables, with rows representing individual records and columns representing fields or attributes. The relational model is associated with several popular database products such as MySQL, PostgreSQL, and Oracle. Working with these products has allowed me to easily store and retrieve data in a structured and organized way.

  2. Document Model: The document model is a non-relational database model that stores data in the form of documents, such as JSON or XML. Document databases are associated with products such as MongoDB and Couchbase. working with these databases allowed me to store semi-structured data and perform complex queries easily.

  3. Graph Model: Graph databases are non-relational databases that use graph structures with nodes, edges, and properties to represent and store data. They are associated with products such as Neo4j and ArangoDB. Working with these products allowed me to store and retrieve data in a way that is optimized for graph-based queries, such as social networks and recommendation systems.

  4. Key-Value Model: The key-value model is a non-relational database model that stores data in the form of key-value pairs. Key-value databases are associated with products such as Redis and Riak. Working with these products allowed me to store and retrieve data in a way that is optimized for high-speed, low-latency access.

Interfacing with these different database products has had a positive effect on my development work by allowing me to choose the right database model and product for the specific needs of a project. Each model has its own strengths and weaknesses, and choosing the right one can greatly affect the performance and scalability of a system.


Grade: B

 
The community often comments that the term "model" is excessively overloaded, and ChatGPT chose to respond using a safe definition for it, but the database model categorizations ChatGPT uses are a bit rigid because some database products overlap these categorizations.While database products offer other models, the question makes use of the object "you", so unless the creators of ChatGPT (i.e. OpenAI) divulge as to which database models have been used by ChatGPT in the past it is not possible to verify whether these might apply.

(10) Gfesser: Let's go to the whiteboard and walk through the application architecture of one of your current projects. Don't worry about the notation you use.
 
ChatGPT: I am sorry, but I am just a software model and I do not have physical abilities, so I am unable to go to a whiteboard. However, I can describe the architecture of a project in plain text.

The project I am going to describe is a web-based e-commerce application. The architecture is based on microservices and follows a client-server pattern.

The front-end is built using React, a JavaScript library for building user interfaces. The front-end communicates with the back-end through a set of RESTful APIs.

The back-end is composed of several microservices, each one responsible for a specific functionality.

  • The Catalog Service handles all the product-related functionality, such as fetching product data, handling product search, and managing product categories.
  • The Order Service handles all the order-related functionality, such as creating and managing orders, handling payments, and sending confirmation emails.
  • The User Service handles all the user-related functionality, such as creating and managing user accounts, handling authentication, and managing user profiles.

All the microservices communicate with each other through a message broker, such as RabbitMQ, which allows for loose coupling and asynchronous communication.

The data is stored in a MySQL database, which is responsible for storing all the data related to products, orders, and users.

The application is deployed on a cloud platform, such as AWS or GCP, and is designed to be scalable and highly available.

The architecture is designed to be modular, so new services can be added or removed easily. This allows for flexibility and makes the application easy to maintain and evolve.

It's important to note that this is just a high-level overview, and there may be other components and considerations involved depending on the specific requirements of the project.


Grade: C

 
ChatGPT doesn't currently generate images, so this question is a bit unfair. I even made multiple unsuccessful attempts to get DALL-E and Midjourney to generate plausible images of people drawing architecture diagrams on whiteboards. While the first sentence of the response seems to understand what use of a whiteboard entails – at least use of a physical whiteboard – the second sentence mentions "the architecture of a project" rather than an "application architecture" as mentioned by the question, leading me to doubt that ChatGPT understands the question. The answer is generally reasonable, but while it mentions loose coupling between microservices it doesn't seem to adequately take the data into account, due to its mentioning of a presumably singular MySQL database which stores "all" the data for the listed microservices. And the caveat that ChatGPT provides in its closing sentence does not provide comfort.


Let's try that again: ChatGPT can actually output markup
For the dialogue presented as part of this sidebar, I entered all Mermaid markup generated by ChatGPT into the Mermaid Live Editor. As noted, I needed to modify the generated markup on a few occasions due to editor bugs rather than incorrectly generated markup. While the generated sequence diagrams can be impressive at first glance, after several exchanges it's clear that ChatGPT shows no true understanding, resulting in my going in circles with ChatGPT to repeatedly fix issues, only to discover that issues continue to get reintroduced over time. As such, ChatGPT's deficiencies are magnified over time, not boosting the grade I granted for the initial non-markup response.
Note that I've included some inline commentary in parenthesis to highlight the surfacing of these scenarios.
Gfesser: Using markdown format for Mermaid, draw a sequence diagram of a web-based e-commerce application that uses AWS native services and microservice hosted components.
ChatGPT:
 
sequenceDiagram
    participant Frontend
    participant API Gateway
    participant Microservice1
    participant Microservice2
    participant AmazonDynamoDB
    participant AmazonSNS
    participant AmazonSQS
    participant AmazonCognito
    participant User
    Frontend ->> API Gateway: Add Item to Cart
    API Gateway ->> Microservice1: Check Stock Availability
    Microservice1 ->> AmazonDynamoDB: Query Product Inventory
    AmazonDynamoDB –>> Microservice1: Return Inventory Count
    alt Sufficient Inventory
        Microservice1 ->> AmazonDynamoDB: Decrement Product Inventory
        AmazonDynamoDB –>> Microservice1: Return Success
        Microservice1 ->> AmazonSQS: Send Order to Fulfillment Queue
        AmazonSQS –>> Microservice1: Return Success
        Microservice1 ->> Frontend: Order Confirmed
    else Insufficient Inventory
        Microservice1 ->> Frontend: Out of Stock
    end
    Frontend ->> API Gateway: Checkout
    API Gateway ->> AmazonCognito: Authenticate User
    AmazonCognito –>> API Gateway: Return User Credentials
    API Gateway ->> Microservice2: Process Payment
    Microservice2 ->> AmazonSNS: Send Shipping Notification
    AmazonSNS ->> User: Send Shipping Notification
 
Chatgpt-mermaid-sequence-diagram-1-erikonsoftware
 

This sequence diagram shows an example of a web-based e-commerce application that uses a combination of AWS native services and microservice hosted components:

  • Frontend: This is the user-facing part of the application that sends requests to the API Gateway.
  • API Gateway: This serves as the entry point for all incoming requests and forwards them to microservices.
  • Microservice1: This is a hosted microservice that provides functionality related to user authentication, product management, and order management. It checks stock availability and decrements product inventory by sending a message to the Amazon SQS fulfillment queue.
  • Microservice2: This is a hosted microservice that provides functionality related to payment processing, shipping, and notifications. It processes the payment and sends a notification to the user using Amazon SNS.
  • AmazonDynamoDB: This is a NoSQL database service that stores and retrieves data for Microservice1.
  • AmazonSNS: This is a managed service that provides pub/sub messaging for Microservice2.
  • AmazonSQS: This is a managed message queue service that decouples and scales microservices for Microservice1.
  • AmazonCognito: This is a managed service that provides user sign-up, sign-in, and access control.
  • User: This is the end-user of the application.

The sequence of events in the diagram is as follows:

  1. The user adds an item to their cart on the frontend of the application, which sends a request to the API Gateway.
  2. The API Gateway forwards the request to Microservice1 to check the stock availability of the item.
  3. Microservice1 queries the AmazonDynamoDB database for the inventory count of the item and receives the count in response.
  4. If the inventory count is sufficient, Microservice1 decrements the inventory count in AmazonDynamoDB and sends a message to the AmazonSQS fulfillment queue to process the order. It then sends a confirmation message to the frontend.
  5. If the inventory count is insufficient, Microservice1 sends a message to the frontend that the item is out of stock.
  6. The user checks out on the frontend of the application, which sends a request to the API Gateway.
  7. The API Gateway authenticates the user with AmazonCognito and receives the user credentials in response.
  8. The API Gateway forwards the

Gfesser: Why did you wait to authenticate until after checkout?

ChatGPT: The reason for waiting to authenticate until after checkout in the sequence diagram is for simplicity and clarity of the diagram. In a real e-commerce application, the user would likely need to be authenticated before they could add items to their cart or initiate the checkout process. However, including these additional steps in the sequence diagram would make it more complex and harder to read, so they were omitted for the sake of clarity.

(The reasoning here makes absolutely no sense: reduced complexity of a sequence diagram doesn't necessarily make it clearer to understand, especially when it comes to authentication.)

Gfesser: Why did you send Shipping Notification to User instead of Front End?

ChatGPT: In the sequence diagram, the shipping notification is sent from Microservice2 to the user using Amazon SNS, rather than to the frontend of the application. This is because the notification is specific to the user and not relevant to the frontend of the application. By sending the notification directly to the user, it ensures that they receive the notification even if they are not currently interacting with the application.

Additionally, the frontend of the application may not have access to the user's contact information or preferences, such as their email address or phone number, which are needed to send the notification. By sending the notification directly to the user using Amazon SNS, the application can ensure that the notification is sent to the correct contact information and in the preferred format of the user.

(ChatGPT's initial explanation doesn't make any sense: how can a notification be sent to a user non-programmatically? It seems to assume programmatic notification in the second portion of its response, as it mentions email address and phone number, but neither of these are depicted in the generated sequence diagram.)

Gfesser: How does User consume SNS notifications?

ChatGPT: When a user signs up to receive notifications using Amazon SNS, they must provide their preferred contact information and format for the notification, such as email or SMS. Once the notification is sent from the e-commerce application using SNS, it is delivered to the user through their preferred channel.

For example, if the user chose to receive notifications by email, the notification would be sent to the user's email address provided during sign-up. The user can then read the notification in their email inbox.

Alternatively, if the user chose to receive notifications by SMS, the notification would be sent to the user's phone number provided during sign-up. The user can then view the notification in their text messages.

Overall, the user consumes SNS notifications through their preferred communication channel, which is specified during sign-up.

Gfesser: What would happen if Product Inventory was decremented, and an Order was not sent to Fulfillment Queue?

ChatGPT: If the product inventory was decremented and an order was not sent to the fulfillment queue, it would indicate a problem with the application's logic or the integration between its components.

For example, if the inventory was decremented but the order was not sent to the fulfillment queue, it might indicate an error in the code that handles the checkout process. Alternatively, it could indicate a problem with the communication between the microservices, such as a failure in the message queue that connects them.

As a result of this error, the customer's order may not be processed and fulfilled, resulting in a negative experience for the customer. The inventory levels may also become inaccurate, causing issues with future orders or inventory management. It is essential to monitor the application for such errors and take steps to address them promptly to ensure a seamless customer experience and accurate inventory management.

(Here ChatGPT seems to show some understanding, but it doesn't demonstrate this understanding in the generated sequence diagram.)

Gfesser: How would you modify this architecture to ensure that Product Inventory is not decremented unless a given Order is sent to Fulfillment Queue?

ChatGPT: To ensure that the product inventory is not decremented unless an order is sent to the fulfillment queue, we can modify the architecture to introduce an additional step that confirms the order's placement in the fulfillment queue before reducing the product inventory.

One way to achieve this is to modify Microservice1 to publish a message to a new queue, such as the "Order Confirmation" queue, after the order is successfully placed in the fulfillment queue. Microservice3, which is responsible for managing the product inventory, can then consume messages from the "Order Confirmation" queue. When Microservice3 receives an order confirmation message, it can decrement the product inventory accordingly.

Here is an updated diagram that illustrates this modification:

sequenceDiagram
    participant User
    participant Frontend
    participant Microservice1
    participant FulfillmentQueue
    participant OrderConfirmationQueue
    participant Microservice3

    User->>+Frontend: Place order
    Frontend->>+Microservice1: Submit order
    Microservice1->>+FulfillmentQueue: Send order
    FulfillmentQueue–>>-Microservice1: Order received
    Microservice1->>+OrderConfirmationQueue: Send confirmation
    OrderConfirmationQueue–>>-Microservice3: Order confirmed
    Microservice3->>Microservice1: Decrement inventory
    Microservice1–>>Frontend: Order confirmed
    Microservice3–>>Frontend: Inventory updated
    Microservice1->>-User: Send shipping notification
 
Chatgpt-mermaid-sequence-diagram-2-erikonsoftware

Gfesser: Manually edited highlighted markdown by removing trailing "-" of each line due to the following error when submitting to Mermaid: "Error: Trying to inactivate an inactive participant". Edited markdown is technically incorrect UML notation, but conveys same information in spirit.
With this modification, the product inventory will only be decremented after the order is successfully placed in the fulfillment queue and confirmed through the "Order Confirmation" queue. This ensures that the product inventory stays accurate and prevents potential issues that could arise if the inventory was decremented but the order was not sent to the fulfillment queue.
 
Gfesser: How would you modify this architecture to ensure that Product Inventory is not decremented unless a given Order is successful?
 
ChatGPT: To ensure that the product inventory is not decremented unless a given order is successful, we can modify the architecture to introduce an additional step that confirms the order's success before reducing the product inventory.

One way to achieve this is to modify Microservice1 to publish a message to a new queue, such as the "Order Success" queue, after the order is successfully processed. Microservice3, which is responsible for managing the product inventory, can then consume messages from the "Order Success" queue. When Microservice3 receives an order success message, it can decrement the product inventory accordingly.

Here is an updated diagram that illustrates this modification:

sequenceDiagram
    participant User
    participant Frontend
    participant Microservice1
    participant FulfillmentQueue
    participant OrderSuccessQueue
    participant Microservice3

    User->>+Frontend: Place order
    Frontend->>+Microservice1: Submit order
    Microservice1->>+FulfillmentQueue: Send order
    FulfillmentQueue–>>-Microservice1: Order received
    Microservice1->>+OrderSuccessQueue: Send order success
    OrderSuccessQueue–>>-Microservice3: Order success
    Microservice3->>Microservice1: Decrement inventory
    Microservice1–>>Frontend: Order confirmed
    Microservice3–>>Frontend: Inventory updated
    Microservice1->>-User: Send shipping notification
 
Chatgpt-mermaid-sequence-diagram-2-erikonsoftware-3

Gfesser: Manually edited highlighted markdown by removing trailing "-" of each line due to the following error when submitting to Mermaid: "Error: Trying to inactivate an inactive participant". Edited markdown is technically incorrect UML notation, but conveys same information in spirit.
With this modification, the product inventory will only be decremented after the order is successfully processed and confirmed through the "Order Success" queue. This ensures that the product inventory stays accurate and prevents potential issues that could arise if the inventory was decremented but the order was not successful.
 
Gfesser: Explain how Microservice3 is able to send a message to Frontend after Microservice1 sends a message to Frontend.
 
In the original sequence diagram, Microservice3 is shown sending a message directly to the Frontend after it updates the product inventory. However, in practice, it is more likely that Microservice3 would send a message to Microservice1, which would then send a message to the Frontend.

In this modified sequence, Microservice1 is responsible for sending a message to the Frontend to confirm that the order has been successfully processed. After Microservice3 receives an order success message from the "Order Success" queue, it decrements the product inventory and sends a message to Microservice1. Microservice1 then sends a message to the Frontend to confirm the order.

This decoupled architecture, where each microservice has a specific responsibility and communicates with other microservices using messages or events, allows for greater scalability, flexibility, and fault tolerance.

(ChatGPT attempts to compensate for the content of its last sequence diagram by falling back to the expression "in practice", but it should have just explained what is "more likely" at the outset.)

Gfesser: Update your sequence diagram to show this modification.
 
ChatGPT: Here's the updated sequence diagram that reflects the modification to show Microservice3 sending a message to Microservice1, which then sends a message to the Frontend after the product inventory is updated:
 
sequenceDiagram
    participantUser
    participantFrontend
    participantMicroservice1
    participantFulfillmentQueue
    participantOrderSuccessQueue
    participantMicroservice3

    User->>+Frontend: Place order
    Frontend->>+Microservice1: Submit order
    Microservice1->>+FulfillmentQueue: Send order
    FulfillmentQueue–>>-Microservice1: Order received
    Microservice1->>+OrderSuccessQueue: Send order success
    OrderSuccessQueue–>>-Microservice3: Order success
    Microservice3->>Microservice1: Decrement inventory
    Microservice1–>>+Microservice3: Inventory updated
    Microservice1–>>-Frontend: Order confirmed
 
Chatgpt-mermaid-sequence-diagram-2-erikonsoftware-4

Gfesser: Manually edited highlighted markdown by removing trailing "-" of each line due to the following error when submitting to Mermaid: "Error: Trying to inactivate an inactive participant". Edited markdown is technically incorrect UML notation, but conveys same information in spirit.
With this modification, the sequence diagram shows that Microservice3 sends a message to Microservice1 after the inventory is updated, and Microservice1 sends a message to the Frontend to confirm the order. This decoupled architecture allows for greater scalability, flexibility, and fault tolerance, as each microservice has a specific responsibility and communicates with other microservices using messages or events.
 
(The following exchange is separate from the first, asking ChatGPT to tackle data engineering rather than traditional software engineering.)
 
Gfesser: Using markdown format for Mermaid, draw a sequence diagram of an example batch data engineering pipeline that ingests and processes this data for downstream analytical consumption.
 
ChatGPT: Here's a sequence diagram of an example batch data engineering pipeline that ingests and processes data for downstream analytical consumption using Apache Airflow and Apache Spark:
 
sequenceDiagram
    participantScheduler
    participantAirflowTask
    participantSparkJob
    participantDataWarehouse

    Scheduler->>+AirflowTask: Trigger DAG
    AirflowTask->>+SparkJob: Submit job
    SparkJob->>+DataWarehouse: Read raw data
    SparkJob->>+DataWarehouse: Write processed data
 
Chatgpt-mermaid-sequence-diagram-2-erikonsoftware-5
 
In this sequence diagram, the process starts with a scheduler, which triggers a DAG (Directed Acyclic Graph) in Apache Airflow. An Airflow task then submits a Spark job for processing data. The Spark job reads the raw data from a data source (e.g., a file system or database), processes the data, and writes the processed data to a data warehouse.

This pipeline provides a scalable and maintainable way to process batch data for downstream analytical consumption, by using tools like Apache Airflow and Apache Spark. With Airflow, we can schedule and manage complex data processing workflows, while Spark provides a distributed and scalable way to process large amounts of data. The processed data can then be stored in a data warehouse for downstream analytical consumption.

(At the outset, ChatGPT already makes a mistake: what data is being read and processed in the sequence diagram?)

Gfesser: Where is the data source in your diagram?

ChatGPT: I apologize for the confusion. In my previous sequence diagram, I didn't explicitly include a data source in the pipeline. Here's an updated sequence diagram that includes a data source:

sequenceDiagram
    participantScheduler
    participantAirflowTask
    participantSparkJob
    participantDataSource
    participantDataWarehouse

    Scheduler->>+AirflowTask: Trigger DAG
    AirflowTask->>+SparkJob: Submit job
    DataSource->>+SparkJob: Read raw data
    SparkJob->>+DataWarehouse: Write processed data
 
Chatgpt-mermaid-sequence-diagram-2-erikonsoftware-6

In this updated sequence diagram, the data source is represented as a separate participant (i.e., DataSource) in the pipeline. The data source could be any system that contains the raw data that needs to be processed, such as a file system or database. The Spark job reads the raw data from the data source, processes it, and writes the processed data to a data warehouse.

I hope this clarifies any confusion. Please let me know if you have any further questions.

(Not bad, but how is the Spark job able to execute?)

Gfesser: What serves as the compute for the spark job in your diagram?

ChatGPT: Thank you for the question. In my previous diagram, I didn't specify the compute resource used for the Spark job. Typically, Spark jobs require a cluster of compute resources to perform distributed processing of data.

Here's an updated sequence diagram that shows a Spark cluster as the compute resource for the Spark job:

sequenceDiagram
    participantScheduler
    participantAirflowTask
    participantSparkJob
    participantDataSource
    participantSparkCluster
    participantDataWarehouse

    Scheduler->>+AirflowTask: Trigger DAG
    AirflowTask->>+SparkJob: Submit job
    DataSource->>+SparkCluster: Read raw data
    SparkCluster->>+SparkJob: Process data
    SparkJob->>+DataWarehouse: Write processed data
 
Chatgpt-mermaid-sequence-diagram-2-erikonsoftware-7

In this updated diagram, the Spark job is performed by a Spark cluster, which is a set of interconnected computing nodes that work together to process data in parallel. The Spark cluster reads the raw data from the data source and processes it, and the processed data is written to a data warehouse.

I hope this clarifies any confusion. Please let me know if you have any further questions.

(ChatGPT adds a Spark cluster to the sequence diagram, but if it's the cluster that is reading the data what is the Spark job used for?)

Gfesser: Is it the spark cluster or the spark job in your diagram that processes data from the data source?

ChatGPT: Thank you for pointing that out. In my previous diagram, I incorrectly implied that the Spark cluster processes data from the data source. In reality, it's the Spark job that processes data from the data source using the resources provided by the Spark cluster.

Here's an updated sequence diagram that shows the Spark job as the component that processes data from the data source:

sequenceDiagram
    participantScheduler
    participantAirflowTask
    participantSparkJob
    participantDataSource
    participantSparkCluster
    participantDataWarehouse

    Scheduler->>+AirflowTask: Trigger DAG
    AirflowTask->>+SparkCluster: Request resources
    DataSource->>+SparkJob: Read raw data
    SparkJob->>+SparkCluster: Process data
    SparkCluster->>-SparkJob: Return resources
    SparkJob->>+DataWarehouse: Write processed data

 

Chatgpt-mermaid-sequence-diagram-2-erikonsoftware-8

In this updated diagram, the Spark cluster is responsible for providing the compute resources needed to execute the Spark job. The Spark job reads the raw data from the data source, processes it using the resources provided by the Spark cluster, and writes the processed data to a data warehouse.

I apologize for any confusion my previous diagram may have caused. Please let me know if you have any further questions.

(Again, ChatGPT seemingly understands the issue it previously introduced, but it introduces other issues, the most noticeable of which is its simply flipping what is reading and processing the data.)

Gfesser: Modify your diagram to show that you understand the relationship between the spark cluster and spark job.

ChatGPT: Certainly, here's an updated sequence diagram that shows the interaction between the Spark cluster and Spark job:

sequenceDiagram
    participant Scheduler
    participant AirflowTask
    participant SparkJob
    participant DataSource
    participant SparkCluster
    participant DataWarehouse

    Scheduler->>+AirflowTask: Trigger DAG
    AirflowTask->>+SparkCluster: Request resources
    SparkCluster->>+SparkJob: Start job
    DataSource->>+SparkJob: Read raw data
    SparkJob->>+SparkCluster: Process data
    SparkCluster->>-SparkJob: Return resources
    SparkJob->>+DataWarehouse: Write processed data

Chatgpt-mermaid-sequence-diagram-2-erikonsoftware-9

In this updated diagram, the Spark cluster is responsible for providing the compute resources needed to execute the Spark job. The Spark job is submitted to the Spark cluster for execution, and reads the raw data from the data source. The Spark cluster then provides the necessary resources to process the data and returns the resources to the cluster once the job is complete. Finally, the Spark job writes the processed data to a data warehouse.

I hope this clarifies any confusion. Let me know if you have any further questions.

Gfesser: Explain the difference between reading and processing data in your diagram.

ChatGPT: Certainly. In the context of the batch data engineering pipeline diagram, reading data refers to the process of extracting raw data from a data source, such as a file or database, and loading it into memory for further processing.

On the other hand, processing data involves transforming the raw data into a format that is suitable for downstream analytical consumption. In the diagram, this process is performed by the Spark job using the resources provided by the Spark cluster. The specific data processing tasks performed by the Spark job will depend on the requirements of the downstream analytics use case, but could include tasks such as filtering, aggregating, or joining data.

In summary, reading data is the initial step in the pipeline that retrieves raw data from the data source, while processing data is the subsequent step that transforms and prepares the data for downstream analytical consumption.

Gfesser: Modify your diagram to show that it is the spark job that both reads and processes the data.
 
ChatGPT: Certainly, here's an updated sequence diagram that reflects the fact that the Spark job performs both the data reading and processing:

sequenceDiagram
    participant Scheduler
    participant AirflowTask
    participant SparkJob
    participant DataSource
    participant SparkCluster
    participant DataWarehouse

    Scheduler->>+AirflowTask: Trigger DAG
    AirflowTask->>+SparkCluster: Request resources
    SparkCluster->>+SparkJob: Start job
    SparkJob->>+DataSource: Read raw data
    SparkJob->>+SparkCluster: Process data
    SparkCluster->>-SparkJob: Return resources
    SparkJob->>+DataWarehouse: Write processed data

Chatgpt-mermaid-sequence-diagram-2-erikonsoftware-10

In this updated diagram, the Spark job is responsible for both reading and processing the data. The job is submitted to the Spark cluster for execution, and reads the raw data from the data source using resources provided by the Spark cluster. The Spark cluster then provides the necessary resources to process the data and returns the resources to the cluster once the job is complete. Finally, the Spark job writes the processed data to a data warehouse.

I hope this modification helps clarify the data reading and processing steps in the pipeline. Let me know if you have any further questions.

(After a few more exchanges, ChatGPT makes some slight improvements.)

Gfesser: Why did you change the direction of the message that represents reading the data?

ChatGPT: You are correct, in the original diagram, the direction of the message between the Spark job and the data source was incorrect. The message should have been sent from the Spark job to the data source to represent the act of reading data.

In the updated diagram, the direction of the message has been corrected to reflect that the Spark job is responsible for reading the raw data from the data source. The message arrow now points from the Spark job to the data source, indicating that the Spark job is the one initiating the communication.

Thank you for bringing this to my attention.

Gfesser: In your diagram, where is the data stored while processing it after reading it?

ChatGPT: In the diagram, the data is stored in memory and on disk within the Spark cluster while it is being processed. Once the Spark job completes processing, it writes the results to a data warehouse, which is represented by the DataWarehouse participant in the diagram.

While the data is being processed by the Spark job, it is distributed across the nodes in the Spark cluster and is stored in the cluster's memory and on disk. Spark divides the data into partitions, which are processed in parallel across the nodes in the cluster. The processing results are then combined by the Spark job and written to the data warehouse.

I hope this clarifies where the data is stored while it is being processed. Let me know if you have any other questions.

Gfesser: How does the spark job in your diagram know how to structure the processed data for the data warehouse?

ChatGPT: The structure of the processed data is typically defined by the schema of the data warehouse table where the results will be stored. The schema specifies the columns and data types of the table, which defines the structure of the data that will be written to it.

In the example diagram, the Spark job is assumed to have prior knowledge of the schema of the DataWarehouse table, which it uses to structure the processed data before writing it to the data warehouse.

Alternatively, the Spark job could dynamically infer the schema of the processed data and then use that schema to write the data to the data warehouse. Spark provides functions for automatically inferring the schema of data based on its content, such as inferSchema() in PySpark.

Overall, the schema of the data warehouse table is an important consideration for data processing pipelines, as it defines the structure of the data that will be stored and analyzed downstream.

Gfesser: Explain how the Spark job could have prior knowledge of the target table schema.
 
ChatGPT: There are a few different ways that the Spark job could have prior knowledge of the target table schema:

  1. Manual schema definition: The schema for the target table could be manually defined and included as part of the Spark job code or configuration. This would typically involve defining the column names, data types, and other schema properties in a schema definition file or configuration file that is used by the Spark job.

  2. Schema discovery: The schema for the target table could be discovered dynamically by querying the data warehouse system to retrieve the schema metadata. For example, the Spark job could query a system like Apache Hive or Apache HBase to retrieve the schema of the target table.

  3. Collaborative schema design: The schema for the target table could be designed collaboratively between the data engineering and data analysis teams. The data engineering team could provide guidance on the schema properties required for efficient data processing and storage, while the data analysis team could provide input on the data types and structures required for their analytical use cases. This collaborative process could result in a shared understanding of the schema that is used by both teams to build the data pipeline.

Overall, having prior knowledge of the target table schema is an important consideration for building efficient and effective data processing pipelines. It enables the Spark job to structure the processed data correctly for downstream analysis and can help avoid schema conflicts and other errors in the data pipeline.

Gfesser: Explain how the spark job could have prior knowledge of which data warehouse table to target.
 
ChatGPT: Similar to how the Spark job can have prior knowledge of the target table schema, it can also have prior knowledge of which data warehouse table to target. Here are a few approaches:

  1. Configuration file: The Spark job could have a configuration file that specifies the target table name or the target database and table name. This file can be passed as a parameter to the job when submitting it, and the job can then read the configuration to determine the target table.

  2. Job parameters: The Spark job can be submitted with parameters that specify the target table or database and table name. These parameters can be passed to the job through the command line or through a job scheduler that supports parameterized jobs.

  3. Metadata store: The Spark job can query a metadata store, such as a data catalog, to retrieve information about the target table. This metadata store can contain information about the table name, location, schema, and other properties that the Spark job needs to know.

  4. Collaborative approach: The data engineering and data analysis teams can collaborate to determine the target table for the Spark job. The data engineering team can provide guidance on the database and table structure that is most efficient for data processing, while the data analysis team can provide input on the table structure that is most useful for their analytical needs. This collaborative process can result in a shared understanding of the target table that is used by both teams to build the data pipeline.

Having prior knowledge of the target table can help ensure that the Spark job outputs data to the correct location in the data warehouse and avoids data loss or overwriting. It also helps to ensure that the processed data is available for downstream analysis.

(Here, ChatGPT mentions a "metastore" for the first time, and I immediately wonder whether it will misrepresent what a Hive metastore is all about, as I've personally interviewed a fair portion of individuals who do not understand basic data engineering.)

Gfesser: Modify your diagram to how use of a Hive metastore.
 
ChatGPT: Here is the modified diagram that includes the use of a Hive Metastore:
 
sequenceDiagram
    participantDataIngestion
    participantSparkCluster
    participantDataWarehouse
    participantHiveMetastore
 
    DataIngestion->>+SparkCluster: Send data
    SparkCluster->>-SparkJob: Start Spark Job
    SparkJob->>+SparkCluster: Read data
    SparkJob->>+SparkCluster: Process data
    SparkJob->>+HiveMetastore: Get metadata
    HiveMetastore–>>-SparkJob: Metadata
    SparkJob->>+DataWarehouse: Write data
    DataWarehouse–>>-SparkJob: Acknowledge write
 
Chatgpt-mermaid-sequence-diagram-2-erikonsoftware-11
 
In this modified diagram, the Spark job reads metadata from the Hive Metastore before writing data to the Data Warehouse. The Hive Metastore serves as a central repository of metadata about data stored in the Data Warehouse, including schema information. The Spark job can use this metadata to ensure that the processed data is structured in a way that is compatible with the target Data Warehouse table.
 
(In fixing the sequence diagram to depict use of a Hive metastore, ChatGPT decides to throw out other portions of its prior diagram.)
 
Gfesser: Modify your diagram to show a scheduler, airflow task, and data source like your prior diagram.
 
ChatGPT: Here is the modified diagram that includes a scheduler, an Airflow task, and a data source:
 
sequenceDiagram
    participantDataIngestion
    participantScheduler
    participantAirflowTask
    participantSparkCluster
    participantDataWarehouse
    participantHiveMetastore
    participantDataSource
 
    Scheduler->>+AirflowTask: Schedule task
    AirflowTask->>+DataIngestion: Trigger ingestion
    DataIngestion->>+SparkCluster: Send data
    SparkCluster->>-SparkJob: Start Spark Job
    SparkJob->>+SparkCluster: Read data
    SparkJob->>+SparkCluster: Process data
    SparkJob->>+HiveMetastore: Get metadata
    HiveMetastore–>>-SparkJob: Metadata
    SparkJob->>+DataWarehouse: Write data
    DataWarehouse–>>-SparkJob: Acknowledge write
 
Chatgpt-mermaid-sequence-diagram-2-erikonsoftware-12
 
In this modified diagram, the Data Ingestion process is triggered by an Airflow task scheduled by a scheduler. The data is then processed by a Spark job as before, with metadata about the Data Warehouse table obtained from the Hive Metastore. The data source is represented as a separate participant in the sequence diagram, indicating that it is distinct from the Data Ingestion process.
 
(ChatGPT is starting to look like a disaster, with many issues here, including the fact that the data ingestion it depicts is detached from the data source, which isn't even referenced in the sequence diagram.)
 
Gfesser: You made the same earlier mistake as you did before. Where is the data coming from if the data source is not being read?
 
ChatGPT: I apologize for the confusion in my previous response. You are correct, the data source needs to be read in order to process it, so the data source being read should be included in the sequence diagram.

Here is an updated sequence diagram that includes the data source being read as well as a scheduler, Airflow task, and a Hive Metastore:

sequenceDiagram
    participantScheduler
    participantAirflowTask
    participantDataPipelineJob
    participantSparkCluster
    participantHiveMetastore
    participantDataWarehouse

    Scheduler->>AirflowTask: Trigger job
    AirflowTask->>DataPipelineJob: Start job
    DataPipelineJob->>DataPipelineJob: Read data from data source
    DataPipelineJob->>SparkCluster: Process data
    SparkCluster->>DataPipelineJob: Processed data
    DataPipelineJob->>HiveMetastore: Write processed data
    HiveMetastore->>DataWarehouse: Persist processed data
 
In this updated diagram, the scheduler triggers the Airflow task to start the data pipeline job. The data pipeline job reads data from the data source, processes the data on the Spark cluster, and writes the processed data to the Hive Metastore. Finally, the Hive Metastore persists the processed data to the data warehouse.
 
Chatgpt-mermaid-sequence-diagram-2-erikonsoftware-13
 
(ChatGPT continues to dig a hole in its reasoning. A data source is implied, and the Spark job is now generically referred to as a "data pipeline job", which is fine, but what's this about data being written to the Hive metastore?)
 
Gfesser: How is the processed data being written to the Hive metastore?
 
ChatGPT: In the modified diagram, the processed data is being written to the Hive Metastore using Spark SQL. After the Spark job has processed the data, it writes the results to a temporary table in the Hive Metastore using Spark SQL's INSERT INTO statement. The temporary table is then merged into the target table using another SQL statement. This approach allows Spark to write data to the Hive Metastore in parallel, which can significantly improve performance.
 
(No – this makes absolutely no sense: a Hive metastore consists solely of metadata, thus the name.)
 
Gfesser: Is the data pipeline spark job actually writing data to the hive metastore?
 
ChatGPT: Yes, the data pipeline Spark job is writing the processed data to the Hive Metastore. After the Spark job has finished processing the data, it writes the results to a temporary table in the Hive Metastore using Spark SQL's INSERT INTO statement. The temporary table is then merged into the target table using another SQL statement. This approach allows Spark to write data to the Hive Metastore in parallel, which can significantly improve performance.
 
Gfesser: But is the storage for the processed data the hive metastore or somewhere else?
 
ChatGPT: The processed data is temporarily stored in the memory of the Spark cluster during the execution of the Spark job. Once the job has completed, the processed data is written to the Hive Metastore. Therefore, the Hive Metastore is the final storage location for the processed data in this pipeline. However, the Hive Metastore itself does not store the data. Instead, it stores metadata about the location, schema, and format of the data, while the actual data is stored in a distributed file system such as Hadoop Distributed File System (HDFS) or Amazon S3.
 
(After a couple exchanges, ChatGPT eventually states correctly that the Hive metastore itself doesn't actually store the processed data, but it *still* shows initial misunderstanding. At this point, I stopped my dialogue with ChatGPT.)
 

Midjourney 2023-02-02 Anime painting of a horse wearing a sweater sitting in a corner office and working on a laptop overlooking the Chicago River 2

Gfesser via Midjourney: "Anime painting of horse wearing a sweater and working on a laptop, sitting in a corner office overlooking the Chicago River."

(11) Gfesser: Some developers describe themselves as "full-stack" developers. What does "full-stack" mean to you in the context of being a data scientist?
 
ChatGPT: "Full-stack" in the context of a data scientist typically means that the individual has experience and expertise across the entire data science stack, from data collection and cleaning, to model building and deployment. This includes the following areas:
  1. Data Collection: Full-stack data scientists are able to collect data from a variety of sources, including APIs, web scraping, and databases. They are also able to clean and prepare data for analysis.

  2. Data Analysis: They have experience using statistical and machine learning techniques to analyze data, as well as being able to use visualization libraries such as matplotlib and seaborn to create visual representations of data.

  3. Model Building: They can build and train machine learning models, and have the knowledge to choose the appropriate model for a given problem. They also have experience with deep learning frameworks like TensorFlow and Keras.

  4. Deployment: They can deploy models in production environments, including deploying models as web services, and have knowledge of distributed systems, cloud computing and containerization technologies such as Docker.

  5. Software Development: They have experience with programming languages such as Python, R, and SQL and are familiar with software development principles, such as version control and testing,

Being a full-stack data scientist is important because it allows you to tackle problems end-to-end, from data collection to deployment, and to understand the entire data science pipeline.


Grade: C

ChatGPT confuses "full-stack" for activities performed on a given project, the scope of question #6. And interestingly, the activities listed in its answer to question #6, while similar in spirit, are different. As I explained in another piece, "full-stack" means all layers of a given solution. In the case of an application, the subject of my earlier piece, this means not only both the front-end and back-end, but all the wiring that enables these layers to work together, scripts to configure and deploy, as well as associated data modeling used by the persistence layer. While I personally have significant experience building all of these layers, typically when building a new product, I also explained that developers often work in "clusters" of expertise, for specific reasons such as skill set, desire, or perceived company value. And of course, as complexity grows for a given stack, the ability to effectively program across all of these layers becomes more challenging.


(12) Gfesser: Tell me about your experiences with user interface design. What is the difference between UI and UX?
 
ChatGPT: As a language model, I do not have any personal experiences with user interface design, but I can provide information on the subject.

User Interface (UI) design refers to the visual and interactive elements of a software application or website. It includes things like buttons, text fields, drop-down menus, and other elements that users interact with directly. UI designers are responsible for creating the look and feel of the application, and making sure that the interface is visually pleasing and easy to use.

User Experience (UX) design, on the other hand, is a broader field that encompasses the entire experience of a user with a product or service. This includes things like how easy it is to navigate the application, how quickly users can find the information they need, and how well the application meets their needs. UX designers conduct user research, testing, and analysis to improve the overall satisfaction and effectiveness of the product.

In short, UI is the visual and interactive design of the application, while UX is the overall experience of using the application. Both are important for creating a successful and user-friendly product, but they have distinct focuses and require different skillsets.


Grade: A

 
As with question #2 in the original KDNuggets piece, this is a giveaway question, but question #12 wasn't, and as such, this is one of the questions I would replace were I to update my list of software engineering questions. Generally speaking, "experimental design" relates to organizing an experiment in a manner which provides the proper data to answer questions. While traditional software engineering might not typically assess the effects of multiple inputs, or factors, on responses or measures of performance, experimental design would ideally work very similarly, in contrast to trial-and-error, one-factor-at-a-time approaches which lead to increased testing effort.

(13) Gfesser: Many software developers do not value data because they take it as a given as being available, think of it as "the easy part", do not understand it, or do not wish to work on it. Tell me how you view data within the context of an organization, from the perspective of both the software engineer and statistician. Who do you expect to do a bulk of the data work prior to analyses?
 
ChatGPT: Data is a critical component of any organization, and it is important for both software engineers and statisticians to understand and value its role in decision making.

From the perspective of a software engineer, data is the foundation upon which the software is built. Without accurate and relevant data, the software will not be able to deliver meaningful insights or solve problems. Software engineers should be aware of the importance of data quality and should be able to work with data scientists to ensure that the data is accurate and relevant.

From the perspective of a statistician, data is the raw material that is used to generate insights and inform decision making. Statisticians use statistical techniques to analyze data and extract insights, which are then used to inform business decisions. They also help to ensure data quality and work with software engineers to design and build systems that can handle large amounts of data.

In terms of who does the bulk of data work prior to analyses, it depends on the specific organization and the role of the team members. In some organizations, data scientists and statisticians may be responsible for the bulk of data work, while in others, software engineers may be more involved in data collection and cleaning. It is important for all team members to understand the role of data and to work together to ensure that the data is accurate and relevant.


Grade: A

Good answer. 


Data and the software engineer (7 years later)
In the aforementioned follow-up piece on "full-stack" development, I elaborated on the four points I made about software engineers and data in question #13. Do these points still hold true? Some readers might question whether these points were *ever* true, but my original points weren't an attempt to address universal truth. My original points were an attempt to address what I was seeing as the prevailing zeitgeist from the perspective of a consultant on the ground working with a variety of clients (the type I mentioned in the above "straw man argument" sidebar).
Gfesser (2016): The data is already available. For relatively less involved application development projects, data is often already available. An example of such a project is a pure portal project in which an application is built to access this already available data (already available by other means, perhaps by only another set of users). Because the data is already available (at least most of the data is already available, as a small amount of data may need to be added for areas such as security), expertise in implementing the data store is not needed, just interpretation of some subset of the data for user consumption.
Gfesser (2016): The data is "the easy part". Unlike the first point above where data is already available, this second point is subjective. The terms "easy" and "hard" are always subjective, regardless of source, because these are relative terms. In my experience, it is often the case that if data is not already available and a developer determines that it is the easy part, there is usually some other underlying reason for their confidence that falls under one of the two points below. Data is the lifeblood of most organizations, and decisions made with regard to data typically have longer term affects than decisions made in the programmatic code of higher layers because application code is typically easier to rewrite relative to restructuring data.
Gfesser (2016): The data is not understood. This third point is very common. Developers in the first cluster of expertise explained earlier in this blog post typically fall into this category, depending on lower layers to interpret data for them. This reason for not valuing data is distinct from the other 3 because no judgement is being placed on whether the data is available, whether working with the data is difficult, or whether desire to work with the data exists. The data is simply misunderstood, and so the developer seeks to distance themselves from it. On one project earlier in my career, I recall mentioning to a developer that it is common that some developers do not understand data and database concepts. Their response was simply that data specialists do not understand object oriented concepts in the programmatic code. While not a good defense, this way of looking at things speaks to some of the data resistance that lives in some organizations.
Gfesser (2016): A desire to work on the data does not exist. The verbiage of this fourth point is a diplomatic way of saying that a particular individual just doesn't want anything to do with the data. On a number of occasions, for example, I've unfortunately heard developers quip "I don't care about that" in regard to a number of areas, but the most common area is the data. In some sense, we've come full circle in this essay. In my introductory comments, I described use of my fingers, hands, and arms to talk about the place of a given technology within a given project in the enterprise. When it comes to data, sometimes zero digits is what expresses developer desire to work with data.
Gfesser (2023): While a divide between data-centric and application-centric organizations still exists, as firms have increasingly grown to be data-centric these four points have become less and less accurate. It's much more common for software engineers to understand data and database concepts. Data specific to a particular domain will continue to pose some barriers to entry with respect to understanding, but the difference now is that software engineers seem more interested in gaining this understanding rather than avoiding it. It's much more common for software engineers to be interested in working with data, and I think this is due in large part because of the tooling that has been increasingly made available over the last several years that has helped facilitate convergence of what tended to be separate worlds in many organizations. It would be interesting to revisit the specific developers I mentioned to determine whether they still hold the same attitude, but my guess is that they're likely more open to working with data even if solely due to marketability.

(14) Gfesser: Provide an example of a time when you made a technology choice for an application architecture you designed, which you determined was the wrong decision during implementation. How did you rectify this decision?
 
(Interestingly, whereas all other questions took a few seconds for ChatGPT to answer, it took about a minute for it to answer this question.)

ChatGPT: As a language model, I do not have personal experiences to provide examples, but I can provide a hypothetical scenario.

Let's say that a team was building a web application and they initially chose to use a monolithic architecture. However, during implementation, they realized that the monolithic architecture was not suitable for their needs because it made it difficult to scale and update the application.

To rectify this decision, the team could have chosen to migrate to a microservices architecture. This would involve breaking the monolithic application into smaller, independent services that can be deployed and scaled independently. This approach would have allowed the team to more easily update and scale the application, and would have been more suitable for their needs.

In general, rectifying the wrong decision in technology choices could be done by conducting a thorough evaluation of the current architecture and identifying the areas where it is falling short. Then, comparing it with other architectures that could fit better and choosing the one that addresses the shortcoming with minimal changes to the existing architecture. Another approach could be to pilot the new approach alongside the existing one and evaluate the results before committing to a full migration.


Grade: B

 
Decent answer, although monolithic and microservices architectures are mentioned at the neglect of componentized monoliths, not to mention hybrid architectures making use of a combination of these options. And piloting multiple approaches alongside one another can be seen as a form of "canary testing".


(15) Gfesser: In thinking about the code base for a software application, what do you view as the equivalent of "chartjunk"? How have you prevented it in your code? (Note: the original data scientist question #15 incorrectly referred to "chart junk".)
 
ChatGPT: In the context of software development, "chartjunk" is equivalent to "code smell" or "technical debt". It refers to the parts of the codebase that are difficult to understand, maintain, or modify. This could include things like duplicated code, complex and deeply nested conditional statements, and poorly named variables or functions.

To prevent "code smell" or "chartjunk" in my code, I follow some best practices such as:

  • Keeping the codebase clean and well-organized. This includes using consistent indentation, naming conventions, and commenting conventions throughout the codebase.

  • Using appropriate design patterns and principles like SOLID, DRY and KISS to keep the codebase modular, extensible and easy to understand.

  • Regularly refactoring the codebase to remove duplicated code, simplify complex logic, and improve the overall structure of the code.

  • Writing automated tests to ensure that the codebase is functional and easy to maintain over time.

  • Reviewing code with other developers to catch potential issues and improve the overall quality of the codebase.

By following these best practices, I can help keep the codebase clean and maintainable, and prevent the buildup of "chartjunk" or "code smell" that can make the codebase difficult to understand or modify over time.


Grade: B

 
Again a decent answer, although the fourth and last sentences in ChatGPT's response displays lack of understanding because the term "chartjunk" should only be used when referencing visualizations, not code. Chartjunk refers to those visual elements in charts and graphs which are either unnecessary or distracting with respect to understanding the presented information. And the last sentence refers to "buildup" in the context of chartjunk, something that isn't typically associated with visualizations, although the code that generates a given visualization might be prone to code smell.


Midjourney 2023-02-02 Anime painting of a horse wearing a sweater sitting in a corner office and working on a laptop overlooking the Chicago River 3

Gfesser via Midjourney: "Anime painting of horse wearing a sweater and working on a laptop, sitting in a corner office overlooking the Chicago River."

(16) Gfesser: Are there any areas of software engineering in which you feel you are especially adept? How do you balance these areas with other tasks that are needed, especially those which you do not particularly enjoy?
 
ChatGPT: As a large language model, I am adept in a wide range of software engineering topics, such as algorithms, data structures, programming languages, and software development methodologies. However, I am particularly adept in the areas of natural language processing and machine learning, as these are areas that I have been trained on extensively.

In terms of balancing these areas with other tasks that are needed, I rely on my ability to quickly learn and adapt to new technologies and programming languages. I also try to stay organized and prioritize tasks based on their importance and urgency.

For tasks that I do not particularly enjoy, I try to focus on the bigger picture and the end goal of the project. This helps to keep me motivated and focused on the task at hand. Additionally, I try to take breaks and step away from the task if I am feeling burnt out or unmotivated. Sometimes, a fresh perspective can help me to approach the task in a new way or find a solution that I had not considered previously.

Furthermore, I try to collaborate and communicate effectively with my team members and stakeholders. This helps me to understand the context of the task and the reasons why it is important. It also allows me to share my expertise and help others with tasks that they may not be as comfortable with.


Grade: A

 
Another decent answer. My original intention with this question was to allude back to software engineering question #13 and the subsequent "data and the software engineer" sidebar. However, in revisiting the first sidebar above following software engineering question #1 ("ChatGPT's straw man argument"), ChatGPT inappropriately applies a human quality (in this case, "enjoyment") to itself here.


(17) Gfesser: Let's go back to the whiteboard and talk about the data persistence layer in the application architecture you drew. What were your options? Now walk me through some portion of the data model which which you are intimately familiar, and discuss the change that it might need to undergo over time. To what degree does the chosen database model support change?
 
ChatGPT: As a language model, I am not able to go to a whiteboard and draw the data persistence layer of a specific application architecture. However, I can provide a general overview of data persistence layer options and how they support change in the data model.

When it comes to data persistence, there are several options to choose from such as relational databases, NoSQL databases, in-memory databases, and file-based storage. Each option has its own set of advantages and disadvantages and the choice of database depends on the specific requirements of the application and the data being stored.

For example, let's say that I am intimately familiar with a relational database that is being used to store customer information for an e-commerce application. The data model includes tables for customers, orders, and products. Over time, the business requirements may change and the data model may need to evolve to accommodate new features or data points. For example, a new requirement may be added to store customer loyalty information, or the application may need to store information about customer preferences.

Relational databases are designed to handle such changes, and they provide several mechanisms to evolve the data model without disrupting the existing data. One option is to add new columns to existing tables, another option is to create new tables and establish relationships between them, and yet another option is to use views to combine data from multiple tables.

However, in some cases, the changes required may be too extensive and may require a complete redesign of the data model. In such cases, the relational database should be able to handle the change with minimal data loss.

The chosen database model should also have the ability to handle the changing data volume and the increasing number of concurrent connections. This can be achieved by using database sharding, horizontal partitioning, and load balancing.

To sum up, the chosen database model should support change and evolution of the data model to accommodate the changing business requirements. It should also have the ability to handle the increasing data volume and number of concurrent connections.


Grade: D

 
While there's no point in spending time going in circles again after seeing ChatGPT's responses to the software engineering question #10 follow-up ("let's try that again: ChatGPT can actually output markup"), ChatGPT makes a fundamental error here by equating a given data persistence layer with the underlying database itself. Additionally, ChatGPT later states that "minimal data loss" should be incurred if a "complete redesign of the data model" were performed, when instead it should have made it clear that *zero* data loss should be incurred (or at least be specific about irrelevant portions of the original data that might no longer be needed, such as foreign keys etc). ChatGPT also mentions "database model" for the first time in its second-to-last paragraph, but acts as though it has already made use of this term by its use of the phrase "should also". Some readers might think I'm being too much of a stickler, but I would disagree, given ChatGPT's confusion over the term "persistence layer".

(18) Gfesser: Risks are inherent in the software development process. Provide some examples of how you mitigated risk on a recent project.
 
ChatGPT: Risks can come in many forms in software development, such as project delays, cost overruns, or technical challenges. Some strategies I've used to mitigate risk on recent projects include:
  • Using agile development methodologies, which allow for more frequent check-ins and adjustments to the project plan as needed. This can help identify and address potential issues early on, before they become major problems.
  • Conducting thorough requirements gathering and design phases, to ensure that the project scope is well-defined and all stakeholders are on the same page about what is expected.
  • Building in testing and quality assurance processes throughout the development cycle, to catch and fix bugs early on and ensure that the final product meets the necessary specifications.
  • Using version control systems and testing in multiple environments to ensure that the code is stable and that any bugs or issues that arise can be tracked and fixed quickly.
  • Keeping a close eye on the project budget, and making sure that all costs are tracked and accounted for, to minimize the risk of cost overruns.

In a recent project, I was working on a large-scale web application. To mitigate risk, I divided the project into smaller modules and assigned different teams to work on each module. This helped us to identify and resolve issues early on in the development process and also helped us to ensure that the development process was on track. Additionally, we made sure that testing and QA were integrated into the development process from the start and we used version control systems to keep track of changes in the codebase.


Grade: C

 
As with software engineering question #16, ChatGPT again refers to itself as though it were human in its team assignments. Now, arguably, these assignments could be automated. But if ChatGPT understood its statements, it would also realize that not only can the "testing and QA" it mentions in its closing sentence be automated, these *should* be automated! ChatGPT also makes other debatable statements, such as its assertion that agile development makes use of "project plans", and that "requirements gathering and design phases" should be used to ensure well-defined scope.


(19) Gfesser: Which programming languages do you currently use in your daily work? Why? How has your use of languages changed over the course of your career?
 
ChatGPT: As a language model AI, I can work with many different programming languages and have knowledge of many more. I can use whatever language my client wants to use. My use of languages changes depending on the requirements of the projects I work on. For example, if a client wants to build a web application, I would recommend using languages such as JavaScript, Python, Ruby, or Java. If a client wants to build a machine learning model, I would recommend using languages such as Python or R. I am capable of working with many programming languages and selecting the right one for the job is a crucial part of my work as a data scientist and software engineer.
 

Grade: C

 
As with my response to ChatGPT's response to question #17, some readers might think I'm being too much of a stickler, but it's initial response of "I can use whatever language my client wants to use" quickly changes to, in effect, "I can use whatever the dependencies of my projects require" and "I can make recommendations on what to use based on the scenario", which are arguably three different responses. If I were interviewing a human candidate for a software engineering role, I would spend time exploring what they meant by this mixed response. Additionally, since I haven't technically mentioned the concept of an architect role yet, I would spend time exploring whether the individual typically deferred to their clients to determine technology selection or whether they have actually been an active participant in driving the selection process, regardless of any negative connotations the candidate might think exist for an architect role.


Open source wins.
My best response regarding why I value open source so highly was likely best articulated during my interview by the CEO of Unite.AI in 2020, which took place roughly midway between my original counterpoint article and the one you are reading right now, as he had commented that he "learned a lot about open source" from me:
Unite.AI: You’re a huge open source advocate, could you discuss why open source is so important?

Gfesser: One aspect about open source that I've needed to explain to executives over the years is that the primary benefit of open source is not that use of such software is made available without monetary cost, but that the source code is made freely available.

Additionally, developers making use of this source code can modify it for their own use, and if suggested changes are approved, make these changes available to other developers using it. In fact, the movement behind open source software started due to developers waiting at length for commercial firms to make changes to products they licensed, so developers took it upon themselves to write software with the same functionality, opening it up to be improved upon by other developers.

Commercialized open source takes advantage of these benefits, the reality being that many modern products make use of open source under the covers, even whilst commercial variants of such software typically provide additional components not available as part of a given open source release, providing differentiators as well as support if this is needed.

My first experiences with open source took place while building the healthcare product I mentioned earlier, making use of tooling such as Apache Ant, used to build software, and an early DevOps product at the time called Hudson (the code base of which later became Jenkins). The primary reason behind our decisions to use these open source products was that these either provided better solutions to commercial alternatives, or were innovative solutions not even offered by commercial entities, not to mention that the commercial licensing of some of the products we had been using was overly restrictive, leading to excessive red tape when it came time to needing more licenses, due to the costs involved.

Over time, I've seen open source offerings continue to evolve, providing much needed innovation. For example, many of the issues with which my colleagues and I wrestled building this healthcare product were later solved by an innovative open source Java product we started using called Spring Framework, which is still going strong after more than a decade, the ecosystem of which now stretches far beyond some of the innovations that it initially provided, now seen as commonplace, such as dependency injection.

Unite.AI: While reviewing the book "Machine Learning at Enterprise Scale" you stated that "use of open source products, frameworks, and languages alongside an agile architecture composed of a mix of open source and commercial components provides the nimbleness that many firms need but don't immediately realize at the outset". Could you go into some details as to why you believe that firms which use open source are more nimble?

Gfesser: Many commercial data products use key open source components under the covers, and enable developers to use popular programming languages such as Python. The firms which build these products know that the open source components they've chosen to incorporate give them a jump start when these are already widely used by the community.

Open source components with strong communities are easier to sell, due to the familiarity that these bring to the table. Commercially available products which consist mainly of closed source, or even open source that is largely only used by specific commercial products, often require either training by these vendors, or licenses in order to make use of the software.

Additionally, documentation for such components is largely not made publicly available, forcing the continued dependency of developers on these firms. When widely accepted open source components such as Apache Spark are the central focus, as with products such as Databricks Unified Analytics Platform, many of these items are already made available in the community, minimizing the portions on which development teams need to depend on commercial entities to do their work.

Additionally, because components such as Apache Spark are broadly accepted as de facto industry standard tooling, code can also be more easily migrated across commercial implementations of such products. Firms will always be inclined to incorporate what they view as competitive differentiators, but many developers don't want to use products that are completely novel because this proves challenging to move between firms, and tends to cut their ties with the strong communities they have come to expect.

From personal experience, I've worked with such products in the past, and it can be challenging to get competent support. And this is ironic, given that such firms sell their products with the customer expectation that support will be provided in a timely manner. I've had the experience submitting a pull request to an open source project, with the fix incorporated into the build that same day, but cannot say the same about any commercial project with which I have worked.

Unite.AI: Something else that you believe about open source is that it leads to "access to strong developer communities." How large are some of these communities and what makes them so effective?

Gfesser: Developer communities around a given open source product can reach into the hundreds of thousands. Adoption rates don't necessarily point to community strength, but are a good indicator that this is the case due to their tendency to produce virtuous cycles. I consider communities to be strong when these produce healthy discussion and effective documentation, and where active development is taking place.

When an architect or senior developer works through the process to choose which such products to incorporate into what they are building, many factors typically come into play, not only about the product itself and what the community looks like, but about the development teams who will be adopting these, whether these are a good fit for the ecosystem being developed, what the roadmap looks like, and in some cases whether commercial support can be found in the case this may be needed. However, many of these aspects fall by the wayside in the absence of strong developer communities.


(20) Gfesser: Tell me about the development tools you use on a regular basis. Do you generally prefer commercial or open source tooling? Why?
 
ChatGPT: As a language model AI, the development tools I use vary depending on the project and the client's preferences. Generally, I use a combination of commercial and open-source tools. Some examples of development tools that I commonly use include:
  • Integrated development environments (IDEs) such as Visual Studio Code, Eclipse, PyCharm, and IntelliJ IDEA.
  • Version control systems such as Git and SVN
  • Project management tools such as Trello, Asana, and Jira
  • Code review tools such as GitHub, GitLab, and Gerrit
  • Continuous integration and deployment (CI/CD) tools such as Jenkins, Travis CI, and CircleCI
  • Debugging and profiling tools such as GDB, Valgrind, and perf
  • Dependency management tools such as pip, npm, and maven
  • Testing frameworks such as JUnit, TestNG, and pytest

I prefer open source tools because they are generally free and have a large community of users and contributors. This means that there is a wealth of documentation and support available, and it is easy to find solutions to problems. Additionally, open-source tools are usually more customizable and flexible, which is useful for my work as a data scientist and software engineer. Commercial tools tend to be more powerful and user-friendly, but they can be more expensive and may not be as customizable.


Grade: A

 
As with its responses to questions #1 and #4, ChatGPT's response here is largely a laundry list, and at this point in the interview there's not much left to say except for the fact that ChatGPT chose to only provide an answer to "why?" for the second half of my question and not the first. That said, ChatGPT's answer to my second "why?" largely aligns with the above "open source wins" side bar.

Midjourney 2023-02-02 Anime painting of a horse wearing a sweater sitting in a corner office and working on a laptop overlooking the Chicago River 4

Gfesser via Midjourney: "Anime painting of horse wearing a sweater and working on a laptop, sitting in a corner office overlooking the Chicago River."

In my opening paragraphs to this piece, I mentioned that the "hidden" value of my questions was perhaps discovered in the process of submitting my questions to ChatGPT. And what is this value? The number of questions to explore *personal* experience is heavily represented in my software engineering questions, whereas the data science questions typically explore theory only. Only 7 data science questions (#3, #5, #12, #14, #16, #17, #20) explore personal experience, in contrast with my software engineering questions, of which only 3 (#8, #11, #13) do *not* explore personal experience.

My original piece commented in its closing paragraphs that my questions were intended to be practical in nature, exploring whether a given candidate is able to succeed in the workplace. Understanding what a candidate can already do is quite the contrast to seeking text book answers to theoretical questions, which should typically be asked only if the intention is to provide seeds for follow-up questions which are practical in nature. Perhaps an argument can be made for asking junior candidates more theoretical questions, but even junior candidates should be expected to explain what they can do based on their work inside and outside the classroom.

Subscribe to Erik on Software

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe