Educating ChatGPT on Knowledge Lakehouse

Big Data

Educating ChatGPT on Knowledge Lakehouse

lohitnath.453

September 23, 2023

Educating ChatGPT on Knowledge Lakehouse

[ad_1]

Posted in Enterprise |
March 17, 2023 4 min learn

As the usage of ChatGPT turns into extra prevalent, I ceaselessly encounter clients and information customers citing ChatGPT’s responses of their discussions. I really like the keenness surrounding ChatGPT and the eagerness to find out about fashionable information architectures akin to information lakehouses, information meshes, and information materials. ChatGPT is a superb useful resource for gaining high-level insights and constructing consciousness of any know-how. Nevertheless, warning is important when delving deeper into a selected know-how. ChatGPT is skilled on historic information and relying on how one phrases their query, it might supply inaccurate or deceptive data.

I took the free model of ChatGPT on a take a look at drive (in March 2023) and requested some easy questions on information lakehouse and its elements. Listed here are some responses that weren’t precisely proper, and our rationalization on the place and why it went mistaken. Hopefully this weblog will give ChatGPT a possibility to study and proper itself whereas counting in direction of my 2023 contribution to social good.

I assumed this was a reasonably complete checklist. The one key part that’s lacking is a standard, shared desk format, that can be utilized by all analytic companies accessing the lakehouse information. When implementing an information lakehouse, the desk format is a important piece as a result of it acts as an abstraction layer, making it straightforward to entry all of the structured, unstructured information within the lakehouse by any engine or instrument, concurrently. The desk format gives the required construction for the unstructured information that’s lacking in an information lake, utilizing a schema or metadata definition, to carry it nearer to a knowledge warehouse. Among the widespread desk codecs are Apache Iceberg, Delta Lake, Hudi, and Hive ACID.

Additionally, the info lake layer is just not restricted to cloud object shops. Many corporations nonetheless have large quantities of information on premises and information lakehouses will not be restricted to public clouds. They are often constructed on premises or as hybrid deployments leveraging non-public clouds, HDFS shops, or Apache Ozone.

At Cloudera, we additionally present machine studying as a part of our lakehouse, so information scientists get easy accessibility to dependable information within the information lakehouse to shortly launch new machine studying initiatives and construct and deploy new fashions for superior analytics.

I like how ChatGPT began this reply, but it surely shortly jumps into options and even offers an incorrect response on the characteristic comparability. Options will not be the one means of deciding which is a greater desk format. It will depend on compatibility, openness, versatility, and different components that may assure broader utilization for various information customers, assure safety and governance, and future-proof your structure.

Here’s a high-level characteristic comparability chart if you wish to go into the main points of what’s accessible on Delta Lake versus Apache Iceberg.

This response is somewhat harmful due to its incorrectness and demonstrates why I really feel these instruments will not be prepared for deeper evaluation. At first look it might appear like an affordable response, however its premise is mistaken, which makes you doubt your complete response and different responses as effectively. Saying “Delta Lake is constructed on prime of Apache Iceberg” is inaccurate as the 2 are fully totally different, unrelated desk codecs and one has nothing to do with the conception of the opposite. They had been created by totally different organizations to resolve widespread information issues.

I’m impressed that ChatGPT received this one proper, though it made just a few errors with our product names, and missed just a few which might be important for a lakehouse implementation.

CDP’s elements that assist an information lakehouse structure embody:

Apache Iceberg desk format that’s built-in into CDP to offer construction to the huge quantities of structured, unstructured information in your information lake.
Knowledge companies, together with cloud native information warehouse known as CDW, information engineering service known as CDE, information streaming service known as information in movement, and machine studying service known as CML.
Cloudera Shared Knowledge Expertise (SDX), which gives a unified information catalog with automated information profilers, unified safety, and unified governance over all of your information each in the private and non-private cloud.

ChatGPT is a good instrument to get a high-level view of latest applied sciences, however I’d say use it rigorously, validate its responses, and use it just for the attention stage of the shopping for cycle. As you go into the consideration or comparability stage, it’s not dependable but.

Additionally, solutions on ChatGPT preserve updating so hopefully it corrects itself earlier than you learn this weblog.

To study extra about Cloudera’s lakehouse go to the webpage and if you’re able to get began watch the Cloudera Now demo.

[ad_2]