<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Piyush’s Substack: Apache Spark]]></title><description><![CDATA[Lets Learn Apache Spark!]]></description><link>https://piyushagarwal441.substack.com/s/apache-spark</link><image><url>https://substackcdn.com/image/fetch/$s_!VUqH!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d907ed-5e54-477c-8d47-29a75e950c07_96x96.png</url><title>Piyush’s Substack: Apache Spark</title><link>https://piyushagarwal441.substack.com/s/apache-spark</link></image><generator>Substack</generator><lastBuildDate>Tue, 26 May 2026 18:29:04 GMT</lastBuildDate><atom:link href="https://piyushagarwal441.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Piyush Agarwal]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[piyushagarwal441@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[piyushagarwal441@substack.com]]></itunes:email><itunes:name><![CDATA[Piyush Agarwal]]></itunes:name></itunes:owner><itunes:author><![CDATA[Piyush Agarwal]]></itunes:author><googleplay:owner><![CDATA[piyushagarwal441@substack.com]]></googleplay:owner><googleplay:email><![CDATA[piyushagarwal441@substack.com]]></googleplay:email><googleplay:author><![CDATA[Piyush Agarwal]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Python Chapter 1]]></title><description><![CDATA[Preparing for interviews]]></description><link>https://piyushagarwal441.substack.com/p/python-chapter-1</link><guid isPermaLink="false">https://piyushagarwal441.substack.com/p/python-chapter-1</guid><dc:creator><![CDATA[Piyush Agarwal]]></dc:creator><pubDate>Thu, 03 Oct 2024 13:32:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!u1uU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d93c3e-f577-48bf-a516-8fa9d2c711e6_400x525.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!u1uU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d93c3e-f577-48bf-a516-8fa9d2c711e6_400x525.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!u1uU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d93c3e-f577-48bf-a516-8fa9d2c711e6_400x525.jpeg 424w, https://substackcdn.com/image/fetch/$s_!u1uU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d93c3e-f577-48bf-a516-8fa9d2c711e6_400x525.jpeg 848w, https://substackcdn.com/image/fetch/$s_!u1uU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d93c3e-f577-48bf-a516-8fa9d2c711e6_400x525.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!u1uU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d93c3e-f577-48bf-a516-8fa9d2c711e6_400x525.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!u1uU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d93c3e-f577-48bf-a516-8fa9d2c711e6_400x525.jpeg" width="400" height="525" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/55d93c3e-f577-48bf-a516-8fa9d2c711e6_400x525.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:525,&quot;width&quot;:400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Fluent Python, 2nd Edition&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Fluent Python, 2nd Edition" title="Fluent Python, 2nd Edition" srcset="https://substackcdn.com/image/fetch/$s_!u1uU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d93c3e-f577-48bf-a516-8fa9d2c711e6_400x525.jpeg 424w, https://substackcdn.com/image/fetch/$s_!u1uU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d93c3e-f577-48bf-a516-8fa9d2c711e6_400x525.jpeg 848w, https://substackcdn.com/image/fetch/$s_!u1uU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d93c3e-f577-48bf-a516-8fa9d2c711e6_400x525.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!u1uU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55d93c3e-f577-48bf-a516-8fa9d2c711e6_400x525.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>While I do believe, I have a good grasp over pythonic concepts, I don&#8217;t want to mess up my interview and make sure that I do rather well. So, I am going through the textbook:</p><ul><li><p><a href="https://learning.oreilly.com/library/view/fluent-python-2nd/9781492056348/">Fluent Python, 2nd edition </a></p></li></ul><p>I just want to get a hang of bits of textbook definitions to make sure that I don&#8217;t speak garbage or don&#8217;t lose out on words.</p><p></p><ol><li><p>What is &#8216;self&#8217; in python?</p><ol><li><p>Self refers to an instance of a class. It is used to access variables, attributes and methods of a class.</p></li></ol></li><li><p>What is &#8216;__init__&#8217; in python?</p><ol><li><p>&#8216;__init__&#8217; is an instance method in a class used to initialise objects in python. It is a constructor basically.</p></li></ol></li><li><p>What are special methods in Python?</p><ol><li><p>The Python interpreter invokes special methods to perform basic object operations, often triggered by special syntax. The special method names are always written with leading and trailing double underscores. For example, the syntax <code>obj[key]</code> is supported by the <code>__getitem__</code> special method. In order to evaluate <code>my_collection[key]</code>, the interpreter calls <code>my_collection.__getitem__(key)</code>.</p></li><li><p>The first thing to know about special methods is that they are meant to be called by the Python interpreter, and not by you. You don&#8217;t write my_object.__len__(). You write len(my_object) and, if my_object is an instance of a user-defined class, then Python calls the __len__ method you implemented.</p></li></ol></li><li><p>Why len () Is Not a Method ? [Copied from the book]</p><p>I asked this question to core developer Raymond Hettinger in 2013, and the key to his answer was a quote from <a href="https://fpy.li/1-8">&#8220;The Zen of Python&#8221;</a>: &#8220;practicality beats purity.&#8221; In <a href="https://learning.oreilly.com/library/view/fluent-python-2nd/9781492056348/ch01.html#how_special_used">&#8220;How Special Methods Are Used&#8221;</a>, I described how <code>len(x)</code> runs very fast when <code>x</code> is an instance of a built-in type. No method is called for the built-in objects of CPython: the length is simply read from a field in a C struct. Getting the number of items in a collection is a common operation and must work efficiently for such basic and diverse types as <code>str</code>, <code>list</code>, <code>memoryview</code>, and so on.</p><p>In other words, <code>len</code> is not called as a method because it gets special treatment as part of the Python Data Model, just like <code>abs</code>. But thanks to the special method <code>__len__</code>, you can also make <code>len</code> work with your own custom objects. This is a fair compromise between the need for efficient built-in objects and the consistency of the language. Also from &#8220;The Zen of Python&#8221;: &#8220;Special cases aren&#8217;t special enough to break the rules.&#8221;</p></li></ol><p>I think I had a good time reading the 1st chapter, and definitely enjoyed it.</p><div class="pullquote"><p>I work on the unceded traditional Coast Salish lands including those of the Tsleil-Waututh (s&#601;l&#787;ilw&#787;&#601;ta&#660;&#620;), Kwikwetlem (k&#695;ik&#695;&#601;&#411;&#787;&#601;m), Squamish (S&#7733;wx&#817;w&#250;7mesh &#218;xwumixw) and Musqueam (x&#695;m&#601;&#952;k&#695;&#601;y&#787;&#601;m) Nations.</p></div><p></p>]]></content:encoded></item><item><title><![CDATA[Basic Spark Dataframe Structure]]></title><description><![CDATA[How does a Spark dataframe look like?]]></description><link>https://piyushagarwal441.substack.com/p/basic-spark-dataframe-structure</link><guid isPermaLink="false">https://piyushagarwal441.substack.com/p/basic-spark-dataframe-structure</guid><dc:creator><![CDATA[Piyush Agarwal]]></dc:creator><pubDate>Wed, 25 Sep 2024 18:23:46 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FzEj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d95c7f0-2727-4e48-a94d-86cef6cf644d_591x518.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>Schema</h3><p>For any table, Schema defines the table columns and the data type of the columns. The schema for any dataframe can be seen by using the printSchema() method.</p><pre><code>spark = SparkSession.builder.appName("demo").getOrCreate()

df = spark.read.json("data/flight-data/json/2015-summary.json")
df.printSchema()

myManualSchema = StructType([
  StructField("DEST_COUNTRY_NAME", StringType(), True),
  StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
  StructField("count", LongType(), False, metadata={"hello":"world"})
])
df = spark.read.format("json").schema(myManualSchema)\
  .load("data/flight-data/json/2015-summary.json")</code></pre><p>The schema datatype is referred to as <strong>StructType</strong> and is composed of fields called <strong>StructFields</strong>.  These fields consist of the name of the column, datatype and a Boolean flag which indicates whether the column is containing any null value or not. Users can also add any associated metadata with the column.</p><h3>Columns and Expressions</h3><h4>Columns</h4><p>Columns in spark only exist in the context of rows which further exist only in the context of dataframes. So, the upshot is that there is no equivalent <strong>pd.Series</strong> in Spark. </p><h4>So what is the use of col() function in spark?</h4><p>This is explained well by a stack overflow answer, available <a href="https://stackoverflow.com/questions/64076200/pyspark-what-is-the-real-use-of-col-function">here</a>. The upshot is to refer columns only using strings, the col() function can be used.  </p><h4>Expressions</h4><p>An expression is a set of transformations on one or more values in a record/row in a DataFrame. Expressions are evaluated by forming a logical tree which specifies the order of operations. For example:</p><pre><code>(((col("someCol") + 5) * 200) - 6) &lt; col("otherCol")</code></pre><p>has a logical tree which looks like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FzEj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d95c7f0-2727-4e48-a94d-86cef6cf644d_591x518.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FzEj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d95c7f0-2727-4e48-a94d-86cef6cf644d_591x518.png 424w, https://substackcdn.com/image/fetch/$s_!FzEj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d95c7f0-2727-4e48-a94d-86cef6cf644d_591x518.png 848w, https://substackcdn.com/image/fetch/$s_!FzEj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d95c7f0-2727-4e48-a94d-86cef6cf644d_591x518.png 1272w, https://substackcdn.com/image/fetch/$s_!FzEj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d95c7f0-2727-4e48-a94d-86cef6cf644d_591x518.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FzEj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d95c7f0-2727-4e48-a94d-86cef6cf644d_591x518.png" width="591" height="518" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d95c7f0-2727-4e48-a94d-86cef6cf644d_591x518.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:518,&quot;width&quot;:591,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!FzEj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d95c7f0-2727-4e48-a94d-86cef6cf644d_591x518.png 424w, https://substackcdn.com/image/fetch/$s_!FzEj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d95c7f0-2727-4e48-a94d-86cef6cf644d_591x518.png 848w, https://substackcdn.com/image/fetch/$s_!FzEj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d95c7f0-2727-4e48-a94d-86cef6cf644d_591x518.png 1272w, https://substackcdn.com/image/fetch/$s_!FzEj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d95c7f0-2727-4e48-a94d-86cef6cf644d_591x518.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Logical Tree compiled by spark from the above expr(.)</figcaption></figure></div><h3>Records (rows)</h3><p>Each row in a Spark Dataframe is a single record, representing arrays of bytes. The Row objects are manipulated using column expressions. To access the first row of any dataframe in spark, the following can be executed:</p><pre><code>df.first()</code></pre><p>Always note that transformations are always performed on a dataframe. They cannot be applied on singular rows or singular columns without any dataframe context. </p><p>So, Dataframes are the key objects in Spark to work with. They are composed of records and these records are manipulated by specifying transformations on the columns.</p><p>This is all for this piece. I am consistently going to update this <a href="https://piyushagarwal441.substack.com/s/apache-spark">Spark Section</a> on the blog. If you are a beginner enthusiast like me and want to learn Apache spark, feel free to reach out and we can do it together!</p><div class="pullquote"><p><em>I work on the unceded traditional Coast Salish lands including those of the Tsleil-Waututh (s&#601;l&#787;ilw&#787;&#601;ta&#660;&#620;), Kwikwetlem (k&#695;ik&#695;&#601;&#411;&#787;&#601;m), Squamish (S&#7733;wx&#817;w&#250;7mesh &#218;xwumixw) and Musqueam (x&#695;m&#601;&#952;k&#695;&#601;y&#787;&#601;m) Nations.</em></p></div><p></p>]]></content:encoded></item><item><title><![CDATA[Logical Plan, Optimizers, Physical Plan]]></title><description><![CDATA[How does Spark execute code?]]></description><link>https://piyushagarwal441.substack.com/p/logical-plan-optimizers-physical</link><guid isPermaLink="false">https://piyushagarwal441.substack.com/p/logical-plan-optimizers-physical</guid><dc:creator><![CDATA[Piyush Agarwal]]></dc:creator><pubDate>Thu, 19 Sep 2024 13:59:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!93Ec!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc918c3-2ea2-4d4e-8956-a85ab15a8466_1250x343.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>For all languages, one of the fundamental things to understand is, how does user-written code gets executed? Well, we are very curious about Spark. What happens behind the scenes? How does Spark execute user written code?</p><p>Spark follows a rather simple plan to execute user written code. The agents involved in this execution are:</p><ul><li><p>Catalog</p></li><li><p>Catalyst Optimizer</p></li><li><p>Resilient Distributed Datasets</p></li></ul><p>There are two major steps in executing user code:</p><ol><li><p>Converting user code to a logical plan.</p></li><li><p>Converting the logical plan to a physical plan.</p></li></ol><h2>User code to logical plan</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!93Ec!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc918c3-2ea2-4d4e-8956-a85ab15a8466_1250x343.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!93Ec!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc918c3-2ea2-4d4e-8956-a85ab15a8466_1250x343.png 424w, https://substackcdn.com/image/fetch/$s_!93Ec!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc918c3-2ea2-4d4e-8956-a85ab15a8466_1250x343.png 848w, https://substackcdn.com/image/fetch/$s_!93Ec!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc918c3-2ea2-4d4e-8956-a85ab15a8466_1250x343.png 1272w, https://substackcdn.com/image/fetch/$s_!93Ec!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc918c3-2ea2-4d4e-8956-a85ab15a8466_1250x343.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!93Ec!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc918c3-2ea2-4d4e-8956-a85ab15a8466_1250x343.png" width="1250" height="343" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbc918c3-2ea2-4d4e-8956-a85ab15a8466_1250x343.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:343,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!93Ec!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc918c3-2ea2-4d4e-8956-a85ab15a8466_1250x343.png 424w, https://substackcdn.com/image/fetch/$s_!93Ec!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc918c3-2ea2-4d4e-8956-a85ab15a8466_1250x343.png 848w, https://substackcdn.com/image/fetch/$s_!93Ec!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc918c3-2ea2-4d4e-8956-a85ab15a8466_1250x343.png 1272w, https://substackcdn.com/image/fetch/$s_!93Ec!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbc918c3-2ea2-4d4e-8956-a85ab15a8466_1250x343.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">User code to logical plan</figcaption></figure></div><ol><li><p>The user code is first converted into a logical plan, which is only a set of abstract transformations. No executors or drivers are called yet. The plan is unresolved because even a correct code, might have tables or columns that might not exist.</p></li><li><p>The Catalog, which acts as the repository of all tables, resolves the logical plan. The logical plan then passes through the catalyst optimizer. The catalyst optimizer is a collection of rules to optimize the logical plan by optimizing transformations. </p></li></ol><h2>Logical plan to physical plan </h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vctX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa359dc1d-8818-4109-baa5-2fcec998a28d_1367x449.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vctX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa359dc1d-8818-4109-baa5-2fcec998a28d_1367x449.png 424w, https://substackcdn.com/image/fetch/$s_!vctX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa359dc1d-8818-4109-baa5-2fcec998a28d_1367x449.png 848w, https://substackcdn.com/image/fetch/$s_!vctX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa359dc1d-8818-4109-baa5-2fcec998a28d_1367x449.png 1272w, https://substackcdn.com/image/fetch/$s_!vctX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa359dc1d-8818-4109-baa5-2fcec998a28d_1367x449.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vctX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa359dc1d-8818-4109-baa5-2fcec998a28d_1367x449.png" width="1367" height="449" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a359dc1d-8818-4109-baa5-2fcec998a28d_1367x449.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:449,&quot;width&quot;:1367,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!vctX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa359dc1d-8818-4109-baa5-2fcec998a28d_1367x449.png 424w, https://substackcdn.com/image/fetch/$s_!vctX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa359dc1d-8818-4109-baa5-2fcec998a28d_1367x449.png 848w, https://substackcdn.com/image/fetch/$s_!vctX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa359dc1d-8818-4109-baa5-2fcec998a28d_1367x449.png 1272w, https://substackcdn.com/image/fetch/$s_!vctX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa359dc1d-8818-4109-baa5-2fcec998a28d_1367x449.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Logical plan to physical plan</figcaption></figure></div><ol><li><p>The physical plan directly involves the driver and executors as it dictates how the code is executed on the cluster. Different physical plans are ideated and they are compared through a cost model.</p></li><li><p>Spark then runs the best physical plan further optimizing it at runtime, by generating native Java bytecode. The results are then returned to the user.</p></li></ol><p></p><p>This is all for this piece. I am consistently going to update this <a href="https://piyushagarwal441.substack.com/s/apache-spark">Spark Section</a> on the blog. If you are a beginner enthusiast like me and want to learn Apache spark, feel free to reach out and we can do it together!</p><div class="pullquote"><p><em>I work on the unceded traditional Coast Salish lands including those of the Tsleil-Waututh (s&#601;l&#787;ilw&#787;&#601;ta&#660;&#620;), Kwikwetlem (k&#695;ik&#695;&#601;&#411;&#787;&#601;m), Squamish (S&#7733;wx&#817;w&#250;7mesh &#218;xwumixw) and Musqueam (x&#695;m&#601;&#952;k&#695;&#601;y&#787;&#601;m) Nations.</em></p></div>]]></content:encoded></item><item><title><![CDATA[Lazy Evaluation, Drivers and Executors]]></title><description><![CDATA[A quick look at the Core Spark API]]></description><link>https://piyushagarwal441.substack.com/p/lazy-evaluation-drivers-and-executors</link><guid isPermaLink="false">https://piyushagarwal441.substack.com/p/lazy-evaluation-drivers-and-executors</guid><dc:creator><![CDATA[Piyush Agarwal]]></dc:creator><pubDate>Tue, 17 Sep 2024 14:56:11 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!AzZh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444e3a3f-3755-460e-9ac7-4751d90c533d_789x539.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iT6D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85928bbd-64ef-4408-9f8d-c5d934ff1a95_227x149.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iT6D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85928bbd-64ef-4408-9f8d-c5d934ff1a95_227x149.png 424w, https://substackcdn.com/image/fetch/$s_!iT6D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85928bbd-64ef-4408-9f8d-c5d934ff1a95_227x149.png 848w, https://substackcdn.com/image/fetch/$s_!iT6D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85928bbd-64ef-4408-9f8d-c5d934ff1a95_227x149.png 1272w, https://substackcdn.com/image/fetch/$s_!iT6D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85928bbd-64ef-4408-9f8d-c5d934ff1a95_227x149.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iT6D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85928bbd-64ef-4408-9f8d-c5d934ff1a95_227x149.png" width="275" height="180.5066079295154" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/85928bbd-64ef-4408-9f8d-c5d934ff1a95_227x149.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:149,&quot;width&quot;:227,&quot;resizeWidth&quot;:275,&quot;bytes&quot;:12723,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iT6D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85928bbd-64ef-4408-9f8d-c5d934ff1a95_227x149.png 424w, https://substackcdn.com/image/fetch/$s_!iT6D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85928bbd-64ef-4408-9f8d-c5d934ff1a95_227x149.png 848w, https://substackcdn.com/image/fetch/$s_!iT6D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85928bbd-64ef-4408-9f8d-c5d934ff1a95_227x149.png 1272w, https://substackcdn.com/image/fetch/$s_!iT6D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85928bbd-64ef-4408-9f8d-c5d934ff1a95_227x149.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>Earlier, we saw that <a href="https://piyushagarwal441.substack.com/p/looking-for-software-engineering-73c">Apache Spark </a>was a unified computing engine with a set of libraries for parallel data processing on computer clusters. Now, the natural questions that arise are:</p><ul><li><p>How does Apache Spark handle parallel processing? </p></li><li><p>What are the processes that work underneath the higher level Python and SQL calls which make it possible to compute and transform large amounts of data?</p></li></ul><p>Today, we will be taking a sneak-peek on some of those behind the scene management performed by Spark. I am using the textbook <strong><a href="https://learning.oreilly.com/library/view/spark-the-definitive/9781491912201/">Spark: The Definitive Guide</a></strong> as my reference focusing on the 2nd chapter, <a href="https://learning.oreilly.com/library/view/spark-the-definitive/9781491912201/ch02.html">A Gentle Introduction to Spark </a></p><h2>Core Spark Architecture</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AzZh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444e3a3f-3755-460e-9ac7-4751d90c533d_789x539.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AzZh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444e3a3f-3755-460e-9ac7-4751d90c533d_789x539.png 424w, https://substackcdn.com/image/fetch/$s_!AzZh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444e3a3f-3755-460e-9ac7-4751d90c533d_789x539.png 848w, https://substackcdn.com/image/fetch/$s_!AzZh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444e3a3f-3755-460e-9ac7-4751d90c533d_789x539.png 1272w, https://substackcdn.com/image/fetch/$s_!AzZh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444e3a3f-3755-460e-9ac7-4751d90c533d_789x539.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AzZh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444e3a3f-3755-460e-9ac7-4751d90c533d_789x539.png" width="789" height="539" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/444e3a3f-3755-460e-9ac7-4751d90c533d_789x539.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:539,&quot;width&quot;:789,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image" title="image" srcset="https://substackcdn.com/image/fetch/$s_!AzZh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444e3a3f-3755-460e-9ac7-4751d90c533d_789x539.png 424w, https://substackcdn.com/image/fetch/$s_!AzZh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444e3a3f-3755-460e-9ac7-4751d90c533d_789x539.png 848w, https://substackcdn.com/image/fetch/$s_!AzZh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444e3a3f-3755-460e-9ac7-4751d90c533d_789x539.png 1272w, https://substackcdn.com/image/fetch/$s_!AzZh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F444e3a3f-3755-460e-9ac7-4751d90c533d_789x539.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Spark Architecture Application</figcaption></figure></div><p>At the heart of all Spark Applications are two sets of processes:</p><ol><li><p>Driver Process </p><ol><li><p>maintaining information about the spark application</p></li><li><p>responding to a user&#8217;s program or input</p></li><li><p>analyzing, distributing and scheduling work across the executor</p></li></ol></li><li><p>Executor Process </p><ol><li><p>executing code assigned by the driver</p></li><li><p>reporting the state of the computation back to the driver node</p></li></ol></li></ol><p>There is a single driver process and multiple executor processes, with the driver process occupying one node of the cluster whereas the executor occupies the other nodes. The cluster manager only allocates resources to complete the required work. The SparkSession is generally the driver process in Spark applications. </p><p>The core spark architecture is exceedingly simple and elegant which allows it to scale to petabytes of data. The allocations of tasks by the driver is carried out by creating partitions in the data. Now we discuss the ideas of partitions, transformations, actions and lazy evaluation. </p><h2>Actions, Transformations, Partitions and <br>Lazy Evaluations </h2><h3>Partitions</h3><p>To work in parallel and use the available executors, Spark breaks up the data into chunks called partitions. Each partition sit separately on one physical machine in the cluster. The number of partitions and the available executors control the degree of parallelism that Spark offers. The user can set the number of partitions but does not manipulate how the data is partitioned. The user only has to specify the high level <strong>transformations </strong>on the data.</p><h3>Transformations</h3><p>Any action or modifications on the data is performed through Transformations. Note that transformations are not immediately applied as in pandas but are only abstractly stored. These transformations (and only the necessary ones) are performed only when an <strong>Action</strong> is specified.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gtmk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64905d3f-24a9-4cdf-acd3-c5525dd8fb78_1166x654.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gtmk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64905d3f-24a9-4cdf-acd3-c5525dd8fb78_1166x654.png 424w, https://substackcdn.com/image/fetch/$s_!Gtmk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64905d3f-24a9-4cdf-acd3-c5525dd8fb78_1166x654.png 848w, https://substackcdn.com/image/fetch/$s_!Gtmk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64905d3f-24a9-4cdf-acd3-c5525dd8fb78_1166x654.png 1272w, https://substackcdn.com/image/fetch/$s_!Gtmk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64905d3f-24a9-4cdf-acd3-c5525dd8fb78_1166x654.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gtmk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64905d3f-24a9-4cdf-acd3-c5525dd8fb78_1166x654.png" width="1166" height="654" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64905d3f-24a9-4cdf-acd3-c5525dd8fb78_1166x654.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:654,&quot;width&quot;:1166,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:175126,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Gtmk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64905d3f-24a9-4cdf-acd3-c5525dd8fb78_1166x654.png 424w, https://substackcdn.com/image/fetch/$s_!Gtmk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64905d3f-24a9-4cdf-acd3-c5525dd8fb78_1166x654.png 848w, https://substackcdn.com/image/fetch/$s_!Gtmk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64905d3f-24a9-4cdf-acd3-c5525dd8fb78_1166x654.png 1272w, https://substackcdn.com/image/fetch/$s_!Gtmk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64905d3f-24a9-4cdf-acd3-c5525dd8fb78_1166x654.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Narrow and Wide Transformations</figcaption></figure></div><p>There are generally two kinds of transformation we deal with, narrow transformations and wide transformations. As evident in the figure, under narrow transformations one input partition contributes to only one output partition whereas under wide transformations one input partition contributes to many output partitions. map(), filter() are some easy examples of narrow transformations whereas groupby(), aggregate() are some examples of wide transformations. </p><h3>Lazy Evaluations</h3><p>Transformations on the data are only specified and not applied immediately onto the data. This is because Spark follows a lazy evaluation plan, and waits until the last moment to execute code. The goal here is to provide a streamlined physical plan for the list of transformations which will run as efficiently as possible across the cluster.</p><h3>Actions</h3><p>Actions are the logical end-step where the computation specified by the list of transformations is triggered. So for any actual computation, an action needs to be performed. There are generally three types of action:</p><ul><li><p>View data in the console</p></li><li><p>Collect data to native objects </p></li><li><p>write to output data sources</p></li></ul><p></p><p>This is all for this piece. I am consistently going to update this <a href="https://piyushagarwal441.substack.com/s/apache-spark">Spark Section</a> on the blog. If you are a beginner enthusiast like me and want to learn Apache spark, feel free to reach out and we can do it together!</p><div class="pullquote"><p>I work on the unceded traditional Coast Salish lands including those of the Tsleil-Waututh (s&#601;l&#787;ilw&#787;&#601;ta&#660;&#620;), Kwikwetlem (k&#695;ik&#695;&#601;&#411;&#787;&#601;m), Squamish (S&#7733;wx&#817;w&#250;7mesh &#218;xwumixw) and Musqueam (x&#695;m&#601;&#952;k&#695;&#601;y&#787;&#601;m) Nations.</p></div><p></p><p></p><p></p><p></p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[Looking for Software Engineering/ Data Engineering / Data Analytics jobs - DAY 3]]></title><description><![CDATA[What is Apache Spark?]]></description><link>https://piyushagarwal441.substack.com/p/looking-for-software-engineering-73c</link><guid isPermaLink="false">https://piyushagarwal441.substack.com/p/looking-for-software-engineering-73c</guid><dc:creator><![CDATA[Piyush Agarwal]]></dc:creator><pubDate>Thu, 12 Sep 2024 07:41:11 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9cc13ac6-ee16-44ca-9907-9657557199fe_338x149.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xPEH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6069ce87-5e41-4f8c-8118-407dc84ad2df_338x149.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xPEH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6069ce87-5e41-4f8c-8118-407dc84ad2df_338x149.png 424w, https://substackcdn.com/image/fetch/$s_!xPEH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6069ce87-5e41-4f8c-8118-407dc84ad2df_338x149.png 848w, https://substackcdn.com/image/fetch/$s_!xPEH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6069ce87-5e41-4f8c-8118-407dc84ad2df_338x149.png 1272w, https://substackcdn.com/image/fetch/$s_!xPEH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6069ce87-5e41-4f8c-8118-407dc84ad2df_338x149.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xPEH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6069ce87-5e41-4f8c-8118-407dc84ad2df_338x149.png" width="338" height="149" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6069ce87-5e41-4f8c-8118-407dc84ad2df_338x149.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:149,&quot;width&quot;:338,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Apache Spark on Hadoop: Learn, Try and Do&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Apache Spark on Hadoop: Learn, Try and Do" title="Apache Spark on Hadoop: Learn, Try and Do" srcset="https://substackcdn.com/image/fetch/$s_!xPEH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6069ce87-5e41-4f8c-8118-407dc84ad2df_338x149.png 424w, https://substackcdn.com/image/fetch/$s_!xPEH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6069ce87-5e41-4f8c-8118-407dc84ad2df_338x149.png 848w, https://substackcdn.com/image/fetch/$s_!xPEH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6069ce87-5e41-4f8c-8118-407dc84ad2df_338x149.png 1272w, https://substackcdn.com/image/fetch/$s_!xPEH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6069ce87-5e41-4f8c-8118-407dc84ad2df_338x149.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>I am in the software/data job market. I know the market is slumpy, opportunities are scarce and one needs to apply to as many jobs and reach out to as many people in this period. I have good expertise in programming via Python, R; understanding of data structures and algorithms and have been developing my own Python package for a bioinformatics workflow. I am looking to get a foot-hold in the Software/Tech domain, and am documenting my job-hunt process. Learning something relevant daily and documenting seems like one way to attract employers and get some interviews. So, everyday I want to learn something mentioned in the job profiles and document it. For today it is:</p><h2>What is Apache Spark?</h2><div class="pullquote"><p>Apache Spark is a <strong>unified computing engine</strong> and <strong>a set of libraries</strong> for <strong>parallel data processing</strong> on <strong>computer clusters</strong>.</p></div><p>The above explanation was provided by the developers of Spark: Bill Chambers &amp; Matei Zaharia in Spark: The Definitive Guide. The key themes motivating the founders of Spark were:</p><ol><li><p>Having a single (<strong>unified</strong>) tool to deal with a range of analytical tasks encompassing simple data load to SQL queries to machine learning and graph processing.</p></li><li><p>Focus on computing rather than storage. Spark works hard to deal with all kind of data sources and is primarily a <strong>computing engine</strong>: a data processing tool.</p></li><li><p>Provide unified API for common data analysis tasks. Spark <strong>libraries</strong> serve that purpose, with different libraries providing different functionalities. Spark SQL is associated to SQL and structured data, MLlib with Machine Learning and GraphX with Graph Analytics.</p></li></ol><p></p><p>References:</p><ol><li><p>Spark: The Definitive Guide by Bill Chambers &amp; Matei Zaharia</p></li></ol><div class="pullquote"><p>I work on the unceded traditional Coast Salish lands including those of the Tsleil-Waututh (s&#601;l&#787;ilw&#787;&#601;ta&#660;&#620;), Kwikwetlem (k&#695;ik&#695;&#601;&#411;&#787;&#601;m), Squamish (S&#7733;wx&#817;w&#250;7mesh &#218;xwumixw) and Musqueam (x&#695;m&#601;&#952;k&#695;&#601;y&#787;&#601;m) Nations.</p></div><p></p>]]></content:encoded></item></channel></rss>