[{"content":"","date":"8 April 2026","externalUrl":null,"permalink":"/tags/ai/","section":"Tags","summary":"","title":"AI","type":"tags"},{"content":"","date":"8 April 2026","externalUrl":null,"permalink":"/blog/","section":"Blog","summary":"","title":"Blog","type":"blog"},{"content":" Introduction # Suppose you are working in NLP or AI like half of the world including myself. You want to do some text cleaning to prepare training data for a multi-language model. Your goal is to remove text with duplicate meaning within each language. So you \u0026ldquo;write\u0026rdquo; a quick cleaning function, even dusting off the ancient language of regex that was once known to humans but only spoken by LLMs these days.\npreprocess.py 1 2 3 4 5 6 7 8 9 10 import re def clean_text(text: str) -\u0026gt; str: \u0026#34;\u0026#34;\u0026#34;Normalize text for dedup.\u0026#34;\u0026#34;\u0026#34; text = text.strip() text = text.lower() text = re.sub(r\u0026#34;\\s+\u0026#34;, \u0026#34; \u0026#34;, text) # collapse whitespace text = re.sub(r\u0026#34;[^\\w\\s]\u0026#34;, \u0026#34; \u0026#34;, text) # replace punctuation return text You run the function on an example input.\n$ python -c \u0026#34;from preprocess import clean_text; print(clean_text(\u0026#39; The woman was walking \\t\\t \\n \\n Down a busy street \u0026#39;))\u0026#34; the woman was walking down a busy street Looks like it\u0026rsquo;s working as expected. But there is a bug in this function. Can you spot it? I\u0026rsquo;ll give you a hint: it\u0026rsquo;s not regex related.\npreprocess.py 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 import re def clean_text(text: str) -\u0026gt; str: \u0026#34;\u0026#34;\u0026#34;Normalize text for dedup.\u0026#34;\u0026#34;\u0026#34; text = text.strip() text = text.lower() text = re.sub(r\u0026#34;\\s+\u0026#34;, \u0026#34; \u0026#34;, text) # collapse whitespace text = re.sub(r\u0026#34;[^\\w\\s]\u0026#34;, \u0026#34; \u0026#34;, text) # replace punctuation return text assert clean_text( \u0026#34;Die Frau ging langsam durch eine belebte Straße\u0026#34; ) == clean_text( \u0026#34; Die Frau ging langsam durch\\neine belebte Strasse \u0026#34; ) Can you spot the bug now?\nThe whitespace and newlines are handled fine by strip and regex. The bug is caused by the str.lower() method. In German, \u0026ldquo;Straße\u0026rdquo; and \u0026ldquo;Strasse\u0026rdquo; both mean \u0026ldquo;street.\u0026rdquo; But str.lower() does not map them to the same lowercase string, so duplicates slip through our dedup.\n$ python preprocess.py Traceback (most recent call last): File \u0026#34;preprocess.py\u0026#34;, line 13, assert clean_text(...) == clean_text(...) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError The fix is to use the str.casefold() method. It applies Unicode case folding rules designed for caseless matching.\nimport re def clean_text(text: str) -\u0026gt; str: \u0026#34;\u0026#34;\u0026#34;Normalize text for dedup.\u0026#34;\u0026#34;\u0026#34; text = text.strip() text = text.casefold() text = re.sub(r\u0026#34;\\s+\u0026#34;, \u0026#34; \u0026#34;, text) # collapse whitespace text = re.sub(r\u0026#34;[^\\w\\s]\u0026#34;, \u0026#34; \u0026#34;, text) # replace punctuation return text assert clean_text( \u0026#34;Die Frau ging langsam durch eine belebte Straße\u0026#34; ) == clean_text( \u0026#34; Die Frau ging langsam durch\\neine belebte Strasse \u0026#34; ) print(\u0026#34;Assertion passed!\u0026#34;) $ python preprocess.py Assertion passed! \u0026quot;Straße\u0026quot;.casefold() gives \u0026quot;strasse\u0026quot;, and now both versions match.\nOther languages have similar cases. For example, Turkish has a dotless variant of lowercase i: ı, and dotted capital I: İ. Greek has two lowercase sigmas, σ and ς, same letter different position in the word. Sometimes these differences are easy to overlook.\n$ python -c \u0026#34;assert \u0026#39;ﬁnance\u0026#39; == \u0026#39;finance\u0026#39;\u0026#34; Traceback (most recent call last): assert \u0026#39;ﬁnance\u0026#39; == \u0026#39;finance\u0026#39; ^^^^^^^^^^^^^^^^^^^^^ AssertionError In a wall of text, these two spellings look identical but are not the same. The first uses ﬁ, a single Unicode ligature, the second uses fi as two characters. OCR software and PDF extractors produce these frequently.\nRemember when I said the bug? I lied, it was actually a bug. There are more bugs in the code we thought we just fixed. But before we go hunting for them, let\u0026rsquo;s talk about what we can do about these bugs.\nWhat we just did is called manual testing. We wrote the code, gave it some input, and checked whether the result was what we expected. The problem with our approach, even though we will see that there is a better way, is that we did this by hand. We ran the function a couple of times with different inputs, fixed the code to make sure it worked, and moved on. What happens when we or someone else later changes the code back to using str.lower()? In the best case we find out by doing some manual testing again. In the worst case we get silent failures. Imagine spending hours on compute to train your model only to realize that your training data had duplicates the whole time.\nManual testing does not scale. The better way is writing test code that checks your source code automatically. When you hear people say \u0026ldquo;testing\u0026rdquo; this almost always refers to automated testing, so we lose \u0026ldquo;automated\u0026rdquo; and just say \u0026ldquo;testing.\u0026rdquo; So, instead of eyeballing some output every time, the test code does it for you, and runs every time you change your code. If something breaks, you know immediately.\nToday we will see how testing ensures that our code works correctly and continues to work correctly as we make changes. It gives us the confidence to refactor and improve any codebase without fear of breaking things. The same confidence carries over to letting AI make changes to our code, which matters as more and more coding is done with AI. We will also see how AI can accelerate the testing process itself.\nTesting has a long history in software, but in data roles it is not emphasized as much as it should be. When a web app has a bug, the app crashes and someone notices. But a bad join in a pipeline produces plausible but wrong numbers, and nobody catches it for months. Traditional software validates data on the way in and trusts it on the way out. Data work is different: you are frequently reading data you did not produce, from sources you do not control. The prevalence of silent failures in data work is what makes testing even more important.\nThe primary goal of keeping data in traditional software is to serve the application, not to build a prediction model or train LLMs on it (fingers crossed on this last one). But if you are working as a Data Analyst, Data Engineer, Data Scientist, ML Engineer, or AI Engineer, your primary job is to analyze, interpret, and model messy and rarely structured data. We will see why testing data quality and pipelines, and why for any type of modeling, whether it is statistical, econometric, ML, or AI, modeling matters a lot.\nUnit Tests # We start with the same example we used at the end of the typing chapter.\npricing.py 1 2 3 4 5 6 7 8 9 10 11 12 13 from enum import Enum class Tier(float, Enum): DIAMOND = 1.4 PLATINUM = 0.3 GOLD = 0.2 SILVER = 0.1 BRONZE = 0.05 def apply_discount(price: float, tier: Tier) -\u0026gt; float: return price * (1 - tier) Here, the story is that a developer added a new customer tier but accidentally entered 1.4 instead of 0.4 for the discount value. From the type checker\u0026rsquo;s point of view, everything is valid. But from a business point of view a 140% discount does not make sense, and if we apply it, we get a negative price. So the code is type-correct, but the behavior is still incorrect. This is a bug that type checking alone cannot protect us from. Now let\u0026rsquo;s see how testing helps us catch errors like this, not just this time, but also as the codebase evolves.\nYour first instinct might be to test apply_discount. Let\u0026rsquo;s start by writing a happy-path test.\ntest_pricing.py 1 2 3 4 5 from pricing import Tier, apply_discount def test_apply_discount_happy_path(): assert apply_discount(100.0, Tier.GOLD) == 80.0 Python includes a built-in testing library, unittest, and the most popular third-party testing framework is pytest. This will not be a tutorial on either of them, except for one small pytest feature we will use. Let\u0026rsquo;s run our test with pytest.\n$ pytest test_pricing.py ======================== test session starts ========================= configfile: pyproject.toml collected 1 item test_pricing.py . [100%] ========================= 1 passed in 0.01s ========================== This is a good starting point. The test checks that the function behaves correctly in a normal case. But it doesn\u0026rsquo;t protect us from the bug. In order to catch the bug, we add another test.\ntest_pricing.py 1 2 3 4 5 6 7 8 9 10 from pricing import Tier, apply_discount def test_apply_discount_happy_path(): assert apply_discount(100.0, Tier.GOLD) == 80.0 def test_discounted_price_is_positive(): for tier in Tier: assert apply_discount(100.0, tier) \u0026gt; 0.0 ======================================= test session starts ======================================== configfile: pyproject.toml collected 2 items test_pricing.py .F [100%] ============================================ FAILURES ============================================== ___________________________ test_discounted_price_is_positive _____________________________ def test_discounted_price_is_positive(): for tier in Tier: \u0026gt; assert apply_discount(100.0, tier) \u0026gt; 0.0 E assert -39.99999999999999 \u0026gt; 0.0 E + where -39.99999999999999 = apply_discount(100.0, \u0026lt;Tier.DIAMOND: 1.4\u0026gt;) test_pricing.py:10: AssertionError ====================================== short test summary info ===================================== FAILED test_pricing.py::test_discounted_price_is_positive - assert -39.99999999999999 \u0026gt; 0.0 ===================================== 1 failed, 1 passed in 0.02s ================================== The test failed, and that is a good thing. pytest is telling us that there is a problem with the DIAMOND tier, which makes it easier to debug and fix.\nBut notice that pytest says \u0026ldquo;collected 2 items.\u0026rdquo; The second function loops over all five Tier members, so it actually makes five assertions inside a single test. If DIAMOND fails, the loop stops there, and we never learn whether the other tiers would also have problems.\npytest has a very simple way to break this down with the parametrize decorator. Instead of using a for loop, we tell pytest to run the test function once for each Tier member as a separate test case.\ntest_pricing.py 1 2 3 4 5 6 7 8 9 10 11 12 import pytest from pricing import Tier, apply_discount def test_apply_discount_happy_path(): assert apply_discount(100.0, Tier.GOLD) == 80.0 @pytest.mark.parametrize(\u0026#34;tier\u0026#34;, Tier) def test_discounted_price_is_positive(tier): assert apply_discount(100.0, tier) \u0026gt; 0.0 $ pytest test_pricing.py ========================== test session starts =========================== configfile: pyproject.toml collected 6 items test_pricing.py .F.... [100%] E assert -39.999... \u0026gt; 0.0 test_pricing.py:12: AssertionError ====================== 1 failed, 5 passed in 0.01s ======================= Now we can see there are 6 test cases: 1 happy-path test plus 5 parametrized ones, one for each tier. With the verbose flag, we can see each tier tested individually.\n$ pytest test_pricing.py -v ============================ test session starts ============================== configfile: pyproject.toml collected 6 items test_pricing.py::test_apply_discount_happy_path PASSED [ 16%] test_pricing.py::...return_positive_values[Tier.DIAMOND] FAILED [ 33%] test_pricing.py::...return_positive_values[Tier.PLATINUM] PASSED [ 50%] test_pricing.py::...return_positive_values[Tier.GOLD] PASSED [ 66%] test_pricing.py::...return_positive_values[Tier.SILVER] PASSED [ 83%] test_pricing.py::...return_positive_values[Tier.BRONZE] PASSED [100%] E assert -39.999... \u0026gt; 0.0 test_pricing.py:12: AssertionError ========================== short test summary info ============================= FAILED test_pricing.py::test_discounted_price_is_positive[Tier.DIAMOND] ======================== 1 failed, 5 passed in 0.01s =========================== Now, our code has some tests, and that is really important. There are still two things we can improve, though.\nIn our codebase, it is very likely that other functions or methods also use the Tier enum we defined for discounts. That means they can have problems similar to the one we have here. Should we go ahead and write more tests for those cases to catch the same bug? We could, but that would be extra work we can avoid. If we step back for a moment, the real bug is not in apply_discount logic. The apply_discount function is doing exactly what we told it to do: it multiplies the price by one minus the discount rate. The real problem is in the data. DIAMOND has a value of 1.4, and that does not make sense as a discount rate. In this case, the better test is to go directly to the source of the problem and test the Tier values themselves.\ntest_pricing.py 1 2 3 4 5 6 7 8 9 10 11 12 import pytest from pricing import Tier, apply_discount def test_apply_discount_happy_path(): assert apply_discount(100.0, Tier.GOLD) == 80.0 @pytest.mark.parametrize(\u0026#34;tier\u0026#34;, Tier) def test_tier_is_less_than_one(tier): assert tier \u0026lt; 1.0 $ pytest test_pricing.py =============== test session starts ================ configfile: pyproject.toml collected 6 items test_pricing.py .F.... [100%] E assert \u0026lt;Tier.DIAMOND: 1.4\u0026gt; \u0026lt; 1.0 test_pricing.py:12: AssertionError =========== 1 failed, 5 passed in 0.01s ============ This test catches the bug where it was introduced, and it does not depend on any downstream function. That gives us a useful general principle: test the thing that is wrong, as close to the source as possible. Let\u0026rsquo;s fix the bug now so our tests pass.\nAt this point, you might ask: are these enough tests? Are two tests too few? How would we know?\nCoverage # Coverage tells us which parts of the code were exercised by our tests. In Python, the coverage library reports statement coverage by default, and it can also measure branch coverage.\nLet\u0026rsquo;s run coverage on our tests.\n$ coverage run --source=pricing -m pytest test_pricing.py $ coverage report Name Stmts Miss Cover -------------------------------- pricing.py 9 0 100% -------------------------------- TOTAL 9 0 100% We have 100% coverage, which means every statement in our code was executed while the tests ran. Surely, that means we are well covered. You cannot get more than 100% coverage, right? What are you going to do, write tests for code that has not been written yet? Well, actually, that is a thing, and it is called Test-Driven Development. But TDD is a topic for another time.\nDespite having 100% coverage, we are actually still not well covered, which brings us to the second issue we mentioned earlier. What if a developer entered a negative discount? The discounted price would become larger than the original price, which is a bug. But our tests currently do not catch this. Like before, the source of the problem is the Tier data. So we update our test function that would catch it.\ntest_pricing.py 1 2 3 4 5 6 7 8 9 10 11 12 import pytest from pricing import Tier, apply_discount def test_apply_discount_happy_path(): assert apply_discount(100.0, Tier.GOLD) == 80.0 @pytest.mark.parametrize(\u0026#34;tier\u0026#34;, Tier) def test_tier_is_valid_discount_rate(tier): assert 0.0 \u0026lt; tier \u0026lt; 1.0 $ pytest test_pricing.py =============== test session starts ================ configfile: pyproject.toml collected 6 items test_pricing.py ...... [100%] ================ 6 passed in 0.01s ================= This helps us understand that coverage is good for finding untested code, but it is not very useful as a single number that tells you how good your tests are. In a way, we have translated the question, \u0026ldquo;How many tests do we need?\u0026rdquo; into something a little more tangible, like, \u0026ldquo;What percentage of statements should be covered by our tests?\u0026rdquo; That is still useful progress. In many codebases, you will see something like 80% coverage used as a minimum requirement.\nThere are also great teams that do not enforce a lower bound, because a coverage target can be easy to game, especially now that boilerplate tests are easy to generate with LLMs. What matters more is the quality of the tests, since 100% coverage may not be enough. There is no single ideal number. Even at 100% coverage, we can still miss edge cases, so testing needs to be thought through carefully. Teams should choose targets that fit their risk level and business needs.\nMutation Testing # So, if coverage alone is not enough, and \u0026ldquo;you should think about testing carefully\u0026rdquo; is not exactly easy to act on, is there any low-hanging fruit we can pick? The next thing people often reach for is mutation testing. Mutation testing asks a tougher question: if I make a small change to the code, do my tests notice? There are libraries such as mutmut that can do this for you.\nFor example, imagine a mutant changes the apply_discount logic:\nreturn price * (1 - tier) into this:\nreturn price * (1 + tier) If your tests still pass, that is a red flag. This means that your tests were not really checking the behavior closely enough. In our case, the happy-path test catches it.\n$ pytest test_pricing.py ========================== test session starts ============================== configfile: pyproject.toml collected 6 items test_pricing.py F..... [100%] ================================ FAILURES =================================== _____________________ test_apply_discount_happy_path ________________________ def test_apply_discount_happy_path(): \u0026gt; assert apply_discount(100.0, Tier.GOLD) == 80.0 E assert 120.0 == 80.0 E + where 120.0 = apply_discount(100.0, \u0026lt;Tier.GOLD: 0.2\u0026gt;) test_pricing.py:6: AssertionError ========================= short test summary info =========================== FAILED test_pricing.py::test_apply_discount_happy_path - assert 120.0 == 80.0 ======================== 1 failed, 5 passed in 0.03s ======================== The mutant was killed. So, our tests are sensitive enough to detect this kind of change.\nProperty-Based Testing # Earlier, we stated a property that the discount tiers should satisfy, and tested it. Similarly, apply_discount has properties it should satisfy. Since our tier values are valid discounts between zero and one, for any positive price the discounted result should also stay strictly between zero and the original price. Because the discount tier has a small, fixed set of members, we could test the property for every value. But apply_discount also takes a price argument, which makes the input space too big to test exhaustively. That is where we can go one step further with the Hypothesis library, or more generally, property-based testing. You describe your hypothesis or a property that should hold for many inputs, including edge cases you may not have thought of.\ntest_pricing.py 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import pytest from hypothesis import given, strategies as st from pricing import Tier, apply_discount def test_apply_discount_happy_path(): assert apply_discount(100.0, Tier.GOLD) == 80.0 @pytest.mark.parametrize(\u0026#34;tier\u0026#34;, Tier) def test_tier_is_valid_discount_rate(tier): assert 0.0 \u0026lt; tier \u0026lt; 1.0 @given( price=st.floats(min_value=0, exclude_min=True), tier=st.sampled_from(Tier), ) def test_discounted_price_stays_between_zero_and_original(price, tier): assert 0.0 \u0026lt; apply_discount(price, tier) \u0026lt; price This test does not hardcode a single example. It expresses a rule that should always be true. Let\u0026rsquo;s run it.\n$ pytest test_pricing.py -v --tb=short =============================== test session starts ================================ configfile: pyproject.toml plugins: hypothesis-6.151.11 collected 7 items test_pricing.py::test_apply_discount_happy_path PASSED [ 14%] test_pricing.py::test_tier_is_valid_discount_rate[Tier.DIAMOND] PASSED [ 28%] test_pricing.py::test_tier_is_valid_discount_rate[Tier.PLATINUM] PASSED [ 42%] test_pricing.py::test_tier_is_valid_discount_rate[Tier.GOLD] PASSED [ 57%] test_pricing.py::test_tier_is_valid_discount_rate[Tier.SILVER] PASSED [ 71%] test_pricing.py::test_tier_is_valid_discount_rate[Tier.BRONZE] PASSED [ 85%] test_pricing.py::test_discounted_price_stays_between_zero_and_original FAILED [100%] E assert 5e-324 \u0026lt; 5e-324 E + where 5e-324 = apply_discount(5e-324, \u0026lt;Tier.DIAMOND: 0.4\u0026gt;) ============================= 1 failed, 6 passed in 0.11s ========================== It failed. And we did not write that test case. Hypothesis generated it. The price 5e-324 is a subnormal float, the smallest positive value IEEE 754 can represent. At that scale, apply_discount(5e-324, 0.4) rounds back to 5e-324 because the result is too small to represent with any more precision. So the strict \u0026lt; price assertion fails since the discounted price equals the original.\nWe can disallow subnormal floats in the strategy.\n@given( price=st.floats(min_value=0, exclude_min=True, allow_subnormal=False), tier=st.sampled_from(Tier), ) def test_discounted_price_stays_between_zero_and_original(price, tier): assert 0.0 \u0026lt; apply_discount(price, tier) \u0026lt; price $ pytest test_pricing.py --tb=short ========================== test session starts =========================== configfile: pyproject.toml collected 7 items test_pricing.py ......F [100%] E assert inf \u0026lt; inf E + where inf = apply_discount(inf, \u0026lt;Tier.DIAMOND: 0.4\u0026gt;) ======================== 1 failed, 6 passed in 0.12s ===================== It failed again, with a different input. This time Hypothesis tried infinity. Infinity times anything is infinity, so apply_discount(inf, 0.4) returns inf, and inf \u0026lt; inf is false. We disallow infinity too.\n@given( price=st.floats(min_value=0, exclude_min=True, allow_subnormal=False, allow_infinity=False), tier=st.sampled_from(Tier), ) def test_discounted_price_stays_between_zero_and_original(price, tier): assert 0.0 \u0026lt; apply_discount(price, tier) \u0026lt; price $ pytest test_pricing.py ========================== test session starts =========================== configfile: pyproject.toml collected 7 items test_pricing.py ....... [100%] ========================= 7 passed in 0.10s ============================= pytest reports seven tests here, but that last Hypothesis test is actually exercising many generated inputs under the hood. Hypothesis kept finding edge cases in the float space until we constrained the strategy enough. This reveals a real problem in our code: accepting any float as a price without validation is dangerous. We will talk about parsing and validation in the next chapter.\nMC/DC # So far, everything we have talked about comes from the world of everyday software development. It is worth noting that there is a much stricter standard in safety-critical systems that is called MC/DC, which stands for Modified Condition/Decision Coverage. If you are building flight control software, medical devices, or autonomous defense systems, regulators may require MC/DC. So if you plan to work in defense tech or at a \u0026ldquo;dual-use\u0026rdquo; company like Anduril, Palantir, or OpenAI, take notes.\nThe idea is that executing every statement or branch is not enough. You need to show that each individual condition in a decision independently affects the outcome.\nHere is an example:\ndef can_launch(is_armed: bool, safety_key_inserted: bool) -\u0026gt; bool: if is_armed and safety_key_inserted: return True return False Now suppose you write these two parametrized tests:\ntest_launch.py 1 2 3 4 5 6 7 8 9 10 11 12 13 14 import pytest from launch import can_launch @pytest.mark.parametrize( \u0026#34;is_armed, safety_key_inserted, expected\u0026#34;, [ (True, True, True), (False, False, False), ], ) def test_can_launch(is_armed, safety_key_inserted, expected): assert can_launch(is_armed, safety_key_inserted) is expected Those tests give you 100 percent statement and branch coverage, because both the true and false paths run. But they do not satisfy MC/DC. Why not? Because both conditions changed at the same time. Neither condition has been shown to independently flip the outcome.\nTo satisfy MC/DC, you add the missing test vectors:\ntest_launch.py 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import pytest from launch import can_launch @pytest.mark.parametrize( \u0026#34;is_armed, safety_key_inserted, expected\u0026#34;, [ (True, True, True), (False, True, False), (True, False, False), ], ) def test_can_launch(is_armed, safety_key_inserted, expected): assert can_launch(is_armed, safety_key_inserted) is expected Now each condition has been shown to independently flip the result. MC/DC also requires completeness: each condition must take both true and false values across the test suite. Our three vectors satisfy that too, since is_armed and safety_key_inserted each appear as both True and False.\nEvery test we wrote so far was for small pieces of our code in isolation, which are called unit tests. How do we test our code at larger scales, for example to check how different parts of our system work together? To answer that, let\u0026rsquo;s look at the different kinds of testing and how they relate to each other.\nTesting Pyramid # The different kinds of testing are usually shown as a pyramid with three layers.\nAt the base are unit tests. As we saw, these test individual functions or methods in isolation, or more generally, the smallest units of code. What counts as the smallest unit is not always obvious. A useful rule of thumb is to think of it as the smallest piece of logic that does one coherent thing, such as making a decision, transforming data, or applying a rule, and can still be tested on its own. Unit tests are fast, small, and relatively cheap to write and run, so they usually make up the largest portion of your test suite.\nThe middle layer is integration tests. These verify that multiple components work together correctly. For example, that might mean testing a function that reads from a database, a service that calls an external API, or a pipeline that pulls data, applies transformations, and writes the resulting features to a feature store. These tests are slower and broader than unit tests, but they are valuable because many production failures happen at the boundaries between components.\nAt the top are end-to-end tests, or e2e for short. These simulate real user workflows across the entire system. For example, that might mean testing that a user can complete a checkout flow from start to finish. In an LLM application, it could mean testing the full path from a user question to document retrieval, prompt construction, answer generation, and finally showing that answer in the UI. These tests are the slowest, broadest, and most expensive to maintain, so you usually want fewer of them.\nThere are many other types of tests, including component tests, functional tests, acceptance tests, performance tests, regression tests, and security tests. We focus on unit, integration, and end-to-end because they map to increasing scope and cost, and they build the right intuition for everything else.\nIntegration and E2E Tests # Let\u0026rsquo;s see what integration testing looks like by returning to clean_text. Earlier, we said there were more bugs hiding in it.\nThe ﬁ ligature is not actually a bug: casefold() already decomposes it. The real issues are subtler. \u0026ldquo;café\u0026rdquo; can be encoded as a single character or as \u0026ldquo;e\u0026rdquo; plus a combining accent, and our function treats those differently. The fix is unicodedata.normalize(\u0026quot;NFKC\u0026quot;, text). There is also an ordering bug: collapsing whitespace before removing punctuation leaves double spaces, and punctuation removal can leave leading and trailing spaces. Swap the two regex lines and move strip() to the end to fix that.\nHere is the corrected version:\npreprocess.py 1 2 3 4 5 6 7 8 9 10 11 12 import re import unicodedata def clean_text(text: str) -\u0026gt; str: \u0026#34;\u0026#34;\u0026#34;Normalize text for dedup.\u0026#34;\u0026#34;\u0026#34; text = unicodedata.normalize(\u0026#34;NFKC\u0026#34;, text) text = text.casefold() text = re.sub(r\u0026#34;[^\\w\\s]\u0026#34;, \u0026#34; \u0026#34;, text) # replace punctuation text = re.sub(r\u0026#34;\\s+\u0026#34;, \u0026#34; \u0026#34;, text) # collapse whitespace text = text.strip() return text In a real system, clean_text does not exist in isolation. For example cleaned text could be flowing into an encoder, and the encoder would then write vectors to a database. Integration testing checks that these components work together across boundaries, not just individually. If an integration test fails, you know the pipeline is broken, but not which component caused it.\nTake two strings that look different but should produce the same result after cleaning:\n\u0026#34; Die Straße , cafe\\u0301\\n\\n machine-learning \u0026#34; \u0026#34;die strasse café machine learning\u0026#34; Both go through clean_text, then through an encoder. If cleaning works correctly, both produce identical vectors.\ntest_pipeline_integration.py 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from preprocess import clean_text @pytest.mark.integration def test_equivalent_inputs_produce_same_vector(): raw_a = \u0026#34; Die Straße , cafe\\u0301\\n\\n machine-learning. \u0026#34; raw_b = \u0026#34;die strasse café machine learning\u0026#34; cleaned_a = clean_text(raw_a) cleaned_b = clean_text(raw_b) vec_a = encode(cleaned_a) # your embedding function vec_b = encode(cleaned_b) assert vec_a == vec_b # same meaning, same vector When an integration test depends on external systems like a database or an API, you can use mocks to simulate them. Python\u0026rsquo;s unittest.mock library with Mock and MagicMock is the standard tool for this.\nAn end-to-end test has the same idea but would go further. For example, you might test the entire flow from raw input in, cleaned and encoded, stored in a database, vectors used for RAG, and the user sees the result in the UI. One difference from integration tests is that you would do this with no mocks, unless absolutely necessary, to test the real system as closely as possible.\nPractical Tips # Test behavior, not implementation. Do not test that casefold() was called, someone can implement the same behavior differently. Think of it as Given-When-Then, without referring to the implementation: given some setup, when I run the function, then I expect this output.\nNo need to test built-in or popular third-party libraries. They already have large test suites and are battle-tested by millions of users. Test how your code uses the libraries and integrates with them. We did this with clean_text by testing the output of our function, not whether re.sub() or casefold() work correctly on their own.\nWhen you start working on an untested codebase, write a couple of end-to-end tests first. They catch big breakages immediately. Then, as you work on specific parts, add integration and unit tests before modifying the existing code. This flips the pyramid, but it is a pragmatic approach.\nLabel your tests by type. Unit tests should run frequently like on every save, every commit, every push. Integration and end-to-end tests are slower and often depend on external systems, so you usually run those only in CI. pytest markers like @pytest.mark.integration or @pytest.mark.e2e make it easy to select which tests to run locally and which to leave for the pipeline.\nUse AI to accelerate testing. Developers have always hated writing tests because it is boring, repetitive work for most. Now AI can draft a solid first set of tests pretty quickly. Tell it what kind of testing you want: if you ask for coverage targets, mutation testing, or property-based testing by name, the results are much better than just saying \u0026ldquo;write tests.\u0026rdquo;\nBe careful when testing floating-point values. We saw Hypothesis catch subnormal floats and infinity in our pricing tests. The same issues come up constantly in data work: model metrics, aggregations, financial calculations.\n$ python -c \u0026#34;print(0.1 + 0.2 == 0.3)\u0026#34; False Use approximate comparisons instead of exact equality with float values, pytest provides pytest.approx for this.\ntest_float_approx.py 1 2 3 4 5 import pytest def test_floating_point_equality(): assert 0.1 + 0.2 == pytest.approx(0.3) NumPy has numpy.testing.assert_allclose for arrays, PyTorch has torch.testing.assert_close. If you do numerical work, approximate comparisons should be your default.\nConclusion # Previously, in the typing chapter, we added type annotations to catch bugs that testing alone would not catch. Today, we learned about adding tests to verify correctness, and catch the kind of bugs that type checkers cannot see. Together, typing and testing catch most issues in our own code before they reach production.\nBut in real systems, a lot of the data comes from sources which we cannot fully control: API requests, config files, CSV uploads, user input. Nothing stops someone from sending {\u0026quot;price\u0026quot;: 0, \u0026quot;tier\u0026quot;: \u0026quot;DIAMOND\u0026quot;} to your endpoint. We spent an entire chapter hardening clean_text, but how do we even know the file we read to clean is a valid string? What if it is binary, or truncated, or in an encoding our code does not expect?\nIn the next chapter, we will talk about parsing and validation to complete what I call the typing, testing, and parsing holy trinity. We will see how to take raw, untrusted input and turn it into data our code can actually rely on. See you there.\nResources # Tooling # Testing Frameworks # unittest pytest Test Quality # coverage.py mutmut Hypothesis Mocking # unittest.mock Numerical Testing # pytest.approx numpy.testing.assert_allclose torch.testing.assert_close Testing Concepts # Mutation Testing Modified Condition/Decision Coverage Python Standard Library # str.casefold() unicodedata.normalize Standards # NIST IR 8397 Unicode Case Folding IEEE 754 ","date":"8 April 2026","externalUrl":null,"permalink":"/blog/posts/test-python-code/","section":"Blog","summary":"Even 100% test coverage can give you a false sense of confidence. Unit tests, coverage, mutation testing, property-based testing, and more. Here’s how to test Python code properly.","title":"How to Test Python Code","type":"blog"},{"content":"Senior Machine Learning Engineer at The Burning Glass Institute, building ML and NLP systems for labor market analytics.\nI have comb-shaped skills: deeper experience in Python, Software Engineering, ML/NLP, and Game Theory, with exposure to Cloud Architecture and Engineering, AI Systems Design, Data Engineering, and MLOps.\nIn a previous life, I was a quant in finance. I went into the tech side of quantitative finance with a goal of picking up transferable engineering and machine learning skills. I worked on high-frequency trading and market-neutral hedge fund strategies, where I was lucky enough to be surrounded by great engineers and learned the trade (pun intended) by doing. Once I\u0026rsquo;d done that, I went back to being surrounded by great economists at The Burning Glass Institute. I find it refreshing now to be working at a nonprofit after years where profit was the only objective.\nI\u0026rsquo;ve also been doing my PhD on the side in Economics. My research includes game theory, matching theory, and decision theory.\n","date":"8 April 2026","externalUrl":null,"permalink":"/","section":"Oral Ersoy Dokumaci","summary":"Senior Machine Learning Engineer at The Burning Glass Institute, building ML and NLP systems for labor market analytics.\nI have comb-shaped skills: deeper experience in Python, Software Engineering, ML/NLP, and Game Theory, with exposure to Cloud Architecture and Engineering, AI Systems Design, Data Engineering, and MLOps.\n","title":"Oral Ersoy Dokumaci","type":"page"},{"content":"","date":"8 April 2026","externalUrl":null,"permalink":"/tags/production/","section":"Tags","summary":"","title":"Production","type":"tags"},{"content":"","date":"8 April 2026","externalUrl":null,"permalink":"/tags/python/","section":"Tags","summary":"","title":"Python","type":"tags"},{"content":"","date":"8 April 2026","externalUrl":null,"permalink":"/tags/software-engineering/","section":"Tags","summary":"","title":"Software-Engineering","type":"tags"},{"content":"","date":"8 April 2026","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":"","date":"8 April 2026","externalUrl":null,"permalink":"/tags/testing/","section":"Tags","summary":"","title":"Testing","type":"tags"},{"content":"","date":"18 March 2026","externalUrl":null,"permalink":"/tags/typing/","section":"Tags","summary":"","title":"Typing","type":"tags"},{"content":" Introduction # simple_example.py 1 2 3 my_variable = \u0026#34;Python In Production\u0026#34; my_variable = 2026 print(my_variable) This Python code assigns a string to a variable, then assigns an integer to the same variable, and finally prints its value. What do you think will happen if we run this? Will it run without errors?\nIf you\u0026rsquo;re familiar with Python even at a very basic level, this is a simple question. It will run without errors and just print 2026. Here in this example, the second assignment overwrites the first one, making my_variable an integer. You can also confirm this by printing type(my_variable) to see \u0026lt;class 'int'\u0026gt;.\nNow, let\u0026rsquo;s add a str type hint in the first line:\nsimple_example_typed.py 1 2 3 my_variable: str = \u0026#34;Python In Production\u0026#34; my_variable = 2026 print(my_variable) We annotated my_variable as a string, but in the next line we assigned an integer. Now this time, do you think it will still run without errors? Think about it for a moment.\nAgain yes, nothing magical will happen just because we added a type hint. Python is a dynamically typed language, which means the type of a variable is determined at runtime. Other well-known examples of dynamically typed languages include PHP, Ruby, Lua, R, and JavaScript. This contrasts with statically typed languages, where the type of a variable is fixed at compile time, such as in C, C++, Java, Go, Rust, and TypeScript.\nWhere it gets interesting is that we can use static type checkers like mypy, pyright, and ty to catch potential runtime errors even before running the code!\nIf we ran mypy\n$ mypy simple_example_typed.py we will get an error:\nerror: Incompatible types in assignment (expression has type \u0026#34;int\u0026#34;, variable has type \u0026#34;str\u0026#34;) [assignment] But how is this helpful? Why are we getting an error? After all, the above code is a perfectly valid Python code and Python lets us do this. Why should we care if a type checker complains about this?\nToday we\u0026rsquo;re going to answer these questions and talk about static typing or simply typing in Python. This is one of the most important modern approaches to writing high-quality, production-ready Python code. The dynamic nature of Python is good for development, scripting, and quick prototyping, but we will see that for production code we want to be explicit about the types of our variables and do not want to change their types at runtime. Using typing, your code becomes less prone to bugs, easier to understand, and easier to maintain. We will also see how you can utilize AI coding tools together with typing to generate much better Python code.\nOur approach is not to provide a tutorial on typing syntax, but rather to help you understand what typing actually does and why it is important. As AI coding assistants take on more of the actual code writing, understanding the essential ideas behind typing matters even more, instead of learning syntax. We will also not go into the details of different type checkers and how they compare, but below you will find links to some of them frequently utilized in production workflows. Because the tooling around Python and software development is constantly evolving, our north star is to focus on the core ideas that will stay with you for years to come. Given the current pace of advancements in AI, years is an eternity. So, you will greatly benefit from understanding these ideas if your work involves using Python. Whether you\u0026rsquo;re a data scientist, ML engineer, data engineer, or software engineer (whatever this title will be in the future), type hints will make your Python code noticeably better and can directly or indirectly help you move into more senior roles.\nStatic Typing in a Codebase # In a codebase with many variables and functions, it quickly becomes difficult to keep track of types manually. It\u0026rsquo;s time-consuming, it adds cognitive load, and it\u0026rsquo;s error prone. This is not only true for us, humans, but also true for language models. For example, if a string variable in our codebase was later changed to an integer, and we call a string method on it, we will get an error at runtime.\nAs we just saw, static typing catches these errors that would otherwise be deferred to runtime, making it one of the most effective tools for Python production code. We do not need to run python to get the error, we just need to run a static type checker. In fact, in production workflows, a Continuous Integration pipeline will run a static type checker to catch these bugs automatically before running, merging, or even committing the code. In a future lesson, we will talk about CI in more detail.\nA Latent Bug Example # Here\u0026rsquo;s an example where static typing makes a real difference.\nSuppose we have three membership tiers: gold, silver, and bronze, and a function that applies a discount based on a user\u0026rsquo;s membership tier.\nlatent_bug.py 1 2 3 4 5 6 7 8 9 10 11 12 def apply_discount(price, tier): discounts = { \u0026#34;gold\u0026#34;: 0.2, \u0026#34;silver\u0026#34;: 0.1, \u0026#34;bronze\u0026#34;: 0.05, } discount = discounts[tier] return price * (1 - discount) discounted_price = apply_discount(100.0, \u0026#34;gold\u0026#34;) print(discounted_price) If you test this code, it works as expected.\n$ python latent_bug.py 80.0 Suppose a new \u0026ldquo;platinum\u0026rdquo; tier is added.\n$ python -c \u0026#34;from latent_bug import apply_discount; \\ discounted_price = apply_discount(100.0, \u0026#39;platinum\u0026#39;)\u0026#34; KeyError: \u0026#39;platinum\u0026#39; We get a runtime crash, on a payment path!\nYou may be wondering, how come we add a new tier without updating this function? Isn\u0026rsquo;t it our job to update the code? Yes, but this is a very common mistake, especially in large codebases. Otherwise, we would never have type errors in production, and the world would be a much better place.\nUsing Literal Types # Now let\u0026rsquo;s add type hints. Price is a float. We use Literal to restrict the tier parameter to exactly the values we support. The return type is also a float.\nlatent_bug_literal.py 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 from typing import Literal def apply_discount( price: float, tier: Literal[\u0026#34;gold\u0026#34;, \u0026#34;silver\u0026#34;, \u0026#34;bronze\u0026#34;], ) -\u0026gt; float: discounts = { \u0026#34;gold\u0026#34;: 0.2, \u0026#34;silver\u0026#34;: 0.1, \u0026#34;bronze\u0026#34;: 0.05, } discount = discounts[tier] return price * (1 - discount) discounted_price = apply_discount(100.0, \u0026#34;platinum\u0026#34;) print(discounted_price) The code would still crash during runtime, but a static type checker catches this.\n$ mypy latent_bug_literal.py error: Argument 2 to \u0026#34;apply_discount\u0026#34; has incompatible type \u0026#34;Literal[\u0026#39;platinum\u0026#39;]\u0026#34;; expected \u0026#34;Literal[\u0026#39;gold\u0026#39;, \u0026#39;silver\u0026#39;, \u0026#39;bronze\u0026#39;]\u0026#34; [arg-type] The type checker sees that \u0026quot;platinum\u0026quot; is not one of the allowed values and flags it immediately. This is the kind of bug that is trivial for a machine to find but surprisingly easy for a human to miss.\nBut notice, this only works because we used Literal to restrict the type. If we had hinted tier as just str, the type checker would see nothing wrong. The string \u0026quot;platinum\u0026quot; is a perfectly valid str type, so type checkers would report no errors and the bug would still make it to production. Just adding some type hints is not going to save us if we don\u0026rsquo;t properly restrict the types.\nYou might also be tempted to make some runtime checks like this:\nlatent_bug_literal.py 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 from typing import Literal def apply_discount( price: float, tier: Literal[\u0026#34;gold\u0026#34;, \u0026#34;silver\u0026#34;, \u0026#34;bronze\u0026#34;], ) -\u0026gt; float: discounts = { \u0026#34;gold\u0026#34;: 0.2, \u0026#34;silver\u0026#34;: 0.1, \u0026#34;bronze\u0026#34;: 0.05, } if tier not in discounts: raise ValueError(f\u0026#34;Invalid tier: {tier}\u0026#34;) discount = discounts[tier] return price * (1 - discount) discounted_price = apply_discount(100.0, \u0026#34;platinum\u0026#34;) print(discounted_price) This looks more robust, but it is a bad idea. We did not solve our problem, we just replaced a KeyError with a ValueError that we still need to catch at runtime. And how would we handle it? Apply no discount? Apply the largest discount? No answer will make sense because this is a business logic error. On top of that, the tier input to this function is not external, it does not come from a user or a config file. It comes from within our own code. A runtime guard for something our own code controls just bloats the function for no benefit. This is an example where static type checking is a better tool than runtime checks: it catches the mistake at the call site, before the code ever runs.\nFully Typed Version # Should we also add type hints to the discounts dictionary and the discount variable inside the function? Yes, it is a good practice to do so. Here is the fully typed version:\nlatent_bug_fully_typed.py 1 2 3 4 5 6 7 8 9 10 11 12 13 14 from typing import Literal def apply_discount( price: float, tier: Literal[\u0026#34;gold\u0026#34;, \u0026#34;silver\u0026#34;, \u0026#34;bronze\u0026#34;], ) -\u0026gt; float: discounts: dict[Literal[\u0026#34;gold\u0026#34;, \u0026#34;silver\u0026#34;, \u0026#34;bronze\u0026#34;], float] = { \u0026#34;gold\u0026#34;: 0.2, \u0026#34;silver\u0026#34;: 0.1, \u0026#34;bronze\u0026#34;: 0.05, } discount: float = discounts[tier] return price * (1 - discount) The type checker can often infer the types of local variables from their assignments, so these inner annotations are not strictly necessary. But being explicit makes the code self-documenting. Anyone or any language model reading this function can immediately see what every variable is supposed to be without tracing through the logic. It also gives us more ground to catch bugs. This time, even if someone changed the tier parameter back to str in the function signature, the type checker would still catch the problem at the discounts[tier] line:\nerror: Invalid index type \u0026#34;str\u0026#34; for \u0026#34;dict[Literal[\u0026#39;gold\u0026#39;, \u0026#39;silver\u0026#39;, \u0026#39;bronze\u0026#39;], float]\u0026#34;; expected type \u0026#34;Literal[\u0026#39;gold\u0026#39;, \u0026#39;silver\u0026#39;, \u0026#39;bronze\u0026#39;]\u0026#34; [index] Because the dictionary\u0026rsquo;s key type is Literal[\u0026quot;gold\u0026quot;, \u0026quot;silver\u0026quot;, \u0026quot;bronze\u0026quot;], it refuses to be indexed by an arbitrary str. The more thoroughly you type your code, the more layers of protection you get.\nOpen-Closed Principle # Notice there\u0026rsquo;s something deeper going on here. The fact that this function breaks whenever a tier it doesn\u0026rsquo;t know about shows up is a violation of what is known as the open-closed principle, which is one of the well-known software design principles called the SOLID principles. The open-closed principle says that code should be open for extension but closed for modification. You should be able to add new behavior without changing existing code. Here, every time a new payment tier is added, this function or any function that uses a tier needs to change. If someone forgets to update the relevant functions, program will crash at runtime. The type checker is not just catching a bug, but it\u0026rsquo;s also pointing at a design problem.\nSo, another benefit of typing is that it can nudge you toward better design. Design patterns can help you satisfy the open-closed principle, but until you implement one, typing will still be there to catch these bugs. In a future lesson, we will talk about design patterns in more detail but let\u0026rsquo;s do a better fix right now.\nInstead of a dictionary lookup, we can encode the discount values directly into the type system using an enum:\nlatent_bug_enum.py 1 2 3 4 5 6 7 8 9 10 11 12 13 14 from enum import Enum class Tier(float, Enum): GOLD = 0.2 SILVER = 0.1 BRONZE = 0.05 def apply_discount(price: float, tier: Tier) -\u0026gt; float: return price * (1 - tier) apply_discount(100.0, Tier.PLATINUM) The function no longer knows or cares about which specific tiers exist. It just takes a Tier and applies it. If someone adds a PLATINUM = 0.3 to the enum, the function doesn\u0026rsquo;t need to change, it will just work. That\u0026rsquo;s the open-closed principle in action. Moreover, the type checker still has your back:\n$ mypy latent_bug_enum.py error: \u0026#34;type[Tier]\u0026#34; has no attribute \u0026#34;PLATINUM\u0026#34; [attr-defined] It tells you that PLATINUM doesn\u0026rsquo;t exist on Tier yet.\nSo let\u0026rsquo;s add it:\nlatent_bug_enum_fixed.py 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from enum import Enum class Tier(float, Enum): PLATINUM = 0.3 GOLD = 0.2 SILVER = 0.1 BRONZE = 0.05 def apply_discount(price: float, tier: Tier) -\u0026gt; float: return price * (1 - tier) apply_discount(100.0, Tier.PLATINUM) One line added to the enum. The apply_discount function didn\u0026rsquo;t change, and no other code that uses membership tiers needs to change either. The type checkers are happy. The call returns 70.0, a 30% discount, exactly as expected, and most importantly, the platinum user is happy.\nPractical Tips # How much should you use type hints? If you\u0026rsquo;re new to typing, use them as much as possible. An additional benefit of adding type hints is that you will also need to think about the data structures you use and the operations you perform on them.\nIf you\u0026rsquo;re working with an existing untyped codebase, you don\u0026rsquo;t need to add type hints everywhere at once. Start by typing the most important parts such as public function signatures, module boundaries, and areas where bugs tend to appear. This is called gradual typing, and it\u0026rsquo;s one of the strengths of Python\u0026rsquo;s approach. Most type checkers support this incremental adoption by allowing you to configure them to only check files that already have annotations.\nIf you are in a job that involves working with data, you will often work with notebooks. The static type checkers may not support running natively in notebooks since notebooks are just JSON files that do not have a valid Python AST that type checkers can parse. But one of the greatest advantages of open source software development is that someone else has probably already solved your problem. For notebooks, you can use tools that can link each notebook to a python file, and then run the type checker on the associated python file. Below you will find some links to tools that can do this.\nFinally, AI coding assistants can fully utilize typing to generate better code. This is just as simple as asking them to generate code with type hints and run a type checker on it. The main idea is to let AI know about the best practices for writing production Python code. We will cover more of these best practices in each lesson.\nHistory of Typing in Python # Python was created by Guido van Rossum and first released in 1991. Its core design principles were readability, simplicity, and elegance. \u0026ldquo;Beautiful is better than ugly\u0026rdquo; is the first line of the Zen of Python, and the community has a word for code that follows Python\u0026rsquo;s this philosophy: Pythonic.\nConsider what it takes to write a simple hello world program. In statically typed languages you need includes, main functions, class definitions, or explicit return types just to print something to standard out, whereas in Python you can just write a single line of code.\nhello_python.pyprint(\u0026#34;Hello, World!\u0026#34;) One part of that simplicity is dynamic typing. Variables don\u0026rsquo;t need type declarations, and functions don\u0026rsquo;t need signatures specifying what goes in and what comes out. This keeps the code concise, but as we already saw, it comes with tradeoffs once codebases grow. Untyped codebases with thousands of lines of code and dozens of contributors became harder to maintain, harder to refactor, and more prone to bugs that only appeared at runtime.\nPython is a community driven language, and to address this growing need, in 2014, the community responded with PEP 484 and the typing module. For the first time, Python had an official vocabulary for expressing types, and third party static type checkers could use that vocabulary to check your code before it runs. Since then, the typing system has continued to evolve, with new features arriving in nearly every Python release.\nIt\u0026rsquo;s worth mentioning how JavaScript solved the same problem. JavaScript, another dynamically typed language, faced identical growing pains. The solution there was TypeScript, a superset of JavaScript. You write .ts files, run them through a transpiler, and out comes .js.\nPython took a different approach. Instead of creating a new language, it embedded type annotations directly into the existing syntax. There is no TypedPython, no separate .tpy extension, no transpiler. A type-annotated Python file is still just a Python file. The annotations are there, but they don\u0026rsquo;t change how the code executes.\nThis separation is an intentional design choice, and there are good reasons for it. First, it preserves backwards compatibility. Existing Python code doesn\u0026rsquo;t break when you add type hints. Second, it keeps the interpreter simple and fast, with no type-checking overhead at runtime. Third, and perhaps most importantly, it lets the type-checking ecosystem evolve independently and faster than the language itself. Python release cycles are long, but type checkers can ship improvements every week. Python\u0026rsquo;s approach turned out to be both elegant yet still powerful enough.\nUnder the Hood # If you made it this far, it is a good time to look under the hood to understand what\u0026rsquo;s actually happening when Python encounters type hints.\nIn Python everything is an object, and every object has a type. 99.9% of the Python code you will encounter in the wild is executed by CPython, the reference implementation of Python that is written in C. So, Python is unironically executed by a statically typed language. Another contrast between Python and C is that Python is a strongly typed language, while C is a weakly typed language, but this is a topic for another lesson.\nIn CPython, every value is represented by a C struct that starts with a PyObject header.\n// Include/object.h (CPython 3.14) typedef struct _object { Py_ssize_t ob_refcnt; PyTypeObject *ob_type; } PyObject; This header contains just two fields: a reference count (ob_refcnt) that tracks how many references point to this object, and a type pointer (*ob_type) that tells Python what kind of object this is. The actual data lives in extended structs like PyLongObject for ints and PyUnicodeObject for strings. These embed PyObject at the top so they can all be treated uniformly through pointer casting. This is the \u0026ldquo;everything is an object\u0026rdquo; part of Python.\nIn the no type hints scenario, when you write x = \u0026quot;Python in Production\u0026quot;, Python allocates a PyUnicodeObject on the heap to hold the string, and the name x gets mapped to that object. How that mapping works depends on scope, and beyond the scope (pun intended) of this lesson. What matters here is that the type information lives inside the object itself, not attached to the name x. This is what makes Python dynamically typed at its core. When you reassign x = 2026, Python allocates a new PyLongObject on the heap for the integer and updates the mapping so x now maps to the new object. The name x doesn\u0026rsquo;t care what type it refers to. The old PyUnicodeObject has its reference count decremented immediately, and if no other references remain, it is deallocated on that same bytecode instruction. This is the CPython\u0026rsquo;s primary memory management mechanism, reference counting.\nType hints don\u0026rsquo;t change any of what we just saw. When you write x: int = 2026 with the type hint, Python still allocates the PyLongObject for 2026 and maps the name x to it, exactly as before. The annotation changes nothing about how the value is stored or how the variable behaves. Python stores the annotations as a dictionary, and you can inspect them yourself:\nannotations.pyx: int = 2026 print(__annotate__(1)) $ python3.14 annotations.py {\u0026#39;x\u0026#39;: \u0026lt;class \u0026#39;int\u0026#39;\u0026gt;} So, type hints are metadata, not instructions. Python records them but doesn\u0026rsquo;t act on them and it doesn\u0026rsquo;t affect the runtime behavior or performance. The external tools, type checkers, language servers, read this metadata and enforce the types statically.\nConclusion # Today we covered the essential ideas behind typing in Python. Whether you\u0026rsquo;re building an API, a data pipeline, or a machine learning model, static typing will catch bugs that testing alone won\u0026rsquo;t, making your code self-documenting, and save you and your team time during code reviews.\nIf you\u0026rsquo;re not using type hints yet, start today. Pick one file, add annotations to its function signatures, and run a type checker. You might be surprised by what it finds.\nEarlier, we saw that typing catches classes of mistakes your tests forgot to cover. The opposite is also true, testing can catch some of the things that typing alone won\u0026rsquo;t, making them complementary not substitute tools.\nlatent_bug_enum_fixed.py 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from enum import Enum class Tier(float, Enum): DIAMOND = 1.4 PLATINUM = 0.3 GOLD = 0.2 SILVER = 0.1 BRONZE = 0.05 def apply_discount(price: float, tier: Tier) -\u0026gt; float: return price * (1 - tier) discounted_price = apply_discount(100.0, Tier.DIAMOND) print(discounted_price) Here the developer added the diamond tier, but there is a typo: the discount value is larger than 1. Now, if we call the apply_discount function with the diamond tier, we will have to give items for free or even worse, we will have to pay the customer for whatever item they bought since the discounted price will be negative. Typing is not going to save us here, but testing will.\nIn the next lesson, we will look at testing your code. We will understand how typing and testing together form such a powerful pair for production Python code. See you there!\nResources # Tooling # Type Checkers # mypy pyright ty Notebook to Python # jupytext nbqa marimo notebooks Software Design Principles # Open-closed principle SOLID principles Python and Historical Context # CPython Python History BDFL Some Python Type Hints PEPs # PEP 3107 - Function Annotations PEP 484 - Type Hints PEP 526 - Variable Annotations PEP 544 - Structural Subtyping PEP 586 - Literal Types PEP 589 - TypedDict PEP 591 - Final Keyword PEP 604 - Allow writing union types as X | Y PEP 612 - Parameter Specification Variables PEP 613 - Explicit Type Aliases ","date":"18 March 2026","externalUrl":null,"permalink":"/blog/posts/use-static-typing/","section":"Blog","summary":"Type hints are metadata that Python records but doesn’t enforce. Static type checkers read that metadata and catch bugs before your code runs. Here’s why that matters.","title":"Why Static Typing Matters in Python","type":"blog"},{"content":"Lately I stopped trying to resist the pull of the AI coding black hole. By that I mean I do not code by hand anymore. I don\u0026rsquo;t even bother changing variable names, even though I have some meh-to-okay vim skills (shout out to ThePrimeagen for his vim series, but these skills are going to get obsolete pretty quickly1). The news we hear about engineers at company after company only using AI to code is actually true.\nWhen I first thought about this black hole analogy, I was going to write something like \u0026ldquo;if you have large mass, like Linus Torvalds, you may be able to resist the pull of AI coding.\u0026rdquo; We know that wouldn\u0026rsquo;t have aged well. Yes, it wasn\u0026rsquo;t really production code, but it all started that way. Speaking of ThePrimeagen, I believe he is a very good programmer (relatively big mass) and genuine in the content he creates, in the sense that his primary drive is not generating money (his trajectory not pointing directly towards the black hole). But he too has been leaning into AI coding lately. Seems like we are all headed in the same direction.\nBut why did we resist, or still resist, AI coding? A lot of it comes down to identity. We define our abilities by the quality of the code we can produce. When a language model generates mediocre code on the first, second, or maybe even the fifth try, and we still use it in our sacred codebase because it was faster than writing it ourselves, it reinforces the feeling that relying on it diminishes us. At our core, we associate manual fluency with competence, so letting AI code for us feels like waving the white flag.\nThen there is the skill decay. We have learned and forgotten many skills and tools before, but this feels different. As developers, we build confidence around mastery of our primary language, its syntax and edge cases. Now we forget the basics of how to wield our primary weapon. If we are no longer the ones who remember every corner of the language, what are we? Our identity as developers is slipping away and being replaced with something else, and nobody knows what that something else is yet. I could mention other reasons too, but they mostly stem from identity, just reframed.\nGetting past this is hard, especially for ICs. If you became a manager at some point, you already went through this whole dance, despite how hard it was. Yet, you got the promotion (kudos to you by the way), which made it easier to accept the new reality. But an IC opted out of the manager role by choice, and now they are being forced to become the managers of AI agents with no apparent rewards. The people who are most excited and bullish about AI coding tend not to define themselves as developers in the first place.\nNow, all of this has been happening right under our noses, and it was not a single decision but mostly a drift. We used GitHub Copilot with Codex (the model) back in 2021 for code completion, and we remember how insane it felt at the time, something everyone takes for granted now. Then came ChatGPT in 2022 with the back-and-forth copy pasting, having it write tons of lines of docstrings and unit tests, finally doing the boring parts of the job, but hey not the actual job (no offense to QA engineers). Suddenly, in 2024 it slipped into our actual code logic with Cursor2. Finally in 2025, Antigravity, Codex (the agent), and Claude Code, drove the final nails in the coffin.\nSo, depending on your gravity and trajectory, either you have passed the event horizon at one of these milestones or you are about to. Like most of us, I think I had median developer gravity and a moderate trajectory somewhat pointing towards the black hole, and got sucked in within the last month. Linus Torvalds has considerable mass, and kernel development is a different kind of work (on the other hand, the Linux kernel is already making some room for AI-assisted contributions), but I believe he too will eventually get sucked in.\nThe weird part is how distorted the perception becomes. From inside the event horizon, we feel the productivity gains. Output goes up, things ship faster, we feel like 10x developers. But from the outside, time has moved on. People have already gotten used to this speed. What feels like a breakthrough to us looks like the new baseline to everyone else. When 10x is the norm from the outside, devs at 2x or even 5x start to look bad, and we wonder if we will make the cut.\nTo put some numbers on it: it took me a couple of hours on a Sunday afternoon to launch this website from scratch. It is not a page on Netlify, but a properly set up static site: Hugo on S3 behind CloudFront, all infrastructure in Terraform with remote state, OIDC federation between GitHub Actions and AWS so there are zero stored credentials, CloudFront access logs queryable through Athena, and CI/CD that runs terraform plan on PR and deploys on release bla bla bla. Nothing too fancy though, just done rightish. Writing this blog post took me twice as long.\nWhat about code complexity, maintainability, security, all the other -ities? I initially thought it would take a while to solve these. But right now we can throw some decent SDLC workflows with SDD and TDD in the mix and we are good to go. And even that level of involvement may not last long. Planning mode, hooks, skills, subagents, agent teams with PR agents, security agents, plugins, all of it will be automated. With all these agents, might as well think we are in the matrix.\nAmong those -ities though, security is the one that gets talked about the most. It must be both a dream and a nightmare to be working in cybersecurity right now. It used to be the case that only managers did not prioritize security, now even half of engineers let it go completely. Everybody is leaving their newly AI-made doors unlocked, at this point you can find a free API key for most services with a manual GitHub search. OpenClaw followed the infamous PlayStation Network playbook, dialed the security down to almost zero, and people still went crazy for the product. On the other hand, cybersecurity devs must feel the fear too, with all those security and pentesting agents coming for their jobs. OpenAI\u0026rsquo;s Aardvark, and now Anthropic\u0026rsquo;s Claude Code Security. Cybersecurity stocks lost 10% in 24 hours after that last one.\nHard to draw conclusions when we are still in the middle of it all. It also feels like an impossible task to keep up with everything happening in this new world. Andrej Karpathy coined the term \u0026ldquo;Vibe Coding\u0026rdquo; but even he feels behind. So, try to remember that it is okay to feel overwhelmed.\nWhen the dust settles, we will all need to find our new identities. Maybe you will become a product person, or a forward-deployed engineer but for AI agent clients, or go offline and go blue collar. Maybe you will be lucky enough to still practice the lost art of manual coding, at a distant star orbiting the giant black hole of AI coding.\nI used vim motions in Cursor and Claude Code\u0026rsquo;s /vim to edit this post. Together with some occasional ssh sessions, these are probably the last places I will ever use vim. Vim as a daily development skill is losing much of its value fast.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\nCursor came out in 2023 but I use 2024 here since that\u0026rsquo;s when most people started using it seriously.\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"23 February 2026","externalUrl":null,"permalink":"/blog/posts/ai-coding-black-hole/","section":"Blog","summary":"There is no escape.","title":"AI Coding Black Hole","type":"blog"},{"content":"","date":"22 February 2026","externalUrl":null,"permalink":"/tags/meta/","section":"Tags","summary":"","title":"Meta","type":"tags"},{"content":"This site is live. I\u0026rsquo;ve been meaning to put together a personal site for a while, and here it is.\nI plan to write about two things primarily.\nThe first is production-grade Python development. There\u0026rsquo;s a gap between writing Python that works and writing Python that ships. Data scientists, ML engineers, and software engineers each approach the language differently, and I think there\u0026rsquo;s a lot of value in bridging those perspectives. I want to write about packaging, testing, typing, CI/CD, project structure, and the tooling choices that make Python codebases maintainable beyond a notebook or a prototype, and also designed to work with current AI tools from the get go. I\u0026rsquo;ve been building a production Python template that puts a lot of these ideas into practice.\nThe second is coding with AI. I use AI tools daily in my work and I have opinions on where they help, where they don\u0026rsquo;t, and how to get the most out of them without losing the ability to reason about your own code.\nThere will probably be other things too: cloud architecture, ML systems, and whatever else I find worth writing about. But those two topics are where I\u0026rsquo;ll start.\n","date":"22 February 2026","externalUrl":null,"permalink":"/blog/posts/hello-world/","section":"Blog","summary":"Production-grade Python, coding with AI, and other things on my mind.","title":"What I'll Be Writing About","type":"blog"},{"content":"","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":" oedokumaci/python-production-template AI-native Copier template for production-ready Python projects Jinja 3 1 oedokumaci/gale-shapley-algorithm Python implementation of the Gale-Shapley Algorithm TypeScript 9 2 ","externalUrl":null,"permalink":"/code/","section":"Oral Ersoy Dokumaci","summary":" oedokumaci/python-production-template AI-native Copier template for production-ready Python projects ","title":"Code","type":"page"}]