1. Introduction¶

To begin this project log, let me briefly talk about the motivation. Honestly, the initial idea came to me almost immediately. First, I felt that using LSTMs again didn’t make sense—I wanted something newer. Second, during an early course on nuclear models, I came across the ENSDF database, and I realized it was a natural fit for testing diffusion models. After some investigation, I eventually shifted to using a transformer-based model.

Overall, I see this as an incremental model: adding LoRA updates along the time axis, simulating how human science has historically advanced step by step in measuring nuclear physics parameters. From this perspective, it makes sense to expect future predictions of new isotopes and their parameters to also follow this trend.

For model setup and loading, I relied on PyTorch packages. It’s still the mainstream choice right now, and at each step I also used some AI-assisted code generation to speed up learning and prototyping.

2. Pipeline¶

The entire pipeline includes: data cleaning → data processing → model input → model construction → regression prediction. Every step had to be done manually, and the data cleaning part in particular was painful. The raw files contain not only the values we need, but also a lot of unnecessary comments and low-quality experimental notes.

My first idea was to do simple row-wise parsing: use the first few characters of each line to separate categories, and then extract basic values such as energy. For modeling strategy, the simplest baseline was a multi-output regression model predicting excitation energy values (e.g., L, G, Q). This served as the starting point for G/Q training.

Example data sample:

112PD G 737.2 3 20 14 A

112PD L 890.3 3 0(+)

3. Current Status and future¶

At this stage, the workflow still feels overly complex. My thought is to first focus on a single energy regression baseline using only part of the dataset, and then gradually expand to multi-output predictions later. Training should start from a simple base model, then progressively add LoRA updates.

This also matches the incremental-learning goal: since ENSDF adds new nuclei every year, splitting training and testing sets by year is a natural setup.

For more technical details, see the technical notes below, which document all parameters and implementation details.¶

As for future plans, my idea is to extend this into a sequence generation framework, where training is organized by isotope categories. The goal would be to eventually use and predict the full LGQ dataset, instead of only focusing on energy-level differences as I am doing now.

Of course, this will likely require significant architectural updates, additional learning on my part, and much stronger computational resources. At the moment, with only a 3090 Ti on hand, it’s simply not realistic to expect to finish this plan.

Technical Notes¶

Data Cleaning & Processing¶

Data cleaning itself is not particularly difficult, but because the ENSDF source files are both numerous and messy, I ran into many challenges. I have been working on this project for about three months, and nearly half of that time has been spent revisiting and reconsidering the data cleaning stage.

In my version 1.0 cleaning pipeline, I only kept the L-cards and discarded all sub-cards, since our primary focus is on energy. As a result, I removed every record that did not contain an energy entry.

Within each L-card, the position of the fields is fixed and strictly defined. I followed the official ENSDF format manual (BNL Format Manual, p.22), and retained all information available there as candidate input features.

All of this work was carried out inside a single Jupyter Notebook, clean.ipynb, which currently handles both processing and output. I realize that using a notebook for such a task is not the most professional or efficient approach, and the file itself still contains a lot of redundant steps and quick tests. In the future, I plan to refactor this into a standalone .py script for clarity and maintainability.

clean.ipynb¶

Below is a list of all notebook cells, along with their functions and explanations:

Basic data input and read-in test, using only a single year of data.
Similar to Cell 1, but includes a small state machine to directly filter out "COMMENTS" and "REFERENCES" lines (a bit redundant).
Extract all L-lines by checking whether column 7 equals "L".
Same logic as Cell 3 but written with an explicit double loop, used to count data blocks and the total number of L-lines for a sanity check.
Drop all sub-cards of L-lines. At this stage, the focus is only on the main L-cards, since they contain enough relevant information, while sub-card parsing would be too complicated.
Read in all planned local ENSDF source files (currently covering all years).
Quick inspection.
Slice out the desired features according to column definitions from the ENSDF manual.
Perform element-level counting and extraction.
Add more physics features, along with computed and transformed features.
Convert half-life values into seconds.
Inspect overall statistics and dataset properties.
Quick inspection.
Remove all non-numeric energy levels. I considered parsing the non-numeric ones, and I still plan to revisit this later, though it will be a complicated task.
Quick inspection.
Strip parentheses from spin values and create an additional feature flagging whether the value was originally in parentheses.
Similar to Cell 16, but targeting uncertainty values — filtering and cleaning accordingly.
Extended version of Cell 17, followed by an inspection.
Quick inspection.
Export final cleaned dataset into both CSV and Feather formats.