To begin this project log, let me briefly talk about the motivation. Honestly, the initial idea came to me almost immediately. First, I felt that using LSTMs again didn’t make sense—I wanted something newer. Second, during an early course on nuclear models, I came across the ENSDF database, and I realized it was a natural fit for testing diffusion models. After some investigation, I eventually shifted to using a transformer-based model.
Overall, I see this as an incremental model: adding LoRA updates along the time axis, simulating how human science has historically advanced step by step in measuring nuclear physics parameters. From this perspective, it makes sense to expect future predictions of new isotopes and their parameters to also follow this trend.
For model setup and loading, I relied on PyTorch packages. It’s still the mainstream choice right now, and at each step I also used some AI-assisted code generation to speed up learning and prototyping.
The entire pipeline includes: data cleaning → data processing → model input → model construction → regression prediction. Every step had to be done manually, and the data cleaning part in particular was painful. The raw files contain not only the values we need, but also a lot of unnecessary comments and low-quality experimental notes.
My first idea was to do simple row-wise parsing: use the first few characters of each line to separate categories, and then extract basic values such as energy. For modeling strategy, the simplest baseline was a multi-output regression model predicting excitation energy values (e.g., L, G, Q). This served as the starting point for G/Q training.
Example data sample:
112PD G 737.2 3 20 14 A
112PD L 890.3 3 0(+)
At this stage, the workflow still feels overly complex. My thought is to first focus on a single energy regression baseline using only part of the dataset, and then gradually expand to multi-output predictions later. Training should start from a simple base model, then progressively add LoRA updates.
This also matches the incremental-learning goal: since ENSDF adds new nuclei every year, splitting training and testing sets by year is a natural setup.
As for future plans, my idea is to extend this into a sequence generation framework, where training is organized by isotope categories. The goal would be to eventually use and predict the full LGQ dataset, instead of only focusing on energy-level differences as I am doing now.
Of course, this will likely require significant architectural updates, additional learning on my part, and much stronger computational resources. At the moment, with only a 3090 Ti on hand, it’s simply not realistic to expect to finish this plan.
Data cleaning itself is not particularly difficult, but because the ENSDF source files are both numerous and messy, I ran into many challenges. I have been working on this project for about three months, and nearly half of that time has been spent revisiting and reconsidering the data cleaning stage.
In my version 1.0 cleaning pipeline, I only kept the L-cards and discarded all sub-cards, since our primary focus is on energy. As a result, I removed every record that did not contain an energy entry.
Within each L-card, the position of the fields is fixed and strictly defined. I followed the official ENSDF format manual (BNL Format Manual, p.22), and retained all information available there as candidate input features.
All of this work was carried out inside a single Jupyter Notebook, clean.ipynb, which currently handles both processing and output. I realize that using a notebook for such a task is not the most professional or efficient approach, and the file itself still contains a lot of redundant steps and quick tests. In the future, I plan to refactor this into a standalone .py script for clarity and maintainability.
Below is a list of all notebook cells, along with their functions and explanations: