A Real-World Solution to Escape Embedded Double Quotes in JSON

UPDATED ON JAN 30, 2024 : 956 words, 5 minute read — SOFTWARE DEVELOPMENT

The problem: you need to decode a JSON string, but at some point in the process you don’t control, unescaped double quotes are inserted into your string values. How might you sanitize the string to get a valid JSON object for decoding?

⚠️ The first answer I saw on Stack Overflow was “Go make the upstream service give you valid JSON, [duh you stupid idiot]”. Helpful! Sometimes you’re stuck in crappy situations and need a way forward, even if it’s not ideal. Hopefully this article acts as the answer I wish I had found.

Why 🔗︎

Why am I bothering to do this? JSON with unescaped double quotes in it isn’t valid JSON, so shouldn’t it be getting sanitized before reaching me? Unfortunately, no. As a part of building Workflow Buddy, an open-source tool to extend Slack’s no-code workspace automation, I need to handle untrusted variables being inserted into plain-text JSON strings before I have a chance to decode them.

An actual example of this in action looks like:

# Example with variable.
{
  "key": "{{65591853-edfe-4721-856d-ecd157766461==plain_text_user_input}}"
}

then becomes

# Example with variable inserted with double quotes scattered inside.
{
"key": "value with "two quotes" inside"
}

This makes the standard JSON parser go 💥, because how is it supposed to know you didn’t want key: "value with " and just forgot to add another key after?

Why do I let people build their own JSON strings in plain-text input blocks 🔗︎

It provides the most flexibility for end users. I don’t want my arbitrary decisions to block them from the critical last-mile of what they’re building. Whether you think that’s a stupid reason or not, this article is moving on without you.

Potential approaches 🔗︎

As the intrepid developer I am, I checked Stack Overflow first and came up with squat that felt useful. As all heroes do, after crying in the bathroom, I closed all my SO tabs and started frantically coming up with ideas. Gotta dig through some garbage to find the gold, right?

  • 🤮 Randomly change a quote to escaped, then see if decode works (slow and inefficient).
  • 😖 Walk through string chars, keep a STATE to check if we are in a key or a value - (how to actually tell if we are in a key/value vs just finding JSON special characters, e.g. {}",, in a string?).
  • 🤔🥇 Make an educated guess for which quote we could escape, then try to decode again and see what happens (Not the worst idea to explore….).

Escape double quotes in JSON with Python 🔗︎

This Python version follows the principle of “Ask forgiveness, not permission”. When we get a decoding failure, we make an educated guess about which quote we can tweak (e.g. convert to escaped, \"), then attempt to decode again. If the decoder made deeper progress into the string characters, it was a positive change. If not, then we really mucked things up and should bubble the error up.

def sanitize_unescaped_quotes_and_load_json_str(s: str, strict=False) -> dict:  # type: ignore
    # TODO: one thing this doesn't handle, is if the unescaped text includes valid JSON - then you're just out of luck
    js_str = s
    prev_pos = -1
    curr_pos = 0
    while curr_pos > prev_pos:
        # after while check, move marker before we overwrite it
        prev_pos = curr_pos
        try:
            return json.loads(js_str, strict=strict)
        except json.JSONDecodeError as err:
            curr_pos = err.pos
            if curr_pos <= prev_pos:
                # previous change didn't make progress, so error
                raise err

            # find the previous " before e.pos
            prev_quote_index = js_str.rfind('"', 0, curr_pos)
            # escape it to \"
            js_str = js_str[:prev_quote_index] + "\\" + js_str[prev_quote_index:]

Workflow Buddy source file: utils.py

Limitations 🔗︎

Is this a perfect solution? Nope, you’ll have to consider if the limitations screw up your use case. The most important I’ve run into is if you provide valid JSON inside of your string, you’ll end up with different top level keys in your object than you expect. For example:

 # valid unescaped JSON nested inside string - won't error, but won't come out as you want.
 
{
"key": "JSON inside me {"a": "b","c": "d"} "
}

What you’ll get is:

# Woh, now we have a 'c' key!
{
"key": "JSON inside me {\"a\": \"b",
"c": "d\"} "
}

If you are curious about digging deeper, feel free to check out the test cases for the parser in the Workflow Buddy repo. . Function is sut.sanitize_unescaped_quotes_and_load_json_str(test_str).

Escape double quotes in JSON with Javascript (WIP) 🔗︎

⚠️ Not completed yet, but dropping my thoughts from initial exploration here, since it seemed possible to get very similar functionality to the Python sanitizer.

When a Decode error occurs, you get access to e.message, example:

JSON.parse: expected ',' or '}' after property value in object at line 1 column 39 of the JSON data

and and could then parse it for the line 1 col 39 bit. Unfortunately, but that’s not the exact same as pos in the Python example. You should be able to get creative and “hop” lines by looking for next \n, then using the column info. Ideally we would find a way to do it without needing any information from the decoder other than success/failure - then any language would be able to use it.

WIP JSON testing 🔗︎

const json = '{"result":true, "count":42, "str": "a"bc"}';
try {
  const obj = JSON.parse(json);
  console.log(obj);
} catch(e) {
  console.error(e instanceof SyntaxError);
  console.error(e.message);
  console.error(e.name);
  console.error(e.fileName);
  console.error(e.lineNumber); // line number where error happened in code, not in parsing
  console.error(e.columnNumber); // col where error happened in code, not in parsing
  console.error(e.stack);
}

Tell me why I’m wrong 🔗︎

As this is the internet, someone reading this will have an opinion about my approach. Let me know if you have suggestions or alternative approaches through Twitter DMs or my email proxy .


See Also