New to CAPEC? Start Here
Home > CAPEC List > CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic (Version 3.9)  

CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic

Attack Pattern ID: 80
Abstraction: Detailed
View customized information:
+ Description
This attack is a specific variation on leveraging alternate encodings to bypass validation logic. This attack leverages the possibility to encode potentially harmful input in UTF-8 and submit it to applications not expecting or effective at validating this encoding standard making input filtering difficult. UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. Legal UTF-8 characters are one to four bytes long. However, early version of the UTF-8 specification got some entries wrong (in some cases it permitted overlong characters). UTF-8 encoders are supposed to use the "shortest possible" encoding, but naive decoders may accept encodings that are longer than necessary. According to the RFC 3629, a particularly subtle form of this attack can be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters.
+ Likelihood Of Attack


+ Typical Severity


+ Relationships
Section HelpThis table shows the other attack patterns and high level categories that are related to this attack pattern. These relationships are defined as ChildOf and ParentOf, and give insight to similar items that may exist at higher and lower levels of abstraction. In addition, relationships such as CanFollow, PeerOf, and CanAlsoBe are defined to show similar attack patterns that the user may want to explore.
ChildOfStandard Attack PatternStandard Attack Pattern - A standard level attack pattern in CAPEC is focused on a specific methodology or technique used in an attack. It is often seen as a singular piece of a fully executed attack. A standard attack pattern is meant to provide sufficient details to understand the specific technique and how it attempts to accomplish a desired goal. A standard level attack pattern is a specific type of a more abstract meta level attack pattern.267Leverage Alternate Encoding
PeerOfDetailed Attack PatternDetailed Attack Pattern - A detailed level attack pattern in CAPEC provides a low level of detail, typically leveraging a specific technique and targeting a specific technology, and expresses a complete execution flow. Detailed attack patterns are more specific than meta attack patterns and standard attack patterns and often require a specific protection mechanism to mitigate actual attacks. A detailed level attack pattern often will leverage a number of different standard level attack patterns chained together to accomplish a goal.64Using Slashes and URL Encoding Combined to Bypass Validation Logic
PeerOfDetailed Attack PatternDetailed Attack Pattern - A detailed level attack pattern in CAPEC provides a low level of detail, typically leveraging a specific technique and targeting a specific technology, and expresses a complete execution flow. Detailed attack patterns are more specific than meta attack patterns and standard attack patterns and often require a specific protection mechanism to mitigate actual attacks. A detailed level attack pattern often will leverage a number of different standard level attack patterns chained together to accomplish a goal.71Using Unicode Encoding to Bypass Validation Logic
Section HelpThis table shows the views that this attack pattern belongs to and top level categories within that view.
+ Execution Flow
  1. Survey the application for user-controllable inputs: Using a browser or an automated tool, an attacker follows all public links and actions on a web site. They record all the links, the forms, the resources accessed and all other potential entry-points for the web application.

    Use a spidering tool to follow and record all links and analyze the web pages to find entry points. Make special note of any links that include parameters in the URL.
    Use a proxy tool to record all user input entry points visited during a manual traversal of the web application.
    Use a browser to manually explore the website and analyze how it is constructed. Many browsers' plugins are available to facilitate the analysis or automate the discovery.
  1. Probe entry points to locate vulnerabilities: The attacker uses the entry points gathered in the "Explore" phase as a target list and injects various UTF-8 encoded payloads to determine if an entry point actually represents a vulnerability with insufficient validation logic and to characterize the extent to which the vulnerability can be exploited.

    Try to use UTF-8 encoding of content in Scripts in order to bypass validation routines.
    Try to use UTF-8 encoding of content in HTML in order to bypass validation routines.
    Try to use UTF-8 encoding of content in CSS in order to bypass validation routines.
+ Prerequisites
The application's UTF-8 decoder accepts and interprets illegal UTF-8 characters or non-shortest format of UTF-8 encoding.
Input filtering and validating is not done properly leaving the door open to harmful characters for the target host.
+ Skills Required
[Level: Low]
An attacker can inject different representation of a filtered character in UTF-8 format.
[Level: Medium]
An attacker may craft subtle encoding of input data by using the knowledge that they have gathered about the target host.
+ Indicators
A web page that contains overly long UTF-8 codes constitute a protocol anomaly, and could be an indication that an attacker is attempting to exploit a vulnerability on the target host.
An attacker can use a fuzzer in order to probe for a UTF-8 encoding vulnerability. The fuzzer should generate suspicious network activity noticeable by an intrusion detection system.
An IDS filtering network traffic may be able to detect illegal UTF-8 characters.
+ Consequences
Section HelpThis table specifies different individual consequences associated with the attack pattern. The Scope identifies the security property that is violated, while the Impact describes the negative technical impact that arises if an adversary succeeds in their attack. The Likelihood provides information about how likely the specific consequence is expected to be seen relative to the other consequences in the list. For example, there may be high likelihood that a pattern will be used to achieve a certain impact, but a low likelihood that it will be exploited to achieve a different impact.
Access Control
Bypass Protection Mechanism
Execute Unauthorized Commands
Modify Data
Unreliable Execution
+ Mitigations
The Unicode Consortium recognized multiple representations to be a problem and has revised the Unicode Standard to make multiple representations of the same code point with UTF-8 illegal. The UTF-8 Corrigendum lists the newly restricted UTF-8 range (See references). Many current applications may not have been revised to follow this rule. Verify that your application conform to the latest UTF-8 encoding specification. Pay extra attention to the filtering of illegal characters.

The exact response required from an UTF-8 decoder on invalid input is not uniformly defined by the standards. In general, there are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:

  • 1. Insert a replacement character (e.g. '?', '').
  • 2. Ignore the bytes.
  • 3. Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map).
  • 4. Not notice and decode as if the bytes were some similar bit of UTF-8.
  • 5. Stop decoding and report an error (possibly giving the caller the option to continue).

It is possible for a decoder to behave in different ways for different types of invalid input.

RFC 3629 only requires that UTF-8 decoders must not decode "overlong sequences" (where a character is encoded in more bytes than needed but still adheres to the forms above). The Unicode Standard requires a Unicode-compliant decoder to "...treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."

Overlong forms are one of the most troublesome types of UTF-8 data. The current RFC says they must not be decoded but older specifications for UTF-8 only gave a warning and many simpler decoders will happily decode them. Overlong forms have been used to bypass security validations in high profile products including Microsoft's IIS web server. Therefore, great care must be taken to avoid security issues if validation is performed before conversion from UTF-8, and it is generally much simpler to handle overlong forms before any input validation is done.

To maintain security in the case of invalid input, there are two options. The first is to decode the UTF-8 before doing any input validation checks. The second is to use a decoder that, in the event of invalid input, returns either an error or text that the application considers to be harmless. Another possibility is to avoid conversion out of UTF-8 altogether but this relies on any other software that the data is passed to safely handling the invalid data.

Another consideration is error recovery. To guarantee correct recovery after corrupt or lost bytes, decoders must be able to recognize the difference between lead and trail bytes, rather than just assuming that bytes will be of the type allowed in their position.

For security reasons, a UTF-8 decoder must not accept UTF-8 sequences that are longer than necessary to encode a character. If you use a parser to decode the UTF-8 encoding, make sure that parser filter the invalid UTF-8 characters (invalid forms or overlong forms).
Look for overlong UTF-8 sequences starting with malicious pattern. You can also use a UTF-8 decoder stress test to test your UTF-8 parser (See Markus Kuhn's UTF-8 and Unicode FAQ in reference section)
Assume all input is malicious. Create an allowlist that defines all valid input to the software system based on the requirements specifications. Input that does not match against the allowlist should not be permitted to enter into the system. Test your decoding process against malicious input.
+ Example Instances

Perhaps the most famous UTF-8 attack was against unpatched Microsoft Internet Information Server (IIS) 4 and IIS 5 servers. If an attacker made a request that looked like this

http://servername/scripts/..%c0%af../winnt/system32/ cmd.exe

the server didn't correctly handle %c0%af in the URL. What do you think %c0%af means? It's 11000000 10101111 in binary; and if it's broken up using the UTF-8 mapping rules, we get this: 11000000 10101111. Therefore, the character is 00000101111, or 0x2F, the slash (/) character! The %c0%af is an invalid UTF-8 representation of the / character. Such an invalid UTF-8 escape is often referred to as an overlong sequence.

So when the attacker requested the tainted URL, they accessed


In other words, they walked out of the script's virtual directory, which is marked to allow program execution, up to the root and down into the system32 directory, where they could pass commands to the command shell, Cmd.exe.

See also: CVE-2000-0884
+ Taxonomy Mappings
Section HelpCAPEC mappings to ATT&CK techniques leverage an inheritance model to streamline and minimize direct CAPEC/ATT&CK mappings. Inheritance of a mapping is indicated by text stating that the parent CAPEC has relevant ATT&CK mappings. Note that the ATT&CK Enterprise Framework does not use an inheritance model as part of the mapping to CAPEC.
Relevant to the ATT&CK taxonomy mapping (see parent )
+ References
[REF-1] G. Hoglund and G. McGraw. "Exploiting Software: How to Break Code". Addison-Wesley. 2004-02.
[REF-112] David Wheeler. "Secure Programming for Linux and Unix HOWTO". 5.9. Character Encoding. <>.
[REF-530] Michael Howard and David LeBlanc. "Writing Secure Code". Chapter 12. Microsoft Press.
[REF-531] Bruce Schneier. "Security Risks of Unicode". Crypto-Gram Newsletter. 2000-07-15. <>.
[REF-532] "Wikipedia". UTF-8. The Wikimedia Foundation, Inc. <>.
[REF-533] F. Yergeau. "RFC 3629 - UTF-8, a transformation format of ISO 10646". 2003-11. <>.
[REF-114] Eric Hacker. "IDS Evasion with Unicode". 2001-01-03. <>.
[REF-535] "Corrigendum #1: UTF-8 Shortest Form". The Unicode Standard. Unicode, Inc.. 2001-03. <>.
[REF-525] Markus Kuhn. "UTF-8 and Unicode FAQ for Unix/Linux". 1999-06-04. <>.
[REF-537] Markus Kuhn. "UTF-8 decoder capability and stress test". 2003-02-19. <>.
+ Content History
Submission DateSubmitterOrganization
(Version 2.6)
CAPEC Content TeamThe MITRE Corporation
Modification DateModifierOrganization
(Version 2.12)
CAPEC Content TeamThe MITRE Corporation
Updated References
(Version 3.3)
CAPEC Content TeamThe MITRE Corporation
Updated Example_Instances, Execution_Flow, Mitigations, Skills_Required
(Version 3.5)
CAPEC Content TeamThe MITRE Corporation
Updated Related_Weaknesses
(Version 3.8)
CAPEC Content TeamThe MITRE Corporation
Updated Example_Instances, Mitigations
More information is available — Please select a different filter.
Page Last Updated or Reviewed: July 31, 2018